Forecasting hurricane movements - A machine learning approach
Felix Kroeber - Introduction to data science & machine learning (winter term 2021/22)
# general setup
import geopandas as gpd
import matplotlib.pyplot as plt
import multiprocess
import numpy as np
import os
import pandas as pd
import plotly.express as px
import plotly.graph_objects as go
import seaborn as sns
import warnings
from datetime import datetime, timedelta
from dtreeviz.trees import dtreeviz
from geographiclib.geodesic import Geodesic
from IPython.core.interactiveshell import InteractiveShell
from IPython.display import Image, HTML
from ipywidgets import widgets, Layout
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestRegressor
from sklearn.linear_model import LinearRegression
from sklearn import metrics
from sklearn import tree
from sklearn.svm import SVR
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.neural_network import MLPRegressor
from sklearn.model_selection import RandomizedSearchCV
%matplotlib inline
InteractiveShell.ast_node_interactivity = "all"
plt.rcParams['figure.figsize'] = [12, 7.5]
pd.set_option('display.max_rows', 50)
pd.set_option('display.max_columns', 50)
pd.set_option('display.precision', 3)
pd.options.mode.chained_assignment = None
# read hurricane data set
workdir = "C:\\Users\\felix\\OneDrive\\Studium_Master\\electives\\Intro_to_ML\\end_of_term"
hurricanes = pd.read_csv(os.path.join(workdir, "atlantic.csv"))
hurricanes.head()
hurricanes.tail()
hurricanes.info()
# transform lat/long into numeric values
def coord_converter(coord_str):
if any(x in coord_str for x in ["N", "E"]):
return float(coord_str[:-1])
if any(x in coord_str for x in ["S", "W"]):
return float("-" + coord_str[:-1])
hurricanes["Longitude"] = hurricanes["Longitude"].apply(coord_converter)
hurricanes["Latitude"] = hurricanes["Latitude"].apply(coord_converter)
# transform date and time into datetime
hurricanes['Date'] = [str(date) for date in hurricanes['Date']]
hurricanes['Time'] = [str(date) for date in hurricanes['Time']]
hurricanes['Time'] = hurricanes['Time'].str.pad(width=4, fillchar='0')
hurricanes.insert(2,
"Datetime",
[datetime.strptime(dt[0]+dt[1], "%Y%m%d%H%M") for dt in zip(hurricanes['Date'], hurricanes['Time'])])
hurricanes.drop(columns=['Date', 'Time'], inplace=True)
# re-transform into int objects
hurricanes.insert(3, "Year", [x.year for x in hurricanes.Datetime])
hurricanes.insert(4, "Month", [x.month for x in hurricanes.Datetime])
hurricanes.insert(5, "Day", [x.day for x in hurricanes.Datetime])
hurricanes.insert(6, "Hour", [x.hour for x in hurricanes.Datetime])
# remove white spaces in front of names
hurricanes["Name"] = hurricanes["Name"].str.lstrip()
# encode NaNs
hurricanes.replace(["", -99, -999], np.nan, inplace=True)
hurricanes.replace(r'^\s*$', np.nan, regex=True, inplace=True)
# remove duplicates
hurricanes.drop_duplicates(inplace=True)
On average, there are about 29 records per hurricane. The longest track record of a single hurricane dates back to 1899 and contains 133 entries.
Nine different statuses are used to characterise the condition of a hurricane. The meaning of the abbreviations can be taken from the data documentation (the most frequent abbreviation TS, for example, describes the existence of a tropical cyclone of tropical storm intensity (34-63 knots)). Basically, the status as a categorical variable describes a combination of information given by the latitude and the wind intensity variable.
There are also nine different values for the event variable, which encodes events such as landfall (L). Note that most events contain information which can only be determined posthoc (e.g. intensity peak or minimum in central pressure). Since only special events are described, values for the event variable exist only in slightly fewer than 1000 cases.
The main part of the data set describes hurricanes in the 20th century. About half of all entries refer to the 100-year period 1850-1950, while the other half of the entries were recorded in the following 65 years.
All locations where hurricanes have been recorded are in the northern hemisphere, with a large proportion in the tropical-subtropical region. With regard to longitude, there are obviously some erroneous entries. The minimum degree entry does not exist and the maximum entry of 63 degrees is highly improbable due to its distance from the Atlantic Ocean.
While there are entries for the maximum wind speed for almost all data points, the minimum pressure in the hurricane system was only recorded for slightly more than 1/3 of the entries. Both variables are inconspicuous with regard to their values, the described minimum and maximum values appear plausible.
All variables describing the wind radii have values for only about 5900 entries. Including these variables as features in the following analyses would therefore mean to exclude a large part of all data.
# print basic summary stats for all vars
hurricanes.describe(include=object)
hurricanes.describe(datetime_is_numeric=True)
The findings so far suggest the following data processing steps to be sensible. All these steps effectively reduce the initial data set to data points for which values for the variables of interest are available.
# keep only most recent records due to larger inaccuracies in entries from early mid-century
hurricanes = hurricanes[hurricanes["Datetime"] > datetime.strptime("19500101", "%Y%d%m")]
# only keep entries for which at least 5 data points are available
long_track_hurricanes = hurricanes['ID'].value_counts()[hurricanes['ID'].value_counts() >= 5]
hurricanes = hurricanes[hurricanes['ID'].isin(long_track_hurricanes.index.tolist())]
# only retain entries for which pressure is available
hurricanes = hurricanes[~hurricanes['Minimum Pressure'].isna()]
# remove wind radii vars with lot of NaN values
hurricanes = hurricanes.loc[:,:"Minimum Pressure"]
# remove events as they are not useful
hurricanes.drop(columns="Event", inplace=True)
# Plotting time intervals between records
hurricanes_grouped = hurricanes.groupby("ID")
res = []
for name, group in hurricanes_grouped:
delta_t = (group["Datetime"]-group["Datetime"].shift()).iloc[1:]
res.append(delta_t)
res = [item for sublist in res for item in sublist]
res = [x/pd.Timedelta(minutes=60) for x in res]
plt.hist(res, bins=30, range=(0,30))
plt.ylabel('Frequency')
plt.xlabel('Timeinterval (h)');
# remove all entries not corresponding to the synoptic times 00,06,12,18
syn_times_idx = [dt.strftime("%H%M") in ["0000", "0600", "1200", "1800"] for dt in hurricanes["Datetime"]]
hurricanes = hurricanes[syn_times_idx]
# analyse temporal patterns in interval times
hurricanes_grouped = hurricanes.groupby("ID")
hurricanes["Time interval"] = hurricanes_grouped["Datetime"].transform(lambda x: x - x.shift())
_hurricanes = hurricanes.dropna(subset=["Time interval"])
# plotting time intervals between records
hurricanes_grouped = _hurricanes.groupby([x for x in 5 * np.floor(_hurricanes["Year"]/5)])
hurricanes_vis = pd.DataFrame({
"n_data_points" : hurricanes_grouped["ID"].count(),
"p_long_intervals" : hurricanes_grouped["Time interval"].agg(lambda x: sum(x != pd.Timedelta(hours=6))/len(x))
})
fig, ax1 = plt.subplots()
# plotting number of data points on left axis
ax1.bar(hurricanes_vis.index, "n_data_points", data = hurricanes_vis, width = 1.5)
ax1.set_ylabel("number of data points")
ax1.set_xticks(ticks = [*[x-2.5 for x in hurricanes_vis.index], hurricanes_vis.index[-1]+2.5],
labels = [int(x) for x in [*hurricanes_vis.index, hurricanes_vis.index[-1]+5]],
rotation = 45)
ax1.set_ylim(0)
# plotting percentage of 6 hour intervals on right axis
ax2 = ax1.twinx()
ax2.plot(hurricanes_vis.index, hurricanes_vis.p_long_intervals, color="darkred")
ax2.set_ylabel("\n percentage of intervals > 6 hours")
ax2.set_ylim(0, 1);
# keep only records from 1980 onwards & remove later ones with gaps larger than 6 hours
hurricanes = hurricanes[hurricanes["Datetime"] > datetime.strptime("19800101", "%Y%d%m")]
rm_ids = hurricanes[[x > pd.Timedelta(hours=6) for x in hurricanes["Time interval"]]]["ID"]
hurricanes = hurricanes[~hurricanes["ID"].isin(rm_ids)]
print(f"{len(rm_ids)} hurricane tracks removed due to time gaps larger than 6 hours")
# interactive plotting of trajectories
warnings.filterwarnings("ignore")
# create user interface inputs
years = widgets.IntRangeSlider(
value=[1900, 2100],
min=min([x.year for x in hurricanes["Datetime"]]),
max=max([x.year for x in hurricanes["Datetime"]]),
step=1,
description='Years (inclusive):',
style = {'description_width': 'initial'},
layout=Layout(width='50%')
)
months_dt = [datetime.strptime(str(i), "%m") for i in range(1,13)]
options = [(i.strftime('%b'), i) for i in months_dt]
months = widgets.SelectionRangeSlider(
options=options,
index=(0,11),
description='Months (inclusive):',
style = {'description_width': 'initial'},
layout=Layout(width='40%')
)
hurricane_ids = list(hurricanes.ID.unique())
id_select = widgets.SelectMultiple(
options=hurricane_ids,
value=hurricane_ids,
description='Id select:',
rows=10,
style = {'description_width': 'initial'}
)
popup_vars = widgets.SelectMultiple(
options=list(hurricanes.select_dtypes(exclude=['datetime', 'timedelta']).columns),
value=["Maximum Wind", "Minimum Pressure"],
description='Popup variables:',
rows=10,
style = {'description_width': 'initial'}
)
numeric_vars = list(hurricanes.select_dtypes(include=['number']).columns)
numeric_vars.insert(0,None)
categorial_vars = list(hurricanes.select_dtypes(include=['object']).columns)
size_var = widgets.Dropdown(
options=numeric_vars,
description='Marker size:',
style = {'description_width': 'initial'}
)
color_var_numerical = widgets.Dropdown(
options=numeric_vars,
description='Marker color 1st map:',
value="Maximum Wind",
style = {'description_width': 'initial'}
)
color_var_categorial = widgets.Dropdown(
options=categorial_vars,
description='Marker color 2nd map:',
value="Name",
style = {'description_width': 'initial'}
)
basemap = widgets.RadioButtons(
options=['osm_base', 'osm_topographic'],
value='osm_base'
)
out_info = widgets.Output()
# create function to define two interlinked maps with their initial settings
# two maps only differ in their vis settings w.r.t. marker color
# two maps need to be defined due to different layer stack composition (numeric vs. categorial vis var)
def hurrican_map(color_var):
fig = go.FigureWidget(data = px.scatter_mapbox(
data_frame = hurricanes,
lat="Latitude",
lon="Longitude",
color=color_var,
hover_data={"Latitude": False,
"Longitude": False,
"Maximum Wind": True,
"Minimum Pressure": True},
width=975,
height=700,
center=go.layout.mapbox.Center(lat=47.5, lon=-60),
zoom=1.75))
fig = fig.update_layout(mapbox_style="open-street-map")
return fig
map_I = hurrican_map("Maximum Wind")
map_II = hurrican_map("Name")
# define update (backend) procedures
class map_update:
# initialisation procedure - set df & map to apply updates
def __init__(self, map):
self.df = hurricanes
self.map = map
def df_update(self):
# get selection of years
start_year = str(years.value[0]) + "0101"
end_year = str(years.value[1]+1) + "0101"
# get selection of months
start_month = months.value[0].month
end_month = months.value[1].month+1
# apply temporal subsetting
self.df = self.df[self.df["Datetime"].between(datetime.strptime(start_year, "%Y%d%m"),
datetime.strptime(end_year, "%Y%d%m"))]
self.df = self.df[[x.month in range(start_month, end_month) for x in self.df["Datetime"]]]
# apply id subsetting
self.df = self.df[[x in id_select.value for x in self.df["ID"]]]
def map_vis_update(self, color_var):
# clear previous outputs
out_info.clear_output()
# get columns to display in popups
hover_vars = {}
for col in list(self.df.select_dtypes(exclude=['datetime', 'timedelta']).columns):
if col in popup_vars.value:
hover_vars[col] = True
else:
hover_vars[col] = False
# get size var
if size_var.value != None:
marker_size_var = size_var.value
else:
marker_size_var = None
# configure fig based on selection
if len(self.df) != 0:
self.map_new = go.FigureWidget(data = px.scatter_mapbox(
data_frame = self.df,
lat="Latitude",
lon="Longitude",
size = marker_size_var,
color = color_var.value,
hover_data=hover_vars))
self.map_new = self.map_new.update_layout(
{'coloraxis': {'colorbar': {'title': {'text': color_var.value}}}}
)
else:
with out_info:
print("Selection obtained no data points. No map update was done. Reload of widget necessary!")
# update basemap
if basemap.value == "osm_base":
self.map_new = self.map_new.update_layout(
mapbox_style="open-street-map",
mapbox_layers=[
{
"below": 'traces',
"sourcetype": "raster",
"sourceattribution": "© OpenStreetMap contributors",
"source": [
"https://tile.openstreetmap.org/{z}/{x}/{y}.png"
]
}
])
else:
self.map_new = self.map_new.update_layout(
mapbox_style="white-bg",
mapbox_layers=[
{
"below": 'traces',
"sourcetype": "raster",
"sourceattribution": "© OpenStreetMap contributors",
"source": [
"https://tile.opentopomap.org/{z}/{x}/{y}.png"
]
}
])
# update old map using new properties
self.map.update(data = self.map_new.data,
layout = {'coloraxis': self.map_new.layout.coloraxis,
'mapbox': {'layers': self.map_new.layout.mapbox.layers,
'style': self.map_new.layout.mapbox.style}})
def response_map_I(change):
map_I_updater = map_update(map_I)
map_I_updater.df_update()
map_I_updater.map_vis_update(color_var_numerical)
def response_map_II(change):
map_II_updater = map_update(map_II)
map_II_updater.df_update()
map_II_updater.map_vis_update(color_var_categorial)
# observe changes
for response in [response_map_I, response_map_II]:
years.observe(response, names="value")
months.observe(response, names="value")
id_select.observe(response, names="value")
popup_vars.observe(response, names="value")
size_var.observe(response, names="value")
basemap.observe(response, names="value")
color_var_numerical.observe(response_map_I, names="value")
color_var_categorial.observe(response_map_II, names="value")
# configure layout & display figure
accordion = widgets.Accordion([widgets.VBox([years, months]),
widgets.VBox([id_select]),
widgets.HBox([widgets.VBox([popup_vars], layout=Layout(padding='0px 25px 0px 0px')),
widgets.VBox([size_var, color_var_numerical, color_var_categorial])]),
basemap])
accordion.set_title(0, 'Temporal Subsetting')
accordion.set_title(1, 'ID Subsetting')
accordion.set_title(2, 'Visualisation Settings')
accordion.set_title(3, 'Basemap Settings')
widgets.VBox([accordion, out_info, map_I, map_II])
# reset warnings to default
warnings.filterwarnings("default")
Briefly summarised: The prediction problem is considered in the following as a regression of three variables - the distance, the sine component and the cosine component of the angle - which are finally combined again to specify the position in lat/long.
The following features are created by segmenting the hurricane tracks into smaller sections:
# feature engineering part I
hurricanes_grouped = hurricanes.reset_index(level=0).groupby("ID")
features = []
for name, group in hurricanes_grouped:
group["lat_lon"] = list(zip(group["Latitude"], group["Longitude"]))
geom_features = list(map(lambda coord_1, coord_2:
Geodesic.WGS84.Inverse(coord_1[0], coord_1[1], coord_2[0], coord_2[1]),
group["lat_lon"].shift().iloc[1:], group["lat_lon"].iloc[1:]))
# distance calculations
distances = [x["s12"]/1000 for x in geom_features]
dist_vars = {}
# next Distance as var to be explained later on
dist_vars["Next Distance"] = [*[x["s12"]/1000 for x in geom_features], np.nan]
# previous Distances as explaining vars
for time_step in range(1,5):
trailing_nans = time_step*[np.nan]
shifted_dists = [x["s12"]/1000 for x in geom_features][:(len(geom_features)-(time_step-1))]
dist_vars[f"Prev. Distance {(time_step-1)*6}-{time_step*6}h"] = [*trailing_nans, *shifted_dists]
### --- equivalent to, but faster than --- ###
# for time_step in range(1,5):
# shifted_dists = pd.DataFrame(dist_vars["Next Distance"]).transform(lambda x: x.shift(time_step))
# dist_vars[f"Prev. Distance {(time_step-1)*6}-{time_step*6}h"] = shifted_dists.loc[:,0].tolist()
### --- equivalent to, but faster than --- ###
# distance variability as global feature to characterise whole path so far
std = pd.DataFrame(distances).rolling(len(distances), min_periods=2).std()
mean = pd.DataFrame(distances).rolling(len(distances), min_periods=2).mean()
dist_vars["Distance Varcoeff"] = [np.nan, *(std/mean).iloc[:,0].tolist()]
# bearing calculations
bearings = [x["azi1"] for x in geom_features]
for idx, dist in enumerate(distances):
if dist == 0:
bearings[idx] = np.nan
bearings = pd.Series(bearings).interpolate("nearest").tolist()
bearing_vars = {}
# next Bearing as var to be explained later on
bearing_vars["Next Bearing"] = [*bearings, np.nan]
bearing_vars["Next Bearing cos"] = np.cos(np.radians([*bearings, np.nan]))
bearing_vars["Next Bearing sin"] = np.sin(np.radians([*bearings, np.nan]))
# previous Bearings as explaining vars
for time_step in range(1,5):
trailing_nans = time_step*[np.nan]
shifted_dists = bearings[:(len(geom_features)-(time_step-1))]
bearing_vars[f"Prev. Bearing {(time_step-1)*6}-{time_step*6}h"] = [*trailing_nans, *shifted_dists]
bearing_vars[f"Prev. Bearing {(time_step-1)*6}-{time_step*6}h cos"] = np.cos(np.radians(bearing_vars[f"Prev. Bearing {(time_step-1)*6}-{time_step*6}h"]))
bearing_vars[f"Prev. Bearing {(time_step-1)*6}-{time_step*6}h sin"] = np.sin(np.radians(bearing_vars[f"Prev. Bearing {(time_step-1)*6}-{time_step*6}h"]))
# bearing variability as global feature to characterise whole path so far
angle_diffs = pd.DataFrame(bearing_vars["Prev. Bearing 0-6h"]).rolling(2).apply(lambda x: ((x.iloc[1]- x.iloc[0]) + 180) % 360 - 180)
bearing_vars["Bearing Abs Std"] = angle_diffs.abs().rolling(len(angle_diffs), min_periods=2).std().iloc[:,0].tolist()
# append features for each timestep to overall df
for idx, _ in enumerate(group["ID"]):
single_features = {}
# note that operations on groups via loop require index tracking
# otherwise later concating will not match both dfs appropriately
single_features["index"] = group["index"].iloc[idx]
single_features["Timestep"] = idx
for key, item in dist_vars.items():
single_features[key] = item[idx]
for key, item in bearing_vars.items():
single_features[key] = item[idx]
features.append(single_features)
hurricanes_features = pd.merge(hurricanes.reset_index(level=0), pd.DataFrame(features), on="index")
# feature engineering part II
def continent_dist(idx, lat, lon, bearing):
# side note on imports:
# on Linux, when you start a child process, it is forked,
# i.e. the child process inherits the memory state of the parent process
# on Windows, however, processes are spawned, i.e. a new interpreter starts and the code reruns
from geographiclib.geodesic import Geodesic
from shapely.geometry import Point
import geopandas as gpd
# definition of continent geometry
world = gpd.read_file(gpd.datasets.get_path('naturalearth_lowres'))
# create continent multipolygon
continents_geoms = world.dissolve().iloc[0].geometry
# procedure to determine distances to continents
dist = 0
single_point = {}
single_point["index"] = idx
while dist <= 500:
point_geom = Geodesic.WGS84.Direct(lat, lon, bearing, 1000*dist)
if Point(point_geom["lon2"], point_geom["lat2"]).within(continents_geoms):
single_point["Continent Distance"] = dist
break
elif dist == 500:
single_point["Continent Distance"] = 500
break
else:
dist += 25
return single_point
# multiprocessing.Pool stuck indefinitely, thus using multiprocess which works with jupyter notebooks
with multiprocess.Pool(processes=os.cpu_count()) as pool:
params = zip(hurricanes_features.dropna(subset=["Next Bearing"])["index"],
hurricanes_features.dropna(subset=["Next Bearing"])["Latitude"],
hurricanes_features.dropna(subset=["Next Bearing"])["Longitude"],
hurricanes_features.dropna(subset=["Next Bearing"])["Next Bearing"])
dist_to_continent = pool.starmap(continent_dist, params)
hurricanes_features = pd.merge(hurricanes_features, pd.DataFrame(dist_to_continent), on="index")
hurricanes_features.dropna(inplace=True)
# display feature df so far
hurricanes_features
hurricanes_features.hist("Next Distance", bins=25)
plt.title('')
plt.xlabel('distance travelled in 6h (in km)')
plt.ylabel('frequency');
# check potentially suspiciously large distances
hurricanes_features.sort_values("Next Distance", ascending=False).head(5)
# select features & omit useless ones
drop_cols = ['index',
'ID',
'Name',
'Datetime',
'Status',
'Time interval',
'Next Bearing',
'Prev. Bearing 0-6h',
'Prev. Bearing 6-12h',
'Prev. Bearing 12-18h',
'Prev. Bearing 18-24h']
hurricane_features_selected = hurricanes_features.drop(columns=drop_cols)
# plot correlation heatmap
sns.heatmap(hurricane_features_selected.corr(), cmap="RdBu_r", vmin=-1, vmax=1, annot=True, fmt=".2f", annot_kws={"size": 7});
The following patterns can be seen in the correlation matrix:
plt.scatter(hurricanes_features["Distance Varcoeff"],
abs(hurricanes_features["Next Distance"] - hurricanes_features["Prev. Distance 0-6h"]),
alpha = .2)
plt.xlabel('variation coefficient of distance')
plt.ylabel('difference between distances in two subsequent intervals');
hurricanes_grouped = hurricanes_features.groupby("ID")
_hurricanes = hurricanes_features
# calculate distance for next 6 hours as variable to be predicted
_hurricanes["Distance Shifted"] = hurricanes_grouped["Next Distance"].transform(lambda x: x.shift(-1))
_hurricanes = _hurricanes.dropna(subset=["Distance Shifted"])
# transform continuous explanatory variables into categorial variables for boxplot visualisation
_hurricanes.loc[:,"Continent Distance cat"] = pd.cut(_hurricanes.loc[:,"Continent Distance"],
bins=[-1, 5, 150, 550],
labels = ['continental', 'coastal', 'maritime'])
_hurricanes.loc[:,"Latitude cat"] = pd.cut(_hurricanes.loc[:,"Latitude"],
bins=[0, 23.5, 35, 90],
labels = ["tropical", "subtropical", "midlatitude"])
_hurricanes.loc[:,"Distance Latitude cat"] = _hurricanes.loc[:,"Continent Distance cat"].astype(str) + " " + _hurricanes["Latitude cat"].astype(str)
categories = {'continental tropical': "#003366",
'coastal tropical': "#003399",
'maritime tropical': "#336699",
'continental subtropical': "#006633",
'coastal subtropical': "#336633",
'maritime subtropical': "#339933",
'continental midlatitude': "#663300",
'coastal midlatitude': "#993300",
'maritime midlatitude': "#996600"}
boxplot = sns.boxplot(x = _hurricanes.loc[:,"Distance Latitude cat"],
y = _hurricanes.loc[:,"Distance Shifted"],
palette = categories,
order = list(categories.keys()))
group_sizes = _hurricanes.groupby(['Distance Latitude cat']).size().reindex(list(categories.keys()))
medians = (_hurricanes.groupby(['Distance Latitude cat'])['Distance Shifted']
.median()
.round(1)
.reindex(list(categories.keys())))
vertical_offset = _hurricanes['Distance Shifted'].median()* 0.075
for xtick in boxplot.get_xticks():
boxplot.text(xtick,
medians[xtick] + vertical_offset,
medians[xtick],
horizontalalignment='center',
size='small',
color='w',
weight='semibold')
boxplot.text(xtick,
925,
f"n\n{group_sizes[xtick]}",
horizontalalignment='center',
size='small')
boxplot.set_xticklabels(boxplot.get_xticklabels(),rotation=20)
boxplot.set_xlabel(None)
boxplot.set_ylabel("Distance (km) travelled within 6 hours")
boxplot.set_ylim(0,1000);
# defining final feature sets
feature_sets_dist = {}
feature_sets_dist["rudimentary"] = ['Prev. Distance 0-6h']
feature_sets_dist["basic"] = ['Prev. Distance 0-6h',
'Prev. Distance 6-12h',
'Prev. Distance 12-18h',
'Prev. Distance 18-24h']
feature_sets_dist["standard"] = ['Latitude',
'Prev. Distance 0-6h',
'Prev. Distance 6-12h',
'Prev. Distance 12-18h',
'Prev. Distance 18-24h']
feature_sets_dist["enlarged"] = ['Year',
'Month',
'Latitude',
'Longitude',
'Maximum Wind',
'Timestep',
'Prev. Distance 0-6h',
'Prev. Distance 6-12h',
'Prev. Distance 12-18h',
'Prev. Distance 18-24h',
'Distance Varcoeff']
feature_sets_bearing = {}
feature_sets_bearing["sin"] = ['Year',
'Month',
'Latitude',
'Longitude',
'Maximum Wind',
'Timestep',
'Prev. Bearing 0-6h sin',
'Prev. Bearing 6-12h sin',
'Prev. Bearing 12-18h sin',
'Prev. Bearing 18-24h sin',
'Continent Distance',
'Bearing Abs Std']
feature_sets_bearing["cos"] = ['Year',
'Month',
'Latitude',
'Longitude',
'Maximum Wind',
'Timestep',
'Prev. Bearing 0-6h cos',
'Prev. Bearing 6-12h cos',
'Prev. Bearing 12-18h cos',
'Prev. Bearing 18-24h cos',
'Continent Distance',
'Bearing Abs Std']
class ml_models:
# part I - setup procedures
def __init__(self, nprocs=os.cpu_count()):
self.data = hurricanes_features
self.nprocs = nprocs
self.models_res = {}
def predicted_var(self, var, unit):
self.y = self.data[var]
self.unit = unit
def explanatory_vars(self, vars):
self.X = self.data[vars]
def create_test_train_data(self, seed=42):
self.X_train, self.X_test, self.y_train, self.y_test = train_test_split(self.X,
self.y,
test_size=0.3,
random_state=seed)
# part II - hyperparameter tuning
def tune_rf(self):
print("--- Hyperparameter tuning for rf regressor ---")
# create a set of hyperparameter combinations
random_grid = {'n_estimators': [int(x) for x in np.arange(10, 100, 10)],
'max_features': ['auto', 'sqrt'],
'max_depth': [int(x) for x in np.arange(5, 100, 5)],
'min_samples_split': [2, 5, 10, 20]}
n_settings = np.array([len(random_grid[key]) for key in random_grid.keys()]).prod()
print(f"Hyperparameter grid with total of {n_settings} settings created")
# search best parameter combination by testing randomly selected combinations from the grid
# using cross validation with 5 folds in order to account for overfitting & average model performance
rf_random = RandomizedSearchCV(estimator = RandomForestRegressor(random_state = 42),
param_distributions = random_grid,
n_iter = 200,
verbose = 1,
n_jobs = self.nprocs,
random_state = 42)
_ = rf_random.fit(self.X_train, self.y_train)
self.rf_best_params = rf_random.best_params_
print("Best parameter setting:")
print(f"{self.rf_best_params}\n")
def tune_svm(self):
print("--- Hyperparameter tuning for svm regressor ---")
# create a set of hyperparameter combinations
random_grid = {'kernel': ["linear", "poly", "rbf", "sigmoid"],
'C': [round(x,2) for x in np.arange(0.1, 2, 0.1)],
'epsilon': [round(x,2) for x in np.arange(0.05, 1, 0.05)]}
n_settings = np.array([len(random_grid[key]) for key in random_grid.keys()]).prod()
print(f"Hyperparameter grid with total of {n_settings} settings created")
# create standardised input data
X = StandardScaler().fit_transform(self.X)
X_train, X_test, y_train, y_test = train_test_split(X, self.y, test_size=0.3)
# search best parameter combination by testing randomly selected combinations from the grid
# using cross validation with 5 folds in order to account for overfitting & average model performance
svm_random = RandomizedSearchCV(estimator = SVR(),
param_distributions = random_grid,
n_iter = 200,
verbose = 1,
n_jobs = self.nprocs,
random_state = 42)
_ = svm_random.fit(X_train, y_train)
self.svm_best_params = svm_random.best_params_
print("Best parameter setting:")
print(f"{self.svm_best_params}\n")
def tune_mlp(self):
print("--- Hyperparameter tuning for mlp regressor ---")
# create a set of hyperparameter combinations
# create a variety of one & two hidden layer architectures
random_grid = {'hidden_layer_sizes': [*[(x,) for x in np.arange(1,10,1)],
*[(x,) for x in np.arange(10,205,10)],
*[(x,y) for x in np.arange(1,10,1) for y in np.arange(1,6)],
*[(x,y) for x in np.arange(10,205,10) for y in np.arange(5,20,5)]],
'activation': ['tanh', 'relu']}
n_settings = np.array([len(random_grid[key]) for key in random_grid.keys()]).prod()
print(f"Hyperparameter grid with total of {n_settings} settings created")
# create standardised input data
X = StandardScaler().fit_transform(self.X)
X_train, X_test, y_train, y_test = train_test_split(X, self.y, test_size = 0.3, random_state = 42)
# search best parameter combination by testing randomly selected combinations from the grid
# using cross validation with 5 folds in order to account for overfitting & average model performance
default_mlp_params = {"max_iter":2000, "early_stopping":True, "random_state":42}
mlp_random = RandomizedSearchCV(estimator = MLPRegressor(**default_mlp_params),
param_distributions = random_grid,
n_iter = 100,
verbose = 1,
n_jobs = self.nprocs,
random_state = 42)
_ = mlp_random.fit(X_train, y_train)
self.mlp_best_params = mlp_random.best_params_
mlp_res = pd.DataFrame({"means": mlp_random.cv_results_['mean_test_score'],
"stds": mlp_random.cv_results_['std_test_score'],
"params": mlp_random.cv_results_['params']}).nlargest(5, "means")
print("Highest accuracy scores:")
for mean, std, params in zip(mlp_res["means"], mlp_res["stds"], mlp_res["params"]):
print("%0.3f (+/-%0.03f) for %r" % (mean, std, params))
print("")
# part III - model definition & fitting
# linear regression model as baseline
def linear_reg(self, verbose=True):
self.mod_reg = LinearRegression().fit(self.X_train, self.y_train)
self.models_res["lin_reg"] = self.mod_reg.predict(self.X_test)
if verbose:
print("--- Results Lin. Reg. ---")
self._accuracy_info(self.models_res["lin_reg"])
# support vector machine model
# include standardisation first, as SVM is not scale-invariant
def svm(self, verbose=True, **params):
self.mod_svm = make_pipeline(StandardScaler(), SVR(**params)).fit(self.X_train, self.y_train)
self.models_res["svm"] = self.mod_svm.predict(self.X_test)
if verbose:
print("--- Results SVM ---")
self._accuracy_info(self.models_res["svm"])
# random forest model
def rf(self, verbose=True, seed=42, **params):
self.mod_rf = RandomForestRegressor(random_state = seed, **params).fit(self.X_train, self.y_train)
self.models_res["rf"] = self.mod_rf.predict(self.X_test)
if verbose:
print("--- Results RF ---")
self._accuracy_info(self.models_res["rf"])
# multilayer perceptron model
def mlp(self, verbose=True, seed=42, **params):
default_mlp_params = {"max_iter":2000, "verbose":False, "early_stopping":True}
self.mod_mlp = make_pipeline(StandardScaler(), MLPRegressor(random_state = seed,
**default_mlp_params,
**params))
self.mod_mlp = self.mod_mlp.fit(self.X_train, self.y_train)
self.models_res["mlp"] = self.mod_mlp.predict(self.X_test)
if verbose:
print("--- Results MLP ---")
self._accuracy_info(self.models_res["mlp"])
# ensemble model averaging all previously executed models
def ensemble(self, verbose=True):
self.models_res["ensemble"] = np.array(pd.DataFrame(self.models_res).median(axis=1))
if verbose:
print("--- Results Ensemble ---")
self._accuracy_info(self.models_res["ensemble"])
# part IV - systematic model evaluation
# reasons: randomness included in some algorithms as well as in splitting training/test data
# therefore to derive at reliable estimates for performance of single ml algorithms multiple runs necessary
def systematic_ml_eval(self, n_repetitions=30):
def single_mod_run(model_set, seed):
# create & train single models
single_mod_set = model_set
single_mod_set._imports_for_spawned_processes()
single_mod_set.X, single_mod_set.y = self.X, self.y
single_mod_set.unit = self.unit
single_mod_set.create_test_train_data(seed)
single_mod_set.linear_reg(verbose=False)
single_mod_set.svm(verbose=False, **self.svm_best_params)
single_mod_set.rf(verbose=False, seed=seed, **self.rf_best_params)
single_mod_set.mlp(verbose=False, seed=seed, **self.mlp_best_params)
single_mod_set.ensemble(verbose=False)
# evaluate model results & return them
mod_set_res = {"rmse": {}, "y_pred_test": {}}
for mod in ["lin_reg", "svm", "rf", "mlp", "ensemble"]:
# rmse for single models
rmse = metrics.mean_squared_error(single_mod_set.y_test, single_mod_set.models_res[mod], squared=False)
mod_set_res["rmse"][f"rmse_{mod}"] = rmse.round(4)
# predicted y data for single models
mod_set_res["y_pred_test"] = single_mod_set.models_res
mod_set_res["y_pred_test"]["idx"] = np.array(single_mod_set.y_test.index)
return mod_set_res
# perform model runs in a multi-threaded manner
seeds = np.arange(0, n_repetitions)
with multiprocess.Pool(processes=self.nprocs) as pool:
model_nset_res = pool.starmap(single_mod_run, zip(n_repetitions*[ml_models()], seeds))
# format results of n model sets
self.model_nset_acc = [item["rmse"] for item in model_nset_res]
self.model_nset_y_pred = {}
y_pred_nsets = [item["y_pred_test"] for item in model_nset_res]
for key in [i.keys() for i in y_pred_nsets][0]:
self.model_nset_y_pred[key] = np.concatenate([data[key] for data in y_pred_nsets])
# part V - helper functions
# accuracy assessment for continous variables
def _accuracy_info(self, y_self):
print(f"Accuracy (coeff. of determination): {metrics.r2_score(self.y_test, y_self).round(2)}")
print(f"RMSE: {metrics.mean_squared_error(self.y_test, y_self, squared=False).round(2)} {self.unit}\n")
# spawned processes imports
def _imports_for_spawned_processes(self):
global np
global pd
global train_test_split
global RandomForestRegressor
global LinearRegression
global metrics
global tree
global SVR
global make_pipeline
global StandardScaler
global MLPRegressor
global RandomizedSearchCV
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestRegressor
from sklearn.linear_model import LinearRegression
from sklearn import metrics
from sklearn import tree
from sklearn.svm import SVR
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.neural_network import MLPRegressor
from sklearn.model_selection import RandomizedSearchCV
# model runs - distance
for key, item in feature_sets_dist.items():
# general info
print(f"--- Distance models: version '{key}' ---")
# model setup
m_set = ml_models()
m_set.predicted_var("Next Distance", unit="km")
m_set.explanatory_vars(item)
m_set.create_test_train_data()
# hyperparameter tuning
m_set.tune_svm()
m_set.tune_rf()
m_set.tune_mlp()
# train & test 30 models
m_set.systematic_ml_eval()
# transfer model set to meaningful name
exec(f"models_distance_{key} = m_set")
# evaluation
mod_performance = pd.DataFrame(m_set.model_nset_acc).mean().sort_values()
print("Mean RMSE:")
print(mod_performance)
print("")
The model runs for the prediction of the distances show the following results:
# model runs - bearing
for key, item in feature_sets_bearing.items():
# general info
print(f"--- Bearing models: {key} part ---")
# model setup
m_set = ml_models()
if key == "sin":
m_set.predicted_var("Next Bearing sin", unit="")
if key == "cos":
m_set.predicted_var("Next Bearing cos", unit="")
m_set.explanatory_vars(item)
m_set.create_test_train_data()
# hyperparameter tuning
m_set.tune_svm()
m_set.tune_rf()
m_set.tune_mlp()
# train & test 30 models
m_set.systematic_ml_eval()
# transfer model set to meaningful name
exec(f"models_bearing_{key} = m_set")
# evaluation
mod_performance = pd.DataFrame(m_set.model_nset_acc).mean().sort_values()
print("Mean RMSE:")
print(mod_performance)
print("")
pd.DataFrame(models_bearing_cos.model_nset_acc).nlargest(10, "rmse_mlp")
# reduce hurricane feature df to derive at df containing cols important for evaluation
hurricanes_eval = hurricanes_features.drop(columns="index").reset_index()
hurricanes_eval = hurricanes_eval[["index", "ID", "Latitude", "Longitude", "Next Distance", "Next Bearing"]]
# transform lat/lon vars
lat_lon_vars = []
hurricanes_grouped = hurricanes_eval.groupby("ID")
for name, group in hurricanes_grouped:
# get non-shifted and shifted lat_lon coordinates as tuples
group["lat_lon"] = list(zip(group["Latitude"], group["Longitude"]))
group["next_lat_lon"] = group["lat_lon"].shift(-1)
# append lat_lon for each group to overall list
for idx, _ in enumerate(group["ID"]):
group_vars = {}
group_vars["index"] = group["index"].iloc[idx]
group_vars["Lat Lon"] = group["lat_lon"].iloc[idx]
group_vars["Next Lat Lon"] = group["next_lat_lon"].iloc[idx]
lat_lon_vars.append(group_vars)
hurricanes_eval = pd.merge(hurricanes_eval, pd.DataFrame(lat_lon_vars), on="index")
hurricanes_eval.drop(columns = ["Latitude", "Longitude"], inplace=True)
hurricanes_eval = hurricanes_eval.reindex(columns = ['index',
'ID',
'Lat Lon',
'Next Lat Lon',
'Next Distance',
'Next Bearing'])
# get predicted dist & bearing values from model sets
predictions = pd.DataFrame({
"idx": models_distance_standard.model_nset_y_pred["idx"],
"Dist Pred": models_distance_standard.model_nset_y_pred["ensemble"],
"sin_pred": models_bearing_sin.model_nset_y_pred["ensemble"],
"cos_pred": models_bearing_cos.model_nset_y_pred["ensemble"]
})
# re-transform sin & cos values into directions specified in degree
predictions["Bearing Pred"] = np.rad2deg(np.arctan2(predictions["sin_pred"], predictions["cos_pred"]))
predictions.drop(columns = ["sin_pred", "cos_pred"], inplace=True)
# merge predictions df with df containing actually measured lat_lon changes
hurricanes_eval = pd.merge(hurricanes_eval, predictions, left_on = "index", right_on = "idx")
hurricanes_eval.drop(columns=["idx"], inplace=True)
hurricanes_eval.dropna(inplace=True)
hurricanes_eval
# calculate estimate for next lat lon coordinate pair based on predicted distance & bearing
geom_features = list(map(lambda lat_lon, bearing, dist: Geodesic.WGS84.Direct(lat_lon[0], lat_lon[1], bearing, dist),
hurricanes_eval["Lat Lon"],
hurricanes_eval["Bearing Pred"],
hurricanes_eval["Dist Pred"] * 1000))
hurricanes_eval["Pred Lat Lon"] = list(map(lambda geom: (round(geom["lat2"], 2), round(geom["lon2"], 2)),
geom_features))
geom_features = list(map(lambda coord_1, coord_2:
Geodesic.WGS84.Inverse(coord_1[0], coord_1[1], coord_2[0], coord_2[1]),
hurricanes_eval["Next Lat Lon"],
hurricanes_eval["Pred Lat Lon"]))
hurricanes_eval["Distance Diff Coordinates"] = list(map(lambda geom: geom["s12"]/1000, geom_features))
rmse_coord = np.sqrt((hurricanes_eval["Distance Diff Coordinates"]**2).mean())
print(f"Average track error (RMSE) for the 6h forecast: {round(rmse_coord,2)}km")
Three main learnings can be drawn from the analyses above:
Two ideas would be interesting to be analysed complementary to what has been done:
Alemany, S.; Beltran, J.; Perez, A.; Ganzfried, S. (2019): Predicting Hurricane Trajectories Using a Recurrent Neural Network. Cangialosi, J. P. (2021): The 2020 Atlantic Hurricane Season - Forecast validation report.
Chen, Rui; Zhang, Weimin; Wang, Xiang (2020): Machine Learning in Tropical Cyclone Forecast Modeling: A Review. In Atmosphere 11 (7), p. 676.
NOAA (2021a): Climate Prediction Center - Atlantic Hurricane Outlook. Available online at https://www.cpc.ncep.noaa.gov/products/outlooks/hurricane.shtml, updated on 11/2/2022, checked on 11/2/2022.
NOAA (2021b): Hurricane FAQ – NOAA's Atlantic Oceanographic and Meteorological Laboratory. Available online at https://www.aoml.noaa.gov/hrd-faq/#what-is-a-hurricane, updated on 11/2/2022, checked on 11/2/2022.
StatsExchange (2016): Discussion on Encoding of Angle Data. Available online at https://stats.stackexchange.com/questions/218407/encoding-angle-data-for-neural-network.
World Meteorological Organization (2017): Global Guide to Tropical Cyclone Forecasting. Geneva.