Next day rain prediction¶
Credit: AITS Cainvas Community
Predict next day rain in Australia using weather data.
Predicting weather requires keen observation skills and knowledge of weather patterns. With trained deep learing models, we can identify the patterns in data to make predictions for the coming days.
import numpy as np
import pandas as pd
from sklearn.preprocessing import MinMaxScaler, StandardScaler
from sklearn.model_selection import train_test_split
from tensorflow.keras import layers, models, optimizers, losses, callbacks
from sklearn.metrics import accuracy_score, confusion_matrix, f1_score
import matplotlib.pyplot as plt
import random
The dataset¶
On Kaggle by Joe Young and Adam Young
Observations were drawn from numerous weather stations. The daily observations are available from http://www.bom.gov.au/climate/data. An example of the latest weather observations in Canberra: http://www.bom.gov.au/climate/dwo/IDCJDW2801.latest.shtml Definitions adapted from http://www.bom.gov.au/climate/dwo/IDCJDW0000.shtml Data source: http://www.bom.gov.au/climate/dwo/ and http://www.bom.gov.au/climate/data. Copyright Commonwealth of Australia 2010, Bureau of Meteorology.
The dataset is a CSV file with about 10 years of daily weather observations from many locations across Australia. The various features in the dataset indicate weather related inormation for the given day and RainTomorrow is the target attribute.
df = pd.read_csv('https://cainvas-static.s3.amazonaws.com/media/user_data/cainvas-admin/weatherAUS.csv')
df
df.isna().sum()
Too many NaN values! One option will be filling them but here we will be dropping them as there are too may and filling them may tint the dataset.
df = df.dropna()
df
Input attributes¶
df.dtypes
One hot encoding the 4 columns - Location, WindGustDir, WindDir9am, WindDir3pm as their values do not have a range dependency.
Removing the 4 columns as they won't be needed anymore.
RainToday column values can be derived from the RainfallMeasurement column.
Date value is not necessary here too.
dummy_columns = ['Location', 'WindGustDir', 'WindDir9am', 'WindDir3pm']
for column in dummy_columns:
dummy_loc = pd.get_dummies(df[column], prefix = column, drop_first = True)
for c in dummy_loc.columns:
df[c] = dummy_loc[c]
del dummy_loc
df = df.drop(columns = dummy_columns)
df = df.drop(columns = ['RainToday', 'Date'])
df
Changing the datatype of the RainTomorrow column.
df['RainTomorrow'] = (df['RainTomorrow'] == 'Yes').astype('int64')
Balancing the dataset¶
df['RainTomorrow'].value_counts()
df1 = df[df['RainTomorrow'] == 1]
df0 = df[df['RainTomorrow'] == 0]
print("Number of samples in:")
print("Class label 1 - ", len(df1))
print("Class label 0 - ", len(df0))
# Upsampling
df1 = df1.sample(20000, replace = True) # replace = True enables resampling
print('\nAfter resampling - ')
print("Number of samples in:")
print("Class label 1 - ", len(df1))
print("Class label 0 - ", len(df0))
df = df1.append(df0)
Resampling the values to be equal in count can result in ~30k redundant values. So we restrict it to 20k values in class label 1, resulting in only ~8k redundant rows,
# Splitting into train, val and test set -- 80-10-10 split
# First, an 80-20 split
train_df, val_test_df = train_test_split(df, test_size = 0.2, random_state = 113)
# Then split the 20% into half
val_df, test_df = train_test_split(val_test_df, test_size = 0.5, random_state = 113)
len(train_df), len(val_df), len(test_df)
ic = df.columns.tolist()
ic.remove('RainTomorrow')
oc = ['RainTomorrow']
ytrain = train_df[oc]
Xtrain = train_df.drop(columns = oc)
yval = val_df[oc]
Xval = val_df.drop(columns = oc)
ytest = test_df[oc]
Xtest = test_df.drop(columns = oc)
Standardization¶
df.describe()
The standard deviation of atrribute values is not the same and can hence cause some attributes to be weighed above others. This can be avoided by standardizing the values.
ss = StandardScaler()
Xtrain = ss.fit_transform(Xtrain)
Xval = ss.transform(Xval)
Xtest = ss.transform(Xtest)
The model¶
model = models.Sequential([
layers.Dense(16, activation = 'relu', input_shape = Xtrain[0].shape),
layers.Dense(8, activation = 'relu'),
layers.Dense(1, activation = 'sigmoid')
])
cb = callbacks.EarlyStopping(patience = 5, restore_best_weights = True)
model.summary()
model.compile(optimizer = optimizers.Adam(0.01), loss = losses.BinaryCrossentropy(), metrics = ['accuracy'])
history = model.fit(Xtrain, ytrain, validation_data = (Xval, yval), epochs = 256, callbacks = cb)
cm = confusion_matrix(ytest, (model.predict(Xtest)>0.5).astype('int'))
cm = cm.astype('int') / cm.sum(axis=1)[:, np.newaxis]
fig = plt.figure(figsize = (10, 10))
ax = fig.add_subplot(111)
for i in range(cm.shape[1]):
for j in range(cm.shape[0]):
if cm[i,j] > 0.8:
clr = "white"
else:
clr = "black"
ax.text(j, i, format(cm[i, j], '.2f'), horizontalalignment="center", color=clr)
_ = ax.imshow(cm, cmap=plt.cm.Blues)
ax.set_xticks(range(2))
ax.set_yticks(range(2))
ax.set_xticklabels(['No', 'Yes'], rotation = 90)
ax.set_yticklabels(['No', 'Yes'])
plt.xlabel('Predicted')
plt.ylabel('True')
plt.show()
A higher count unique values for the rain class will help in higher accuracy.
The metrics¶
def plot(history, variable, variable2):
plt.plot(range(len(history[variable])), history[variable])
plt.plot(range(len(history[variable2])), history[variable2])
plt.title(variable)
plot(history.history, "accuracy", 'val_accuracy')
plot(history.history, "loss", "val_loss")
Prediction¶
labels = ['No', 'Yes']
# pick random test data sample from one batch
x = random.randint(0, len(Xtest) - 1)
output = model.predict(Xtest[x].reshape(1, -1))[0][0]
pred = (output>0.5).astype('int')
print("Predicted: ", labels[pred], "(", output, "-->", pred, ")")
print("True: ", labels[np.array(ytest)[x][0]])
deepC¶
model.save('rain_prediction.h5')
!deepCC rain_prediction.h5
x = random.randint(0, len(Xtest) - 1)
np.savetxt('sample.data', Xtest[x]) # xth sample into text file
# run exe with input
!rain_prediction_deepC/rain_prediction.exe sample.data
# show predicted output
nn_out = np.loadtxt('deepSea_result_1.out')
pred = (nn_out>0.5).astype('int')
print("Predicted: ", labels[pred], "(", nn_out, "-->", pred, ")")
print("True: ", labels[np.array(ytest)[x][0]])