Next day rain prediction¶

Predict next day rain in Australia using weather data.

Predicting weather requires keen observation skills and knowledge of weather patterns. With trained deep learing models, we can identify the patterns in data to make predictions for the coming days.

import numpy as np
import pandas as pd
from sklearn.preprocessing import MinMaxScaler, StandardScaler
from sklearn.model_selection import train_test_split
from tensorflow.keras import layers, models, optimizers, losses, callbacks
from sklearn.metrics import accuracy_score, confusion_matrix, f1_score
import matplotlib.pyplot as plt
import random

The dataset¶

On Kaggle by Joe Young and Adam Young

Observations were drawn from numerous weather stations. The daily observations are available from http://www.bom.gov.au/climate/data. An example of the latest weather observations in Canberra: http://www.bom.gov.au/climate/dwo/IDCJDW2801.latest.shtml Definitions adapted from http://www.bom.gov.au/climate/dwo/IDCJDW0000.shtml Data source: http://www.bom.gov.au/climate/dwo/ and http://www.bom.gov.au/climate/data. Copyright Commonwealth of Australia 2010, Bureau of Meteorology.

The dataset is a CSV file with about 10 years of daily weather observations from many locations across Australia. The various features in the dataset indicate weather related inormation for the given day and RainTomorrow is the target attribute.

df = pd.read_csv('https://cainvas-static.s3.amazonaws.com/media/user_data/cainvas-admin/weatherAUS.csv')
df

df.isna().sum()

Date                 0
Location             0
MinTemp           1485
MaxTemp           1261
Rainfall          3261
Evaporation      62790
Sunshine         69835
WindGustDir      10326
WindGustSpeed    10263
WindDir9am       10566
WindDir3pm        4228
WindSpeed9am      1767
WindSpeed3pm      3062
Humidity9am       2654
Humidity3pm       4507
Pressure9am      15065
Pressure3pm      15028
Cloud9am         55888
Cloud3pm         59358
Temp9am           1767
Temp3pm           3609
RainToday         3261
RainTomorrow      3267
dtype: int64

Too many NaN values! One option will be filling them but here we will be dropping them as there are too may and filling them may tint the dataset.

df = df.dropna()

df

Input attributes¶

df.dtypes

Date              object
Location          object
MinTemp          float64
MaxTemp          float64
Rainfall         float64
Evaporation      float64
Sunshine         float64
WindGustDir       object
WindGustSpeed    float64
WindDir9am        object
WindDir3pm        object
WindSpeed9am     float64
WindSpeed3pm     float64
Humidity9am      float64
Humidity3pm      float64
Pressure9am      float64
Pressure3pm      float64
Cloud9am         float64
Cloud3pm         float64
Temp9am          float64
Temp3pm          float64
RainToday         object
RainTomorrow      object
dtype: object

One hot encoding the 4 columns - Location, WindGustDir, WindDir9am, WindDir3pm as their values do not have a range dependency.

Removing the 4 columns as they won't be needed anymore.

RainToday column values can be derived from the RainfallMeasurement column.

Date value is not necessary here too.

dummy_columns = ['Location', 'WindGustDir', 'WindDir9am', 'WindDir3pm']

for column in dummy_columns:
    dummy_loc = pd.get_dummies(df[column], prefix = column, drop_first = True)
    for c in dummy_loc.columns:
        df[c] = dummy_loc[c]

del dummy_loc

df = df.drop(columns = dummy_columns)

df = df.drop(columns = ['RainToday', 'Date'])

df

/opt/tljh/user/lib/python3.7/site-packages/ipykernel_launcher.py:6: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy

Changing the datatype of the RainTomorrow column.

df['RainTomorrow'] = (df['RainTomorrow'] == 'Yes').astype('int64')

Balancing the dataset¶

df['RainTomorrow'].value_counts()

0    43993
1    12427
Name: RainTomorrow, dtype: int64

df1 = df[df['RainTomorrow'] == 1]
df0 = df[df['RainTomorrow'] == 0]

print("Number of samples in:")
print("Class label 1 - ", len(df1))
print("Class label 0 - ", len(df0))

# Upsampling 

df1 = df1.sample(20000, replace = True)    # replace = True enables resampling

print('\nAfter resampling - ')

print("Number of samples in:")
print("Class label 1 - ", len(df1))
print("Class label 0 - ", len(df0))

df = df1.append(df0)

Number of samples in:
Class label 1 -  12427
Class label 0 -  43993

After resampling - 
Number of samples in:
Class label 1 -  20000
Class label 0 -  43993

Resampling the values to be equal in count can result in ~30k redundant values. So we restrict it to 20k values in class label 1, resulting in only ~8k redundant rows,

# Splitting into train, val and test set -- 80-10-10 split

# First, an 80-20 split
train_df, val_test_df = train_test_split(df, test_size = 0.2, random_state = 113)

# Then split the 20% into half
val_df, test_df = train_test_split(val_test_df, test_size = 0.5, random_state = 113)

len(train_df), len(val_df), len(test_df)

(51194, 6399, 6400)

ic = df.columns.tolist()
ic.remove('RainTomorrow')

oc = ['RainTomorrow']

ytrain = train_df[oc]
Xtrain = train_df.drop(columns = oc)

yval = val_df[oc]
Xval = val_df.drop(columns = oc)

ytest = test_df[oc]
Xtest = test_df.drop(columns = oc)

Standardization¶

df.describe()

The standard deviation of atrribute values is not the same and can hence cause some attributes to be weighed above others. This can be avoided by standardizing the values.

ss = StandardScaler()

Xtrain = ss.fit_transform(Xtrain)
Xval = ss.transform(Xval)
Xtest = ss.transform(Xtest)

The model¶

model = models.Sequential([
    layers.Dense(16, activation = 'relu', input_shape = Xtrain[0].shape),
    layers.Dense(8, activation = 'relu'),
    layers.Dense(1, activation = 'sigmoid')
])

cb = callbacks.EarlyStopping(patience = 5, restore_best_weights = True)
model.summary()

Model: "sequential"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
=================================================================
dense (Dense)                (None, 16)                1392      
_________________________________________________________________
dense_1 (Dense)              (None, 8)                 136       
_________________________________________________________________
dense_2 (Dense)              (None, 1)                 9         
=================================================================
Total params: 1,537
Trainable params: 1,537
Non-trainable params: 0
_________________________________________________________________

model.compile(optimizer = optimizers.Adam(0.01), loss = losses.BinaryCrossentropy(), metrics = ['accuracy'])

history = model.fit(Xtrain, ytrain, validation_data = (Xval, yval), epochs = 256, callbacks = cb)

Epoch 1/256
1600/1600 [==============================] - 2s 2ms/step - loss: 0.3911 - accuracy: 0.8195 - val_loss: 0.3672 - val_accuracy: 0.8312
Epoch 2/256
1600/1600 [==============================] - 2s 1ms/step - loss: 0.3653 - accuracy: 0.8313 - val_loss: 0.3623 - val_accuracy: 0.8370
Epoch 3/256
1600/1600 [==============================] - 2s 1ms/step - loss: 0.3582 - accuracy: 0.8365 - val_loss: 0.3492 - val_accuracy: 0.8409
Epoch 4/256
1600/1600 [==============================] - 2s 1ms/step - loss: 0.3512 - accuracy: 0.8395 - val_loss: 0.3478 - val_accuracy: 0.8412
Epoch 5/256
1600/1600 [==============================] - 2s 1ms/step - loss: 0.3469 - accuracy: 0.8413 - val_loss: 0.3472 - val_accuracy: 0.8404
Epoch 6/256
1600/1600 [==============================] - 2s 1ms/step - loss: 0.3439 - accuracy: 0.8419 - val_loss: 0.3493 - val_accuracy: 0.8426
Epoch 7/256
1600/1600 [==============================] - 2s 1ms/step - loss: 0.3401 - accuracy: 0.8439 - val_loss: 0.3460 - val_accuracy: 0.8403
Epoch 8/256
1600/1600 [==============================] - 2s 1ms/step - loss: 0.3383 - accuracy: 0.8439 - val_loss: 0.3413 - val_accuracy: 0.8425
Epoch 9/256
1600/1600 [==============================] - 2s 1ms/step - loss: 0.3359 - accuracy: 0.8459 - val_loss: 0.3501 - val_accuracy: 0.8422
Epoch 10/256
1600/1600 [==============================] - 2s 1ms/step - loss: 0.3350 - accuracy: 0.8466 - val_loss: 0.3438 - val_accuracy: 0.8439
Epoch 11/256
1600/1600 [==============================] - 2s 1ms/step - loss: 0.3342 - accuracy: 0.8465 - val_loss: 0.3472 - val_accuracy: 0.8433
Epoch 12/256
1600/1600 [==============================] - 2s 1ms/step - loss: 0.3322 - accuracy: 0.8466 - val_loss: 0.3458 - val_accuracy: 0.8439
Epoch 13/256
1600/1600 [==============================] - 2s 1ms/step - loss: 0.3308 - accuracy: 0.8471 - val_loss: 0.3379 - val_accuracy: 0.8415
Epoch 14/256
1600/1600 [==============================] - 2s 1ms/step - loss: 0.3291 - accuracy: 0.8478 - val_loss: 0.3486 - val_accuracy: 0.8408
Epoch 15/256
1600/1600 [==============================] - 2s 1ms/step - loss: 0.3281 - accuracy: 0.8475 - val_loss: 0.3490 - val_accuracy: 0.8420
Epoch 16/256
1600/1600 [==============================] - 2s 1ms/step - loss: 0.3278 - accuracy: 0.8484 - val_loss: 0.3390 - val_accuracy: 0.8436
Epoch 17/256
1600/1600 [==============================] - 2s 1ms/step - loss: 0.3258 - accuracy: 0.8488 - val_loss: 0.3405 - val_accuracy: 0.8419
Epoch 18/256
1600/1600 [==============================] - 2s 1ms/step - loss: 0.3259 - accuracy: 0.8490 - val_loss: 0.3563 - val_accuracy: 0.8425

cm = confusion_matrix(ytest, (model.predict(Xtest)>0.5).astype('int'))
cm = cm.astype('int') / cm.sum(axis=1)[:, np.newaxis]

fig = plt.figure(figsize = (10, 10))
ax = fig.add_subplot(111)

for i in range(cm.shape[1]):
    for j in range(cm.shape[0]):
        if cm[i,j] > 0.8:
            clr = "white"
        else:
            clr = "black"
        ax.text(j, i, format(cm[i, j], '.2f'), horizontalalignment="center", color=clr)

_ = ax.imshow(cm, cmap=plt.cm.Blues)
ax.set_xticks(range(2))
ax.set_yticks(range(2))
ax.set_xticklabels(['No', 'Yes'], rotation = 90)
ax.set_yticklabels(['No', 'Yes'])
plt.xlabel('Predicted')
plt.ylabel('True')
plt.show()

A higher count unique values for the rain class will help in higher accuracy.

The metrics¶

def plot(history, variable, variable2):
    plt.plot(range(len(history[variable])), history[variable])
    plt.plot(range(len(history[variable2])), history[variable2])
    plt.title(variable)

plot(history.history, "accuracy", 'val_accuracy')

plot(history.history, "loss", "val_loss")

Prediction¶

labels = ['No', 'Yes']

# pick random test data sample from one batch
x = random.randint(0, len(Xtest) - 1)

output = model.predict(Xtest[x].reshape(1, -1))[0][0] 
pred = (output>0.5).astype('int')
print("Predicted: ", labels[pred], "(", output, "-->", pred, ")")    

print("True: ", labels[np.array(ytest)[x][0]])

Predicted:  Yes ( 0.61741173 --> 1 )
True:  No

deepC¶

model.save('rain_prediction.h5')

!deepCC rain_prediction.h5

[INFO]
Reading [keras model] 'rain_prediction.h5'
[SUCCESS]
Saved 'rain_prediction_deepC/rain_prediction.onnx'
[INFO]
Reading [onnx model] 'rain_prediction_deepC/rain_prediction.onnx'
[INFO]
Model info:
  ir_vesion : 4
  doc       : 
[WARNING]
[ONNX]: terminal (input/output) dense_input's shape is less than 1. Changing it to 1.
[WARNING]
[ONNX]: terminal (input/output) dense_2's shape is less than 1. Changing it to 1.
WARN (GRAPH): found operator node with the same name (dense_2) as io node.
[INFO]
Running DNNC graph sanity check ...
[SUCCESS]
Passed sanity check.
[INFO]
Writing C++ file 'rain_prediction_deepC/rain_prediction.cpp'
[INFO]
deepSea model files are ready in 'rain_prediction_deepC/' 
[RUNNING COMMAND]
g++ -std=c++11 -O3 -fno-rtti -fno-exceptions -I. -I/opt/tljh/user/lib/python3.7/site-packages/deepC-0.13-py3.7-linux-x86_64.egg/deepC/include -isystem /opt/tljh/user/lib/python3.7/site-packages/deepC-0.13-py3.7-linux-x86_64.egg/deepC/packages/eigen-eigen-323c052e1731 "rain_prediction_deepC/rain_prediction.cpp" -D_AITS_MAIN -o "rain_prediction_deepC/rain_prediction.exe"
[RUNNING COMMAND]
size "rain_prediction_deepC/rain_prediction.exe"
   text	   data	    bss	    dec	    hex	filename
 124643	   2968	    760	 128371	  1f573	rain_prediction_deepC/rain_prediction.exe
[SUCCESS]
Saved model as executable "rain_prediction_deepC/rain_prediction.exe"

x = random.randint(0, len(Xtest) - 1)

np.savetxt('sample.data', Xtest[x])    # xth sample into text file

# run exe with input
!rain_prediction_deepC/rain_prediction.exe sample.data

# show predicted output
nn_out = np.loadtxt('deepSea_result_1.out')

pred = (nn_out>0.5).astype('int')
print("Predicted: ", labels[pred], "(", nn_out, "-->", pred, ")")    

print("True: ", labels[np.array(ytest)[x][0]])

writing file deepSea_result_1.out.
Predicted:  No ( 0.159694 --> 0 )
True:  No

	Date	Location	MinTemp	MaxTemp	Rainfall	Evaporation	Sunshine	WindGustDir	WindGustSpeed	WindDir9am	...	Humidity9am	Humidity3pm	Pressure9am	Pressure3pm	Cloud9am	Cloud3pm	Temp9am	Temp3pm	RainToday	RainTomorrow
0	2008-12-01	Albury	13.4	22.9	0.6	NaN	NaN	W	44.0	W	...	71.0	22.0	1007.7	1007.1	8.0	NaN	16.9	21.8	No	No
1	2008-12-02	Albury	7.4	25.1	0.0	NaN	NaN	WNW	44.0	NNW	...	44.0	25.0	1010.6	1007.8	NaN	NaN	17.2	24.3	No	No
2	2008-12-03	Albury	12.9	25.7	0.0	NaN	NaN	WSW	46.0	W	...	38.0	30.0	1007.6	1008.7	NaN	2.0	21.0	23.2	No	No
3	2008-12-04	Albury	9.2	28.0	0.0	NaN	NaN	NE	24.0	SE	...	45.0	16.0	1017.6	1012.8	NaN	NaN	18.1	26.5	No	No
4	2008-12-05	Albury	17.5	32.3	1.0	NaN	NaN	W	41.0	ENE	...	82.0	33.0	1010.8	1006.0	7.0	8.0	17.8	29.7	No	No
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
145455	2017-06-21	Uluru	2.8	23.4	0.0	NaN	NaN	E	31.0	SE	...	51.0	24.0	1024.6	1020.3	NaN	NaN	10.1	22.4	No	No
145456	2017-06-22	Uluru	3.6	25.3	0.0	NaN	NaN	NNW	22.0	SE	...	56.0	21.0	1023.5	1019.1	NaN	NaN	10.9	24.5	No	No
145457	2017-06-23	Uluru	5.4	26.9	0.0	NaN	NaN	N	37.0	SE	...	53.0	24.0	1021.0	1016.8	NaN	NaN	12.5	26.1	No	No
145458	2017-06-24	Uluru	7.8	27.0	0.0	NaN	NaN	SE	28.0	SSE	...	51.0	24.0	1019.4	1016.5	3.0	2.0	15.1	26.0	No	No
145459	2017-06-25	Uluru	14.9	NaN	0.0	NaN	NaN	NaN	NaN	ESE	...	62.0	36.0	1020.2	1017.9	8.0	8.0	15.0	20.9	No	NaN

	Date	Location	MinTemp	MaxTemp	Rainfall	Evaporation	Sunshine	WindGustDir	WindGustSpeed	WindDir9am	...	Humidity9am	Humidity3pm	Pressure9am	Pressure3pm	Cloud9am	Cloud3pm	Temp9am	Temp3pm	RainToday	RainTomorrow
6049	2009-01-01	Cobar	17.9	35.2	0.0	12.0	12.3	SSW	48.0	ENE	...	20.0	13.0	1006.3	1004.4	2.0	5.0	26.6	33.4	No	No
6050	2009-01-02	Cobar	18.4	28.9	0.0	14.8	13.0	S	37.0	SSE	...	30.0	8.0	1012.9	1012.1	1.0	1.0	20.3	27.0	No	No
6052	2009-01-04	Cobar	19.4	37.6	0.0	10.8	10.6	NNE	46.0	NNE	...	42.0	22.0	1012.3	1009.2	1.0	6.0	28.7	34.9	No	No
6053	2009-01-05	Cobar	21.9	38.4	0.0	11.4	12.2	WNW	31.0	WNW	...	37.0	22.0	1012.7	1009.1	1.0	5.0	29.1	35.6	No	No
6054	2009-01-06	Cobar	24.2	41.0	0.0	11.2	8.4	WNW	35.0	NW	...	19.0	15.0	1010.7	1007.4	1.0	6.0	33.6	37.6	No	No
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
142298	2017-06-20	Darwin	19.3	33.4	0.0	6.0	11.0	ENE	35.0	SE	...	63.0	32.0	1013.9	1010.5	0.0	1.0	24.5	32.3	No	No
142299	2017-06-21	Darwin	21.2	32.6	0.0	7.6	8.6	E	37.0	SE	...	56.0	28.0	1014.6	1011.2	7.0	0.0	24.8	32.0	No	No
142300	2017-06-22	Darwin	20.7	32.8	0.0	5.6	11.0	E	33.0	E	...	46.0	23.0	1015.3	1011.8	0.0	0.0	24.8	32.1	No	No
142301	2017-06-23	Darwin	19.5	31.8	0.0	6.2	10.6	ESE	26.0	SE	...	62.0	58.0	1014.9	1010.7	1.0	1.0	24.8	29.2	No	No
142302	2017-06-24	Darwin	20.2	31.7	0.0	5.6	10.7	ENE	30.0	ENE	...	73.0	32.0	1013.9	1009.7	6.0	5.0	25.4	31.0	No	No

	MinTemp	MaxTemp	Rainfall	Evaporation	Sunshine	WindGustSpeed	WindSpeed9am	WindSpeed3pm	Humidity9am	Humidity3pm	...	WindDir3pm_NNW	WindDir3pm_NW	WindDir3pm_S	WindDir3pm_SE	WindDir3pm_SSE	WindDir3pm_SSW	WindDir3pm_SW	WindDir3pm_W	WindDir3pm_WNW	WindDir3pm_WSW
6049	17.9	35.2	0.0	12.0	12.3	48.0	6.0	20.0	20.0	13.0	...	0	0	0	0	0	0	1	0	0	0
6050	18.4	28.9	0.0	14.8	13.0	37.0	19.0	19.0	30.0	8.0	...	0	0	0	0	1	0	0	0	0	0
6052	19.4	37.6	0.0	10.8	10.6	46.0	30.0	15.0	42.0	22.0	...	1	0	0	0	0	0	0	0	0	0
6053	21.9	38.4	0.0	11.4	12.2	31.0	6.0	6.0	37.0	22.0	...	0	0	0	0	0	0	0	0	0	1
6054	24.2	41.0	0.0	11.2	8.4	35.0	17.0	13.0	19.0	15.0	...	0	0	0	0	0	0	0	0	1	0
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
142298	19.3	33.4	0.0	6.0	11.0	35.0	9.0	20.0	63.0	32.0	...	0	0	0	0	0	0	0	0	0	0
142299	21.2	32.6	0.0	7.6	8.6	37.0	13.0	11.0	56.0	28.0	...	0	0	0	1	0	0	0	0	0	0
142300	20.7	32.8	0.0	5.6	11.0	33.0	17.0	11.0	46.0	23.0	...	0	0	0	0	0	0	0	1	0	0
142301	19.5	31.8	0.0	6.2	10.6	26.0	9.0	17.0	62.0	58.0	...	1	0	0	0	0	0	0	0	0	0
142302	20.2	31.7	0.0	5.6	10.7	30.0	15.0	7.0	73.0	32.0	...	1	0	0	0	0	0	0	0	0	0

	MinTemp	MaxTemp	Rainfall	Evaporation	Sunshine	WindGustSpeed	WindSpeed9am	WindSpeed3pm	Humidity9am	Humidity3pm	...	WindDir3pm_NNW	WindDir3pm_NW	WindDir3pm_S	WindDir3pm_SE	WindDir3pm_SSE	WindDir3pm_SSW	WindDir3pm_SW	WindDir3pm_W	WindDir3pm_WNW	WindDir3pm_WSW
count	63993.000000	63993.000000	63993.000000	63993.000000	63993.000000	63993.000000	63993.000000	63993.000000	63993.000000	63993.000000	...	63993.000000	63993.000000	63993.000000	63993.000000	63993.000000	63993.000000	63993.000000	63993.000000	63993.000000	63993.000000
mean	13.597165	23.998720	2.538282	5.393252	7.357673	41.502477	15.774600	19.910678	67.014470	51.660885	...	0.049896	0.050631	0.071836	0.072227	0.058741	0.060272	0.068898	0.071742	0.060007	0.067507
std	6.449162	6.994383	7.729316	3.652709	3.859224	13.671841	8.385679	8.562732	18.453199	20.747906	...	0.217732	0.219244	0.258218	0.258865	0.235141	0.237992	0.253283	0.258063	0.237501	0.250901
min	-6.700000	4.100000	0.000000	0.000000	0.000000	9.000000	2.000000	2.000000	0.000000	0.000000	...	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000
25%	8.600000	18.500000	0.000000	2.800000	4.400000	31.000000	9.000000	13.000000	56.000000	37.000000	...	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000
50%	13.300000	23.600000	0.000000	4.800000	8.200000	39.000000	15.000000	19.000000	68.000000	52.000000	...	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000
75%	18.600000	29.500000	1.000000	7.400000	10.500000	50.000000	20.000000	26.000000	80.000000	66.000000	...	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000
max	31.400000	48.100000	206.200000	81.200000	14.500000	124.000000	67.000000	76.000000	100.000000	100.000000	...	1.000000	1.000000	1.000000	1.000000	1.000000	1.000000	1.000000	1.000000	1.000000	1.000000

Model Files
rain_prediction.h5 keras Model
deepSea Compiled Models
rain_prediction.exe deepSea Ubuntu