Cainvas
Model Files
rain_prediction.h5
keras
Model
deepSea Compiled Models
rain_prediction.exe
deepSea
Ubuntu

Next day rain prediction

Credit: AITS Cainvas Community

Photo by GRAMM on Dribbble

Predict next day rain in Australia using weather data.

Predicting weather requires keen observation skills and knowledge of weather patterns. With trained deep learing models, we can identify the patterns in data to make predictions for the coming days.

In [1]:
import numpy as np
import pandas as pd
from sklearn.preprocessing import MinMaxScaler, StandardScaler
from sklearn.model_selection import train_test_split
from keras import layers, models, optimizers, losses, callbacks
from sklearn.metrics import accuracy_score, confusion_matrix, f1_score
import matplotlib.pyplot as plt
import random

The dataset

On Kaggle by Joe Young and Adam Young

Observations were drawn from numerous weather stations. The daily observations are available from http://www.bom.gov.au/climate/data. An example of the latest weather observations in Canberra: http://www.bom.gov.au/climate/dwo/IDCJDW2801.latest.shtml Definitions adapted from http://www.bom.gov.au/climate/dwo/IDCJDW0000.shtml Data source: http://www.bom.gov.au/climate/dwo/ and http://www.bom.gov.au/climate/data. Copyright Commonwealth of Australia 2010, Bureau of Meteorology.

The dataset is a CSV file with about 10 years of daily weather observations from many locations across Australia. The various features in the dataset indicate weather related inormation for the given day and RainTomorrow is the target attribute.

In [2]:
df = pd.read_csv('https://cainvas-static.s3.amazonaws.com/media/user_data/cainvas-admin/weatherAUS.csv')
df
Out[2]:
Date Location MinTemp MaxTemp Rainfall Evaporation Sunshine WindGustDir WindGustSpeed WindDir9am ... Humidity9am Humidity3pm Pressure9am Pressure3pm Cloud9am Cloud3pm Temp9am Temp3pm RainToday RainTomorrow
0 2008-12-01 Albury 13.4 22.9 0.6 NaN NaN W 44.0 W ... 71.0 22.0 1007.7 1007.1 8.0 NaN 16.9 21.8 No No
1 2008-12-02 Albury 7.4 25.1 0.0 NaN NaN WNW 44.0 NNW ... 44.0 25.0 1010.6 1007.8 NaN NaN 17.2 24.3 No No
2 2008-12-03 Albury 12.9 25.7 0.0 NaN NaN WSW 46.0 W ... 38.0 30.0 1007.6 1008.7 NaN 2.0 21.0 23.2 No No
3 2008-12-04 Albury 9.2 28.0 0.0 NaN NaN NE 24.0 SE ... 45.0 16.0 1017.6 1012.8 NaN NaN 18.1 26.5 No No
4 2008-12-05 Albury 17.5 32.3 1.0 NaN NaN W 41.0 ENE ... 82.0 33.0 1010.8 1006.0 7.0 8.0 17.8 29.7 No No
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
145455 2017-06-21 Uluru 2.8 23.4 0.0 NaN NaN E 31.0 SE ... 51.0 24.0 1024.6 1020.3 NaN NaN 10.1 22.4 No No
145456 2017-06-22 Uluru 3.6 25.3 0.0 NaN NaN NNW 22.0 SE ... 56.0 21.0 1023.5 1019.1 NaN NaN 10.9 24.5 No No
145457 2017-06-23 Uluru 5.4 26.9 0.0 NaN NaN N 37.0 SE ... 53.0 24.0 1021.0 1016.8 NaN NaN 12.5 26.1 No No
145458 2017-06-24 Uluru 7.8 27.0 0.0 NaN NaN SE 28.0 SSE ... 51.0 24.0 1019.4 1016.5 3.0 2.0 15.1 26.0 No No
145459 2017-06-25 Uluru 14.9 NaN 0.0 NaN NaN NaN NaN ESE ... 62.0 36.0 1020.2 1017.9 8.0 8.0 15.0 20.9 No NaN

145460 rows × 23 columns

In [3]:
df.isna().sum()
Out[3]:
Date                 0
Location             0
MinTemp           1485
MaxTemp           1261
Rainfall          3261
Evaporation      62790
Sunshine         69835
WindGustDir      10326
WindGustSpeed    10263
WindDir9am       10566
WindDir3pm        4228
WindSpeed9am      1767
WindSpeed3pm      3062
Humidity9am       2654
Humidity3pm       4507
Pressure9am      15065
Pressure3pm      15028
Cloud9am         55888
Cloud3pm         59358
Temp9am           1767
Temp3pm           3609
RainToday         3261
RainTomorrow      3267
dtype: int64

Too many NaN values! One option will be filling them but here we will be dropping them as there are too may and filling them may tint the dataset.

In [4]:
df = df.dropna()

df
Out[4]:
Date Location MinTemp MaxTemp Rainfall Evaporation Sunshine WindGustDir WindGustSpeed WindDir9am ... Humidity9am Humidity3pm Pressure9am Pressure3pm Cloud9am Cloud3pm Temp9am Temp3pm RainToday RainTomorrow
6049 2009-01-01 Cobar 17.9 35.2 0.0 12.0 12.3 SSW 48.0 ENE ... 20.0 13.0 1006.3 1004.4 2.0 5.0 26.6 33.4 No No
6050 2009-01-02 Cobar 18.4 28.9 0.0 14.8 13.0 S 37.0 SSE ... 30.0 8.0 1012.9 1012.1 1.0 1.0 20.3 27.0 No No
6052 2009-01-04 Cobar 19.4 37.6 0.0 10.8 10.6 NNE 46.0 NNE ... 42.0 22.0 1012.3 1009.2 1.0 6.0 28.7 34.9 No No
6053 2009-01-05 Cobar 21.9 38.4 0.0 11.4 12.2 WNW 31.0 WNW ... 37.0 22.0 1012.7 1009.1 1.0 5.0 29.1 35.6 No No
6054 2009-01-06 Cobar 24.2 41.0 0.0 11.2 8.4 WNW 35.0 NW ... 19.0 15.0 1010.7 1007.4 1.0 6.0 33.6 37.6 No No
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
142298 2017-06-20 Darwin 19.3 33.4 0.0 6.0 11.0 ENE 35.0 SE ... 63.0 32.0 1013.9 1010.5 0.0 1.0 24.5 32.3 No No
142299 2017-06-21 Darwin 21.2 32.6 0.0 7.6 8.6 E 37.0 SE ... 56.0 28.0 1014.6 1011.2 7.0 0.0 24.8 32.0 No No
142300 2017-06-22 Darwin 20.7 32.8 0.0 5.6 11.0 E 33.0 E ... 46.0 23.0 1015.3 1011.8 0.0 0.0 24.8 32.1 No No
142301 2017-06-23 Darwin 19.5 31.8 0.0 6.2 10.6 ESE 26.0 SE ... 62.0 58.0 1014.9 1010.7 1.0 1.0 24.8 29.2 No No
142302 2017-06-24 Darwin 20.2 31.7 0.0 5.6 10.7 ENE 30.0 ENE ... 73.0 32.0 1013.9 1009.7 6.0 5.0 25.4 31.0 No No

56420 rows × 23 columns

Input attributes

In [5]:
df.dtypes
Out[5]:
Date              object
Location          object
MinTemp          float64
MaxTemp          float64
Rainfall         float64
Evaporation      float64
Sunshine         float64
WindGustDir       object
WindGustSpeed    float64
WindDir9am        object
WindDir3pm        object
WindSpeed9am     float64
WindSpeed3pm     float64
Humidity9am      float64
Humidity3pm      float64
Pressure9am      float64
Pressure3pm      float64
Cloud9am         float64
Cloud3pm         float64
Temp9am          float64
Temp3pm          float64
RainToday         object
RainTomorrow      object
dtype: object

One hot encoding the 4 columns - Location, WindGustDir, WindDir9am, WindDir3pm as their values do not have a range dependency.

Removing the 4 columns as they won't be needed anymore.

RainToday column values can be derived from the RainfallMeasurement column.

Date value is not necessary here too.

In [6]:
dummy_columns = ['Location', 'WindGustDir', 'WindDir9am', 'WindDir3pm']

for column in dummy_columns:
    dummy_loc = pd.get_dummies(df[column], prefix = column, drop_first = True)
    for c in dummy_loc.columns:
        df[c] = dummy_loc[c]

del dummy_loc

df = df.drop(columns = dummy_columns)

df = df.drop(columns = ['RainToday', 'Date'])

df
/opt/tljh/user/lib/python3.7/site-packages/ipykernel_launcher.py:6: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  
Out[6]:
MinTemp MaxTemp Rainfall Evaporation Sunshine WindGustSpeed WindSpeed9am WindSpeed3pm Humidity9am Humidity3pm ... WindDir3pm_NNW WindDir3pm_NW WindDir3pm_S WindDir3pm_SE WindDir3pm_SSE WindDir3pm_SSW WindDir3pm_SW WindDir3pm_W WindDir3pm_WNW WindDir3pm_WSW
6049 17.9 35.2 0.0 12.0 12.3 48.0 6.0 20.0 20.0 13.0 ... 0 0 0 0 0 0 1 0 0 0
6050 18.4 28.9 0.0 14.8 13.0 37.0 19.0 19.0 30.0 8.0 ... 0 0 0 0 1 0 0 0 0 0
6052 19.4 37.6 0.0 10.8 10.6 46.0 30.0 15.0 42.0 22.0 ... 1 0 0 0 0 0 0 0 0 0
6053 21.9 38.4 0.0 11.4 12.2 31.0 6.0 6.0 37.0 22.0 ... 0 0 0 0 0 0 0 0 0 1
6054 24.2 41.0 0.0 11.2 8.4 35.0 17.0 13.0 19.0 15.0 ... 0 0 0 0 0 0 0 0 1 0
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
142298 19.3 33.4 0.0 6.0 11.0 35.0 9.0 20.0 63.0 32.0 ... 0 0 0 0 0 0 0 0 0 0
142299 21.2 32.6 0.0 7.6 8.6 37.0 13.0 11.0 56.0 28.0 ... 0 0 0 1 0 0 0 0 0 0
142300 20.7 32.8 0.0 5.6 11.0 33.0 17.0 11.0 46.0 23.0 ... 0 0 0 0 0 0 0 1 0 0
142301 19.5 31.8 0.0 6.2 10.6 26.0 9.0 17.0 62.0 58.0 ... 1 0 0 0 0 0 0 0 0 0
142302 20.2 31.7 0.0 5.6 10.7 30.0 15.0 7.0 73.0 32.0 ... 1 0 0 0 0 0 0 0 0 0

56420 rows × 87 columns

Changing the datatype of the RainTomorrow column.

In [7]:
df['RainTomorrow'] = (df['RainTomorrow'] == 'Yes').astype('int64')

Balancing the dataset

In [8]:
df['RainTomorrow'].value_counts()
Out[8]:
0    43993
1    12427
Name: RainTomorrow, dtype: int64
In [9]:
df1 = df[df['RainTomorrow'] == 1]
df0 = df[df['RainTomorrow'] == 0]

print("Number of samples in:")
print("Class label 1 - ", len(df1))
print("Class label 0 - ", len(df0))

# Upsampling 

df1 = df1.sample(20000, replace = True)    # replace = True enables resampling

print('\nAfter resampling - ')

print("Number of samples in:")
print("Class label 1 - ", len(df1))
print("Class label 0 - ", len(df0))

df = df1.append(df0)
Number of samples in:
Class label 1 -  12427
Class label 0 -  43993

After resampling - 
Number of samples in:
Class label 1 -  20000
Class label 0 -  43993

Resampling the values to be equal in count can result in ~30k redundant values. So we restrict it to 20k values in class label 1, resulting in only ~8k redundant rows,

In [10]:
# Splitting into train, val and test set -- 80-10-10 split

# First, an 80-20 split
train_df, val_test_df = train_test_split(df, test_size = 0.2, random_state = 113)

# Then split the 20% into half
val_df, test_df = train_test_split(val_test_df, test_size = 0.5, random_state = 113)

len(train_df), len(val_df), len(test_df)
Out[10]:
(51194, 6399, 6400)
In [11]:
ic = df.columns.tolist()
ic.remove('RainTomorrow')

oc = ['RainTomorrow']

ytrain = train_df[oc]
Xtrain = train_df.drop(columns = oc)

yval = val_df[oc]
Xval = val_df.drop(columns = oc)

ytest = test_df[oc]
Xtest = test_df.drop(columns = oc)

Standardization

In [12]:
df.describe()
Out[12]:
MinTemp MaxTemp Rainfall Evaporation Sunshine WindGustSpeed WindSpeed9am WindSpeed3pm Humidity9am Humidity3pm ... WindDir3pm_NNW WindDir3pm_NW WindDir3pm_S WindDir3pm_SE WindDir3pm_SSE WindDir3pm_SSW WindDir3pm_SW WindDir3pm_W WindDir3pm_WNW WindDir3pm_WSW
count 63993.000000 63993.000000 63993.00000 63993.000000 63993.000000 63993.000000 63993.000000 63993.000000 63993.000000 63993.000000 ... 63993.000000 63993.000000 63993.000000 63993.000000 63993.000000 63993.000000 63993.000000 63993.000000 63993.000000 63993.000000
mean 13.586378 23.998598 2.49404 5.393101 7.357289 41.537012 15.786399 19.933274 67.018752 51.605894 ... 0.050740 0.050568 0.072039 0.071773 0.057819 0.061382 0.069554 0.071352 0.059303 0.067820
std 6.427010 6.985131 7.81636 3.643329 3.857860 13.754940 8.360753 8.592442 18.447099 20.727529 ... 0.219468 0.219116 0.258555 0.258114 0.233402 0.240031 0.254397 0.257413 0.236193 0.251439
min -6.700000 4.100000 0.00000 0.000000 0.000000 9.000000 2.000000 2.000000 0.000000 0.000000 ... 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000
25% 8.600000 18.500000 0.00000 2.800000 4.300000 31.000000 9.000000 13.000000 56.000000 37.000000 ... 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000
50% 13.300000 23.600000 0.00000 4.800000 8.200000 39.000000 15.000000 19.000000 68.000000 52.000000 ... 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000
75% 18.600000 29.500000 1.00000 7.400000 10.500000 50.000000 20.000000 26.000000 80.000000 65.000000 ... 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000
max 31.400000 48.100000 206.20000 81.200000 14.500000 124.000000 67.000000 76.000000 100.000000 100.000000 ... 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000

8 rows × 87 columns

The standard deviation of atrribute values is not the same and can hence cause some attributes to be weighed above others. This can be avoided by standardizing the values.

In [13]:
ss = StandardScaler()

Xtrain = ss.fit_transform(Xtrain)
Xval = ss.transform(Xval)
Xtest = ss.transform(Xtest)

The model

In [14]:
model = models.Sequential([
    layers.Dense(16, activation = 'relu', input_shape = Xtrain[0].shape),
    layers.Dense(8, activation = 'relu'),
    layers.Dense(1, activation = 'sigmoid')
])

cb = callbacks.EarlyStopping(patience = 5, restore_best_weights = True)
In [15]:
model.compile(optimizer = optimizers.Adam(0.01), loss = losses.BinaryCrossentropy(), metrics = ['accuracy'])

history = model.fit(Xtrain, ytrain, validation_data = (Xval, yval), epochs = 256, callbacks = cb)
Epoch 1/256
1600/1600 [==============================] - 3s 2ms/step - loss: 0.3912 - accuracy: 0.8192 - val_loss: 0.3656 - val_accuracy: 0.8315
Epoch 2/256
1600/1600 [==============================] - 2s 2ms/step - loss: 0.3670 - accuracy: 0.8311 - val_loss: 0.3571 - val_accuracy: 0.8354
Epoch 3/256
1600/1600 [==============================] - 2s 2ms/step - loss: 0.3586 - accuracy: 0.8371 - val_loss: 0.3534 - val_accuracy: 0.8415
Epoch 4/256
1600/1600 [==============================] - 2s 2ms/step - loss: 0.3526 - accuracy: 0.8394 - val_loss: 0.3543 - val_accuracy: 0.8431
Epoch 5/256
1600/1600 [==============================] - 2s 2ms/step - loss: 0.3474 - accuracy: 0.8408 - val_loss: 0.3447 - val_accuracy: 0.8439
Epoch 6/256
1600/1600 [==============================] - 2s 2ms/step - loss: 0.3438 - accuracy: 0.8437 - val_loss: 0.3479 - val_accuracy: 0.8472
Epoch 7/256
1600/1600 [==============================] - 2s 2ms/step - loss: 0.3413 - accuracy: 0.8428 - val_loss: 0.3528 - val_accuracy: 0.8431
Epoch 8/256
1600/1600 [==============================] - 2s 2ms/step - loss: 0.3395 - accuracy: 0.8450 - val_loss: 0.3441 - val_accuracy: 0.8486
Epoch 9/256
1600/1600 [==============================] - 2s 2ms/step - loss: 0.3370 - accuracy: 0.8465 - val_loss: 0.3411 - val_accuracy: 0.8433
Epoch 10/256
1600/1600 [==============================] - 2s 2ms/step - loss: 0.3345 - accuracy: 0.8462 - val_loss: 0.3393 - val_accuracy: 0.8422
Epoch 11/256
1600/1600 [==============================] - 2s 2ms/step - loss: 0.3329 - accuracy: 0.8457 - val_loss: 0.3433 - val_accuracy: 0.8445
Epoch 12/256
1600/1600 [==============================] - 2s 2ms/step - loss: 0.3320 - accuracy: 0.8470 - val_loss: 0.3447 - val_accuracy: 0.8433
Epoch 13/256
1600/1600 [==============================] - 2s 2ms/step - loss: 0.3297 - accuracy: 0.8475 - val_loss: 0.3409 - val_accuracy: 0.8439
Epoch 14/256
1600/1600 [==============================] - 2s 2ms/step - loss: 0.3295 - accuracy: 0.8471 - val_loss: 0.3461 - val_accuracy: 0.8448
Epoch 15/256
1600/1600 [==============================] - 2s 2ms/step - loss: 0.3283 - accuracy: 0.8484 - val_loss: 0.3478 - val_accuracy: 0.8450
In [16]:
cm = confusion_matrix(ytest, (model.predict(Xtest)>0.5).astype('int'))
cm = cm.astype('int') / cm.sum(axis=1)[:, np.newaxis]

fig = plt.figure(figsize = (10, 10))
ax = fig.add_subplot(111)

for i in range(cm.shape[1]):
    for j in range(cm.shape[0]):
        if cm[i,j] > 0.8:
            clr = "white"
        else:
            clr = "black"
        ax.text(j, i, format(cm[i, j], '.2f'), horizontalalignment="center", color=clr)

_ = ax.imshow(cm, cmap=plt.cm.Blues)
ax.set_xticks(range(2))
ax.set_yticks(range(2))
ax.set_xticklabels(['No', 'Yes'], rotation = 90)
ax.set_yticklabels(['No', 'Yes'])
plt.xlabel('Predicted')
plt.ylabel('True')
plt.show()

A higher count unique values for the rain class will help in higher accuracy.

The metrics

In [17]:
def plot(history, variable, variable2):
    plt.plot(range(len(history[variable])), history[variable])
    plt.plot(range(len(history[variable2])), history[variable2])
    plt.title(variable)
In [18]:
plot(history.history, "accuracy", 'val_accuracy')
In [19]:
plot(history.history, "loss", "val_loss")

Prediction

In [20]:
labels = ['No', 'Yes']
In [21]:
# pick random test data sample from one batch
x = random.randint(0, len(Xtest) - 1)

output = model.predict(Xtest[x].reshape(1, -1))[0][0] 
pred = (output>0.5).astype('int')
print("Predicted: ", labels[pred], "(", output, "-->", pred, ")")    

print("True: ", labels[np.array(ytest)[x][0]])
Predicted:  Yes ( 0.84764785 --> 1 )
True:  Yes

deepC

In [22]:
model.save('rain_prediction.h5')

!deepCC rain_prediction.h5
[INFO]
Reading [keras model] 'rain_prediction.h5'
[SUCCESS]
Saved 'rain_prediction.onnx'
[INFO]
Reading [onnx model] 'rain_prediction.onnx'
[INFO]
Model info:
  ir_vesion : 4
  doc       : 
[WARNING]
[ONNX]: terminal (input/output) dense_input's shape is less than 1. Changing it to 1.
[WARNING]
[ONNX]: terminal (input/output) dense_2's shape is less than 1. Changing it to 1.
WARN (GRAPH): found operator node with the same name (dense_2) as io node.
[INFO]
Running DNNC graph sanity check ...
[SUCCESS]
Passed sanity check.
[INFO]
Writing C++ file 'rain_prediction_deepC/rain_prediction.cpp'
[INFO]
deepSea model files are ready in 'rain_prediction_deepC/' 
[RUNNING COMMAND]
g++ -std=c++11 -O3 -fno-rtti -fno-exceptions -I. -I/opt/tljh/user/lib/python3.7/site-packages/deepC-0.13-py3.7-linux-x86_64.egg/deepC/include -isystem /opt/tljh/user/lib/python3.7/site-packages/deepC-0.13-py3.7-linux-x86_64.egg/deepC/packages/eigen-eigen-323c052e1731 rain_prediction_deepC/rain_prediction.cpp -o rain_prediction_deepC/rain_prediction.exe
[RUNNING COMMAND]
size "rain_prediction_deepC/rain_prediction.exe"
   text	   data	    bss	    dec	    hex	filename
 117618	   9184	    760	 127562	  1f24a	rain_prediction_deepC/rain_prediction.exe
[SUCCESS]
Saved model as executable "rain_prediction_deepC/rain_prediction.exe"
In [23]:
x = random.randint(0, len(Xtest) - 1)

np.savetxt('sample.data', Xtest[x])    # xth sample into text file

# run exe with input
!rain_prediction_deepC/rain_prediction.exe sample.data

# show predicted output
nn_out = np.loadtxt('deepSea_result_1.out')

pred = (nn_out>0.5).astype('int')
print("Predicted: ", labels[pred], "(", nn_out, "-->", pred, ")")    

print("True: ", labels[np.array(ytest)[x][0]])
reading file sample.data.
writing file deepSea_result_1.out.
Predicted:  No ( 2.74525e-05 --> 0 )
True:  No