Cainvas
Model Files
abalone.h5
keras
Model
deepSea Compiled Models
abalone.exe
deepSea
Ubuntu

Abalone age prediction app

Credit: AITS Cainvas Community

Photo by Nico Medina on Dribbble

Abalone is a common name for sea snails. Determining their age is a detailed process. Their shell is cut through the cone, stained and the rings are counted using a microscope.

Here, we use measurements such as length, height, weight and other features to predict their age.

In [1]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from keras import models, optimizers, losses, layers, callbacks
from sklearn.metrics import confusion_matrix
import matplotlib.pyplot as plt
import random

The dataset

Data comes from an original (non-machine-learning) study: Warwick J Nash, Tracy L Sellers, Simon R Talbot, Andrew J Cawthorn and Wes B Ford (1994) "The Population Biology of Abalone (Haliotis species) in Tasmania. I. Blacklip Abalone (H. rubra) from the North Coast and Islands of Bass Strait", Sea Fisheries Division, Technical Report No. 48 (ISSN 1034-3288)

UCI Machine Learning Repository

The dataset is a CSV file containing features of 4177 samples.

In [2]:
df = pd.read_csv('https://cainvas-static.s3.amazonaws.com/media/user_data/cainvas-admin/abalone.csv')
df
Out[2]:
Sex Length Diameter Height Whole weight Shucked weight Viscera weight Shell weight Rings
0 M 0.455 0.365 0.095 0.5140 0.2245 0.1010 0.1500 15
1 M 0.350 0.265 0.090 0.2255 0.0995 0.0485 0.0700 7
2 F 0.530 0.420 0.135 0.6770 0.2565 0.1415 0.2100 9
3 M 0.440 0.365 0.125 0.5160 0.2155 0.1140 0.1550 10
4 I 0.330 0.255 0.080 0.2050 0.0895 0.0395 0.0550 7
... ... ... ... ... ... ... ... ... ...
4172 F 0.565 0.450 0.165 0.8870 0.3700 0.2390 0.2490 11
4173 M 0.590 0.440 0.135 0.9660 0.4390 0.2145 0.2605 10
4174 M 0.600 0.475 0.205 1.1760 0.5255 0.2875 0.3080 9
4175 F 0.625 0.485 0.150 1.0945 0.5310 0.2610 0.2960 10
4176 M 0.710 0.555 0.195 1.9485 0.9455 0.3765 0.4950 12

4177 rows × 9 columns

Preprocessing

Encoding the input columns

In [3]:
# One hot encoding the sex attribute.
df_dummies = pd.get_dummies(df['Sex'], drop_first = True, prefix = "Sex_")

# Inserting dummy columns
for column in df_dummies.columns:
    df[column] = df_dummies[column]
    
# Dropping the original column
df = df.drop(columns = ['Sex'])

df
Out[3]:
Length Diameter Height Whole weight Shucked weight Viscera weight Shell weight Rings Sex__I Sex__M
0 0.455 0.365 0.095 0.5140 0.2245 0.1010 0.1500 15 0 1
1 0.350 0.265 0.090 0.2255 0.0995 0.0485 0.0700 7 0 1
2 0.530 0.420 0.135 0.6770 0.2565 0.1415 0.2100 9 0 0
3 0.440 0.365 0.125 0.5160 0.2155 0.1140 0.1550 10 0 1
4 0.330 0.255 0.080 0.2050 0.0895 0.0395 0.0550 7 1 0
... ... ... ... ... ... ... ... ... ... ...
4172 0.565 0.450 0.165 0.8870 0.3700 0.2390 0.2490 11 0 0
4173 0.590 0.440 0.135 0.9660 0.4390 0.2145 0.2605 10 0 1
4174 0.600 0.475 0.205 1.1760 0.5255 0.2875 0.3080 9 0 1
4175 0.625 0.485 0.150 1.0945 0.5310 0.2610 0.2960 10 0 0
4176 0.710 0.555 0.195 1.9485 0.9455 0.3765 0.4950 12 0 1

4177 rows × 10 columns

Encoding the output columns

In [4]:
def rings_label(x):
    if x<=10:
        return 'young'
    if x<=20:
        return 'middle age'
    if x<=30:
        return 'old'
    
df['Rings'] = df['Rings'].apply(rings_label)
In [5]:
df['Rings'].value_counts()
Out[5]:
young         2730
middle age    1411
old             36
Name: Rings, dtype: int64
In [6]:
# One hot encoding the sex attribute.
df_dummies = pd.get_dummies(df['Rings'])

# Inserting dummy columns
for column in df_dummies.columns:
    df[column] = df_dummies[column]
    
# Dropping the original column
df = df.drop(columns = ['Rings'])

df
Out[6]:
Length Diameter Height Whole weight Shucked weight Viscera weight Shell weight Sex__I Sex__M middle age old young
0 0.455 0.365 0.095 0.5140 0.2245 0.1010 0.1500 0 1 1 0 0
1 0.350 0.265 0.090 0.2255 0.0995 0.0485 0.0700 0 1 0 0 1
2 0.530 0.420 0.135 0.6770 0.2565 0.1415 0.2100 0 0 0 0 1
3 0.440 0.365 0.125 0.5160 0.2155 0.1140 0.1550 0 1 0 0 1
4 0.330 0.255 0.080 0.2050 0.0895 0.0395 0.0550 1 0 0 0 1
... ... ... ... ... ... ... ... ... ... ... ... ...
4172 0.565 0.450 0.165 0.8870 0.3700 0.2390 0.2490 0 0 1 0 0
4173 0.590 0.440 0.135 0.9660 0.4390 0.2145 0.2605 0 1 0 0 1
4174 0.600 0.475 0.205 1.1760 0.5255 0.2875 0.3080 0 1 0 0 1
4175 0.625 0.485 0.150 1.0945 0.5310 0.2610 0.2960 0 0 0 0 1
4176 0.710 0.555 0.195 1.9485 0.9455 0.3765 0.4950 0 1 1 0 0

4177 rows × 12 columns

Defining the input and output columns

In [7]:
# defining the input and output columns to separate the dataset in the later cells.

input_columns = df.columns.tolist()
input_columns.remove('young')
input_columns.remove('middle age')
input_columns.remove('old')

output_columns = ['young', 'middle age', 'old']

print("Number of input columns: ", len(input_columns))
#print("Input columns: ", ', '.join(input_columns))

print("Number of output columns: ", len(output_columns))
#print("Output columns: ", ', '.join(output_columns))
Number of input columns:  9
Number of output columns:  3

Train validation test split

In [8]:
# Splitting into train, val and test set -- 80-10-10 split

# First, an 80-20 split
train_df, val_test_df = train_test_split(df, test_size = 0.2)

# Then split the 20% into half
val_df, test_df = train_test_split(val_test_df, test_size = 0.5)

print("Number of samples in...")
print("Training set: ", len(train_df))
print("Validation set: ", len(val_df))
print("Testing set: ", len(test_df))
Number of samples in...
Training set:  3341
Validation set:  418
Testing set:  418
In [9]:
# Splitting into X (input) and y (output)

Xtrain, ytrain = np.array(train_df[input_columns]), np.array(train_df[output_columns])

Xval, yval = np.array(val_df[input_columns]), np.array(val_df[output_columns])

Xtest, ytest = np.array(test_df[input_columns]), np.array(test_df[output_columns])

Standardization

In [10]:
ss = StandardScaler()

Xtrain = ss.fit_transform(Xtrain)
Xval = ss.transform(Xval)
Xtest = ss.transform(Xtest)

The model

In [11]:
model = models.Sequential([
    layers.Dense(32, activation = 'relu', input_shape = Xtrain[0].shape),
    layers.Dense(8, activation = 'relu'),
    layers.Dense(3, activation = 'softmax')
])

cb = callbacks.EarlyStopping(patience = 5, restore_best_weights = True)
In [12]:
model.compile(optimizer = optimizers.Adam(0.001), loss = losses.CategoricalCrossentropy(), metrics = ['accuracy'])

history = model.fit(Xtrain, ytrain, validation_data = (Xval, yval), epochs = 256, callbacks = cb)
Epoch 1/256
105/105 [==============================] - 0s 3ms/step - loss: 0.8207 - accuracy: 0.5965 - val_loss: 0.6156 - val_accuracy: 0.7464
Epoch 2/256
105/105 [==============================] - 0s 2ms/step - loss: 0.5924 - accuracy: 0.7204 - val_loss: 0.5516 - val_accuracy: 0.7536
Epoch 3/256
105/105 [==============================] - 0s 2ms/step - loss: 0.5505 - accuracy: 0.7381 - val_loss: 0.5263 - val_accuracy: 0.7536
Epoch 4/256
105/105 [==============================] - 0s 2ms/step - loss: 0.5265 - accuracy: 0.7492 - val_loss: 0.5079 - val_accuracy: 0.7632
Epoch 5/256
105/105 [==============================] - 0s 2ms/step - loss: 0.5081 - accuracy: 0.7611 - val_loss: 0.5002 - val_accuracy: 0.7799
Epoch 6/256
105/105 [==============================] - 0s 2ms/step - loss: 0.4973 - accuracy: 0.7641 - val_loss: 0.4963 - val_accuracy: 0.7799
Epoch 7/256
105/105 [==============================] - 0s 2ms/step - loss: 0.4878 - accuracy: 0.7719 - val_loss: 0.4919 - val_accuracy: 0.7775
Epoch 8/256
105/105 [==============================] - 0s 2ms/step - loss: 0.4824 - accuracy: 0.7731 - val_loss: 0.4878 - val_accuracy: 0.7847
Epoch 9/256
105/105 [==============================] - 0s 2ms/step - loss: 0.4777 - accuracy: 0.7767 - val_loss: 0.4924 - val_accuracy: 0.7871
Epoch 10/256
105/105 [==============================] - 0s 2ms/step - loss: 0.4745 - accuracy: 0.7770 - val_loss: 0.4914 - val_accuracy: 0.7919
Epoch 11/256
105/105 [==============================] - 0s 2ms/step - loss: 0.4722 - accuracy: 0.7719 - val_loss: 0.4863 - val_accuracy: 0.7871
Epoch 12/256
105/105 [==============================] - 0s 2ms/step - loss: 0.4676 - accuracy: 0.7785 - val_loss: 0.4876 - val_accuracy: 0.7823
Epoch 13/256
105/105 [==============================] - 0s 2ms/step - loss: 0.4661 - accuracy: 0.7836 - val_loss: 0.4975 - val_accuracy: 0.7967
Epoch 14/256
105/105 [==============================] - 0s 2ms/step - loss: 0.4664 - accuracy: 0.7767 - val_loss: 0.4922 - val_accuracy: 0.7847
Epoch 15/256
105/105 [==============================] - 0s 2ms/step - loss: 0.4610 - accuracy: 0.7800 - val_loss: 0.4941 - val_accuracy: 0.7990
Epoch 16/256
105/105 [==============================] - 0s 2ms/step - loss: 0.4601 - accuracy: 0.7830 - val_loss: 0.4907 - val_accuracy: 0.7967
In [13]:
model.evaluate(Xtest, ytest)
14/14 [==============================] - 0s 1ms/step - loss: 0.4695 - accuracy: 0.7967
Out[13]:
[0.4695490002632141, 0.7966507077217102]
In [14]:
cm = confusion_matrix(np.argmax(ytest, axis = 1), (np.argmax(model.predict(Xtest), axis = 1)))
cm = cm.astype('int') / cm.sum(axis=1)[:, np.newaxis]

fig = plt.figure(figsize = (10, 10))
ax = fig.add_subplot(111)

for i in range(cm.shape[1]):
    for j in range(cm.shape[0]):
        if cm[i,j] > 0.8:
            clr = "white"
        else:
            clr = "black"
        ax.text(j, i, format(cm[i, j], '.2f'), horizontalalignment="center", color=clr)

_ = ax.imshow(cm, cmap=plt.cm.Blues)
ax.set_xticks(range(3))
ax.set_yticks(range(3))
ax.set_xticklabels(output_columns, rotation = 90)
ax.set_yticklabels(output_columns)
plt.xlabel('Predicted')
plt.ylabel('True')
plt.show()

Other attributes such as weather patterns and location (hence food availability) can help in classifying them better.

Plotting the metrics

In [15]:
def plot(history, variable, variable2):
    plt.plot(range(len(history[variable])), history[variable])
    plt.plot(range(len(history[variable2])), history[variable2])
    plt.legend([variable, variable2])
    plt.title(variable)
In [16]:
plot(history.history, "loss", "val_loss")
In [17]:
plot(history.history, "accuracy", "val_accuracy")

Prediction

In [18]:
# pick random test data sample from one batch
x = random.randint(0, len(Xtest) - 1)

output = model.predict(Xtest[x].reshape(1, -1))[0]
print("Predicted: ", output_columns[np.argmax(output)])   
print("Probability: ", output[np.argmax(output)])

print("True: ", output_columns[np.argmax(ytest[x])])
Predicted:  young
Probability:  0.9382079
True:  young

deepC

In [19]:
model.save('abalone.h5')

!deepCC abalone.h5
[INFO]
Reading [keras model] 'abalone.h5'
[SUCCESS]
Saved 'abalone_deepC/abalone.onnx'
[INFO]
Reading [onnx model] 'abalone_deepC/abalone.onnx'
[INFO]
Model info:
  ir_vesion : 4
  doc       : 
[WARNING]
[ONNX]: terminal (input/output) dense_input's shape is less than 1. Changing it to 1.
[WARNING]
[ONNX]: terminal (input/output) dense_2's shape is less than 1. Changing it to 1.
WARN (GRAPH): found operator node with the same name (dense_2) as io node.
[INFO]
Running DNNC graph sanity check ...
[SUCCESS]
Passed sanity check.
[INFO]
Writing C++ file 'abalone_deepC/abalone.cpp'
[INFO]
deepSea model files are ready in 'abalone_deepC/' 
[RUNNING COMMAND]
g++ -std=c++11 -O3 -fno-rtti -fno-exceptions -I. -I/opt/tljh/user/lib/python3.7/site-packages/deepC-0.13-py3.7-linux-x86_64.egg/deepC/include -isystem /opt/tljh/user/lib/python3.7/site-packages/deepC-0.13-py3.7-linux-x86_64.egg/deepC/packages/eigen-eigen-323c052e1731 "abalone_deepC/abalone.cpp" -D_AITS_MAIN -o "abalone_deepC/abalone.exe"
[RUNNING COMMAND]
size "abalone_deepC/abalone.exe"
   text	   data	    bss	    dec	    hex	filename
 123489	   2584	    760	 126833	  1ef71	abalone_deepC/abalone.exe
[SUCCESS]
Saved model as executable "abalone_deepC/abalone.exe"
In [20]:
x = random.randint(0, len(Xtest) - 1)
print(x)
np.savetxt('sample.data', Xtest[x])    # xth sample into text file

# run exe with input
!abalone_deepC/abalone.exe sample.data

# show predicted output
nn_out = np.loadtxt('deepSea_result_1.out')
print(model.predict(Xtest[x].reshape(1, -1))[0])
print(nn_out)
#print(x, Xtest[x])
print("Predicted: ", output_columns[np.argmax(nn_out)])   
print("Probability: ", nn_out[np.argmax(nn_out)])
#print(x, Xtest[x])
print("True: ", output_columns[np.argmax(ytest[x])])
10
writing file deepSea_result_1.out.
[0.8444116  0.15433694 0.00125141]
[0. 1. 0.]
Predicted:  middle age
Probability:  1.0
True:  young