What type of star is it?¶
Credit: AITS Cainvas Community
Photo by Alex Kunchevsky for OUTLΛNE on Dribbble
Identify the type of star using its characteristics like luminosity, temperature, colour, etc.
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, MinMaxScaler
from tensorflow.keras import models, optimizers, losses, layers, callbacks
from sklearn.metrics import confusion_matrix
import matplotlib.pyplot as plt
import random
Dataset¶
On Kaggle by Deepraj Baidya | Github
The dataset took 3 weeks to collect for 240 stars which are mostly collected from web. The missing data were manually calculated using equations of astrophysics.
The dataset is a CSV file with characteristics of a star like luminosity, temperature, colour, radius etc that help classify them into one of the 6 classes - Brown Dwarf, Red Dwarf, White Dwarf, Main Sequence, Supergiant, Hypergiant.
df = pd.read_csv('https://cainvas-static.s3.amazonaws.com/media/user_data/cainvas-admin/star.csv')
df
Preprocessing¶
df['Star color'].value_counts()
There are many shades of colours mentioned, some similar like Yellowish white and White-Yellow, and many spelling for blue-white.
We can identify 5 basic colours from the given list - blue, white, yellow, orange, red. Lets rewrite the column as 5 columns with multilabel values.
colours = ['Blu', 'Whit', 'Yellow', 'Orang', 'Red'] # using root word of colours as the spelling can differ while specifying shades
df[colours] = 0
for c in colours:
df.loc[df['Star color'].str.contains(c, case = False), c]=1
df['Spectral Class'].value_counts()
# One hot encoding the input column
df_dummies = pd.get_dummies(df['Spectral Class'], drop_first = True, prefix = 'Spectral')
for column in df_dummies:
df[column] = df_dummies[column]
# One hot encoding the output column
y = pd.get_dummies(df['Star type'])
# Dropping the encoded columns
df = df.drop(columns = ['Spectral Class', 'Star type', 'Star color'])
Looking into the variable value ranges
df.describe()
The standard deviation values differ across the attributes.
Splitting the dataset before standardization (mean = 0, sd = 1).
Train val test split¶
# Splitting into train, val and test set -- 80-10-10 split
# First, an 80-20 split
Xtrain, X_val_test, ytrain, y_val_test = train_test_split(df, y, test_size = 0.2)
# Then split the 20% into half
Xval, Xtest, yval, ytest = train_test_split(X_val_test, y_val_test, test_size = 0.5)
print("Number of samples in...")
print("Training set: ", len(Xtrain))
print("Validation set: ", len(Xval))
print("Testing set: ", len(Xtest))
Standardization¶
ss = StandardScaler()
Xtrain = ss.fit_transform(Xtrain)
Xval = ss.transform(Xval)
Xtest = ss.transform(Xtest)
The model¶
model = models.Sequential([
layers.Dense(16, activation = 'relu', input_shape = Xtrain[0].shape),
layers.Dense(8, activation = 'relu'),
layers.Dense(6, activation = 'softmax')
])
cb = callbacks.EarlyStopping(patience = 5, restore_best_weights = True)
model.summary()
model.compile(optimizer = optimizers.Adam(0.001), loss = losses.CategoricalCrossentropy(), metrics = ['accuracy'])
history = model.fit(Xtrain, ytrain, validation_data = (Xval, yval), epochs = 256, callbacks = cb)
model.evaluate(Xtest, ytest)
cm = confusion_matrix(np.argmax(ytest.values, axis = 1), (np.argmax(model.predict(Xtest), axis = 1)))
cm = cm.astype('int') / cm.sum(axis=1)[:, np.newaxis]
fig = plt.figure(figsize = (10, 10))
ax = fig.add_subplot(111)
for i in range(cm.shape[1]):
for j in range(cm.shape[0]):
if cm[i,j] > 0.8:
clr = "white"
else:
clr = "black"
ax.text(j, i, format(cm[i, j], '.2f'), horizontalalignment="center", color=clr)
_ = ax.imshow(cm, cmap=plt.cm.Blues)
ax.set_xticks(range(6))
ax.set_yticks(range(6))
ax.set_xticklabels(range(6))
ax.set_yticklabels(range(6))
plt.xlabel('Predicted')
plt.ylabel('True')
plt.show()
Plotting the metrics¶
def plot(history, variable, variable2):
plt.plot(range(len(history[variable])), history[variable])
plt.plot(range(len(history[variable2])), history[variable2])
plt.legend([variable, variable2])
plt.title(variable)
plot(history.history, "loss", "val_loss")
plot(history.history, "accuracy", "val_accuracy")
Prediction¶
classes = ['Brown Dwarf', 'Red Dwarf', 'White Dwarf', 'Main Sequence', 'Supergiant', 'Hypergiant']
# pick random test data sample from one batch
x = random.randint(0, len(Xtest) - 1)
output = model.predict(Xtest[x].reshape(1, -1))[0]
print("Predicted: ", classes[np.argmax(output)])
print("Probability: ", output[np.argmax(output)])
print("True: ", classes[np.argmax(ytest.values[x])])
deepC¶
model.save('star.h5')
!deepCC star.h5
x = random.randint(0, len(Xtest) - 1)
print(x)
np.savetxt('sample.data', Xtest[x]) # xth sample into text file
# run exe with input
!star_deepC/star.exe sample.data
# show predicted output
nn_out = np.loadtxt('deepSea_result_1.out')
print("Model output: ", model.predict(Xtest[x].reshape(1, -1))[0])
print("deepC output:",nn_out)
print("Predicted: ", classes[np.argmax(nn_out)])
print("Probability: ", nn_out[np.argmax(nn_out)])
print("True: ", classes[np.argmax(ytest.values[x])])