Cainvas

Prediction of Remaining Useful Life (RUL) of JET engine

Credit: AITS Cainvas Community

Photo by Mario Jacome on Dribbble

  • The main aim of this post is to document my implementation of a model that can be used to perform predictive maintenance on commercial turbofan engine. the predictive maintenance approach used here is a data-driven approach, meaning that data collected from the operational jet engine is used to perform predictive maintenance modeling. to be specific, the project aim is to build a predictive model to estimate the Remaining Useful Life ( RUL) of a jet engine based on run-to-failure data of a fleet of similar jet engines.

Setup: Importing neccessary libraries

In [1]:
import pandas as pd
from matplotlib import pyplot as plt
import seaborn as sns
import numpy as np
import tensorflow as tf
import sklearn.metrics

Downloading and Unzipping Dataset

In [2]:
!wget -N "https://cainvas-static.s3.amazonaws.com/media/user_data/cainvas-admin/archive_XmkjwwI.zip"
!unzip -o "archive_XmkjwwI.zip" 
!rm "archive_XmkjwwI.zip"
--2021-07-14 14:45:39--  https://cainvas-static.s3.amazonaws.com/media/user_data/cainvas-admin/archive_XmkjwwI.zip
Resolving cainvas-static.s3.amazonaws.com (cainvas-static.s3.amazonaws.com)... 52.219.64.92
Connecting to cainvas-static.s3.amazonaws.com (cainvas-static.s3.amazonaws.com)|52.219.64.92|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 12944209 (12M) [application/x-zip-compressed]
Saving to: ‘archive_XmkjwwI.zip’

archive_XmkjwwI.zip 100%[===================>]  12.34M  --.-KB/s    in 0.1s    

2021-07-14 14:45:39 (118 MB/s) - ‘archive_XmkjwwI.zip’ saved [12944209/12944209]

Archive:  archive_XmkjwwI.zip
  inflating: CMaps/Damage Propagation Modeling.pdf  
  inflating: CMaps/RUL_FD001.txt     
  inflating: CMaps/RUL_FD002.txt     
  inflating: CMaps/RUL_FD003.txt     
  inflating: CMaps/RUL_FD004.txt     
  inflating: CMaps/readme.txt        
  inflating: CMaps/test_FD001.txt    
  inflating: CMaps/test_FD002.txt    
  inflating: CMaps/test_FD003.txt    
  inflating: CMaps/test_FD004.txt    
  inflating: CMaps/train_FD001.txt   
  inflating: CMaps/train_FD002.txt   
  inflating: CMaps/train_FD003.txt   
  inflating: CMaps/train_FD004.txt   
  inflating: CMaps/x.txt             

Understanding and visualizing the data

Our goal is to predict the remaining useful life of a jet engine—how much time it has left before failing.

In [3]:
train_df = pd.read_csv("CMaps/train_FD001.txt",sep=" ",header=None)
test_df = pd.read_csv("CMaps/test_FD001.txt",sep=" ",header=None)

train_df.describe()
Out[3]:
0 1 2 3 4 5 6 7 8 9 ... 18 19 20 21 22 23 24 25 26 27
count 20631.000000 20631.000000 20631.000000 20631.000000 20631.0 20631.00 20631.000000 20631.000000 20631.000000 2.063100e+04 ... 20631.000000 20631.000000 2.063100e+04 20631.000000 20631.0 20631.0 20631.000000 20631.000000 0.0 0.0
mean 51.506568 108.807862 -0.000009 0.000002 100.0 518.67 642.680934 1590.523119 1408.933782 1.462000e+01 ... 8143.752722 8.442146 3.000000e-02 393.210654 2388.0 100.0 38.816271 23.289705 NaN NaN
std 29.227633 68.880990 0.002187 0.000293 0.0 0.00 0.500053 6.131150 9.000605 1.776400e-15 ... 19.076176 0.037505 1.387812e-17 1.548763 0.0 0.0 0.180746 0.108251 NaN NaN
min 1.000000 1.000000 -0.008700 -0.000600 100.0 518.67 641.210000 1571.040000 1382.250000 1.462000e+01 ... 8099.940000 8.324900 3.000000e-02 388.000000 2388.0 100.0 38.140000 22.894200 NaN NaN
25% 26.000000 52.000000 -0.001500 -0.000200 100.0 518.67 642.325000 1586.260000 1402.360000 1.462000e+01 ... 8133.245000 8.414900 3.000000e-02 392.000000 2388.0 100.0 38.700000 23.221800 NaN NaN
50% 52.000000 104.000000 0.000000 0.000000 100.0 518.67 642.640000 1590.100000 1408.040000 1.462000e+01 ... 8140.540000 8.438900 3.000000e-02 393.000000 2388.0 100.0 38.830000 23.297900 NaN NaN
75% 77.000000 156.000000 0.001500 0.000300 100.0 518.67 643.000000 1594.380000 1414.555000 1.462000e+01 ... 8148.310000 8.465600 3.000000e-02 394.000000 2388.0 100.0 38.950000 23.366800 NaN NaN
max 100.000000 362.000000 0.008700 0.000600 100.0 518.67 644.530000 1616.910000 1441.490000 1.462000e+01 ... 8293.720000 8.584800 3.000000e-02 400.000000 2388.0 100.0 39.430000 23.618400 NaN NaN

8 rows × 28 columns

Let's drop the last 2 columns because they're all NaN values.

In [4]:
train_df.drop(columns=[26,27],inplace=True)
test_df.drop(columns=[26,27],inplace=True)

Now let's add the column names from the documentation:

In [5]:
columns = ['unit_number','time_in_cycles','setting_1','setting_2','TRA','T2','T24','T30','T50','P2','P15','P30','Nf',
           'Nc','epr','Ps30','phi','NRf','NRc','BPR','farB','htBleed','Nf_dmd','PCNfR_dmd','W31','W32' ]

train_df.columns = columns
test_df.columns = columns

train_df.head()
Out[5]:
unit_number time_in_cycles setting_1 setting_2 TRA T2 T24 T30 T50 P2 ... phi NRf NRc BPR farB htBleed Nf_dmd PCNfR_dmd W31 W32
0 1 1 -0.0007 -0.0004 100.0 518.67 641.82 1589.70 1400.60 14.62 ... 521.66 2388.02 8138.62 8.4195 0.03 392 2388 100.0 39.06 23.4190
1 1 2 0.0019 -0.0003 100.0 518.67 642.15 1591.82 1403.14 14.62 ... 522.28 2388.07 8131.49 8.4318 0.03 392 2388 100.0 39.00 23.4236
2 1 3 -0.0043 0.0003 100.0 518.67 642.35 1587.99 1404.20 14.62 ... 522.42 2388.03 8133.23 8.4178 0.03 390 2388 100.0 38.95 23.3442
3 1 4 0.0007 0.0000 100.0 518.67 642.35 1582.79 1401.87 14.62 ... 522.86 2388.08 8133.83 8.3682 0.03 392 2388 100.0 38.88 23.3739
4 1 5 -0.0019 -0.0002 100.0 518.67 642.37 1582.85 1406.22 14.62 ... 522.19 2388.04 8133.80 8.4294 0.03 393 2388 100.0 38.90 23.4044

5 rows × 26 columns

In [6]:
train_df.describe().transpose()
Out[6]:
count mean std min 25% 50% 75% max
unit_number 20631.0 51.506568 2.922763e+01 1.0000 26.0000 52.0000 77.0000 100.0000
time_in_cycles 20631.0 108.807862 6.888099e+01 1.0000 52.0000 104.0000 156.0000 362.0000
setting_1 20631.0 -0.000009 2.187313e-03 -0.0087 -0.0015 0.0000 0.0015 0.0087
setting_2 20631.0 0.000002 2.930621e-04 -0.0006 -0.0002 0.0000 0.0003 0.0006
TRA 20631.0 100.000000 0.000000e+00 100.0000 100.0000 100.0000 100.0000 100.0000
T2 20631.0 518.670000 0.000000e+00 518.6700 518.6700 518.6700 518.6700 518.6700
T24 20631.0 642.680934 5.000533e-01 641.2100 642.3250 642.6400 643.0000 644.5300
T30 20631.0 1590.523119 6.131150e+00 1571.0400 1586.2600 1590.1000 1594.3800 1616.9100
T50 20631.0 1408.933782 9.000605e+00 1382.2500 1402.3600 1408.0400 1414.5550 1441.4900
P2 20631.0 14.620000 1.776400e-15 14.6200 14.6200 14.6200 14.6200 14.6200
P15 20631.0 21.609803 1.388985e-03 21.6000 21.6100 21.6100 21.6100 21.6100
P30 20631.0 553.367711 8.850923e-01 549.8500 552.8100 553.4400 554.0100 556.0600
Nf 20631.0 2388.096652 7.098548e-02 2387.9000 2388.0500 2388.0900 2388.1400 2388.5600
Nc 20631.0 9065.242941 2.208288e+01 9021.7300 9053.1000 9060.6600 9069.4200 9244.5900
epr 20631.0 1.300000 0.000000e+00 1.3000 1.3000 1.3000 1.3000 1.3000
Ps30 20631.0 47.541168 2.670874e-01 46.8500 47.3500 47.5100 47.7000 48.5300
phi 20631.0 521.413470 7.375534e-01 518.6900 520.9600 521.4800 521.9500 523.3800
NRf 20631.0 2388.096152 7.191892e-02 2387.8800 2388.0400 2388.0900 2388.1400 2388.5600
NRc 20631.0 8143.752722 1.907618e+01 8099.9400 8133.2450 8140.5400 8148.3100 8293.7200
BPR 20631.0 8.442146 3.750504e-02 8.3249 8.4149 8.4389 8.4656 8.5848
farB 20631.0 0.030000 1.387812e-17 0.0300 0.0300 0.0300 0.0300 0.0300
htBleed 20631.0 393.210654 1.548763e+00 388.0000 392.0000 393.0000 394.0000 400.0000
Nf_dmd 20631.0 2388.000000 0.000000e+00 2388.0000 2388.0000 2388.0000 2388.0000 2388.0000
PCNfR_dmd 20631.0 100.000000 0.000000e+00 100.0000 100.0000 100.0000 100.0000 100.0000
W31 20631.0 38.816271 1.807464e-01 38.1400 38.7000 38.8300 38.9500 39.4300
W32 20631.0 23.289705 1.082509e-01 22.8942 23.2218 23.2979 23.3668 23.6184

Next, delete columns with constant values (std = 0)—they do not carry information about the state of the unit.

In [7]:
train_df.drop(columns=['Nf_dmd','PCNfR_dmd', "TRA", 'P2','T2','TRA','farB','epr'],inplace=True)
test_df.drop(columns=['Nf_dmd','PCNfR_dmd', "TRA", 'P2','T2','TRA','farB','epr'],inplace=True)

Let's make time series plots for two of the units, so we can get a sense of the data.

In [8]:
ax1 = train_df[train_df.unit_number == 1].plot(subplots=True, sharex=True, figsize=(20,20))
In [9]:
ax1 = train_df[train_df.unit_number == 99].plot(subplots=True, sharex=True, figsize=(20,20))

We can see that the different unit numbers have a different number of total cycles before they fail. Let's look at this more closely:

In [10]:
train_df.groupby('unit_number')['time_in_cycles'].max().describe()
Out[10]:
count    100.000000
mean     206.310000
std       46.342749
min      128.000000
25%      177.000000
50%      199.000000
75%      229.250000
max      362.000000
Name: time_in_cycles, dtype: float64
In [11]:
fig, ax = plt.subplots()
train_df.groupby('unit_number')['time_in_cycles'].max().hist(ax=ax, bins=30);
ax.set_xlabel("Number of cycles before failure");
ax.set_ylabel("Number of units");

From this we can infer the the remaining useful life (RUL) of each unit in the training data is time_in_cycles.max() minus the current time in cycles, for any given cycle. Let's add that as our dependent variable.

(For the testing data, the RUL is in a separate file.)

In [12]:
def add_rul(data):
    df = data.copy()
    max_cycles = df.groupby('unit_number')['time_in_cycles'].max().reset_index()
    max_cycles = pd.DataFrame(max_cycles)
    max_cycles.columns = ['unit_number','max_cycles']
    df = df.merge(max_cycles, on=['unit_number'], how='left')
    df['RUL'] = df['max_cycles'] - df['time_in_cycles']
    df.drop(columns=['max_cycles'],inplace = True)
    
    return df

train_df = add_rul(train_df)

We can check to make sure the RUL looks like we would expect:

In [13]:
ax1 = train_df[train_df.unit_number == 1][['time_in_cycles', "RUL"]].plot(subplots=True, sharex=True, figsize=(20,3))

Next, let's see how our indpendent variables correlate with our dependent variable (RUL) and each other:

In [14]:
fig, ax = plt.subplots(figsize=(20, 20))
sns.heatmap(train_df.corr(), ax=ax, annot=True,cmap='RdYlGn',linewidths=0.2);