Cainvas

Resume Screening using Deep Learning

Credit: AITS Cainvas Community

Photo by Joe Le Huquet on Dribbble

In this notebook, we need to determine the category of domain from the resume that is provided. The dataset consists of two columns - Resume and Category, where Resume is the input and Category the output.

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt 
import seaborn as sns
In [2]:
resume = pd.read_csv("https://cainvas-static.s3.amazonaws.com/media/user_data/cainvas-admin/UpdatedResumeDataSet.csv")
In [3]:
resume
Out[3]:
Category Resume
0 Data Science Skills * Programming Languages: Python (pandas...
1 Data Science Education Details \r\nMay 2013 to May 2017 B.E...
2 Data Science Areas of Interest Deep Learning, Control Syste...
3 Data Science Skills • R • Python • SAP HANA • Table...
4 Data Science Education Details \r\n MCA YMCAUST, Faridab...
... ... ...
957 Testing Computer Skills: • Proficient in MS office (...
958 Testing ❖ Willingness to accept the challenges. ❖ ...
959 Testing PERSONAL SKILLS • Quick learner, • Eagerne...
960 Testing COMPUTER SKILLS & SOFTWARE KNOWLEDGE MS-Power ...
961 Testing Skill Set OS Windows XP/7/8/8.1/10 Database MY...

962 rows × 2 columns

In [4]:
#view an example of a resume from our data
resume['Resume'][0]
Out[4]:
'Skills * Programming Languages: Python (pandas, numpy, scipy, scikit-learn, matplotlib), Sql, Java, JavaScript/JQuery. * Machine learning: Regression, SVM, Naïve Bayes, KNN, Random Forest, Decision Trees, Boosting techniques, Cluster Analysis, Word Embedding, Sentiment Analysis, Natural Language processing, Dimensionality reduction, Topic Modelling (LDA, NMF), PCA & Neural Nets. * Database Visualizations: Mysql, SqlServer, Cassandra, Hbase, ElasticSearch D3.js, DC.js, Plotly, kibana, matplotlib, ggplot, Tableau. * Others: Regular Expression, HTML, CSS, Angular 6, Logstash, Kafka, Python Flask, Git, Docker, computer vision - Open CV and understanding of Deep learning.Education Details \r\n\r\nData Science Assurance Associate \r\n\r\nData Science Assurance Associate - Ernst & Young LLP\r\nSkill Details \r\nJAVASCRIPT- Exprience - 24 months\r\njQuery- Exprience - 24 months\r\nPython- Exprience - 24 monthsCompany Details \r\ncompany - Ernst & Young LLP\r\ndescription - Fraud Investigations and Dispute Services   Assurance\r\nTECHNOLOGY ASSISTED REVIEW\r\nTAR (Technology Assisted Review) assists in accelerating the review process and run analytics and generate reports.\r\n* Core member of a team helped in developing automated review platform tool from scratch for assisting E discovery domain, this tool implements predictive coding and topic modelling by automating reviews, resulting in reduced labor costs and time spent during the lawyers review.\r\n* Understand the end to end flow of the solution, doing research and development for classification models, predictive analysis and mining of the information present in text data. Worked on analyzing the outputs and precision monitoring for the entire tool.\r\n* TAR assists in predictive coding, topic modelling from the evidence by following EY standards. Developed the classifier models in order to identify "red flags" and fraud-related issues.\r\n\r\nTools & Technologies: Python, scikit-learn, tfidf, word2vec, doc2vec, cosine similarity, Naïve Bayes, LDA, NMF for topic modelling, Vader and text blob for sentiment analysis. Matplot lib, Tableau dashboard for reporting.\r\n\r\nMULTIPLE DATA SCIENCE AND ANALYTIC PROJECTS (USA CLIENTS)\r\nTEXT ANALYTICS - MOTOR VEHICLE CUSTOMER REVIEW DATA * Received customer feedback survey data for past one year. Performed sentiment (Positive, Negative & Neutral) and time series analysis on customer comments across all 4 categories.\r\n* Created heat map of terms by survey category based on frequency of words * Extracted Positive and Negative words across all the Survey categories and plotted Word cloud.\r\n* Created customized tableau dashboards for effective reporting and visualizations.\r\nCHATBOT * Developed a user friendly chatbot for one of our Products which handle simple questions about hours of operation, reservation options and so on.\r\n* This chat bot serves entire product related questions. Giving overview of tool via QA platform and also give recommendation responses so that user question to build chain of relevant answer.\r\n* This too has intelligence to build the pipeline of questions as per user requirement and asks the relevant /recommended questions.\r\n\r\nTools & Technologies: Python, Natural language processing, NLTK, spacy, topic modelling, Sentiment analysis, Word Embedding, scikit-learn, JavaScript/JQuery, SqlServer\r\n\r\nINFORMATION GOVERNANCE\r\nOrganizations to make informed decisions about all of the information they store. The integrated Information Governance portfolio synthesizes intelligence across unstructured data sources and facilitates action to ensure organizations are best positioned to counter information risk.\r\n* Scan data from multiple sources of formats and parse different file formats, extract Meta data information, push results for indexing elastic search and created customized, interactive dashboards using kibana.\r\n* Preforming ROT Analysis on the data which give information of data which helps identify content that is either Redundant, Outdated, or Trivial.\r\n* Preforming full-text search analysis on elastic search with predefined methods which can tag as (PII) personally identifiable information (social security numbers, addresses, names, etc.) which frequently targeted during cyber-attacks.\r\nTools & Technologies: Python, Flask, Elastic Search, Kibana\r\n\r\nFRAUD ANALYTIC PLATFORM\r\nFraud Analytics and investigative platform to review all red flag cases.\r\nâ\x80¢ FAP is a Fraud Analytics and investigative platform with inbuilt case manager and suite of Analytics for various ERP systems.\r\n* It can be used by clients to interrogate their Accounting systems for identifying the anomalies which can be indicators of fraud by running advanced analytics\r\nTools & Technologies: HTML, JavaScript, SqlServer, JQuery, CSS, Bootstrap, Node.js, D3.js, DC.js'
In [5]:
resume['Category'].value_counts()
Out[5]:
Java Developer               84
Testing                      70
DevOps Engineer              55
Python Developer             48
Web Designing                45
HR                           44
Hadoop                       42
ETL Developer                40
Operations Manager           40
Mechanical Engineer          40
Data Science                 40
Blockchain                   40
Sales                        40
Arts                         36
Database                     33
PMO                          30
Electrical Engineering       30
Health and fitness           30
DotNet Developer             28
Business Analyst             28
Automation Testing           26
Network Security Engineer    25
Civil Engineer               24
SAP Developer                24
Advocate                     20
Name: Category, dtype: int64
In [6]:
sns.countplot(y="Category", data=resume)
Out[6]:
<AxesSubplot:xlabel='count', ylabel='Category'>
In [7]:
#pre-processing of data to remove special characters, hashtags, urls etc
import re
def cleanResume(resumeText):
    resumeText = re.sub('http\S+\s*', ' ', resumeText)  # remove URLs
    resumeText = re.sub('RT|cc', ' ', resumeText)  # remove RT and cc
    resumeText = re.sub('#\S+', '', resumeText)  # remove hashtags
    resumeText = re.sub('@\S+', '  ', resumeText)  # remove mentions
    resumeText = re.sub('[%s]' % re.escape("""!"#$%&'()*+,-./:;<=>?@[\]^_`{|}~"""), ' ', resumeText)  # remove punctuations
    resumeText = re.sub(r'[^\x00-\x7f]',r' ', resumeText) 
    resumeText = re.sub('\s+', ' ', resumeText)  # remove extra whitespace
    return resumeText
    
resume['cleaned_resume'] = resume.Resume.apply(lambda x: cleanResume(x))
In [8]:
#data-set after pre-processing
resume
Out[8]:
Category Resume cleaned_resume
0 Data Science Skills * Programming Languages: Python (pandas... Skills Programming Languages Python pandas num...
1 Data Science Education Details \r\nMay 2013 to May 2017 B.E... Education Details May 2013 to May 2017 B E UIT...
2 Data Science Areas of Interest Deep Learning, Control Syste... Areas of Interest Deep Learning Control System...
3 Data Science Skills • R • Python • SAP HANA • Table... Skills R Python SAP HANA Tableau SAP HANA SQL ...
4 Data Science Education Details \r\n MCA YMCAUST, Faridab... Education Details MCA YMCAUST Faridabad Haryan...
... ... ... ...
957 Testing Computer Skills: • Proficient in MS office (... Computer Skills Proficient in MS office Word B...
958 Testing ❖ Willingness to accept the challenges. ❖ ... Willingness to a ept the challenges Positive ...
959 Testing PERSONAL SKILLS • Quick learner, • Eagerne... PERSONAL SKILLS Quick learner Eagerness to lea...
960 Testing COMPUTER SKILLS & SOFTWARE KNOWLEDGE MS-Power ... COMPUTER SKILLS SOFTWARE KNOWLEDGE MS Power Po...
961 Testing Skill Set OS Windows XP/7/8/8.1/10 Database MY... Skill Set OS Windows XP 7 8 8 1 10 Database MY...

962 rows × 3 columns

In [9]:
# Printing an original resume
print('--- Original resume ---')
print(resume['Resume'][0])
--- Original resume ---
Skills * Programming Languages: Python (pandas, numpy, scipy, scikit-learn, matplotlib), Sql, Java, JavaScript/JQuery. * Machine learning: Regression, SVM, Naïve Bayes, KNN, Random Forest, Decision Trees, Boosting techniques, Cluster Analysis, Word Embedding, Sentiment Analysis, Natural Language processing, Dimensionality reduction, Topic Modelling (LDA, NMF), PCA & Neural Nets. * Database Visualizations: Mysql, SqlServer, Cassandra, Hbase, ElasticSearch D3.js, DC.js, Plotly, kibana, matplotlib, ggplot, Tableau. * Others: Regular Expression, HTML, CSS, Angular 6, Logstash, Kafka, Python Flask, Git, Docker, computer vision - Open CV and understanding of Deep learning.Education Details 

Data Science Assurance Associate 

Data Science Assurance Associate - Ernst & Young LLP
Skill Details 
JAVASCRIPT- Exprience - 24 months
jQuery- Exprience - 24 months
Python- Exprience - 24 monthsCompany Details 
company - Ernst & Young LLP
description - Fraud Investigations and Dispute Services   Assurance
TECHNOLOGY ASSISTED REVIEW
TAR (Technology Assisted Review) assists in accelerating the review process and run analytics and generate reports.
* Core member of a team helped in developing automated review platform tool from scratch for assisting E discovery domain, this tool implements predictive coding and topic modelling by automating reviews, resulting in reduced labor costs and time spent during the lawyers review.
* Understand the end to end flow of the solution, doing research and development for classification models, predictive analysis and mining of the information present in text data. Worked on analyzing the outputs and precision monitoring for the entire tool.
* TAR assists in predictive coding, topic modelling from the evidence by following EY standards. Developed the classifier models in order to identify "red flags" and fraud-related issues.

Tools & Technologies: Python, scikit-learn, tfidf, word2vec, doc2vec, cosine similarity, Naïve Bayes, LDA, NMF for topic modelling, Vader and text blob for sentiment analysis. Matplot lib, Tableau dashboard for reporting.

MULTIPLE DATA SCIENCE AND ANALYTIC PROJECTS (USA CLIENTS)
TEXT ANALYTICS - MOTOR VEHICLE CUSTOMER REVIEW DATA * Received customer feedback survey data for past one year. Performed sentiment (Positive, Negative & Neutral) and time series analysis on customer comments across all 4 categories.
* Created heat map of terms by survey category based on frequency of words * Extracted Positive and Negative words across all the Survey categories and plotted Word cloud.
* Created customized tableau dashboards for effective reporting and visualizations.
CHATBOT * Developed a user friendly chatbot for one of our Products which handle simple questions about hours of operation, reservation options and so on.
* This chat bot serves entire product related questions. Giving overview of tool via QA platform and also give recommendation responses so that user question to build chain of relevant answer.
* This too has intelligence to build the pipeline of questions as per user requirement and asks the relevant /recommended questions.

Tools & Technologies: Python, Natural language processing, NLTK, spacy, topic modelling, Sentiment analysis, Word Embedding, scikit-learn, JavaScript/JQuery, SqlServer

INFORMATION GOVERNANCE
Organizations to make informed decisions about all of the information they store. The integrated Information Governance portfolio synthesizes intelligence across unstructured data sources and facilitates action to ensure organizations are best positioned to counter information risk.
* Scan data from multiple sources of formats and parse different file formats, extract Meta data information, push results for indexing elastic search and created customized, interactive dashboards using kibana.
* Preforming ROT Analysis on the data which give information of data which helps identify content that is either Redundant, Outdated, or Trivial.
* Preforming full-text search analysis on elastic search with predefined methods which can tag as (PII) personally identifiable information (social security numbers, addresses, names, etc.) which frequently targeted during cyber-attacks.
Tools & Technologies: Python, Flask, Elastic Search, Kibana

FRAUD ANALYTIC PLATFORM
Fraud Analytics and investigative platform to review all red flag cases.
• FAP is a Fraud Analytics and investigative platform with inbuilt case manager and suite of Analytics for various ERP systems.
* It can be used by clients to interrogate their Accounting systems for identifying the anomalies which can be indicators of fraud by running advanced analytics
Tools & Technologies: HTML, JavaScript, SqlServer, JQuery, CSS, Bootstrap, Node.js, D3.js, DC.js
In [10]:
# Printing the same resume after text cleaning
print('--- Cleaned resume ---')
print(resume['cleaned_resume'][0])
--- Cleaned resume ---
Skills Programming Languages Python pandas numpy scipy scikit learn matplotlib Sql Java JavaScript JQuery Machine learning Regression SVM Na ve Bayes KNN Random Forest Decision Trees Boosting techniques Cluster Analysis Word Embedding Sentiment Analysis Natural Language processing Dimensionality reduction Topic Modelling LDA NMF PCA Neural Nets Database Visualizations Mysql SqlServer Cassandra Hbase ElasticSearch D3 js DC js Plotly kibana matplotlib ggplot Tableau Others Regular Expression HTML CSS Angular 6 Logstash Kafka Python Flask Git Docker computer vision Open CV and understanding of Deep learning Education Details Data Science Assurance Associate Data Science Assurance Associate Ernst Young LLP Skill Details JAVASCRIPT Exprience 24 months jQuery Exprience 24 months Python Exprience 24 monthsCompany Details company Ernst Young LLP description Fraud Investigations and Dispute Services Assurance TECHNOLOGY ASSISTED REVIEW TAR Technology Assisted Review assists in a elerating the review process and run analytics and generate reports Core member of a team helped in developing automated review platform tool from scratch for assisting E discovery domain this tool implements predictive coding and topic modelling by automating reviews resulting in reduced labor costs and time spent during the lawyers review Understand the end to end flow of the solution doing research and development for classification models predictive analysis and mining of the information present in text data Worked on analyzing the outputs and precision monitoring for the entire tool TAR assists in predictive coding topic modelling from the evidence by following EY standards Developed the classifier models in order to identify red flags and fraud related issues Tools Technologies Python scikit learn tfidf word2vec doc2vec cosine similarity Na ve Bayes LDA NMF for topic modelling Vader and text blob for sentiment analysis Matplot lib Tableau dashboard for reporting MULTIPLE DATA SCIENCE AND ANALYTIC PROJECTS USA CLIENTS TEXT ANALYTICS MOTOR VEHICLE CUSTOMER REVIEW DATA Received customer feedback survey data for past one year Performed sentiment Positive Negative Neutral and time series analysis on customer comments across all 4 categories Created heat map of terms by survey category based on frequency of words Extracted Positive and Negative words across all the Survey categories and plotted Word cloud Created customized tableau dashboards for effective reporting and visualizations CHATBOT Developed a user friendly chatbot for one of our Products which handle simple questions about hours of operation reservation options and so on This chat bot serves entire product related questions Giving overview of tool via QA platform and also give recommendation responses so that user question to build chain of relevant answer This too has intelligence to build the pipeline of questions as per user requirement and asks the relevant recommended questions Tools Technologies Python Natural language processing NLTK spacy topic modelling Sentiment analysis Word Embedding scikit learn JavaScript JQuery SqlServer INFORMATION GOVERNANCE Organizations to make informed decisions about all of the information they store The integrated Information Governance portfolio synthesizes intelligence across unstructured data sources and facilitates action to ensure organizations are best positioned to counter information risk Scan data from multiple sources of formats and parse different file formats extract Meta data information push results for indexing elastic search and created customized interactive dashboards using kibana Preforming ROT Analysis on the data which give information of data which helps identify content that is either Redundant Outdated or Trivial Preforming full text search analysis on elastic search with predefined methods which can tag as PII personally identifiable information social security numbers addresses names etc which frequently targeted during cyber attacks Tools Technologies Python Flask Elastic Search Kibana FRAUD ANALYTIC PLATFORM Fraud Analytics and investigative platform to review all red flag cases FAP is a Fraud Analytics and investigative platform with inbuilt case manager and suite of Analytics for various ERP systems It can be used by clients to interrogate their A ounting systems for identifying the anomalies which can be indicators of fraud by running advanced analytics Tools Technologies HTML JavaScript SqlServer JQuery CSS Bootstrap Node js D3 js DC js
In [11]:
#Obtaining the most common words

import nltk
nltk.download('stopwords')
nltk.download('punkt')
from nltk.corpus import stopwords
import string
from wordcloud import WordCloud

oneSetOfStopWords = set(stopwords.words('english')+['``',"''"])
totalWords =[]
Sentences = resume['cleaned_resume'].values
cleanedSentences = ""
for i in range(len(resume)):
    cleanedText = cleanResume(Sentences[i])
    cleanedSentences += cleanedText
    requiredWords = nltk.word_tokenize(cleanedText)
    for word in requiredWords:
        if word not in oneSetOfStopWords and word not in string.punctuation:
            totalWords.append(word)
    
wordfreqdist = nltk.FreqDist(totalWords)
mostcommon = wordfreqdist.most_common(50)
print(mostcommon)
[nltk_data] Downloading package stopwords to /home/jupyter-
[nltk_data]     gunjan/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to /home/jupyter-
[nltk_data]     gunjan/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[('Exprience', 3829), ('months', 3233), ('company', 3130), ('Details', 2967), ('description', 2634), ('1', 2134), ('Project', 1808), ('project', 1579), ('6', 1499), ('data', 1438), ('team', 1424), ('Maharashtra', 1385), ('year', 1244), ('Less', 1137), ('January', 1086), ('using', 1041), ('Skill', 1018), ('Pune', 1016), ('Management', 1010), ('SQL', 990), ('Ltd', 934), ('management', 927), ('C', 896), ('Engineering', 855), ('Education', 833), ('Developer', 806), ('Java', 773), ('2', 754), ('development', 752), ('monthsCompany', 746), ('Pvt', 730), ('application', 727), ('System', 715), ('reports', 697), ('business', 696), ('India', 693), ('requirements', 693), ('I', 690), ('various', 688), ('A', 688), ('Data', 674), ('The', 672), ('University', 656), ('process', 648), ('Testing', 646), ('test', 638), ('Responsibilities', 637), ('system', 636), ('testing', 634), ('Software', 632)]
In [12]:
#Visualising most common words with Wordcloud
wordcloud = WordCloud(    background_color='black',
                          width=1600,
                          height=800,
                    ).generate(cleanedSentences)

fig = plt.figure(figsize=(30,20))
plt.imshow(wordcloud)
plt.axis('off')
plt.tight_layout(pad=0)
fig.savefig("tag.png")
plt.show()
In [13]:
from sklearn.utils import shuffle

# Get features and labels from data and shuffle
features = resume['cleaned_resume'].values
original_labels = resume['Category'].values
labels = original_labels[:]

for i in range(len(resume)):
  labels[i] = str(labels[i].lower())  # convert to lowercase
  labels[i] = labels[i].replace(" ", "")  # use hyphens to convert multi-token labels into single tokens

features, labels = shuffle(features, labels)

# Print example feature and label
print(features[0])
print(labels[0])
Education Details June 2013 to June 2016 Diploma Computer science Pune Maharashtra Aissms June 2016 BE pursuing Computer science Pune Maharashtra Anantrao pawar college of Engineering Research centre Python Developer Skill Details Company Details company Cybage Software Pvt Ltd description I want to work in organisation as a python developer to utilize my knowledge To gain more knowledge with our organisation 
pythondeveloper
In [14]:
# Split into train and test
train_split = 0.8
train_size = int(train_split * len(resume))

train_features = features[:train_size]
train_labels = labels[:train_size]

test_features = features[train_size:]
test_labels = labels[train_size:]

# Print size of each split
print(len(train_labels))
print(len(test_labels))
769
193
In [15]:
#tokenize features and labels

import tensorflow as tf
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.preprocessing.text import Tokenizer

# Tokenize feature data
vocab_size = 6000
oov_tok = '<>'

feature_tokenizer = Tokenizer(num_words=vocab_size, oov_token=oov_tok)
feature_tokenizer.fit_on_texts(features)

feature_index = feature_tokenizer.word_index
print(dict(list(feature_index.items())[:20]))

# Print example sequences from train and test datasets
train_feature_sequences = feature_tokenizer.texts_to_sequences(train_features)

test_feature_sequences = feature_tokenizer.texts_to_sequences(test_features)
{'<>': 1, 'and': 2, 'the': 3, 'of': 4, 'to': 5, 'in': 6, 'for': 7, 'exprience': 8, 'with': 9, 'company': 10, 'a': 11, 'project': 12, 'months': 13, 'description': 14, 'details': 15, 'on': 16, 'as': 17, 'data': 18, '1': 19, 'management': 20}
In [16]:
# Tokenize label data 
label_tokenizer = Tokenizer(lower=True)
label_tokenizer.fit_on_texts(labels)

label_index = label_tokenizer.word_index
print(dict(list(label_index.items())))

# Print example label encodings from train and test datasets
train_label_sequences = label_tokenizer.texts_to_sequences(train_labels)

test_label_sequences = label_tokenizer.texts_to_sequences(test_labels)
{'javadeveloper': 1, 'testing': 2, 'devopsengineer': 3, 'pythondeveloper': 4, 'webdesigning': 5, 'hr': 6, 'hadoop': 7, 'etldeveloper': 8, 'datascience': 9, 'mechanicalengineer': 10, 'sales': 11, 'blockchain': 12, 'operationsmanager': 13, 'arts': 14, 'database': 15, 'healthandfitness': 16, 'electricalengineering': 17, 'pmo': 18, 'dotnetdeveloper': 19, 'businessanalyst': 20, 'automationtesting': 21, 'networksecurityengineer': 22, 'civilengineer': 23, 'sapdeveloper': 24, 'advocate': 25}
In [17]:
# Pad sequences for feature data
max_length = 300
trunc_type = 'post'
pad_type = 'post'

train_feature_padded = pad_sequences(train_feature_sequences, maxlen=max_length, padding=pad_type, truncating=trunc_type)
test_feature_padded = pad_sequences(test_feature_sequences, maxlen=max_length, padding=pad_type, truncating=trunc_type)

# Print example padded sequences from train and test datasets
print(train_feature_padded[0])
print(test_feature_padded[0])
[  55   15  225  382    5  225  156  246  111  133   48   25 2903  225
  156  112 4439  111  133   48   25 4440 4441   79    4   46  411  890
   78   45   41   15   10   15   10 4442   53   54   35   14   71 4443
    5   49    6 2301   17   11   78   45    5 3181  185   97    5 4444
  673   97    9  940 2301    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0]
[  55   15 1876  820    5 1876  820 5832  320  108   33  459   33  459
   41   15   10   15   10 5833 5834   14   33  459  120   60  630    7
 1650 3280  184  208  554    6  312    7 5835  411 5836   57   21  211
  802  762    7  411   57 2863   21   74   16  312 5837    5  279 1834
  146   74   16 2207  152   18   74   16 5838 3594   30  115 4022   74
   16 5839  435  528    5  343  435 2207 3594 1226    6 1125   10 5840
   14   12   14  184    2  524  751 3595  751  146  751  487    2  751
  241   56  184  208  554    6  312  169  295  109  247  554 2088  184
    2  524  751 3595  751  146  751  487  751  241  251 1405    9  234
  751 5841 5842  251 1405    9 5843 1382  730   28    4 3031 2728  277
   67  235 5844   10 5845 5846   14  176  775  101  786 5847   56  120
   21    4 1052   98 1983   21  233  184  208  554    6  312    4   21
 1714  184    2  524 5848 1361  599   16 5849 3596   43    7 2522    4
  579 1120 3596   58    5  115  901  458    9 1643  169  295  109  247
  324    5   43    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0]
In [18]:
#Train a sequential model

# Define the neural network
embedding_dim = 64

model = tf.keras.Sequential([
  # Add an Embedding layer expecting input vocab of size 6000, and output embedding dimension of size 64 we set at the top
  tf.keras.layers.Embedding(vocab_size, embedding_dim, input_length=1),
  tf.keras.layers.Bidirectional(tf.keras.layers.LSTM(embedding_dim)),
  #tf.keras.layers.Dense(embedding_dim, activation='relu'),

  # use ReLU in place of tanh function since they are very good alternatives of each other.
  tf.keras.layers.Dense(embedding_dim, activation='relu'),

  # Add a Dense layer with 25 units and softmax activation for probability distribution
  tf.keras.layers.Dense(26, activation='softmax')
])

model.summary()
Model: "sequential"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
=================================================================
embedding (Embedding)        (None, 1, 64)             384000    
_________________________________________________________________
bidirectional (Bidirectional (None, 128)               66048     
_________________________________________________________________
dense (Dense)                (None, 64)                8256      
_________________________________________________________________
dense_1 (Dense)              (None, 26)                1690      
=================================================================
Total params: 459,994
Trainable params: 459,994
Non-trainable params: 0
_________________________________________________________________
In [19]:
# Compile the model and convert train/test data into NumPy arrays
model.compile(loss='sparse_categorical_crossentropy', optimizer='adam', metrics=['accuracy'])

# Features
train_feature_padded = np.array(train_feature_padded)
test_feature_padded = np.array(test_feature_padded)

# Labels
train_label_sequences = np.array(train_label_sequences)
test_label_sequences = np.array(test_label_sequences)
In [20]:
# Train the neural network
num_epochs = 12

history = model.fit(train_feature_padded, train_label_sequences, epochs=num_epochs, validation_data=(test_feature_padded, test_label_sequences), verbose=2)
Epoch 1/12
WARNING:tensorflow:Model was constructed with shape (None, 1) for input Tensor("embedding_input:0", shape=(None, 1), dtype=float32), but it was called on an input with incompatible shape (None, 300).
WARNING:tensorflow:Model was constructed with shape (None, 1) for input Tensor("embedding_input:0", shape=(None, 1), dtype=float32), but it was called on an input with incompatible shape (None, 300).
WARNING:tensorflow:Model was constructed with shape (None, 1) for input Tensor("embedding_input:0", shape=(None, 1), dtype=float32), but it was called on an input with incompatible shape (None, 300).
25/25 - 1s - loss: 3.2139 - accuracy: 0.1339 - val_loss: 3.0833 - val_accuracy: 0.1088
Epoch 2/12
25/25 - 1s - loss: 2.8970 - accuracy: 0.1743 - val_loss: 2.7089 - val_accuracy: 0.1813
Epoch 3/12
25/25 - 1s - loss: 2.4943 - accuracy: 0.2484 - val_loss: 2.2634 - val_accuracy: 0.3057
Epoch 4/12
25/25 - 1s - loss: 2.0164 - accuracy: 0.4174 - val_loss: 1.8173 - val_accuracy: 0.5233
Epoch 5/12
25/25 - 1s - loss: 1.5046 - accuracy: 0.5956 - val_loss: 1.2717 - val_accuracy: 0.6788
Epoch 6/12
25/25 - 1s - loss: 1.1092 - accuracy: 0.6892 - val_loss: 0.9201 - val_accuracy: 0.7513
Epoch 7/12
25/25 - 1s - loss: 0.7728 - accuracy: 0.8349 - val_loss: 0.6580 - val_accuracy: 0.9067
Epoch 8/12
25/25 - 1s - loss: 0.6225 - accuracy: 0.8791 - val_loss: 0.5337 - val_accuracy: 0.8964
Epoch 9/12
25/25 - 1s - loss: 0.3789 - accuracy: 0.9363 - val_loss: 0.3302 - val_accuracy: 0.9430
Epoch 10/12
25/25 - 1s - loss: 0.3345 - accuracy: 0.9233 - val_loss: 0.2675 - val_accuracy: 0.9430
Epoch 11/12
25/25 - 1s - loss: 0.2066 - accuracy: 0.9558 - val_loss: 0.1576 - val_accuracy: 0.9845
Epoch 12/12
25/25 - 1s - loss: 0.1524 - accuracy: 0.9701 - val_loss: 0.1096 - val_accuracy: 0.9896
In [21]:
# print any random example feature and its correct predicted label

print(test_features[5])
print(test_labels[5])
Education Details January 2016 B E Information Technology Pune Maharashtra Sawitribai Phule Pune University Java Developer Java Developer Vertical Software Skill Details Company Details company Vertical Software description Expertise in design and development of web applications using J2EE Servlets JSP JavaScript HTML CSS JQUERY AJAX JSON Experienced in developing applications using MVC architecture Good understanding of Software Development Life Cycle Phases such as Requirement gathering analysis design development and unit testing Languages open Source Java J2EE Spring Hibernate Frame Work Scripting Languages Server Java JSP Servlets DB Connectivity s Side Program JDBC JavaScript jQuery Ajax JSON Application Server TomCat Database MongoDB MySql IDEs Eclipse 1 Project Title Expense Ledger Role Java Developer Tools and Technologies Java Jsp Servlet MySql JavaScript Json Jquery Ajax 2 Project Title Trimurti Developer Realestate Role Java Developer Tools and Technologies Java Jsp Servlet MySql JavaScript Json Jquery Ajax 3 Project Title Vimay Enterprise Role Java Developer Tools and Technologies Java Jsp Spring Hibernate Maven Jquery Ajax company Higher Secondary School description Pune 58 8 
javadeveloper
In [22]:
#one more custom prediction example

print(test_features[8])
print(test_labels[8])
Education Details BE IT pjlce Java Developer Java Developer Skill Details c Exprience Less than 1 year months c Exprience Less than 1 year months JAVA Exprience Less than 1 year months DS Exprience Less than 1 year months Jdbc Exprience 24 months Hibernate Exprience Less than 1 year months Java J2Ee Exprience Less than 1 year months Javascript Exprience 6 months JQuery Exprience 6 months Ajax Exprience 6 monthsCompany Details company Almighty tech pvt ltd nagpur description 1 As a Java Developer ORGANISATION Almighty tech pvt ltd Nagpur DESIGNATION Java Developer DURATION From 1st jan 2018 Notice Period 15 days JOB RESPONSIBILITIES Resolve Bugs Develop project as per user requirement KNOWLEDGE ABOUT Programming language C C DS Java Swing JDBC J2EE java script jquery Ajax Ms office Excel 
javadeveloper
In [23]:
#determining test score and accuracy
score = model.evaluate(test_feature_padded, test_label_sequences, verbose=1)

print("Test Score:", score[0])
print("Test Accuracy:", score[1])
7/7 [==============================] - 0s 7ms/step - loss: 0.1096 - accuracy: 0.9896
Test Score: 0.10958971083164215
Test Accuracy: 0.9896373152732849
In [24]:
#Visualising the model accuracy and loss

plt.plot(history.history['accuracy'])
plt.plot(history.history['val_accuracy'])

plt.title('model accuracy')
plt.ylabel('accuracy')
plt.xlabel('epoch')
plt.legend(['train','test'], loc='upper left')
plt.show()

plt.plot(history.history['loss'])
plt.plot(history.history['val_loss'])

plt.title('model loss')
plt.ylabel('loss')
plt.xlabel('epoch')
plt.legend(['train','test'], loc='upper left')
plt.show()
In [25]:
model.save("resume_screening.h5")