Use Case HR – Reduce employee attrition and make talents stay longer (Part 2: Prediction)


In the first part of our analysis we’ve put together some basic insights about our data set and we saw that our features showed quite good correlation rates.

In this part we use the data we’ve prepared to predict which employees will leave the company.

If you’re not a tech-lover you can just read the follwoing summary of what we have done and go back to work 😉

Summary for non-techies (and all those who just want to spend 1 minute):

  • We use a typical HR data set as input file. The file contains columns that can easily be found in most HR IT systems (seniority, last evaluation, salary level, last promotion, work accidents, satisfaction level, etc.). You can see the input file here.
  • We create a model able to predict extremely accurately which employees will leave the company and who will stay. The precision level of the model is 98,8% (i.e. in only 1.2% of cases the model does not detect correctly if an employee will leave.
  • We use the prediction model to calculate for each of the employees that are still in the company the exact probability of leaving.
  • Finally, we generate a list of all key employees (employees with a recent evaluation rate > 0.7) ordered by leaving probability. You can see the final output list here.



Detailed steps

We have used the notebook function in the “Files” module of the Verteego Data Science Suite to produce some straight forward data science scripting (with Python and Scikit Learn) in order to predict which of the current key employees (having evaluations higher than 0.7) are most likely to leave.

You can run or download the notebook file here.

We will see in the 3rd and last part of this Use Case how to use Verteego Data Science Suite to inform the employees’ managers about the situation in their teams through a monthly mailing.

Below you’ll find a walk-through of the script. If you want to run the analysis on your own you can also download all files from the repository.


Import main libraries

In this analysis we use Pandas, Numpy and Scikit.

import pandas as pd
import numpy as np

#disable warnings to make notebook smoother
import warnings


Load data

In the first part of this use case, we have done some minor cleaning of the raw data and changed the column names to a more standardized format.
In this step we load all 15k lines of the data into a data frame and display the column names as well as the first five rows of the data set.

# load data frame
leave_df = pd.read_csv('../data/raw_data.csv')
col_names = leave_df.columns.tolist()

#show some basic output
print "Column names:"
print col_names

print "\nSample data:"





Feature preparation

The important part of this step – and maybe of the whole prediction script – is the feature standardization. After having transformed the department labels into integers and put all values into a feature matrix, feature standardization (using the StandardScaler class) converts all values into floats ranging between -1.0 and 1.0. This simple procedure contributes considerably to the accuracy of the predictions.

# Isolate target data
y = leave_df['left']

# We don't need these columns
to_drop = ['name', 'salary', 'left']
leave_feat_space = leave_df.drop(to_drop,axis=1)

# Pull out features for future use
features = leave_feat_space.columns

# convert label features to integers
from sklearn import preprocessing
le_sales = preprocessing.LabelEncoder()["department"])
leave_feat_space["department"] = le_sales.transform(leave_feat_space.loc[:,('department')])

# transforme the whole feature space into a matrix
X = leave_feat_space.as_matrix().astype(np.float)

# standardize all features
scaler = preprocessing.StandardScaler()
X = scaler.fit_transform(X)

print "Feature space holds %d observations and %d features" % X.shape
print "Unique target labels:", np.unique(y)




Prediction function

The prediction function uses a 3-fold cross-validation and is written to take several prediction algorithms as input (we will use it to compare 3 different algorithms in the next step). We call this function both for comparing the algorithms and predicting the leave probabilities in the last step of the script (through the “method” parameter).

# prediction function
def run_cv(X,y,clf_class, method, **kwargs):
	  from sklearn.model_selection import cross_val_predict

	  # Initialize a classifier with key word arguments
	  clf = clf_class(**kwargs)

	  predicted = cross_val_predict(clf, X, y, cv=3, method=method)

	  return predicted


Compare prediction algorithms

In order to chose the best classification algorithm we compared a Support Vector Classifier (SVC), a Random Forest Classifier (RF) and a K-Nearest-Neighbors classifier (KNN). As we can quickly see in the first output, all models produce quite satisfying results (between 95% and 99% accuracy) without even tuning the algorithms. We finally retained the Random Forest Classifier for the prediction as it produces the best results (98.8% accuracy).

from sklearn.svm import SVC
from sklearn.ensemble import RandomForestClassifier as RF
from sklearn.neighbors import KNeighborsClassifier as KNN
from sklearn import metrics

def accuracy(y, predicted):
    # NumPy interprets True and False as 1. and 0.
    return metrics.accuracy_score(y, predicted)

print "Support vector machines:"
print "%.3f" % accuracy(y, run_cv(X,y,SVC, method='predict'))
print "Random forest:"
print "%.3f" % accuracy(y, run_cv(X,y,RF, method='predict'))
print "K-nearest-neighbors:"
print "%.3f" % accuracy(y, run_cv(X,y,KNN, method='predict'))




Calculate confusion matrices

Confusion matrices are a very useful tool to get an overview of the accuracy of a prediction. The matrix provides a value for each crossing point between predicted and realized classes. In our example the confusion matrix has 4 fields: Left-Left, Left-Not left, Not left-Left and Not left-Not left. Thanks to the high quality of the prediction the Left-Left and Not left-Not left fields contain by far the largest amount of values.

from sklearn.metrics import confusion_matrix

y = np.array(y)
class_names = np.unique(y)

# calculate confusion matrices
confusion_matrices = [
	  ( "Support Vector Machines", confusion_matrix(y,run_cv(X,y,SVC, method='predict')) ),
	  ( "Random Forest", confusion_matrix(y,run_cv(X,y,RF, method='predict')) ),
	  ( "K-Nearest-Neighbors", confusion_matrix(y,run_cv(X,y,KNN, method='predict')) ),

# show confusion matrix values
print confusion_matrices




Draw confusion matrices

We use the Seaborn visualization library to draw the confusion matrices for our 3 prediction algorithms.

For example, we can see that for the Random Forest Classifier model we predict 11,375 times correctly that an employee will stay.

import matplotlib.pyplot as plt
import seaborn as sn
%matplotlib inline

# draw confusion matrices
for cf in confusion_matrices:

	  ax = plt.axes()

	  df_cm = pd.DataFrame(cf[1], index = ["Real 0", "Real 1"], columns = ["Pred 0", "Pred 1"])
	  plt.figure(figsize = (6,5))
	  sn.heatmap(df_cm, annot=True, ax = ax, square=True, fmt="d",linewidths=.5)




Calculate prediction probabilities for all employees

We use the predict function to calculate the probabilities for staying and leaving (left=0, left=1) for all 15K employees in our data set. As every predictor makes several hundreds or even thousands of predictions we can compare our probabilities with the actual outcome of each class.

For example, for the group of employees for which we predicted a 60% probability of leaving, the actual leaving percentage is 78%. As we can see the predicted probabilities for the two main classes (pred_prob=0% and pred_prob=100%) are very close to the real probabilities which shows another time that our model is extremely accurate.

# Use 10 estimators so predictions are all multiples of 0.1
pred_prob = run_cv(X, y, RF, n_estimators=10,  method='predict_proba',)

pred_leave = pred_prob[:,1]
is_leave = y == 1

# Number of times a predicted probability is assigned to an observation
counts = pd.value_counts(pred_leave)

# calculate true probabilities
true_prob = {}
for prob in counts.index:
	  true_prob[prob] = np.mean(is_leave[pred_leave == prob])
	  true_prob = pd.Series(true_prob)

# pandas-fu
counts = pd.concat([counts,true_prob], axis=1).reset_index()
counts.columns = ['pred_prob', 'count', 'true_prob']




Generate list of key employees with leaving and staying probabilities

In the last step we filter out all key employees (employees that have a last evaluation higher than 0.7) that are still in the company. This gives us a table of about 6K employees. In order to be able to alert managers about employees that are most likely to leave we order the employee list by their leaving probability and save the whole list as a CSV file.

You can view and download the final output file here.

#create a dataframe containing prob values
pred_prob_df = pd.DataFrame(pred_prob)
pred_prob_df.columns = ['prob_not_leaving', 'prob_leaving']

#merge dataframes to get the name of employees
all_employees_pred_prob_df = pd.concat([leave_df, pred_prob_df], axis=1)

#filter out employees still in the company and having a good evaluation
good_employees_still_working_df = all_employees_pred_prob_df[(all_employees_pred_prob_df["left"] == 0) & 
                                                            (all_employees_pred_prob_df["last_evaluation"] >= 0.7)]

good_employees_still_working_df.sort_values(by='prob_leaving', ascending=False, inplace=True)

#write to csv