**Context**

In the first part of our analysis we’ve put together some basic insights about our data set and we saw that our features showed quite good correlation rates.

In this part we use the data we’ve prepared to predict which employees will leave the company.

If you’re not a tech-lover you can just read the follwoing summary of what we have done and go back to work 😉

**Summary for non-techies (and all those who just want to spend 1 minute):**

- We use a typical HR data set as input file. The file contains columns that can easily be found in most HR IT systems (seniority, last evaluation, salary level, last promotion, work accidents, satisfaction level, etc.). You can see the input file here.
- We create a model able to predict extremely accurately which employees will leave the company and who will stay. The precision level of the model is 98,8% (i.e. in only 1.2% of cases the model does not detect correctly if an employee will leave.
- We use the prediction model to calculate for each of the employees that are still in the company the exact probability of leaving.
- Finally, we generate a list of all key employees (employees with a recent evaluation rate > 0.7) ordered by leaving probability. You can see the final output list here.

**Detailed steps**

We have used the notebook function in the “Files” module of the Verteego Data Science Suite to produce some straight forward data science scripting (with Python and Scikit Learn) in order to predict which of the current key employees (having evaluations higher than 0.7) are most likely to leave.

You can run or download the notebook file here.

We will see in the 3rd and last part of this Use Case how to use Verteego Data Science Suite to inform the employees’ managers about the situation in their teams through a monthly mailing.

Below you’ll find a walk-through of the script. If you want to run the analysis on your own you can also download all files from the repository.

**Import main libraries**

In this analysis we use Pandas, Numpy and Scikit.

**Load data**

In the first part of this use case, we have done some minor cleaning of the raw data and changed the column names to a more standardized format.

In this step we load all 15k lines of the data into a data frame and display the column names as well as the first five rows of the data set.

Output:

**Feature preparation**

The important part of this step – and maybe of the whole prediction script – is the feature standardization. After having transformed the department labels into integers and put all values into a feature matrix, feature standardization (using the StandardScaler class) converts all values into floats ranging between -1.0 and 1.0. This simple procedure contributes considerably to the accuracy of the predictions.

Output:

**Prediction function**

The prediction function uses a 3-fold cross-validation and is written to take several prediction algorithms as input (we will use it to compare 3 different algorithms in the next step). We call this function both for comparing the algorithms and predicting the leave probabilities in the last step of the script (through the “method” parameter).

**Compare prediction algorithms**

In order to chose the best classification algorithm we compared a Support Vector Classifier (SVC), a Random Forest Classifier (RF) and a K-Nearest-Neighbors classifier (KNN). As we can quickly see in the first output, all models produce quite satisfying results (between 95% and 99% accuracy) without even tuning the algorithms. We finally retained the Random Forest Classifier for the prediction as it produces the best results (98.8% accuracy).

Output:

**Calculate confusion matrices**

Confusion matrices are a very useful tool to get an overview of the accuracy of a prediction. The matrix provides a value for each crossing point between predicted and realized classes. In our example the confusion matrix has 4 fields: Left-Left, Left-Not left, Not left-Left and Not left-Not left. Thanks to the high quality of the prediction the Left-Left and Not left-Not left fields contain by far the largest amount of values.

Output:

**Draw confusion matrices**

We use the Seaborn visualization library to draw the confusion matrices for our 3 prediction algorithms.

For example, we can see that for the Random Forest Classifier model we predict 11,375 times correctly that an employee will stay.

Output:

**Calculate prediction probabilities for all employees**

We use the predict function to calculate the probabilities for staying and leaving (left=0, left=1) for all 15K employees in our data set. As every predictor makes several hundreds or even thousands of predictions we can compare our probabilities with the actual outcome of each class.

For example, for the group of employees for which we predicted a 60% probability of leaving, the actual leaving percentage is 78%. As we can see the predicted probabilities for the two main classes (pred_prob=0% and pred_prob=100%) are very close to the real probabilities which shows another time that our model is extremely accurate.

Output:

**Generate list of key employees with leaving and staying probabilities**

In the last step we filter out all key employees (employees that have a last evaluation higher than 0.7) that are still in the company. This gives us a table of about 6K employees. In order to be able to alert managers about employees that are most likely to leave we order the employee list by their leaving probability and save the whole list as a CSV file.

You can view and download the final output file here.