Employee_Attrition_Prediction

Table of Contents

Overview
DataSource
Results
Summary

Overview

To verify if the duration of employment depends on employee demografics, this project trys to discover which are the main effects of employees to make their decision on leave or stay in the company. Based on the analysis, build machine learning models to make predictions and calculate the model accuracy.

Analyze employee's charactistics who left or stayed in the company.
Use machine learning to build models and check model performance on predictions.

DataSource

Human_Resources.csv

This is a clean dataset, including 1470 employees information without unkown or missing value.
35 Columns:
- 7 object columns: Attrition, BusinessTravel, Department, EducationField, Gender, JobRole, MaritalStatus
- 28 number columns: Age, DailyRate, DistanceFromHome, Education, EmployeeCount, EmployeeNumber, EnvironmentSatisfaction, HourlyRate, JobInvolvement, JobLevel, JobSatisfaction, MonthlyIncome, MonthlyRate, NumCompaniesWorked, Over18, OverTime, PercentSalaryHike, PerformanceRating, RelationshipSatisfaction, StandardHours, StockOptionLevel, TotalWorkingYears, TrainingTimesLastYear, WorkLifeBalance, YearsAtCompany, YearsInCurrentRole, YearsSinceLastPromotion, YearsWithCurrManager

Results

Data Preprocessing

Use lambda function to convert boolen columns to numerical data.

employee_df['Attrition'] = employee_df['Attrition'].apply(lambda x:1 if x == 'Yes' else 0)
employee_df['OverTime'] = employee_df['OverTime'].apply(lambda x:1 if x == 'Yes' else 0)
employee_df['Over18'] = employee_df['Over18'].apply(lambda x:1 if x == 'Y' else 0)

Use histogram to check distributions of each column Remove 3 unimportant columns(single unique value) for further analysis:
- Over18 Y
- StandardHours 80
- EmployeeCount 1

Data Visualization

Seperate dataset by attrition
```
left_df = employee_df[employee_df['Attrition'] == 1]
stayed_df = employee_df[employee_df['Attrition'] == 0]
```
- 🙋‍♂️ Employee those stayed: 1233 person (83.88%)
- 🙅‍♂️ Employee those left: 237 person (16.12%)
The relationships between employee demographics from the view of correlation map
- Job level is strongly correlated with total working Years
- Monthly income is strongly correlated with Job level
- Monthly income is strongly correlated with total working Years
- Age is stongly correlated with monthly income
Differences between the demographics of stayed and left employees from the view of countplots and KDE plots
- Countplots
  - Single employees tend to leave compared to married and divorced
  - Sales Representitives tend to leave compared to any other job
  - Less involved employees tend to leave the company
  - Less experienced (low job level) tend to leave the company
- KDE plots
  - Employees who live farther than 10 miles from the company tend to leave
  - Employees younger than 30yrs show higher attrition than over 30yrs
  - Stay in the same team encourage employees to stay
  - Less than 10 total working years tend to leave the company
- BoxPlot
  - The following charts show monthly income based on Gender and JobRole

Machine Learning

Data Cleaning
- Categorical columns
- Numerical columns
- OneHotEncoding and merge dataframe
Feature scaling
- Features: X = minmaxscaler.fit_transform(X_all)
- Target: y = employee_df['Attrition']
Split data into training and testing dataset -test size: 0.25
Building Model
- Linear: Logistic Regression
  - Accurarcy: 84.78%
  - Confusion matrix heatmap
  - classification report
- Decision Tree Classifier
  - Accurarcy: 80.43%
  - Confusion matrix heatmap
  - classification report
- Random Forest Classifier
  - Accurarcy: 82.88%
  - Confusion matrix heatmap
  - classification report
- XGBoost Classifier
  - Accurarcy: 84.78%
  - Confusion matrix heatmap
  - classification report

Summary

Who tends to leave the company? Single(maritial status), sales representatives (job role), Less involved, less working experienced, whose house far away from the company, younger than 30 years old. Typically, employees at their earlier career stage tend to leave the company compare to those more experiened and need to support family.
Logistic regression and XGBoost classifier model got the highest accuracy score(84.78%).
This is an imblanced dataset. F1-score better reflects model performance. Random forest classifier got the lowest f1-score(0.16) on predicting left employees. Logistic regression and decision tree are 0.42, surpass other two models. Overall, logistic regression is best predict the employee attrition.
Since the dataset is imbalanced. Resampling such as oversampling or undersampling is suggested to retrain the models.

Name		Name	Last commit message	Last commit date
Latest commit History 19 Commits
Screenshots		Screenshots
.DS_Store		.DS_Store
HumanResources_Attrition.ipynb		HumanResources_Attrition.ipynb
Human_Resources.csv		Human_Resources.csv
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Employee_Attrition_Prediction

Overview

DataSource