Table of Contents
To verify if the duration of employment depends on employee demografics, this project trys to discover which are the main effects of employees to make their decision on leave or stay in the company. Based on the analysis, build machine learning models to make predictions and calculate the model accuracy.
- Analyze employee's charactistics who left or stayed in the company.
- Use machine learning to build models and check model performance on predictions.
- This is a clean dataset, including 1470 employees information without unkown or missing value.
- 35 Columns:
- 7 object columns: Attrition, BusinessTravel, Department, EducationField, Gender, JobRole, MaritalStatus
- 28 number columns: Age, DailyRate, DistanceFromHome, Education, EmployeeCount, EmployeeNumber, EnvironmentSatisfaction, HourlyRate, JobInvolvement, JobLevel, JobSatisfaction, MonthlyIncome, MonthlyRate, NumCompaniesWorked, Over18, OverTime, PercentSalaryHike, PerformanceRating, RelationshipSatisfaction, StandardHours, StockOptionLevel, TotalWorkingYears, TrainingTimesLastYear, WorkLifeBalance, YearsAtCompany, YearsInCurrentRole, YearsSinceLastPromotion, YearsWithCurrManager
-
Use lambda function to convert boolen columns to numerical data.
employee_df['Attrition'] = employee_df['Attrition'].apply(lambda x:1 if x == 'Yes' else 0) employee_df['OverTime'] = employee_df['OverTime'].apply(lambda x:1 if x == 'Yes' else 0) employee_df['Over18'] = employee_df['Over18'].apply(lambda x:1 if x == 'Y' else 0)
-
Use histogram to check distributions of each column Remove 3 unimportant columns(single unique value) for further analysis:
- Over18 Y
- StandardHours 80
- EmployeeCount 1
-
Seperate dataset by attrition
left_df = employee_df[employee_df['Attrition'] == 1] stayed_df = employee_df[employee_df['Attrition'] == 0]
- 🙋♂️ Employee those stayed: 1233 person (83.88%)
- 🙅♂️ Employee those left: 237 person (16.12%)
-
The relationships between employee demographics from the view of correlation map
- Job level is strongly correlated with total working Years
- Monthly income is strongly correlated with Job level
- Monthly income is strongly correlated with total working Years
- Age is stongly correlated with monthly income
-
Differences between the demographics of stayed and left employees from the view of countplots and KDE plots
-
Countplots
- Single employees tend to leave compared to married and divorced
- Sales Representitives tend to leave compared to any other job
- Less involved employees tend to leave the company
- Less experienced (low job level) tend to leave the company
-
KDE plots
- Employees who live farther than 10 miles from the company tend to leave
- Employees younger than 30yrs show higher attrition than over 30yrs
- Stay in the same team encourage employees to stay
- Less than 10 total working years tend to leave the company
-
BoxPlot
- The following charts show monthly income based on Gender and JobRole
-
-
Data Cleaning
-
Feature scaling
- Features:
X = minmaxscaler.fit_transform(X_all)
- Target:
y = employee_df['Attrition']
- Features:
-
Split data into training and testing dataset -test size: 0.25
-
Building Model
-
Linear: Logistic Regression
5A1D
-
Decision Tree Classifier
-
Random Forest Classifier
-
XGBoost Classifier
-
- Who tends to leave the company? Single(maritial status), sales representatives (job role), Less involved, less working experienced, whose house far away from the company, younger than 30 years old. Typically, employees at their earlier career stage tend to leave the company compare to those more experiened and need to support family.
- Logistic regression and XGBoost classifier model got the highest accuracy score(84.78%).
- This is an imblanced dataset. F1-score better reflects model performance. Random forest classifier got the lowest f1-score(0.16) on predicting left employees. Logistic regression and decision tree are 0.42, surpass other two models. Overall, logistic regression is best predict the employee attrition.
- Since the dataset is imbalanced. Resampling such as oversampling or undersampling is suggested to retrain the models.