8000 GitHub - CelineWW/Employee_Attrition_Prediction
[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
Skip to content

CelineWW/Employee_Attrition_Prediction

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

19 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Employee_Attrition_Prediction

Table of Contents

  1. Overview
  2. DataSource
  3. Results
  4. Summary

Overview

To verify if the duration of employment depends on employee demografics, this project trys to discover which are the main effects of employees to make their decision on leave or stay in the company. Based on the analysis, build machine learning models to make predictions and calculate the model accuracy.

  • Analyze employee's charactistics who left or stayed in the company.
  • Use machine learning to build models and check model performance on predictions.

DataSource

Human_Resources.csv

  • This is a clean dataset, including 1470 employees information without unkown or missing value.
  • 35 Columns:
    • 7 object columns: Attrition, BusinessTravel, Department, EducationField, Gender, JobRole, MaritalStatus
    • 28 number columns: Age, DailyRate, DistanceFromHome, Education, EmployeeCount, EmployeeNumber, EnvironmentSatisfaction, HourlyRate, JobInvolvement, JobLevel, JobSatisfaction, MonthlyIncome, MonthlyRate, NumCompaniesWorked, Over18, OverTime, PercentSalaryHike, PerformanceRating, RelationshipSatisfaction, StandardHours, StockOptionLevel, TotalWorkingYears, TrainingTimesLastYear, WorkLifeBalance, YearsAtCompany, YearsInCurrentRole, YearsSinceLastPromotion, YearsWithCurrManager

Results

Data Preprocessing

  • Use lambda function to convert boolen columns to numerical data.

    employee_df['Attrition'] = employee_df['Attrition'].apply(lambda x:1 if x == 'Yes' else 0)
    employee_df['OverTime'] = employee_df['OverTime'].apply(lambda x:1 if x == 'Yes' else 0)
    employee_df['Over18'] = employee_df['Over18'].apply(lambda x:1 if x == 'Y' else 0)
    
  • Use histogram to check distributions of each column Remove 3 unimportant columns(single unique value) for further analysis:

    • Over18 Y
    • StandardHours 80
    • EmployeeCount 1

    hist1 hist2

Data Visualization

  • Seperate dataset by attrition

    left_df = employee_df[employee_df['Attrition'] == 1]
    stayed_df = employee_df[employee_df['Attrition'] == 0]
    
    • 🙋‍♂️ Employee those stayed: 1233 person (83.88%)
    • 🙅‍♂️ Employee those left: 237 person (16.12%)
  • The relationships between employee demographics from the view of correlation map

    • Job level is strongly correlated with total working Years
    • Monthly income is strongly correlated with Job level
    • Monthly income is strongly correlated with total working Years
    • Age is stongly correlated with monthly income

    heatmap

  • Differences between the demographics of stayed and left employees from the view of countplots and KDE plots

    • Countplots

      • Single employees tend to leave compared to married and divorced
      • Sales Representitives tend to leave compared to any other job
      • Less involved employees tend to leave the company
      • Less experienced (low job level) tend to leave the company

      countplot_age countplot_married

    • KDE plots

      • Employees who live farther than 10 miles from the company tend to leave
      • Employees younger than 30yrs show higher attrition than over 30yrs
      • Stay in the same team encourage employees to stay
      • Less than 10 total working years tend to leave the company

      kde_age kde_distance from home kde_years with current manager kde_total working years

    • BoxPlot

      • The following charts show monthly income based on Gender and JobRole

      boxplot_gender boxplot_jobrole

Machine Learning

  • Data Cleaning

    • Categorical columns cat_df

    • Numerical columns num_df

    • OneHotEncoding and merge dataframe X_all

  • Feature scaling

    • Features: X = minmaxscaler.fit_transform(X_all)
    • Target: y = employee_df['Attrition']
  • Split data into training and testing dataset -test size: 0.25

  • Building Model

    • Linear: Logistic Regression

      • Accurarcy: 84.78%

      • Confusion matrix heatmap

        heatmap_logistic regression

      • classification report

        report_logistic regression

    • 5A1D
    • Decision Tree Classifier

      • Accurarcy: 80.43%

      • Confusion matrix heatmap

        heatmap_decision tree classifier

      • classification report

        report_decision tree

    • Random Forest Classifier

      • Accurarcy: 82.88%

      • Confusion matrix heatmap

        heatmap_random forest classifier

      • classification report

        report_random forest

    • XGBoost Classifier

      • Accurarcy: 84.78%

      • Confusion matrix heatmap

        heatmap_xgboost classifier

      • classification report

        report_xgboost

Summary

  • Who tends to leave the company? Single(maritial status), sales representatives (job role), Less involved, less working experienced, whose house far away from the company, younger than 30 years old. Typically, employees at their earlier career stage tend to leave the company compare to those more experiened and need to support family.
  • Logistic regression and XGBoost classifier model got the highest accuracy score(84.78%).
  • This is an imblanced dataset. F1-score better reflects model performance. Random forest classifier got the lowest f1-score(0.16) on predicting left employees. Logistic regression and decision tree are 0.42, surpass other two models. Overall, logistic regression is best predict the employee attrition.
  • Since the dataset is imbalanced. Resampling such as oversampling or undersampling is suggested to retrain the models.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published
0