Exploratory Data Analysis (EDA) on US Census Data. Investigating income distribution, demographics, and correlations using Python, Pandas, and statistical tests.
This project aims at exploratory analysis of US data, with the main objective of identifying trends and related patterns among the distribution of income, in correlation with given demographic data. The dataset contains necessary information on age, employment, education level, marital status, place of birth, hours worked, gender, race, state, and of course, income.
- Census_Data.csv: Raw census data with all the numeric-coded attributes.
- Attribute_Values.csv: Mapping of numeric attribute values to meaningful categories.
The project includes the following analyses:
- Income Distribution Analysis
- Histogram of income values
- Log-transformed income distribution
- Zipf plot and cumulative frequency analysis
- Demographic Correlations with Work Category
- Chi-square tests to analyze relationships between gender, race, place of birth, and employment category.
- Visualization using heatmaps of all the contingencies.
- Impact and influence of demographic data on income prices
- Comparisons of average incomes between gender, race and place of birth.
- T-tests to identify significant differences.
- Correlations of income with continuous variables
- Scatter plots for education, age and hours worked in relation to income.
Additional statistical tests based on new hypotheses formed from exploratory analysis.
- Python
- Pandas for data processing
- Seaborn & Matplotlib for data visualization
To execute the analysis, follow these steps:
- Clone the repository:
- git clone https://github.com/PolyzosFotios/us-census-data-analysis.git
- cd us-census-data-analysis
- Ιnstall the required dependencies:
- pip install -r requirements.txt
- Start Jupyter Notebook:
- jupyter notebook
- Open and run notebook.ipynb.
This project is available under the MIT License.