[go: up one dir, main page]
More Web Proxy on the site http://driver.im/

DEV Community

Cover image for Data Wrangling with Python 🐍
Fady GA 😎
Fady GA 😎

Posted on • Edited on

Data Wrangling with Python 🐍

A long time ago in a galaxy far, far away ...
A group of people thought they can use whatever data they have right away to generate a killer analysis that is so advanced that it's light year ahead of any analysis that was done so far. And they failed instantly! πŸ˜‚

Unless your data is coming from a system that you built, that is inside a bomb shelter πŸ’£ and no one has access to except for you, maybe then you can use whatever data this system outputs out of the box. Other than that, you will almost always have to do some sort of pre-processing on your data so it can be usable in your use case as more often than not raw data will likely contain (specially if it's collected manually) empty values, wrong entries types ("four" instead of 4 in a "quantity" column πŸ˜•), duplicated rows, strange column names and my personal favorite, typos in a text column πŸ˜….

Transforming your data from its raw (sometimes called "dirty") state to its "clean" state where it can be used in whatever you want it to be used for, is called "data wrangling" (yes, the cowboys thing 😁)
This includes (and not limited to):

  • Removing/replacing empty values.
  • Removing duplicates.
  • Renaming columns.
  • Creating new columns.
  • Dropping irrelevant columns.
  • Changing columns data types.
  • Enriching the data from external sources.
  • ...

The good news is if you know a little bit of python and have the basic understanding of the data wrangling operations, you can do most of the wrangling stuff relatively easy using python, more specifically the pandas 🐼 package. 🀩

So let's begin our "Data Wrangling with Python" crash course πŸ˜‰

Β Β Β Β Β The following section assumes that you have python installed on your system. Either a normal python installation or a conda installation are fine.

(optional) Create a virtual environment:

     With python, you can create several environments isolated from your base installation to develop using a specific python/package version, test things out, try new packages or new packages features or for organizational purposes. In most cases, it's good thing 😊. The following will create a folder named wranglingWithPandas, cd into it, create a virtual environment called venv_pandas and activate it.

In your terminal, type

mkdir wranglingWithPandas
cd wranglingWithPandas

pip -m venv venv_pandas
.\venv_pandas\Scripts\activate (Windows)
source ./venv_pandas/bin/activate (Linux / MacOS)

Installing pandas:

pip install pandas

(optional) Installing Ipython or Jupyter Lab

Β Β Β Β Β It's advisable when working with data to work in an interactive environments. One way to do that is by installing ipython or jupyterlab packages, whichever you are familiar with πŸ€·β€β™‚οΈ.
(only one is required)
pip install ipython
pip install jupyterlab

To run them, type ipython or jupyter lab (note the space) in your terminal.

Importing data:

     Pandas can import data from a wide variety of data sources. Usually it uses the function read_x where "x" is csv, excel, json, ..., etc. In this tutorial, we will use a csv (comma separated values) file representing a transactions (purchases) log in a grocery store. Each row represents one item in a single transaction. One transaction can have multiple items. Sometimes, manual entries can happen 😬. You can find the sample dataset and all the code in this tutorial in this Github repo. The goal of this tutorial is to wrangle this file to find out what is the total revenue per item category and the top 3 purchased items.

To read the csv file in a "DataFrame" (pandas' way of saying, table πŸ˜‰), we use the read_csv function:

import pandas as pd     # A pandas convention πŸ˜…
import re               # Regex library. We will use that later.
import pathlib          # To manage paths in an OOP why.
import datetime         # To manage dates.

file_path = pathlib.Path("store_data_20230116.csv") # Assuming the file is in the same directory.
store_data = pd.read_csv(file_path)
Enter fullscreen mode Exit fullscreen mode

Data Exploration:

Β Β Β Β Β The purpose of data exploration is to understand the data at hand and to identify the issues that the data contains. One way to quickly explore it, is to use the head and info methods as follows.

store_data.head()
Enter fullscreen mode Exit fullscreen mode

Image description

store_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10038 entries, 0 to 10037
Data columns (total 5 columns):
 #   Column      Non-Null Count  Dtype  
---  ------      --------------  -----  
 0   Tran ID     10038 non-null  int64  
 1   Cat         9956 non-null   object 
 2   Item        9966 non-null   object 
 3   Qty         10023 non-null  float64
 4   Unit Price  10011 non-null  float64
dtypes: float64(2), int64(1), object(2)
memory usage: 392.2+ KB
Enter fullscreen mode Exit fullscreen mode

And to explore the distinct values, you can use the unique method.

store_data.Item.unique()

array(['black-eyed peas', 'chickpeas', 'apples', 'grapes', 'lettuce',
       'lentils', 'potatos', 'bananas', 'carrots', 'broccoli', 'LETTUCE',
       nan, 'oranges', 'navy beans', 'Oranges', 'Apples', 'Carrots',
       'Potatos', 'LENTILS', 'Broccoli', 'Chickpeas', 'CARROTS',
       'NAVY BEANS', 'Navy beans', 'BANANAS', 'BROCCOLI', 'Bananas',
       'Lentils', 'Lettuce', 'ORANGES', 'Black-eyed peas', 'Grapes',
       'GRAPES', 'CHICKPEAS', 'APPLES', 'BLACK-EYED PEAS'], dtype=object)

store_data.Cat.unique()

array(['legumes', 'fruits', 'vegetables', nan, 'VEGETABLES', 'Fruits',
       'Legumes', 'Vegetables', 'FRUITS', 'LEGUMES'], dtype=object)
Enter fullscreen mode Exit fullscreen mode

As you can see, the Cat and Item columns contain the same items (per column) but the text case differs. This might be due to the manual input. We will deal with that later.

Another problem that we have is the dreaded null values 😣!
isnull and isna methods can help with identifying the null values in your data.

store_data.isnull().sum()

Tran ID        0
Cat           82
Item          72
Qty           15
Unit Price    27
dtype: int64

Enter fullscreen mode Exit fullscreen mode

Apparently all the columns except for Tran ID contain null values!

Duplicated rows are another issue that you may find in your data! The duplicated method can help to identify those.

store_data.duplicated().sum()   # Duplicated records count.

175
Enter fullscreen mode Exit fullscreen mode
store_data[store_data.duplicated(keep=False)].head(10)
Enter fullscreen mode Exit fullscreen mode

Image description

And now for the fun part, Data Wrangling πŸ˜‰:

Β Β Β Β Β As I said, Data Wrangling consists of a lot of operations. You may perform all or a subset of them depending on what kind of data issues you discovered in your dataset.
In this tutorial, we will do some basic operations to deal with what we have found.
We will start by renaming the columns. We can do this operation either in bulk or individually. When working with dataframes, I personally prefer lower case column names without any spaces. This makes referencing the columns juuuust a litle bit 🀏 faster and avoids lots of potential issues with column names that contains spaces. This is what we will do here.

# Doing a bulk rename
store_data.columns = [col.replace(" ", "_").lower() for col in store_data.columns]

# Doing individual rename
store_data.rename(
    columns={
        "cat": "category",
        "qty": "quantity"
    },
    inplace=True    # Without this, the rename method with return a copy of the modified dataframe.
)

store_data.columns

Index(['tran_id', 'category', 'item', 'quantity', 'unit_price'], dtype='object')
Enter fullscreen mode Exit fullscreen mode

Now we will get rid of the 175 duplicated rows that we saw earlier as it's unlikely to have repeated items in the same transaction, just increase the item's quantity πŸ€·β€β™‚οΈ (note that the duplicated method used here decides that the rows are duplicates when all the columns are the same for those rows. you can change this behavior with the subset parameter)

# One way to drop the duplicates is to not select them.
count_before = len(store_data)
store_data = store_data[~store_data.duplicated()]
count_after = len(store_data)

print(f"Rows count before removing duplicates {count_before}, and after {count_after}. The diff is {count_before - count_after}")

Rows count before removing duplicates 10038, and after 9863. The diff is 175
Enter fullscreen mode Exit fullscreen mode

Next, the tran_id won't be needed for this analysis. Let's drop it!

store_data.drop(
    columns="tran_id",
    inplace=True        
    )

store_data.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 9863 entries, 0 to 10037
Data columns (total 4 columns):
 #   Column      Non-Null Count  Dtype  
---  ------      --------------  -----  
 0   category    9781 non-null   object 
 1   item        9791 non-null   object 
 2   quantity    9848 non-null   float64
 3   unit_price  9836 non-null   float64
dtypes: float64(2), object(2)
memory usage: 385.3+ KB
Enter fullscreen mode Exit fullscreen mode

Remember when we found out that the category and item columns' text are similar, but the case is different? let's correct that.

store_data.category = store_data.category.str.lower()
store_data.item = store_data.item.str.lower()
Enter fullscreen mode Exit fullscreen mode

With str, you can access a lot of string functions that are offered by pandas and lower() is one of them that is vectorized (you can think of it as working in parallel) across the whole pandas' series (column).

Dealing with null values (a.k.a. missing values) is a whole topic on its own. One wrong decision about your missing data and your analysis goes in a whole different direction (maybe the wrong one! 😬). We will try to deal with null values in this dataset within the given context.

If you recall, we found out that all the columns contain null values (except for the tran_id but we have dropped that).
the first column we will deal with is quantity.
I've decided to drop all rows with missing quantities. After all, I can't infer (accurately guessing πŸ˜‰) this figure from anywhere and a wrongfully inferring the quantity may affect the quality of our analysis. Besides, they are only 15 rows from roughly 10K rows 😁. In order to do that, dropna method is your freind.

store_data.dropna(
    subset="quantity",
    inplace=True        
    )
Enter fullscreen mode Exit fullscreen mode

Another case where we can drop the rows with null values, is when the category and item are both nulls! Those rows are basically meaningless. (if you can assume different and unique unit_price for each item, you can infer those columns. But this assumption is dangerous)

store_data.dropna(
    subset=["quantity", "item"],
    inplace=True
)
Enter fullscreen mode Exit fullscreen mode

If we check now for the columns with null values, we will find that now, only category and unit_price contain them.

store_data.isna().sum()

category      48
item           0
quantity       0
unit_price    15
dtype: int64
Enter fullscreen mode Exit fullscreen mode

If we assumed that each item has only one category, the missing values in the category column can be easily obtained.

categories = store_data[["category", "item"]].groupby(by=["category", "item"], as_index=False).count()
categories_dict = {row["item"]: row["category"] for _, row in categories.iterrows()}
categories_dict

{'apples': 'fruits',
 'bananas': 'fruits',
 'grapes': 'fruits',
 'oranges': 'fruits',
 'black-eyed peas': 'legumes',
 'chickpeas': 'legumes',
 'lentils': 'legumes',
 'navy beans': 'legumes',
 'broccoli': 'vegetables',
 'carrots': 'vegetables',
 'lettuce': 'vegetables',
 'potatos': 'vegetables'}
Enter fullscreen mode Exit fullscreen mode

Basically, we have created a "lookup dictionary" for the items' categories. Now we can use this dict with the versatile apply method as follows to get the missing categories.

store_data.category = store_data.item.apply(lambda x: categories_dict[x])
Enter fullscreen mode Exit fullscreen mode

The same technique can be used to get the missing unit_price values.

unit_prices = store_data[["item", "unit_price"]].groupby(by=["item", "unit_price"], as_index=False).count()
unit_prices_dict = {row["item"]: row["unit_price"] for _, row in unit_prices.iterrows()}
store_data.unit_price = store_data.item.apply(lambda x: unit_prices_dict[x])

store_data.isna().sum()

category      0
item          0
quantity      0
unit_price    0
Enter fullscreen mode Exit fullscreen mode

Yaaaay πŸ˜€! No missing values!

We can also change the data type of the columns. We can do that for the quantity column as we can change its type to Int because the quantity in our scenario can't be a fraction. In addition to that, we can enrich our data. Like we can add a total_amount column calculated as quantity x unit_price and a done_on column contains the date when this dataset was created that can be obtained from the csv file name.

store_data.quantity = store_data.quantity.astype(int)

store_data["total_amount"] = store_data.quantity * store_data.unit_price

file_date = re.search(r"\d+", file_path.stem).group()       # Regex to match all the digits in the file name without the extension (stem)
file_date = datetime.datetime.strptime(file_date, "%Y%m%d") # Parse a date object from the previouse match.

store_data["done_on"] = file_date
Enter fullscreen mode Exit fullscreen mode

And now, the analysis 🀩

First, the total revenue for each item category.

(
    store_data[["category", "total_amount"]]
    .groupby(by="category")
    .sum()
)
Enter fullscreen mode Exit fullscreen mode

Image description

Next, the top 3 purchased items

(
    store_data[["item", "quantity"]]
    .groupby(by="item")
    .sum()
    .sort_values(by="quantity", ascending=False)
    .head(3)
)
Enter fullscreen mode Exit fullscreen mode

Image description

Of course, any analysis is often presented using graphs and plots! but those tables will suffice for now as you get the idea πŸ˜‰.

Conclusion

     Data Wrangling is very important when dealing with any dataset. And it is very time-consuming too! There is a statistic that is circulating in the data field that data scientists spend 80% of their time just preparing the data 😯! Python and Pandas are doing a very good job in this domain but not without limitations (check out Apache Spark and see why it's used for big data processing instead of pandas)!
The scenario that I built this tutorial on, isn't the best real-life scenario you can face. But hey, it did the job πŸ˜‰.
Finally, you might have encountered here bits of code that looks more like incantations than a piece of code πŸ˜…. That's normal specially if you are just starting out with python. But I can't explain everything here! I highly encourage you to have a look at the accompanying notebook. It contains all the code in this tutorial with extra explanation. And also, to lookup anything that you didn't understand either in online forums or the documentations but don't let it pass! That's how you'll learn πŸ™‚.

Top comments (0)