Open
Description
Purpose:
As a data scientist I want a simple check that will give me an overview of the structure of my data.
Inputs:
A single Dataset
Output:
Tables or graphs containing:
- General information about the Dataset
- Metadata about special Dataset columns that are defined (label column name, index column, date column)
- For each column
- Identified type
- statistics of the data (see reference: https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.describe.html)
- for numerical: count, mean, min, 10%, 25%, median, 75%, 90%, max)
- for categorical: (count, unique, top, freq)
- histogram
Requirements:
Dataset Info check should run quickly
Should work quickly and not clog display also for a large number of features (e.g. 200)
Output should be easily digestible
Check Category:
Overview