Introduction | Requirements | Motivation | Hello World | Usage | Interoperability | Installation | Project Structure | Design Principles | Scripts | Contributing | Credits | More Info | License
meza is a Python library for reading and processing tabular data. It has a functional programming style API, excels at reading/writing large files, and can process 10+ file types.
With meza, you can
- Read csv/xls/xlsx/mdb/dbf files, and more!
- Type cast records (date, float, text...)
- Process Uñicôdë text
- Lazily stream files by default
- and much more...
meza has been tested and is known to work on Python 3.7, 3.8, and 3.9; and PyPy3.7.
Function | Dependency | Installation | File type / extension |
---|---|---|---|
meza.io.read_mdb |
mdbtools | sudo port install mdbtools |
Microsoft Access / mdb |
meza.io.read_html |
lxml [1] | pip install lxml |
HTML / html |
meza.convert.records2array |
NumPy [2] | pip install numpy |
n/a |
meza.convert.records2df |
pandas | pip install pandas |
n/a |
[1] | If lxml isn't present, read_html will default to the builtin Python html reader |
[2] | records2array can be used without numpy by passing native=True in the function call. This will convert records into a list of native array.array objects. |
pandas is great, but installing it isn't exactly a walk in the park, and it doesn't play nice with PyPy. I designed meza to be a lightweight, easy to install, less featureful alternative to pandas. I also optimized meza for low memory usage, PyPy compatibility, and functional programming best practices.
meza
provides a number of benefits / differences from similar libraries such
as pandas
. Namely:
- a functional programming (instead of object oriented) API
- iterators by default (reading/writing)
- PyPy compatibility
- geojson support (reading/writing)
- seamless integration with sqlachemy (and other libs that work with iterators of dicts)
For more detailed information, please check-out the FAQ.
A simple data processing example is shown below:
First create a simple csv file (in bash)
echo 'col1,col2,col3\nhello,5/4/82,1\none,1/1/15,2\nhappy,7/1/92,3\n' > data.csv
Now we can read the file, manipulate the data a bit, and write the manipulated data back to a new file.
>>> from meza import io, process as pr, convert as cv
>>> from io import open
>>> # Load the csv file
>>> records = io.read_csv('data.csv')
>>> # `records` are iterators over the rows
>>> row = next(records)
>>> row
{'col1': 'hello', 'col2': '5/4/82', 'col3': '1'}
>>> # Let's replace the first row so as not to lose any data
>>> records = pr.prepend(records, row)
# Guess column types. Note: `detect_types` returns a new `records`
# generator since it consumes rows during type detection
>>> records, result = pr.detect_types(records)
>>> {t['id']: t['type'] for t in result['types']}
{'col1': 'text', 'col2': 'date', 'col3': 'int'}
# Now type cast the records. Note: most `meza.process` functions return
# generators, so lets wrap the result in a list to view the data
>>> casted = list(pr.type_cast(records, result['types']))
>>> casted[0]
{'col1': 'hello', 'col2': datetime.date(1982, 5, 4), 'col3': 1}
# Cut out the first column of data and merge the rows to get the max value
# of the remaining columns. Note: since `merge` (by definition) will always
# contain just one row, it is returned as is (not wrapped in a generator)
>>> cut_recs = pr.cut(casted, ['col1'], exclude=True)
>>> merged = pr.merge(cut_recs, pred=bool, op=max)
>>> merged
{'col2': datetime.date(2015, 1, 1), 'col3': 3}
# Now write merged data back to a new csv file.
>>> io.write('out.csv', cv.records2csv(merged))
# View the result
>>> with open('out.csv', 'utf-8') as f:
... f.read()
'col2,col3\n2015-01-01,3\n'
meza is intended to be used directly as a Python library.
meza can read both filepaths and file-like objects. Additionally, all readers return equivalent records iterators, i.e., a generator of dictionaries with keys corresponding to the column names.
>>> from io import open, StringIO
>>> from meza import io
"""Read a filepath"""
>>> records = io.read_json('path/to/file.json')
"""Read a file like object and de-duplicate the header"""
>>> f = StringIO('col,col\nhello,world\n')
>>> records = io.read_csv(f, dedupe=True)
"""View the first row"""
>>> next(records)
{'col': 'hello', 'col_2': 'world'}
"""Read the 1st sheet of an xls file object opened in text mode."""
# Also, santize the header names by converting them to lowercase and
# replacing whitespace and invalid characters with `_`.
>>> with open('path/to/file.xls', 'utf-8') as f:
... for row in io.read_xls(f, sanitize=True):
... # do something with the `row`
... pass
"""Read the 2nd sheet of an xlsx file object opened in binary mode"""
# Note: sheets are zero indexed
>>> with open('path/to/file.xlsx') as f:
... records = io.read_xls(f, encoding='utf-8', sheet=1)
... first_row = next(records)
... # do something with the `first_row`
"""Read any recognized file"""
>>> records = io.read('path/to/file.geojson')
>>> f.seek(0)
>>> records = io.read(f, ext='csv', dedupe=True)
Please see readers for a complete list of available readers and recognized file types.
Numerical analysis (à la pandas) [3]
In the following example, pandas
equivalent methods are preceded by -->
.
>>> import itertools as it
>>> import random
>>> from io import StringIO
>>> from meza import io, process as pr, convert as cv, stats
# Create some data in the same structure as what the various `read...`
# functions output
>>> header = ['A', 'B', 'C', 'D']
>>> data = [(random.random() for _ in range(4)) for x in range(7)]
>>> df = [dict(zip(header, d)) for d in data]
>>> df[0]
{'A': 0.53908..., 'B': 0.28919..., 'C': 0.03003..., 'D': 0.65363...}
"""Sort records by the value of column `B` --> df.sort_values(by='B')"""
>>> next(pr.sort(df, 'B'))
{'A': 0.53520..., 'B': 0.06763..., 'C': 0.02351..., 'D': 0.80529...}
"""Select column `A` --> df['A']"""
>>> next(pr.cut(df, ['A']))
{'A': 0.53908170489952006}
"""Select the first three rows of data --> df[0:3]"""
>>> len(list(it.islice(df, 3)))
3
"""Select all data whose value for column `A` is less than 0.5
--> df[df.A < 0.5]
"""
>>> next(pr.tfilter(df, 'A', lambda x: x < 0.5))
{'A': 0.21000..., 'B': 0.25727..., 'C': 0.39719..., 'D': 0.64157...}
# Note: since `aggregate` and `merge` (by definition) return just one row,
# they return them as is (not wrapped in a generator).
"""Calculate the mean of column `A` across all data --> df.mean()['A']"""
>>> pr.aggregate(df, 'A', stats.mean)['A']
0.5410437473067938
"""Calculate the sum of each column across all data --> df.sum()"""
>>> pr.merge(df, pred=bool, op=sum)
{'A': 3.78730..., 'C': 2.82875..., 'B': 3.14195..., 'D': 5.26330...}
Text processing (à la csvkit) [4]
In the following example, csvkit
equivalent commands are preceded by -->
.
First create a few simple csv files (in bash)
echo 'col_1,col_2,col_3\n1,dill,male\n2,bob,male\n3,jane,female' > file1.csv
echo 'col_1,col_2,col_3\n4,tom,male\n5,dick,male\n6,jill,female' > file2.csv
Now we can read the files, manipulate the data, convert the manipulated data to
json, and write the json back to a new file. Also, note that since all readers
return equivalent records iterators, you can use them interchangeably (in
place of read_csv
) to open any supported file. E.g., read_xls
,
read_sqlite
, etc.
>>> import itertools as it
>>> from meza import io, process as pr, convert as cv
"""Combine the files into one iterator
--> csvstack file1.csv file2.csv
"""
>>> records = io.join('file1.csv', 'file2.csv')
>>> next(records)
{'col_1': '1', 'col_2': 'dill', 'col_3': 'male'}
>>> next(it.islice(records, 4, None))
{'col_1': '6', 'col_2': 'jill', 'col_3': 'female'}
# Now let's create a persistent records list
>>> records = list(io.read_csv('file1.csv'))
"""Sort records by the value of column `col_2`
--> csvsort -c col_2 file1.csv
"""
>>> next(pr.sort(records, 'col_2'))
{'col_1': '2', 'col_2': 'bob', 'col_3': 'male'
"""Select column `col_2` --> csvcut -c col_2 file1.csv"""
>>> next(pr.cut(records, ['col_2']))
{'col_2': 'dill'}
"""Select all data whose value for column `col_2` contains `jan`
--> csvgrep -c col_2 -m jan file1.csv
"""
>>> next(pr.grep(records, [{'pattern': 'jan'}], ['col_2']))
{'col_1': '3', 'col_2': 'jane', 'col_3': 'female'}
"""Convert a csv file to json --> csvjson -i 4 file1.csv"""
>>> io.write('file.json', cv.records2json(records))
# View the result
>>> with open('file.json', 'utf-8') as f:
... f.read()
'[{"col_1": "1", "col_2": "dill", "col_3": "male"}, {"col_1": "2",
"col_2": "bob", "col_3": "male"}, {"col_1": "3", "col_2": "jane",
"col_3": "female"}]'
Geo processing (à la mapbox) [5]
In the following example, mapbox
equivalent commands are preceded by -->
.
First create a geojson file (in bash)
echo '{"type": "FeatureCollection","features": [' > file.geojson
echo '{"type": "Feature", "id": 11, "geometry": {"type": "Point", "coordinates": [10, 20]}},' >> file.geojson
echo '{"type": "Feature", "id": 12, "geometry": {"type": "Point", "coordinates": [5, 15]}}]}' >> file.geojson
Now we can open the file, split the data by id, and finally convert the split data to a new geojson file-like object.
>>> from meza import io, process as pr, convert as cv
# Load the geojson file and peek at the results
>>> records, peek = pr.peek(io.read_geojson('file.geojson'))
>>> peek[0]
{'lat': 20, 'type': 'Point', 'lon': 10, 'id': 11}
"""Split the records by feature ``id`` and select the first feature
--> geojsplit -k id file.geojson
"""
>>> splits = pr.split(records, 'id')
>>> feature_records, name = next(splits)
>>> name
11
"""Convert the feature records into a GeoJSON file-like object"""
>>> geojson = cv.records2geojson(feature_records)
>>> geojson.readline()
'{"type": "FeatureCollection", "bbox": [10, 20, 10, 20], "features": '
'[{"type": "Feature", "id": 11, "geometry": {"type": "Point", '
'"coordinates": [10, 20]}, "properties": {"id": 11}}], "crs": {"type": '
'"name", "properties": {"name": "urn:ogc:def:crs:OGC:1.3:CRS84"}}}'
# Note: you can also write back to a file as shown previously
# io.write('file.geojson', geojson)