This repository showcases examples of structured data published to schema-backed topics, instantly accessible as Apache Iceberg tables.
Prerequisites:
- docker, using compose.yaml which runs tansu, MinIO and an Apache Iceberg REST Catalog
- just, a handy way to save and run project-specific commands
- uv, an extremely fast Python package and project manager used to run the pyiceberg examples
justfile contains recipes to run MinIO, create the buckets, and run the Apache Iceberg REST catalog with Tansu.
Once you have the prerequisites installed, clone this repository and start everything up with:
git clone git@github.com:tansu-io/example-pyiceberg.git
cd example-pyiceberg
just up
Should result in:
âś” Network example-pyiceberg_default
âś” Volume "example-pyiceberg_minio"
âś” Container example-pyiceberg-minio-1
mc: Configuration written to `/tmp/.mc/config.json`. Please update your access credentials.
mc: Successfully created `/tmp/.mc/share`.
mc: Initialized share uploads `/tmp/.mc/share/uploads.json` file.
mc: Initialized share downloads `/tmp/.mc/share/downloads.json` file.
The cluster 'local' is ready
Added `local` successfully.
Bucket created successfully `local/tansu`.
Bucket created successfully `local/lake`.
âś” Container example-pyiceberg-minio-1
âś” Container example-pyiceberg-iceberg-catalog-1
âś” Container example-pyiceberg-tansu-1 Healthy
Done! You can now run the examples.
Employee is a protocol buffer backed topic, with the following schema employee.proto:
syntax = 'proto3';
message Key {
int32 id = 1;
}
message Value {
string name = 1;
string email = 2;
}
Sample employee data is in employees.json:
[
{
"key": { "id": 12321 },
"value": { "name": "Bob", "email": "bob@example.com" }
},
{
"key": { "id": 32123 },
"value": { "name": "Alice", "email": "alice@example.com" }
}
]
Create the employee topic:
just employee-topic-create
Publish the sample data onto the employee topic:
just employee-produce
View the data in pyiceberg
:
just employee-table-scan
Giving the following output:
s3://lake/tansu/employee
table {
1: id: optional int
2: name: optional string
3: email: optional string
}
pyarrow.Table
id: int32
name: large_string
email: large_string
----
id: [[12321,32123]]
name: [["Bob","Alice"]]
email: [["bob@example.com","alice@example.com"]]
Grade is a JSON schema backed topic, with the following schema grade.json:
{
"type": "record",
"name": "Grade",
"fields": [
{ "name": "key", "type": "string", "pattern": "^\\d{3}-\\d{2}-\\d{4}$" },
{
"name": "value",
"type": {
"type": "record",
"fields": [
{ "name": "first", "type": "string" },
{ "name": "last", "type": "string" },
{ "name": "test1", "type": "double" },
{ "name": "test2", "type": "double" },
{ "name": "test3", "type": "double" },
{ "name": "test4", "type": "double" },
{ "name": "final", "type": "double" },
{ "name": "grade", "type": "string" }
]
}
}
]
}
Sample grade data is in: grades.json:
[
{
"key": "123-45-6789",
"value": {
"lastName": "Alfalfa",
"firstName": "Aloysius",
"test1": 40.0,
"test2": 90.0,
"test3": 100.0,
"test4": 83.0,
"final": 49.0,
"grade": "D-"
}
},
...
]
Create the grade topic:
just grade-topic-create
Publish the sample data onto the grade topic:
just grade-produce
View the data in pyiceberg
:
just grade-table-scan
Giving the following output:
s3://lake/tansu/grade
table {
1: key: optional string
2: value: optional struct<3: final: optional double, 4: first: optional string, 5: grade: optional string, 6: last: optional string, 7: test1: optional double, 8: test2: optional double, 9: test3: optional double, 10: test4: optional double>
}
pyarrow.Table
key: large_string
value: struct<final: double, first: large_string, grade: large_string, last: large_string, test1: double, test2: double, test3: double, test4: double>
child 0, final: double
child 1, first: large_string
child 2, grade: large_string
child 3, last: large_string
child 4, test1: double
child 5, test2: double
child 6, test3: double
child 7, test4: double
----
key: [["123-45-6789","123-12-1234","567-89-0123","087-65-4321","456-78-9012",...,"087-75-4321","456-71-9012","234-56-2890","345-67-3901","632-79-9439"]]
value: [
-- is_valid: all not null
-- child 0 type: double
[49,48,44,47,45,...,45,77,90,4,40]
-- child 1 type: large_string
["Aloysius","University","Gramma","Electric","Fred",...,"Jim","Ima","Benny","Boy","Harvey"]
-- child 2 type: large_string
["D-","D+","C","B-","A-",...,"C+","B-","B-","B","C"]
-- child 3 type: large_string
["Alfalfa","Alfred","Gerty","Android","Bumpkin",...,"Dandy","Elephant","Franklin","George","Heffalump"]
-- child 4 type: double
[40,41,41,42,43,...,47,45,50,40,30]
-- child 5 type: double
[90,97,80,23,78,...,1,1,1,1,1]
-- child 6 type: double
[100,96,60,36,88,...,23,78,90,11,20]
-- child 7 type: double
[83,97,40,45,77,...,36,88,80,-1,30]]
Observation is an Avro backed topic, with the following schema observation.avsc:
{
"type": "record",
"name": "observation",
"fields": [
{ "name": "key", "type": "string", "logicalType": "uuid" },
{
"name": "value",
"type": "record",
"fields": [
{ "name": "amount", "type": "double" },
{ "name": "unit", "type": "enum", "symbols": ["CELSIUS", "MILLIBAR"] }
]
}
]
}
Sample observation data, is in: observations.json:
[
{
"key": "1E44D9C2-5E7A-443B-BF10-2B1E5FD72F15",
"value": { "amount": 23.2, "unit": "CELSIUS" }
},
...
]
Create the observation topic:
just observation-topic-create
Publish the sample data onto the observation topic:
just observation-produce
View the data in pyiceberg
:
just observation-table-scan
Giving the following output:
s3://lake/tansu/observation
table {
1: key: optional string
2: value: optional struct<3: amount: optional double, 4: unit: optional string>
}
pyarrow.Table
key: large_string
value: struct<amount: double, unit: large_string>
child 0, amount: double
child 1, unit: large_string
----
key: [["1e44d9c2-5e7a-443b-bf10-2b1e5fd72f15","1e44d9c2-5e7a-443b-bf10-2b1e5fd72f15","1e44d9c2-5e7a-443b-bf10-2b1e5fd72f15","1e44d9c2-5e7a-443b-bf10-2b1e5fd72f15","1e44d9c2-5e7a-443b-bf10-2b1e5fd72f15","1e44d9c2-5e7a-443b-bf10-2b1e5fd72f15","1e44d9c2-5e7a-443b-bf10-2b1e5fd72f15","1e44d9c2-5e7a-443b-bf10-2b1e5fd72f15","1e44d9c2-5e7a-443b-bf10-2b1e5fd72f15","1e44d9c2-5e7a-443b-bf10-2b1e5fd72f15"]]
value: [
-- is_valid: all not null
-- child 0 type: double
[23.2,1027,22.8,1023,22.5,1018,23.1,1020,23.4,1025]
-- child 1 type: large_string
["CELSIUS","MILLIBAR","CELSIUS","MILLIBAR","CELSIUS","MILLIBAR","CELSIUS","MILLIBAR","CELSIUS","MILLIBAR"]]
Person is a JSON schema backed topic, with the following schema person.json:
{
"title": "Person",
"type": "object",
"properties": {
"key": {
"type": "string",
"pattern": "^\\d{3}-\\d{2}-\\d{4}$"
},
"value": {
"type": "object",
"properties": {
"firstName": {
"type": "string",
"description": "The person's first name."
},
"lastName": {
"type": "string",
"description": "The person's last name."
},
"age": {
"description": "Age in years which must be equal to or greater than zero.",
"type": "integer",
"minimum": 0
}
}
}
}
}
Sample person data, is in persons.json:
[
{
"key": "123-45-6789",
"value": { "lastName": "Alfalfa", "firstName": "Aloysius", "age": 21 }
},
...
]
Create the person topic:
just person-topic-create
Publish the sample data onto the person topic:
just person-produce
View the data in pyiceberg
:
just person-table-scan
Giving the following output:
s3://lake/tansu/person
table {
1: key: optional string
2: value: optional struct<3: age: optional long, 4: firstName: optional string, 5: lastName: optional string>
}
pyarrow.Table
key: large_string
value: struct<age: int64, firstName: large_string, lastName: large_string>
child 0, age: int64
child 1, firstName: large_string
child 2, lastName: large_string
----
key: [["123-45-6789","123-12-1234","567-89-0123","087-65-4321","456-78-9012",...,"087-75-4321","456-71-9012","234-56-2890","345-67-3901","632-79-9439"]]
value: [
-- is_valid: all not null
-- child 0 type: int64
[21,52,35,23,72,...,56,45,54,91,17]
-- child 1 type: large_string
["Aloysius","University","Gamma","Electric","Fred",...,"Jim","Ima","Benny","Boy","Harvey"]
-- child 2 type: large_string
["Alfalfa","Alfred","Gerty","Android","Bumpkin",...,"Dandy","Elephant","Franklin","George","Heffalump"]]
Search is a protocol buffer backedd topic, with the following schema search.proto:
syntax = 'proto3';
enum Corpus {
CORPUS_UNSPECIFIED = 0;
CORPUS_UNIVERSAL = 1;
CORPUS_WEB = 2;
CORPUS_IMAGES = 3;
CORPUS_LOCAL = 4;
CORPUS_NEWS = 5;
CORPUS_PRODUCTS = 6;
CORPUS_VIDEO = 7;
}
message Value {
string query = 1;
776F
int32 page_number = 2;
int32 results_per_page = 3;
Corpus corpus = 4;
}
Sample search data, is in searches.json:
[
{
"value": {
"query": "abc/def",
"page_number": 6,
"results_per_page": 13,
"corpus": "CORPUS_WEB"
}
}
]
Create the search topic:
just search-topic-create
Publish the sample data onto the search topic:
just search-produce
View the data in pyiceberg
:
just search-table-scan
Giving the following output:
s3://lake/tansu/search
table {
1: query: optional string
2: page_number: optional int
3: results_per_page: optional int
4: corpus: optional int
}
pyarrow.Table
query: large_string
page_number: int32
results_per_page: int32
corpus: int32
----
query: [["abc/def"]]
page_number: [[6]]
results_per_page: [[13]]
corpus: [[2]]
Taxi is a protocol buffer backed topic, with the following schema taxi.proto:
syntax = 'proto3';
enum Flag {
N = 0;
Y = 1;
}
message Value {
int64 vendor_id = 1;
int64 trip_id = 2;
float trip_distance = 3;
double fare_amount = 4;
Flag store_and_fwd = 5;
}
Sample trip data, is in trips.json:
[
{
"value": {
"vendor_id": 1,
"trip_id": 1000371,
"trip_distance": 1.8,
"fare_amount": 15.32,
"store_and_fwd": "N"
}
},
...
]
Create the taxi topic:
just taxi-topic-create
Publish the sample data onto the taxi topic:
just taxi-produce
View the data in pyiceberg
:
just taxi-table-scan
Giving the following output:
s3://lake/tansu/taxi
table {
1: vendor_id: optional long
2: trip_id: optional long
3: trip_distance: optional float
4: fare_amount: optional double
5: store_and_fwd: optional int
}
pyarrow.Table
vendor_id: int64
trip_id: int64
trip_distance: float
fare_amount: double
store_and_fwd: int32
----
vendor_id: [[1,2,2,1]]
trip_id: [[1000371,1000372,1000373,1000374]]
trip_distance: [[1.8,2.5,0.9,8.4]]
fare_amount: [[15.32,22.15,9.01,42.13]]
store_and_fwd: [[0,0,0,1]]