Star Schema Benchmark for BigQuery and Apache Druid.
This repository contains the scripts to generate the data for the Star Schema Benchmark and run the benchmark queries on BigQuery and Apache Druid.
Overview of benchmark preparation steps:
graph TD
AA[Setup GCP resources] --> A[Generate CSV files]
A --> B[Upload CSV files to GCS]
B --> C[Load CSV files into BigQuery tables]
C --> D[Flatten BigQuery tables]
D --> E[Export flattened table to GCS as Parquet]
E --> F[Ingest Parquet files to Druid]
Use direnv
and .envrc
file to store sensitive data:
export GCP_PROJECT=
export GCP_PROJECT_ZONE=
export GCP_PROJECT_SUBNET=
export GCP_QUERY_PROJECT=
export GCP_BUCKET=
export GCP_VM=
export GCP_DATASET=
export GCP_DRUID_SA=
export GCP_DRUID_HMAC_KEY=
export GCP_DRUID_HMAC_SECRET=
export DRUID_ROUTER_URL=
export DRUID_BROKER_URL
export DRUID_USER=
export DRUID_PASSWORD=
export DRUID_INPUT=
export DRUID_DATASOURCE=
Create GCS bucket:
gcloud storage buckets create $GCP_BUCKET \
--location=europe-west1 --default-storage-class=STANDARD --uniform-bucket-level-access
Add permissions to the bucket for Druid ingestion service account:
gcloud storage buckets add-iam-policy-binding $GCP_BUCKET \
--member=$GCP_DRUID_SA --role=roles/storage.objectViewer
Create BigQuery dataset:
bq mk --data_location=EU $GCP_DATASET
To speed-up data loading to GCS create VM in the same region as the GCS bucket:
gcloud compute instances create $GCP_VM \
--zone=$GCP_PROJECT_ZONE \
--machine-type=n4-standard-2 \
--boot-disk-size=500GB \
--no-address \
--subnet=$GCP_PROJECT_SUBNET
Or start instance if exists:
gcloud compute instances start $GCP_VM --zone=$GCP_PROJECT_ZONE
Check if you can connect to the VM:
gcloud compute ssh $GCP_VM --zone=$GCP_PROJECT_ZONE --internal-ip
Login into VM and install dependencies:
sudo apt-get update
sudo apt install build-essential git
Checkout the repository with data generator:
git clone https://github.com/electrum/ssb-dbgen.git
cd ssb-dbgen
Apply the patch to generate dates in YYYY-MM-DD format:
echo "
diff --git a/dss.h b/dss.h
index 8f78d89..dcbb750 100644
--- a/dss.h
+++ b/dss.h
@@ -517,7 +517,7 @@ int dbg_print(int dt, FILE *tgt, void *data, int len, int eol);
#ifdef SSBM
#define PR_DATE(tgt, yr, mn, dy) \
- sprintf(tgt, "19%02d%02d%02d", yr, mn, dy)
+ sprintf(tgt, "19%02d-%02d-%02d", yr, mn, dy)
#else
#ifdef MDY_DATE
#define PR_DATE(tgt, yr, mn, dy) \
" | git apply
Compile the generator:
cd ssb-dbgen
make
For the initial tests generate ~1GB of data.
./dbgen -f -s 1 -T a
Check how many records are in the generated file wc -l *.tbl
:
30000 customer.tbl # 30k
6001171 lineorder.tbl # 6mln
200000 part.tbl # 200k
2000 supplier.tbl # 2k
6235727 total
It gives ~6mln records, enough to test the queries, but not for benchmarking.
For more realistic benchmarks generate ~30GB of data. NOTE: Do it separately for each table, as the generator can crash.
./dbgen -f -s 30 -T l
./dbgen -f -s 30 -T c
./dbgen -f -s 30 -T p
./dbgen -f -s 30 -T s
Check the results wc -l *.tbl
:
900000 customer.tbl # 900k
179998372 lineorder.tbl # 180mln
1000000 part.tbl # 1mln
60000 supplier.tbl # 60k
181958372 total
Scenario for ~300GB of data:
./dbgen -f -s 300 -T l
./dbgen -f -s 300 -T c
./dbgen -f -s 300 -T p
./dbgen -f -s 300 -T s
Check the results ls -lh *.tbl
(wc
is too slow:
831M customer.tbl
186G lineorder.tbl
149M part.tbl
50M supplier.tbl
Authenticate using your GCP account:
gcloud auth login
With 200MiB/s network speed it takes ~90 seconds to upload 17.5GiB and ~15 minutes to upload 187GiB.
gcloud storage cp *.tbl $GCP_BUCKET
Create dataset:
bq mk --data_location=EU $GCP_DATASET
The customer
table:
bq load \
--source_format=CSV \
--ignore_unknown_values=true \
--field_delimiter="|" \
--schema=schemas/customer.json \
$GCP_DATASET.customer \
$GCP_BUCKET/customer.tbl
The supplier
table:
bq load \
--source_format=CSV \
--ignore_unknown_values=true \
--field_delimiter="|" \
--schema=schemas/supplier.json \
$GCP_DATASET.supplier \
$GCP_BUCKET/supplier.tbl
The part
table:
bq load \
--source_format=CSV \
--ignore_unknown_values=true \
--field_delimiter="|" \
--schema=schemas/part.json \
$GCP_DATASET.part \
$GCP_BUCKET/part.tbl
The lineorder
table:
bq load \
--source_format=CSV \
--ignore_unknown_values=true \
--field_delimiter="|" \
--schema=schemas/lineorder.json \
--time_partitioning_type=YEAR \
--time_partitioning_field=LO_ORDERDATE \
$GCP_DATASET.lineorder \
$GCP_BUCKET/lineorder.tbl
Check if the tables are loaded correctly:
bq --dataset_id=$GCP_DATASET query --use_legacy_sql=false \
"SELECT COUNT(*) FROM customer"
bq --dataset_id=$GCP_DATASET query --use_legacy_sql=false \
"SELECT COUNT(*) FROM supplier"
bq --dataset_id=$GCP_DATASET query --use_legacy_sql=false \
"SELECT COUNT(*) FROM part"
bq --dataset_id=$GCP_DATASET query --use_legacy_sql=false \
"SELECT COUNT(*) FROM lineorder"
bq --dataset_id=$GCP_DATASET query --use_legacy_sql=false "$(cat schemas/lineorder_flat.sql)"
Check if the table is flattened correctly:
bq --dataset_id=$GCP_DATASET query --use_legacy_sql=false \
"SELECT COUNT(*) FROM lineorder_flat"
Extract flattened table to GCS:
YEARS=(1992 1993 1994 1995 1996 1997 1998)
for YEAR in ${YEARS[@]}; do
bq extract \
--destination_format=PARQUET \
--compression=SNAPPY \
$GCP_DATASET.lineorder_flat\$$YEAR \
$GCP_BUCKET/lineorder_flat/$YEAR/\*.parquet
done
Check the export, for 180mln records it should be ~19GiB, for 1.8bln records it should be ~220GiB:
gcloud storage du -s -r $GCP_BUCKET/lineorder_flat
Don't allow Druid to drop data just after the ingestion due to default cluster policy:
curl --location --request POST "$DRUID_ROUTER_URL/druid/coordinator/v1/rules/$DRUID_DATASOURCE" \
--header 'Content-Type: application/json' \
--user "$DRUID_USER:$DRUID_PASSWORD" \
--data-raw '[{"type": "loadForever"}]'
Start Druid native ingestion process, it could take a few hours to ingest data.
YEARS=(1992 1993 1994 1995 1996 1997 1998)
for YEAR in ${YEARS[@]}; do
export YEAR
envsubst < ingestion/0_baseline.json | \
curl "$DRUID_ROUTER_URL/druid/indexer/v1/task" \
--header 'Content-Type: application/json' \
--user "$DRUID_USER:$DRUID_PASSWORD" \
--data-binary "@-"
done
After ingestion Druid router shows the datasource:
Run all benchmarks:
sbt jmh:run
Run Druid benchmarks:
sbt 'jmh:run ssb.DruidSerialBenchmark'
sbt 'jmh:run ssb.DruidConcurrentBenchmark'
Run Druid supplement benchmarks:
sbt 'jmh:run ssb.DruidCountDistinctBenchmark'
Run BigQuery benchmarks:
sbt 'jmh:run ssb.BigQuerySerialBenchmark'
sbt 'jmh:run ssb.BigQueryConcurrentBenchmark'
Run BigQuery supplement benchmarks:
sbt 'jmh:run ssb.BigQueryJoinsBenchmark'
sbt 'jmh:run ssb.BigQueryCountDistinctBenchmark'
Run BigQuery benchmark using short queries:
export QUERY_PREVIEW_ENABLED=true
sbt ...
Stop VM:
gcloud compute instances stop $GCP_VM --zone=europe-west1-b
Remove VM:
gcloud compute instances delete $GCP_VM --zone=europe-west1-b