The Blaze accelerator for Apache Spark leverages native vectorized execution to accelerate query processing. It combines the power of the Apache Arrow-DataFusion library and the scale of the Spark distributed computing framework.
Blaze takes a fully optimized physical plan from Spark, mapping it into DataFusion's execution plan, and performs native plan computation in Spark executors.
Blaze is composed of the following high-level components:
- Spark Extension: hooks the whole accelerator into Spark execution lifetime.
- Spark Shims: specialized codes for different versions of spark.
- Native Engine: implements the native engine in rust, including:
- ExecutionPlan protobuf specification
- JNI gateway
- Customized operators, expressions, functions
Based on the inherent well-defined extensibility of DataFusion, Blaze can be easily extended to support:
- Various object stores.
- Operators.
- Simple and Aggregate functions.
- File formats.
We encourage you to extend DataFusion capability directly and add the supports in Blaze with simple modifications in plan-serde and extension translation.
To build Blaze, please follow the steps below:
- Install Rust
The native execution lib is written in Rust. So you're required to install Rust (nightly) first for compilation. We recommend you to use rustup.
- Install JDK+Maven
Blaze has been well tested on jdk8 and maven3.5, should work fine with higher versions.
- Check out the source code.
git clone git@github.com:blaze-init/blaze.git
cd blaze
- Build the project.
Specify shims package of which spark version that you would like to run on. _Currently we have supported these shims:
- spark303 - for spark3.0.x
- spark313 - for spark3.1.x
- spark324 - for spark3.2.x
- spark333 - for spark3.3.x
- spark351 - for spark3.5.x.
You could either build Blaze in dev mode for debugging or in release mode to unlock the full potential of Blaze.
SHIM=spark333 # or spark303/spark313/spark320/spark324/spark333/spark351
MODE=release # or pre
mvn package -P"${SHIM}" -P"${MODE}"
After the build is finished, a fat Jar package that contains all the dependencies will be generated in the target
directory.
You can use the following command to build a centos-7 compatible release:
SHIM=spark333 MODE=release ./release-docker.sh
This section describes how to submit and configure a Spark Job with Blaze support.
-
move blaze jar package to spark client classpath (normally
spark-xx.xx.xx/jars/
). -
add the follow confs to spark configuration in
spark-xx.xx.xx/conf/spark-default.conf
:
spark.blaze.enable true
spark.sql.extensions org.apache.spark.sql.blaze.BlazeSparkSessionExtension
spark.shuffle.manager org.apache.spark.sql.execution.blaze.shuffle.BlazeShuffleManager
spark.memory.offHeap.enabled false
# suggested executor memory configuration
spark.executor.memory 4g
spark.executor.memoryOverhead 4096
- submit a query with spark-sql, or other tools like spark-thriftserver:
spark-sql -f tpcds/q01.sql
Check Benchmark Results with the latest date for the performance comparison with vanilla Spark 3.3.3. The benchmark result shows that Blaze save about 50% time on TPC-DS/TPC-H 1TB datasets. Stay tuned and join us for more upcoming thrilling numbers.
TPC-DS Query time: (How can I run TPC-DS benchmark?)
We also encourage you to benchmark Blaze and share the results with us. 🤗
We're using Discussions to connect with other members of our community. We hope that you:
- Ask questions you're wondering about.
- Share ideas.
- Engage with other community members.
- Welcome others who are open-minded. Remember that this is a community we build together 💪 .
Blaze is licensed under the Apache 2.0 License. A copy of the license can be found here.