This project provides extensions to the Apache Spark project in Scala and Python:
Diff: A diff
transformation for Dataset
s that computes the differences between
two datasets, i.e. which rows to add, delete or change to get from one dataset to the other.
SortedGroups: A groupByKey
transformation that groups rows by a key while providing
a sorted iterator for each group. Similar to Dataset.groupByKey.flatMapGroups
, but with order guarantees
for the iterator.
Histogram: A histogram
transformation that computes the histogram DataFrame for a value column.
Partitioned Writing: The writePartitionedBy
action writes your Dataset
partitioned and
efficiently laid out with a single operation.
Fluent method call: T.call(transformation: T => R): R
: Turns a transformation T => R
,
that is not part of T
into a fluent method call on T
. This allows writing fluent code like:
import uk.co.gresearch._
i.doThis()
.doThat()
.call(transformation)
.doMore()
Fluent conditional method call: T.when(condition: Boolean).call(transformation: T => T): T
:
Perform a transformation fluently only if the given condition is true.
This allows writing fluent code like:
import uk.co.gresearch._
i.doThis()
.doThat()
.when(condition).call(transformation)
.doMore()
Backticks: backticks(string: String, strings: String*): String)
: Encloses the given column name with backticks (`
) when needed.
This is a handy way to ensure column names with special characters like dots (.
) work with col()
or select()
.
The spark-extension
package is available for all Spark 3.0, 3.1 and 3.2 versions. The package version
has the following semantics: spark-extension_{SCALA_COMPAT_VERSION}-{VERSION}-{SPARK_COMPAT_VERSION}
:
SCALA_COMPAT_VERSION
: Scala binary compatibility (minor) version. Available are2.12
and2.13
.SPARK_COMPAT_VERSION
: Apache Spark binary compatibility (minor) version. Available are3.0
,3.1
and3.2
.VERSION
: The package version, e.g.2.0.0
.
Add this line to your build.sbt
file:
libraryDependencies += "uk.co.gresearch.spark" %% "spark-extension" % "2.1.0-3.2"
Add this dependency to your pom.xml
file:
<dependency>
<groupId>uk.co.gresearch.spark</groupId>
<artifactId>spark-extension_2.12</artifactId>
<version>2.1.0-3.2</version>
</dependency>
Launch a Spark Shell with the Spark Extension dependency (version ≥1.1.0) as follows:
spark-shell --packages uk.co.gresearch.spark:spark-extension_2.12:2.1.0-3.2
Note: Pick the right Scala version (here 2.12) and Spark version (here 3.2) depending on your Spark Shell version.
Launch the Python Spark REPL with the Spark Extension dependency (version ≥1.1.0) as follows:
pyspark --packages uk.co.gresearch.spark:spark-extension_2.12:2.1.0-3.2
Note: Pick the right Scala version and Spark version depending on your PySpark version.
Run your Python scripts that use PySpark via spark-submit
:
spark-submit --packages uk.co.gresearch.spark:spark-extension_2.12:2.1.0-3.2 [script.py]
Note: Pick the right Scala version (here 2.12) and Spark version (here 3.2) depending on your Spark version.
There are plenty of Data Science notebooks around. To use this library, add a jar dependency to your notebook using these Maven coordinates:
uk.co.gresearch.spark:spark-extension_2.12:2.1.0-3.2
Or download the jar and place it on a filesystem where it is accessible by the notebook, and reference that jar file directly.
Check the documentation of your favorite notebook to learn how to add jars to your Spark environment.
You can build this project against different versions of Spark and Scala.
If you want to build for a Spark or Scala version different to what is defined in the pom.xml
file, then run
sh set-version.sh [SPARK-VERSION] [SCALA-VERSION]
For example, switch to Spark 3.2.0 and Scala 2.13.5 by running sh set-version.sh 3.2.0 2.13.5
.
Then execute mvn package
to create a jar from the sources. It can be found in target/
.
Run the Scala tests via mvn test
.
In order to run the Python tests, setup a Python environment as follows (replace [SCALA-COMPAT-VERSION]
and [SPARK-COMPAT-VERSION]
with the respective values):
virtualenv -p python3 venv
source venv/bin/activate
pip install -r python/requirements-[SPARK-COMPAT-VERSION]_[SCALA-COMPAT-VERSION].txt
pip install pytest
Run the Python tests via env PYTHONPATH=python:python/test python -m pytest python/test
.