Open source platform for the machine learning lifecycle
-
Updated
May 5, 2025 - Python
8000
Apache Spark is an open source distributed general-purpose cluster-computing framework. It provides an interface for programming entire clusters with implicit data parallelism and fault tolerance.
Open source platform for the machine learning lifecycle
An end-to-end GoodReads Data Pipeline for Building Data Lake, Data Warehouse and Analytics Platform.
PySpark + Scikit-learn = Sparkit-learn
(Deprecated) Scikit-learn integration package for Apache Spark
pyspark methods to enhance developer productivity 📣 👯 🎉
A command-line tool for launching Apache Spark clusters.
Distributed Deep Learning, with a focus on distributed training, using Keras and Apache Spark.
PySpark Cheat Sheet - example code to help you learn PySpark and develop apps faster
A boilerplate for writing PySpark Jobs
Train and run Pytorch models on Apache Spark.
Code for "Efficient Data Processing in Spark" Course
Easy to use library to bring Tensorflow on Apache Spark
A pure Python implementation of Apache Spark's RDD and DStream interfaces.
An end-to-end data engineering pipeline that orchestrates data ingestion, processing, and storage using Apache Airflow, Python, Apache Kafka, Apache Zookeeper, Apache Spark, and Cassandra. All components are containerized with Docker for easy deployment and scalability.
Dataproc templates and pipelines for solving in-cloud data tasks
Apache Spark 3 - Structured Streaming Course Material
Apache (Py)Spark type annotations (stub files).
Real-Time Financial Market Data Processing and Prediction application
Astronomy Broker based on Apache Spark
Solution Accelerators for Serverless Spark on GCP, the industry's first auto-scaling and serverless Spark as a service
Created by Matei Zaharia
Released May 26, 2014