8000 apache-spark · GitHub Topics · GitHub

More Web Proxy on the site http://driver.im/

#

apache-spark

Apache Spark is an open source distributed general-purpose cluster-computing framework. It provides an interface for programming entire clusters with implicit data parallelism and fault tolerance.

Here are 491 public repositories matching this topic...

mlflow / mlflow

Open source platform for the machine learning lifecycle

machine-learning ai apache-spark ml model-management mlflow

Updated May 5, 2025
Python

goodreads_etl_pipeline

san089 / goodreads_etl_pipeline

An end-to-end GoodReads Data Pipeline for Building Data Lake, Data Warehouse and Analytics Platform.

Updated Mar 9, 2020
Python

lensacom / sparkit-learn

PySpark + Scikit-learn = Sparkit-learn

python machine-learning apache-spark scikit-learn distributed-computing

Updated Dec 31, 2020
Python

databricks / spark-sklearn

(Deprecated) Scikit-learn integration package for Apache Spark

machine-learning apache-spark scikit-learn grid-search parameter-tuning

Updated Dec 3, 2019
Python

mrpowers-io / quinn

pyspark methods to enhance developer productivity 📣 👯 🎉

apache-spark pyspark

Updated Mar 6, 2025
Python

nchammas / flintrock

A command-line tool for launching Apache Spark clusters.

apache-spark ec2 orchestration apache-spark-cluster spark-ec2

Updated Dec 13, 2024
Python

cerndb / dist-keras

Distributed Deep Learning, with a focus on distributed training, using Keras and Apache Spark.

data-science machine-learning apache-spark deep-learning hadoop tensorflow keras optimization-algorithms data-parallelism distributed-optimizers

Updated Jul 25, 2018
Python

cartershanklin / pyspark-cheatsheet

PySpark Cheat Sheet - example code to help you learn PySpark and develop apps faster

big-data spark apache-spark pyspark

Updated Oct 15, 2024
Python

ekampf / PySpark-Boilerplate

A boilerplate for writing PySpark Jobs

python boilerplate apache-spark pyspark

Updated Jan 21, 2024
Python

dmmiller612 / sparktorch

Train and run Pytorch models on Apache Spark.

apache-spark deep-learning distributed-computing pipelines inference pytorch sparktorch

Updated May 11, 2023
Python

josephmachado / efficient_data_processing_spark

Code for "Efficient Data Processing in Spark" Course

apache-spark pyspark data-engineering minio data-pipeline pyspark-notebook

Updated Oct 1, 2024
Python

lifeomic / sparkflow

Easy to use library to bring Tensorflow on Apache Spark

apache-spark deep-learning pipeline tensorflow dataframe spark-ml lifeomic

Updated Oct 11, 2023
Python

pysparkling

svenkreiss / pysparkling

A pure Python implementation of Apache Spark's RDD and DStream interfaces.

python data-science apache-spark data-processing

Updated Sep 3, 2024
Python

airscholar / e2e-data-engineering

An end-to-end data engineering pipeline that orchestrates data ingestion, processing, and storage using Apache Airflow, Python, Apache Kafka, Apache Zookeeper, Apache Spark, and Cassandra. All components are containerized with Docker for easy deployment and scalability.

docker big-data cassandra apache-spark data-storage postgresql data-engineering apache-kafka data-processing data-pipeline real-time-analytics containerization apache-zookeeper apache-airflow etl-pipeline

Updated Feb 14, 2025
Python

GoogleCloudPlatform / dataproc-templates

Dataproc templates and pipelines for solving in-cloud data tasks

bigquery apache-spark jupyter-notebook gcp google-cloud pyspark google-cloud-platform

Updated Apr 2, 2025
Python

LearningJournal / Spark-Streaming-In-Python

Apache Spark 3 - Structured Streaming Course Material

python big-data apache-spark bigdata pyspark data-lake spark-streaming spark-sql

Updated Aug 19, 2023
Python

zero323 / pyspark-stubs

Apache (Py)Spark type annotations (stub files).

python apache-spark stub-files pyspark python-3 type-annotations mypy pep484

Updated Aug 17, 2022
Python

radoslawkrolikowski / financial-market-data-analysis

Real-Time Financial Market Data Processing and Prediction application

python apache-spark mariadb pytorch neural-networks scrapy apache-kafka financial-data structured-streaming

Updated Jul 23, 2023
Python

astrolabsoftware / fink-broker

Astronomy Broker based on Apache Spark

alerts apache-spark astronomy apache-kafka structured-streaming apache-hbase

Updated May 2, 2025
Python

GoogleCloudPlatform / serverless-spark-workshop

Solution Accelerators for Serverless Spark on GCP, the industry's first auto-scaling and serverless Spark as a service

spark apache-spark hadoop serverless bigdata autoscaling usecases dataproc solution-accelerator

Updated May 3, 2024
Python

Created by Matei Zaharia

Released May 26, 2014

Followers: 428 followers
Repository: apache/spark
Website: spark.apache.org
Wikipedia: Wikipedia

Related Topics

0