8000 GitHub - datazip-inc/olake: Fastest open-source tool for replicating Databases to Data Lake in Open Table Formats like Apache Iceberg. ⚑ Efficient, quick and scalable data ingestion for real-time analytics. Supporting Postgres, MongoDB and MySQL
[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
Skip to content

Fastest open-source tool for replicating Databases to Data Lake in Open Table Formats like Apache Iceberg. ⚑ Efficient, quick and scalable data ingestion for real-time analytics. Supporting Postgres, MongoDB and MySQL

License

Notifications You must be signed in to change notification settings

datazip-inc/olake

Repository files navigation

olake
OLake

The fastest open-source tool for replicating databases to Apache Iceberg. OLake, an easy-to-use web interface and a CLI for efficient, scalable, & real-time data ingestion. Visit olake.io/docs for the full documentation, and benchmarks

GitHub issues Documentation slack

πŸš€ Getting Started with OLake UI (Recommended)

OLake UI is a web-based interface for managing OLake jobs, sources, destinations, and configurations. You can run the entire OLake stack (UI, Backend, and all dependencies) using Docker Compose. This is the recommended way to get started.

Quick Start (2 step process):

  1. Start OLake UI via docker compose:
curl -sSL https://raw.githubusercontent.com/datazip-inc/olake-ui/master/docker-compose.yml | docker compose -f - up -d
  1. Access the UI:

Detailed getting started using OLake UI can be found here.

olake-ui

Creating Your First Job

With the UI running, you can create a data pipeline in a few steps:

  1. Create a Job: Navigate to the Jobs tab and click Create Job.
  2. Configure Source: Set up your source connection (e.g., PostgreSQL, MySQL, MongoDB).
  3. Configure Destination: Set up your destination (e.g., Apache Iceberg with a Glue, REST, Hive, or JDBC catalog).
  4. Select Streams: Choose which tables to sync and configure their sync mode (CDC or Full Refresh).
  5. Configure & Run: Give your job a name, set a schedule, and click Create Job to finish.

For a detailed walkthrough, refer to the Jobs documentation.

Performance Benchmarks*

OLake is engineered for high-throughput data replication.

  1. Postgres Connector to Apache Iceberg: (See Detailed Benchmark)

    • Full load: Syncs at 46,262 RPS for 4 billion rows. (101x Airbyte, 11.6x Estuary, 3.1x Debezium)
    • CDC: Syncs at 36,982 RPS for 50 million changes. (63x Airbyte, 12x Estuary, 2.7x Debezium)
  2. MongoDB Connector to Apache Iceberg: (See Detailed Benchmark)

    • Syncs 35,694 records/sec, replicating a 664 GB dataset (230 million rows) in 46 minutes. (20Γ— Airbyte, 15Γ— Debezium, 6Γ— Fivetran)

*These are preliminary results. Fully reproducible benchmark scores will be published soon.

Getting Started with OLake

Install OLake

Below are different ways you can run OLake:

  1. OLake UI (Recommended)
  2. Standalone Docker container
  3. Airflow on EC2
  4. Airflow on Kubernetes

Source / Connectors

  1. Getting started Postgres -> Writers | Postgres Docs
  2. Getting started MongoDB -> Writers | MongoDB Docs
  3. Getting started MySQL -> Writers | MySQL Docs

Writers / Destination

  1. Apache Iceberg Docs

    1. Catalogs
      1. AWS Glue Catalog
      2. REST Catalog
      3. JDBC Catalog
      4. Hive Catalog
    2. Azure ADLS Gen2
    3. Google Cloud Storage (GCS)
    4. MinIO (local)
    5. Iceberg Table Management
      1. S3 Tables Supported
  2. Parquet Writer

    1. AWS S3 Docs
    2. Google Cloud Storage (GCS)
    3. Local FileSystem Docs

Source Connectors

Functionality MongoDB Postgres MySQL
Full Refresh Sync βœ… βœ… βœ…
Incremental Sync WIP WIP WIP
CDC Sync βœ… βœ… βœ…
Full Load Parallel Processing βœ… βœ… βœ…
CDC Parallel Processing βœ… ❌ ❌
Resumable Full Load βœ… βœ… βœ…
CDC Heartbeat (Planned) - - -

Destination Writers

Functionality Local Filesystem AWS S3 Apache Iceberg
Flattening & Normalization βœ… βœ… βœ…
Partitioning βœ… βœ… βœ…
Schema Data Type Changes βœ… βœ… WIP
Schema Evolution βœ… βœ… βœ…

Supported Catalogs For Iceberg Writer

Catalog Status
Glue Catalog Supported
Hive Metastore Supported
JDBC Catalog Supported
REST Catalog Supported (with AWS S3 table)
Azure Purview Not Planned, submit a request
BigLake Metastore Not Planned, submit a request

βš™οΈ Core Framework & CLI

For advanced users and automation, OLake's core logic is exposed via a powerful CLI. The core framework handles state management, configuration validation, logging, and type detection. It interacts with drivers using four main commands:

  • spec: Returns a render-able JSON Schema for a connector's configuration.
  • check: Validates connection configurations for sources and destinations.
  • discover: Returns all available streams (e.g., tables) and their schemas from a source.
  • sync: Executes the data replication job, extracting from the source and writing to the destination.

Find out more about how OLake works here.

Playground

  1. OLake + Apache Iceberg + REST Catalog + Presto
  2. OLake + Apache Iceberg + AWS Glue + Trino
  3. OLake + Apache Iceberg + AWS Glue + Athena
  4. OLake + Apache Iceberg + AWS Glue + Snowflake

πŸ—ΊοΈ Roadmap

Check out our GitHub Project Roadmap and the Upcoming OLake Roadmap to track what's next. If you have ideas or feedback, please share them in our GitHub Discussions or by opening an issue.

❀️ Contributing

We ❀️ contributions, big or small! Check out our Bounty Program. A huge thanks to all our amazing contributors!

About

Fastest open-source tool for replicating Databases to Data Lake in Open Table Formats like Apache Iceberg. ⚑ Efficient, quick and scalable data ingestion for real-time analytics. Supporting Postgres, MongoDB and MySQL

Topics

Resources

License

Stars

Watchers

Forks

Contributors 23

0