This project demonstrates an end-to-end data pipeline using Apache Kafka, Java, and OpenSearch. The application reads a continuous stream of data from Wikimedia, processes it in real-time, and stores it in Kafka. The data is then sent to OpenSearch for storage and further analysis.
- The application reads event changes (new posts or updates to existing ones) from Wikimedia in real-time.
- The producer reads the event changes from Wikimedia and publishes them to a Kafka topic named
wikimedia.newChanges
. - The topic is configured with 3 partitions to reduce latency by approximately 30%.
- Topic Name:
wikimedia.newChanges
- Partitions: 3
- This configuration allows for parallel processing and reduces latency.
- The application consists of 2 consumers reading from 3 partitions for increased throughput. Two consumers were enough to keep the lag to a trivial number.
- OpenSearch is used as the database to store every record from the Kafka topic.
- Data is sent to OpenSearch in batches, which increases throughput by approximately 15%.
- Java
- Kafka
- Zookeeper
- ElasticSearch
- Event Handling
- Download Kafka and Zookeeper: https://www.conduktor.io/kafka/how-to-install-apache-kafka-on-windows/
- Create an account on Bonsai to access Opensearch
- Run the following command in a terminal to start Zookeeper
zookeeper-server-start.sh ~/kafka_2.13-3.8.0/config/zookeeper.properties
- Run the following command in a terminal to start Kafka
kafka-server-start.sh ~/kafka_2.13-3.8.0/config/server.properties
- Clone and download the code in from repository and run it in a Java IDE (preferably IntelliJ)
- Run the Consumer classes followed by the Producer class