Stream processing pipeline that performs the real time analysis of a folder of stored PDFs, using Apache Storm.
This following repo contains the Storm topology code and the instructions to run it locally with Local Mode. This simulates the behavior of the topology as it was deployed into a cluster.
- Vagrant - virtual environment manager.
- Oracle VM VirtualBox - general purpose virtualizer .
- SSH client, such as PuTTY.
git clone https://github.com/aperkaz/storm-pdf-word-counter.git
cd /storm-pdf-word-counter
- Spin up the VM:
vagrant up
- Using SSH client, SSH
127.0.0.1:2222
4.1. Log invagrant:vagrant
- Run the visualization web server
5.1. Inside the VM:cd /vagrant/visualization
5.2.python app.py
- Package the topology
6.1. Inside the VM (open new SSH session):cd /vagrant/topology
6.2.mvn clean
6.3.mvn package
- may take a while the first time. - Execute the packaged topology
7.1. Inside the VM:cd /vagrant/topology
7.2.storm jar target/storm-pdf-word-counter-0.0.1-SNAPSHOT-jar-with-dependencies.jar storm.PdfWordCountTopology
- Live generated results at
http://127.0.0.1:5000
. - Shutdown the VM:
vagrant halt