storm-pdf-word-counter

Stream processing pipeline that performs the real time analysis of a folder of stored PDFs, using Apache Storm.

This following repo contains the Storm topology code and the instructions to run it locally with Local Mode. This simulates the behavior of the topology as it was deployed into a cluster.

Requirements

Vagrant - virtual environment manager.
Oracle VM VirtualBox - general purpose virtualizer .
SSH client, such as PuTTY.

Topology

Getting Started

git clone https://github.com/aperkaz/storm-pdf-word-counter.git
cd /storm-pdf-word-counter
Spin up the VM: vagrant up
Using SSH client, SSH 127.0.0.1:2222
4.1. Log in vagrant:vagrant
Run the visualization web server
5.1. Inside the VM: cd /vagrant/visualization
5.2. python app.py
Package the topology
6.1. Inside the VM (open new SSH session): cd /vagrant/topology
6.2. mvn clean
6.3. mvn package - may take a while the first time.
Execute the packaged topology
7.1. Inside the VM: cd /vagrant/topology
7.2. storm jar target/storm-pdf-word-counter-0.0.1-SNAPSHOT-jar-with-dependencies.jar storm.PdfWordCountTopology
Live generated results at http://127.0.0.1:5000.
Shutdown the VM: vagrant halt

Name		Name	Last commit message	Last commit date
Latest commit History 37 Commits
books		books
topology		topology
visualization		visualization
.gitignore		.gitignore
README.md		README.md
Vagrantfile		Vagrantfile
default.json		default.json
license		license
provision.sh		provision.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

storm-pdf-word-counter

Requirements

Topology

Getting Started

About

Uh oh!

Releases

Packages

Languages

License

aperkaz/storm-pdf-word-counter

Folders and files

Latest commit

History

Repository files navigation

storm-pdf-word-counter

Requirements

Topology

Getting Started

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages