[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
Skip to content

⚙️Using Apache Storm for counting words from stored PDF documents

License

Notifications You must be signed in to change notification settings

aperkaz/storm-pdf-word-counter

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

37 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

storm-pdf-word-counter

Stream processing pipeline that performs the real time analysis of a folder of stored PDFs, using Apache Storm.

This following repo contains the Storm topology code and the instructions to run it locally with Local Mode. This simulates the behavior of the topology as it was deployed into a cluster.

Requirements

Topology

topology

Getting Started

  1. git clone https://github.com/aperkaz/storm-pdf-word-counter.git
  2. cd /storm-pdf-word-counter
  3. Spin up the VM: vagrant up
  4. Using SSH client, SSH 127.0.0.1:2222
    4.1. Log in vagrant:vagrant
  5. Run the visualization web server
    5.1. Inside the VM: cd /vagrant/visualization
    5.2. python app.py
  6. Package the topology
    6.1. Inside the VM (open new SSH session): cd /vagrant/topology
    6.2. mvn clean
    6.3. mvn package - may take a while the first time.
  7. Execute the packaged topology
    7.1. Inside the VM: cd /vagrant/topology
    7.2. storm jar target/storm-pdf-word-counter-0.0.1-SNAPSHOT-jar-with-dependencies.jar storm.PdfWordCountTopology
  8. Live generated results at http://127.0.0.1:5000.
  9. Shutdown the VM: vagrant halt

About

⚙️Using Apache Storm for counting words from stored PDF documents

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published