ApiBot a scheduled crawler

Opensource tool that you can host on your machine and use it for your scraping needs.

Start the app

Heroku, Docker

heroku buildpacks
heroku buildpacks:set heroku/ruby # https://github.com/heroku/heroku-buildpack-ruby
heroku buildpacks:add --index 1 heroku/nodejs # https://github.com/heroku/heroku-buildpack-nodejs
heroku buildpacks:add heroku/google-chrome # https://github.com/heroku/heroku-buildpack-google-chrome

# For https://stackoverflow.com/a/57671870/287166 Webdrivers::BrowserNotFound (Failed to find Chrome binary.):
heroku config:set WD_CHROME_PATH=/app/.apt/usr/bin/google-chrome

Basic usage

When you start the app you can register first User. It will be a superadmin which can manage other users on the platform. It is fine if there is only one user, but in case you want have other projects that need scrapping, you can create another Company and add other users to it.

Scrapping is performed in two steps, first is to download a html page using a Bot and second is to parse downloaded data using Inspect.

Bot is defined with starting url. If the site does not require javascript than you can choose selenium_chrome or headless_selenium_chrome. If target data is not on the starting url than you can perform various Steps to get to it: click on page elements, fill inputs and submit forms.

Main purpose of a bot when we Run it, is to get to the desired Page with data we need (using PageService) and to perform inspection (using InspectService).

Saved html is processed by Inspect to return data in clear formated format (without html tags and spaces). For that we are using Nokogiri https://github.com/sparklemotion/nokogiri/wiki/Cheat-sheet For example for page ...<h1>My Header</h1>... we can inspect header: h1, and for table example we can inspect name: td:nth-child(1) and count: td:nth-child(2). Generated data might look as an object with string values

{
  data: {
    header: 'My Header'
  }
}

or an array of objects with string values

{
  data: [
    {
      name: 'Item1',
      count: '1'
    },
    {
      name: 'Item2',
      count: '2'
    }
  ]
}

Some examples you can run with a click on your api bot locally or on Heroku http://trk-bot.herokuapp.com/examples

Keywords

Web Scraping, Data Scraping, Data Mining, Data Extraction, Data Science, Microsoft Excel, Extract, Transform and Load (ETL), Data Migration, Web Crawling

Configuration

You can start the application on Heroku

To create your keys you can run rails credentials:edit

secret_key_base: 123
exception_recipients: my@gmail.com
mailer_sender: My Company <my@gmail.com>

smtp_username: my@gmail.com
smtp_password: mypassword

# captcha https://www.google.com/recaptcha/admin#site/_____?setup
google_recaptcha_site_key: 123
google_recaptcha_secret_key: 123

Dependencies

It is using:

Ruby on Rails v6.0.2
ruby
sidekiq
postgresql

Development

rails db:setup

Run tests

rake

Similar tools

Kimurai https://github.com/vifreefly/kimuraframework

TODO

callback url that will send data back to some server
push data to google drive, send to email
convert to ruby script that can be run intependently so complex situation can be handler manually
add test for recurring bots

Name		Name	Last commit message	Last commit date
Latest commit History 20 Commits
.github/workflows		.github/workflows
app		app
bin		bin
config		config
db		db
docs		docs
lib		lib
log		log
public		public
storage		storage
test		test
tmp		tmp
vendor		vendor
.browserslistrc		.browserslistrc
.gitignore		.gitignore
.rubocop.yml		.rubocop.yml
.ruby-version		.ruby-version
Gemfile		Gemfile
Gemfile.lock		Gemfile.lock
Procfile		Procfile
README.md		README.md
Rakefile		Rakefile
babel.config.js		babel.config.js
< 8000 a title="config.ru" aria-label="config.ru, (File)" class="Link--primary" href="/trkin/apibot/blob/master/config.ru">config.ru		config.ru
package.json		package.json
postcss.config.js		postcss.config.js
yarn.lock		yarn.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

ApiBot a scheduled crawler

Start the app

Basic usage

Keywords

Configuration

Dependencies

Development

Similar tools

TODO

About

Uh oh!

Releases

Packages

Uh oh!

Languages

trkin/apibot

Folders and files

Latest commit

History

Repository files navigation

ApiBot a scheduled crawler

Start the app

Basic usage

Keywords

Configuration

Dependencies

Development

Similar tools

TODO

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages