8000 GitHub - shsiena/antlion: Express.js middleware that turns your website into an infinite sinkhole for unethical webscraping bots.
[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
Skip to content

Express.js middleware that turns your website into an infinite sinkhole for unethical webscraping bots.

License

Notifications You must be signed in to change notification settings

shsiena/antlion

Repository files navigation


Become indigestible, grow spikes.

Node.js Express.js TypeScript

Report Bug · Request Feature

Table of Contents
  1. About The Project
  2. Installation
  3. Usage
  4. Roadmap
  5. Contributing
  6. License
  7. Contact
  8. Acknowledgments

About The Project

NPM Version NPM Downloads

For too long, AI companies have been flagrantly disrespecting website owners by ignoring their robots.txt and scraping everything on their site without permission. With Antlion, you can fight back.

Antlion is Express.js middleware that gives you the ability to set up dedicated routes on your site to become infinitely recursive tar pits designed to trap webscrapers that ignore your robots.txt file.

Features

  • Bots that ignore your site's robots.txt and enter Antlion's pit are locked in an infinitely deep site full of nonsensical garbled text which loads at the speed of a '90s dial-up connection.

  • Once bots wait upwards of 20 seconds for a page to finally load, they are presented with several links, each of which leads deeper into Antlion's pit.

  • Antlion also automatically handles serving your robots.txt, injecting disallow entries for all trapped routes so ethical bots and search engine indexers skip them automatically — no extra config needed.

  • Any malicious webscrapers gathering data to compile datasets for training LLMs will inadvertently digest millions of lines of useless text, ruining the output of models trained with this data, ideally making bot owners think twice before ignoring the rules in your sacred robots.txt.

  • Adding Antlion to your site is incredibly easy, just install the npm package, give it some unused routes, point it to your existing robots.txt, copy and paste a bunch of random text into a file, and add a single hidden link somewhere on your site that leads into the pit. Antlion will take care of the rest.

(back to top)


Screenshot

Installation

This is a Node.js module available through the npm registry.

Before installing, download and install Node.js. Node.js 18 or higher is required.

If this is a brand new project, make sure to create a package.json first with the npm init command.

Install it with the npm install command:

npm install antlion

(back to top)

Usage

  1. Create a file bait.txt (suggested name), and fill it with as much text as you can. This can be Wikipedia articles, blog posts, textbooks, or even Shakespeare. Do not worry about formatting or special characters.

  2. Choose a couple routes that you are not/do not plan on using, such as /blog/, /docs/installation/ or /aboutus/detailed/. These can be anything, but the more realistic they are, the better.

  3. Remove any existing handlers for /robots.txt.

  4. Import Antlion and add it to your server middleware:

import express from 'express'
import antlion from 'antlion'

const app = express()

antlion(app, {
    robotsPath: 'robots.txt',                 // path to existing robots.txt from project root
    trainingDataPath: 'bait.txt',             // path to training data file from project root
    trappedRoutes: ['/example/', '/trap/']    // array of routes to trap
})

// -- rest of your code --
  1. Hide a link into Antlion's pit somewhere on your site, ideally hidden so regular users will not notice it.
    • This trapped link should be one of the trapped routes, optionally followed by some random text. (May make it harder to evade)
    • Ex: /trap/abcdef, or just /trap

⚠️ NOTE: If you’re running a high-traffic site, consider hosting Antlion on a separate server and linking to it to avoid performance hits.

(back to top)

Roadmap

  • Dynamic HTML to evade detection
  • Bot IP address tracking/logging
  • Text generation model caching for faster startup

See the open issues for a full list of proposed features (and known issues).

(back to top)

Contributing

Contributions are what make the open source community such an amazing place to learn, inspire, and create. Any contributions you make are greatly appreciated.

Setup

Clone the repository:

git clone https://github.com/shsiena/antlion.git

Install dependencies:

cd antlion
npm install

Run test server:

npm run dev

If you have a suggestion that would make this better, please fork the repo and create a pull request. You can also simply open an issue with the tag "enhancement". Don't forget to give the project a star! Thanks again!

  1. Fork the Project
  2. Create your Feature Branch (git checkout -b feature/AmazingFeature)
  3. Commit your Changes (git commit -m 'Add some AmazingFeature')
  4. Push to the Branch (git push origin feature/AmazingFeature)
  5. Open a Pull Request

(back to top)

License

Distributed under the MIT licence. See LICENSE for more information.

Contact

Simon Siena - ssiena@uwaterloo.ca

Project Link: https://github.com/shsiena/antlion

Acknowledgments

Inspired by:

(back to top)

About

Express.js middleware that turns your website into an infinite sinkhole for unethical webscraping bots.

Topics

Resources

License

Stars

Watchers

Forks

Packages

No packages published
0