8000 GitHub - sarvagyakrcs/cf-scraper: A powerful, lightweight web scraping API built with [Hono](https://hono.dev/) and deployed on Cloudflare Workers. Extract text content and attributes from any website using CSS selectors.
[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
Skip to content

A powerful, lightweight web scraping API built with [Hono](https://hono.dev/) and deployed on Cloudflare Workers. Extract text content and attributes from any website using CSS selectors.

Notifications You must be signed in to change notification settings

sarvagyakrcs/cf-scraper

Repository files navigation

Kepler Web Scraper API

A powerful, lightweight web scraping API built with Hono and deployed on Cloudflare Workers. Extract text content and attributes from any website using CSS selectors.

Features

  • Blazing Fast - Powered by Cloudflare Workers edge network
  • CSS Selector Support - Use any CSS selector to target elements
  • Text & Attributes - Extract both text content and HTML attributes
  • CORS Enabled - Ready for browser-based applications
  • Pretty JSON - Optional formatted JSON responses
  • Error Handling - Comprehensive error responses

Quick Start

Prerequisites

Installation

  1. Clone the repository:

    git clone <your-repo-url>
    cd Kepler
  2. Install dependencies:

    bun install
  3. Start development server:

    bun run dev

    The API will be available at http://localhost:8787

API Documentation

Base URL

  • Development: http://localhost:8787
  • Production: https://kepler.sarvagyakrcs.workers.dev

Endpoints

GET /?url=<URL>&selector=<CSS_SELECTOR>&[options]

Scrapes content from the specified URL using the provided CSS selector.

Parameters

Parameter Type Required Description
url string Yes The URL of the webpage to scrape. Protocol is optional (defaults to http://)
selector string Yes CSS selector to target elements. Supports multiple selectors separated by commas
attr string No Extract a specific HTML attribute instead of text content
spaced boolean No Add spaces between HTML tags when extracting text
pretty boolean No Format JSON response with indentation

Response Format

Success Response

{
  "result": "extracted content" | ["array", "of", "results"] | {"selector": ["results"]}
}

Error Response

{
  "error": "Error message describing what went wrong"
}

Usage Examples

Basic Text Extraction

Extract the main heading from a webpage:

curl "http://localhost:8787/?url=https://example.com&selector=h1&pretty=true"

Response:

{
  "result": ["Example Domain"]
}

Multiple Elements

Extract all paragraph text:

curl "http://localhost:8787/?url=https://example.com&selector=p&pretty=true"

Response:

{
  "result": [
    "This domain is for use in illustrative examples in documents.",
    "More information..."
  ]
}

Multiple Selectors

Extract both headings and paragraphs:

curl "http://localhost:8787/?url=https://example.com&selector=h1,p&pretty=true"

Response:

{
  "result": {
    "h1": ["Example Domain"],
    "p": ["This domain is for use...", "More information..."]
  }
}

Attribute Extraction

Extract all links from a page:

curl "http://localhost:8787/?url=https://example.com&selector=a&attr=href&pretty=true"

Response:

{
  "result": "https://www.iana.org/domains/example"
}

Spaced Text Extraction

Extract text with spaces between nested elements:

curl "http://localhost:8787/?url=https://example.com&selector=div&spaced=true&pretty=true"

Real-World Examples

Extract Article Titles from Hacker News

curl "http://localhost:8787/?url=https://news.ycombinator.com&selector=.titleline>a&pretty=true"

Get GitHub Repository Description

curl "http://localhost:8787/?url=https://github.com/microsoft/vscode&selector=.BorderGrid-cell p&pretty=true"

Extract Product Prices

curl "http://localhost:8787/?url=https://example-shop.com&selector=.price&pretty=true"

Get Meta Description

curl "http://localhost:8787/?url=https://example.com&selector=meta[name='description']&attr=content&pretty=true"

JavaScript/Browser Usage

Fetch API

const scrapeData = async (url, selector) => {
  const apiUrl = `http://localhost:8787/?url=${encodeURIComponent(url)}&selector=${encodeURIComponent(selector)}&pretty=true`;
  
  try {
    const response = await fetch(apiUrl);
    const data = await response.json();
    return data.result;
  } catch (error) {
    console.error('Scraping failed:', error);
    return null;
  }
};

// Usage
const titles = await scrapeData('https://news.ycombinator.com', '.titleline>a');
console.log(titles);

jQuery

$.get('http://localhost:8787/', {
  url: 'https://example.com',
  selector: 'h1',
  pretty: true
}).done(function(data) {
  console.log(data.result);
});

Development

Project Structure

Kepler/
├── src/
│   ├── index.ts          # Main Hono application
│   ├── scraper.ts        # Web scraping logic
│   ├── content-types.ts  # MIME type constants
│   └── json-response.ts  # Response utilities
├── package.json
├── tsconfig.json
├── wrangler.jsonc        # Cloudflare Workers config
└── README.md

Available Scripts

Script Description
bun run dev Start development server with hot reload
bun run deploy Deploy to Cloudflare Workers
bun run cf-typegen Generate TypeScript types for Workers

Environment Setup

  1. Generate Cloudflare types:

    bun run cf-typegen
  2. Configure Wrangler:

    wrangler login

Deployment

Deploy to Cloudflare Workers

  1. Login to Cloudflare:

    wrangler login
  2. Update worker name in wrangler.jsonc:

    {
      "name": "your-scraper-api",
      "main": "src/index.ts",
      "compatibility_date": "2025-06-19"
    }
  3. Deploy:

    bun run deploy

Custom Domain (Optional)

  1. Add a custom domain in the Cloudflare Workers dashboard
  2. Update your DNS settings
  3. Your API will be available at https://api.yourdomain.com

Configuration

CORS Settings

CORS is enabled by default for all origins. To restrict access, modify the CORS configuration in src/index.ts:

app.use('/*', cors({
  origin: ['https://yourdomain.com'],
  allowMethods: ['GET'],
}))

Rate Limiting

Consider implementing rate limiting for production use:

// Add to src/index.ts
app.use('/*', async (c, next) => {
  // Implement your rate limiting logic
  await next()
})

Technical Details

Scraping Engine

  • Uses Cloudflare's built-in HTMLRewriter for efficient HTML parsing
  • Streams HTML content for memory efficiency
  • Supports complex CSS selectors including pseudo-selectors

Performance

  • Cold Start: ~50ms on Cloudflare Workers
  • Warm Requests: ~10-20ms
  • Memory Usage: ~5-10MB per request
  • Timeout: 30 seconds (Cloudflare Workers limit)

Limitations

  • Cannot execute JavaScript (static HTML only)
  • Subject to Cloudflare Workers CPU time limits
  • Cannot access localhost or private networks
  • Maximum response size: 128MB

Error Handling

The API returns appropriate HTTP status codes and error messages:

Status Code Description
200 Success
404 Invalid endpoint or missing required parameters
500 Scraping error (network, parsing, or server issues)

Common Error Scenarios

  1. Invalid URL:

    {"error": "Status 404 requesting https://invalid-url.com"}
  2. Invalid Selector:

    {"error": "Invalid CSS selector"}
  3. Network Timeout:

    {"error": "Request timeout"}

Best Practices

URL Encoding

Always encode URLs and selectors when making requests:

const encodedUrl = encodeURIComponent('https://example.com/path?param=value');
const encodedSelector = encodeURIComponent('div.class > p:first-child');

Selector Optimization

  • Use specific selectors to reduce parsing time
  • Avoid overly broad selectors like * or div
  • Test selectors in browser DevTools first

Error Handling

Always handle potential errors in your client code:

try {
  const response = await fetch(apiUrl);
  if (!response.ok) {
    throw new Error(`HTTP ${response.status}`);
  }
  const data = await response.json();
  if (data.error) {
    throw new Error(data.error);
  }
  return data.result;
} catch (error) {
  console.error('Scraping failed:', error.message);
  return null;
}

Contributing

  1. Fork the repository
  2. Create a feature branch: git checkout -b feature/amazing-feature
  3. Commit your changes: git commit -m 'Add amazing feature'
  4. Push to the branch: git push origin feature/amazing-feature
  5. Open a Pull Request

License

This project is licensed under the MIT License - see the LICENSE file for details.

Acknowledgments

Support


Happy Scraping!

About

A powerful, lightweight web scraping API built with [Hono](https://hono.dev/) and deployed on Cloudflare Workers. Extract text content and attributes from any website using CSS selectors.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published
0