Kepler Web Scraper API

A powerful, lightweight web scraping API built with Hono and deployed on Cloudflare Workers. Extract text content and attributes from any website using CSS selectors.

Features

Blazing Fast - Powered by Cloudflare Workers edge network
CSS Selector Support - Use any CSS selector to target elements
Text & Attributes - Extract both text content and HTML attributes
CORS Enabled - Ready for browser-based applications
Pretty JSON - Optional formatted JSON responses
Error Handling - Comprehensive error responses

Quick Start

Prerequisites

Bun (recommended) or Node.js
Wrangler CLI

Installation

Clone the repository:
```
git clone <your-repo-url>
cd Kepler
```
Install dependencies:
```
bun install
```
Start development server:
```
bun run dev
```
The API will be available at http://localhost:8787

API Documentation

Base URL

Development: http://localhost:8787
Production: https://kepler.sarvagyakrcs.workers.dev

Endpoints

`GET /?url=<URL>&selector=<CSS_SELECTOR>&[options]`

Scrapes content from the specified URL using the provided CSS selector.

Parameters

Parameter	Type	Required	Description
`url`	string	Yes	The URL of the webpage to scrape. Protocol is optional (defaults to `http://`)
`selector`	string	Yes	CSS selector to target elements. Supports multiple selectors separated by commas
`attr`	string	No	Extract a specific HTML attribute instead of text content
`spaced`	boolean	No	Add spaces between HTML tags when extracting text
`pretty`	boolean	No	Format JSON response with indentation

Response Format

Success Response

{
  "result": "extracted content" | ["array", "of", "results"] | {"selector": ["results"]}
}

Error Response

{
  "error": "Error message describing what went wrong"
}

Usage Examples

Basic Text Extraction

Extract the main heading from a webpage:

curl "http://localhost:8787/?url=https://example.com&selector=h1&pretty=true"

Response:

{
  "result": ["Example Domain"]
}

Multiple Elements

Extract all paragraph text:

curl "http://localhost:8787/?url=https://example.com&selector=p&pretty=true"

Response:

{
  "result": [
    "This domain is for use in illustrative examples in documents.",
    "More information..."
  ]
}

Multiple Selectors

Extract both headings and paragraphs:

curl "http://localhost:8787/?url=https://example.com&selector=h1,p&pretty=true"

Response:

{
  "result": {
    "h1": ["Example Domain"],
    "p": ["This domain is for use...", "More information..."]
  }
}

Attribute Extraction

Extract all links from a page:

curl "http://localhost:8787/?url=https://example.com&selector=a&attr=href&pretty=true"

Response:

{
  "result": "https://www.iana.org/domains/example"
}

Spaced Text Extraction

Extract text with spaces between nested elements:

curl "http://localhost:8787/?url=https://example.com&selector=div&spaced=true&pretty=true"

Real-World Examples

Extract Article Titles from Hacker News

curl "http://localhost:8787/?url=https://news.ycombinator.com&selector=.titleline>a&pretty=true"

Get GitHub Repository Description

curl "http://localhost:8787/?url=https://github.com/microsoft/vscode&selector=.BorderGrid-cell p&pretty=true"

Extract Product Prices

curl "http://localhost:8787/?url=https://example-shop.com&selector=.price&pretty=true"

Get Meta Description

curl "http://localhost:8787/?url=https://example.com&selector=meta[name='description']&attr=content&pretty=true"

JavaScript/Browser Usage

Fetch API

const scrapeData = async (url, selector) => {
  const apiUrl = `http://localhost:8787/?url=${encodeURIComponent(url)}&selector=${encodeURIComponent(selector)}&pretty=true`;
  
  try {
    const response = await fetch(apiUrl);
    const data = await response.json();
    return data.result;
  } catch (error) {
    console.error('Scraping failed:', error);
    return null;
  }
};

// Usage
const titles = await scrapeData('https://news.ycombinator.com', '.titleline>a');
console.log(titles);

jQuery

$.get('http://localhost:8787/', {
  url: 'https://example.com',
  selector: 'h1',
  pretty: true
}).done(function(data) {
  console.log(data.result);
});

Development

Project Structure

Kepler/
├── src/
│   ├── index.ts          # Main Hono application
│   ├── scraper.ts        # Web scraping logic
│   ├── content-types.ts  # MIME type constants
│   └── json-response.ts  # Response utilities
├── package.json
├── tsconfig.json
├── wrangler.jsonc        # Cloudflare Workers config
└── README.md

Available Scripts

Script	Description
`bun run dev`	Start development server with hot reload
`bun run deploy`	Deploy to Cloudflare Workers
`bun run cf-typegen`	Generate TypeScript types for Workers

Environment Setup

Generate Cloudflare types:
```
bun run cf-typegen
```
Configure Wrangler:
```
wrangler login
```

Deployment

Deploy to Cloudflare Workers

Login to Cloudflare:
```
wrangler login
```

Update worker name in wrangler.jsonc:

{
  "name": "your-scraper-api",
  "main": "src/index.ts",
  "compatibility_date": "2025-06-19"
}

Deploy:
```
bun run deploy
```

Custom Domain (Optional)

Add a custom domain in the Cloudflare Workers dashboard
Update your DNS settings
Your API will be available at https://api.yourdomain.com

Configuration

CORS Settings

CORS is enabled by default for all origins. To restrict access, modify the CORS configuration in src/index.ts:

app.use('/*', cors({
  origin: ['https://yourdomain.com'],
  allowMethods: ['GET'],
}))

Rate Limiting

Consider implementing rate limiting for production use:

// Add to src/index.ts
app.use('/*', async (c, next) => {
  // Implement your rate limiting logic
  await next()
})

Technical Details

Scraping Engine

Uses Cloudflare's built-in HTMLRewriter for efficient HTML parsing
Streams HTML content for memory efficiency
Supports complex CSS selectors including pseudo-selectors

Performance

Cold Start: ~50ms on Cloudflare Workers
Warm Requests: ~10-20ms
Memory Usage: ~5-10MB per request
Timeout: 30 seconds (Cloudflare Workers limit)

Limitations

Cannot execute JavaScript (static HTML only)
Subject to Cloudflare Workers CPU time limits
Cannot access localhost or private networks
Maximum response size: 128MB

Error Handling

The API returns appropriate HTTP status codes and error messages:

Status Code	Description
200	Success
404	Invalid endpoint or missing required parameters
500	Scraping error (network, parsing, or server issues)

Common Error Scenarios

Invalid URL:

{"error": "Status 404 requesting https://invalid-url.com"}

Invalid Selector:
```
{"error": "Invalid CSS selector"}
```
Network Timeout:
```
{"error": "Request timeout"}
```

Best Practices

URL Encoding

Always encode URLs and selectors when making requests:

const encodedUrl = encodeURIComponent('https://example.com/path?param=value');
const encodedSelector = encodeURIComponent('div.class > p:first-child');

Selector Optimization

Use specific selectors to reduce parsing time
Avoid overly broad selectors like * or div
Test selectors in browser DevTools first

Error Handling

Always handle potential errors in your client code:

try {
  const response = await fetch(apiUrl);
  if (!response.ok) {
    throw new Error(`HTTP ${response.status}`);
  }
  const data = await response.json();
  if (data.error) {
    throw new Error(data.error);
  }
  return data.result;
} catch (error) {
  console.error('Scraping failed:', error.message);
  return null;
}

Contributing

Fork the repository
Create a feature branch: git checkout -b feature/amazing-feature
Commit your changes: git commit -m 'Add amazing feature'
Push to the branch: git push origin feature/amazing-feature
Open a Pull Request

License

This project is licensed under the MIT License - see the LICENSE file for details.

Acknowledgments

Hono - Ultrafast web framework
Cloudflare Workers - Edge computing platform
HTMLRewriter - Server-side HTML parsing

Support

Happy Scraping!

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
src		src
.gitignore		.gitignore
README.md		README.md
bun.lock		bun.lock
package.json		package.json
tsconfig.json		tsconfig.json
worker-configuration.d.ts		worker-configuration.d.ts
wrangler.jsonc		wrangler.jsonc

sarvagyakrcs/cf-scraper

Folders and files

Latest commit

History

Repository files navigation

Kepler Web Scraper API

Features

Quick Start

Prerequisites

Installation

API Documentation

Base URL

Endpoints

GET /?url=<URL>&selector=<CSS_SELECTOR>&[options]

Parameters

Response Format

Success Response

Error Response

Usage Examples

Basic Text Extraction

Multiple Elements

Multiple Selectors

Attribute Extraction

Spaced Text Extraction

Real-World Examples

Extract Article Titles from Hacker News

Get GitHub Repository Description

Extract Product Prices

Get Meta Description

JavaScript/Browser Usage

Fetch API

jQuery

Development

Project Structure

Available Scripts

Environment Setup

Deployment

Deploy to Cloudflare Workers

Custom Domain (Optional)

Configuration

CORS Settings

Rate Limiting

Technical Details

Scraping Engine

Performance

Limitations

Error Handling

Common Error Scenarios

Best Practices

URL Encoding

Selector Optimization

Error Handling

Contributing

License

Acknowledgments

Support

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

`GET /?url=<URL>&selector=<CSS_SELECTOR>&[options]`

Packages