A powerful, lightweight web scraping API built with Hono and deployed on Cloudflare Workers. Extract text content and attributes from any website using CSS selectors.
- Blazing Fast - Powered by Cloudflare Workers edge network
- CSS Selector Support - Use any CSS selector to target elements
- Text & Attributes - Extract both text content and HTML attributes
- CORS Enabled - Ready for browser-based applications
- Pretty JSON - Optional formatted JSON responses
- Error Handling - Comprehensive error responses
- Bun (recommended) or Node.js
- Wrangler CLI
-
Clone the repository:
git clone <your-repo-url> cd Kepler
-
Install dependencies:
bun install
-
Start development server:
bun run dev
The API will be available at
http://localhost:8787
- Development:
http://localhost:8787
- Production:
https://kepler.sarvagyakrcs.workers.dev
Scrapes content from the specified URL using the provided CSS selector.
Parameter | Type | Required | Description |
---|---|---|---|
url |
string | Yes | The URL of the webpage to scrape. Protocol is optional (defaults to http:// ) |
selector |
string | Yes | CSS selector to target elements. Supports multiple selectors separated by commas |
attr |
string | No | Extract a specific HTML attribute instead of text content |
spaced |
boolean | No | Add spaces between HTML tags when extracting text |
pretty |
boolean | No | Format JSON response with indentation |
{
"result": "extracted content" | ["array", "of", "results"] | {"selector": ["results"]}
}
{
"error": "Error message describing what went wrong"
}
Extract the main heading from a webpage:
curl "http://localhost:8787/?url=https://example.com&selector=h1&pretty=true"
Response:
{
"result": ["Example Domain"]
}
Extract all paragraph text:
curl "http://localhost:8787/?url=https://example.com&selector=p&pretty=true"
Response:
{
"result": [
"This domain is for use in illustrative examples in documents.",
"More information..."
]
}
Extract both headings and paragraphs:
curl "http://localhost:8787/?url=https://example.com&selector=h1,p&pretty=true"
Response:
{
"result": {
"h1": ["Example Domain"],
"p": ["This domain is for use...", "More information..."]
}
}
Extract all links from a page:
curl "http://localhost:8787/?url=https://example.com&selector=a&attr=href&pretty=true"
Response:
{
"result": "https://www.iana.org/domains/example"
}
Extract text with spaces between nested elements:
curl "http://localhost:8787/?url=https://example.com&selector=div&spaced=true&pretty=true"
curl "http://localhost:8787/?url=https://news.ycombinator.com&selector=.titleline>a&pretty=true"
curl "http://localhost:8787/?url=https://github.com/microsoft/vscode&selector=.BorderGrid-cell p&pretty=true"
curl "http://localhost:8787/?url=https://example-shop.com&selector=.price&pretty=true"
curl "http://localhost:8787/?url=https://example.com&selector=meta[name='description']&attr=content&pretty=true"
const scrapeData = async (url, selector) => {
const apiUrl = `http://localhost:8787/?url=${encodeURIComponent(url)}&selector=${encodeURIComponent(selector)}&pretty=true`;
try {
const response = await fetch(apiUrl);
const data = await response.json();
return data.result;
} catch (error) {
console.error('Scraping failed:', error);
return null;
}
};
// Usage
const titles = await scrapeData('https://news.ycombinator.com', '.titleline>a');
console.log(titles);
$.get('http://localhost:8787/', {
url: 'https://example.com',
selector: 'h1',
pretty: true
}).done(function(data) {
console.log(data.result);
});
Kepler/
├── src/
│ ├── index.ts # Main Hono application
│ ├── scraper.ts # Web scraping logic
│ ├── content-types.ts # MIME type constants
│ └── json-response.ts # Response utilities
├── package.json
├── tsconfig.json
├── wrangler.jsonc # Cloudflare Workers config
└── README.md
Script | Description |
---|---|
bun run dev |
Start development server with hot reload |
bun run deploy |
Deploy to Cloudflare Workers |
bun run cf-typegen |
Generate TypeScript types for Workers |
-
Generate Cloudflare types:
bun run cf-typegen
-
Configure Wrangler:
wrangler login
-
Login to Cloudflare:
wrangler login
-
Update worker name in
wrangler.jsonc
:{ "name": "your-scraper-api", "main": "src/index.ts", "compatibility_date": "2025-06-19" }
-
Deploy:
bun run deploy
- Add a custom domain in the Cloudflare Workers dashboard
- Update your DNS settings
- Your API will be available at
https://api.yourdomain.com
CORS is enabled by default for all origins. To restrict access, modify the CORS configuration in src/index.ts
:
app.use('/*', cors({
origin: ['https://yourdomain.com'],
allowMethods: ['GET'],
}))
Consider implementing rate limiting for production use:
// Add to src/index.ts
app.use('/*', async (c, next) => {
// Implement your rate limiting logic
await next()
})
- Uses Cloudflare's built-in
HTMLRewriter
for efficient HTML parsing - Streams HTML content for memory efficiency
- Supports complex CSS selectors including pseudo-selectors
- Cold Start: ~50ms on Cloudflare Workers
- Warm Requests: ~10-20ms
- Memory Usage: ~5-10MB per request
- Timeout: 30 seconds (Cloudflare Workers limit)
- Cannot execute JavaScript (static HTML only)
- Subject to Cloudflare Workers CPU time limits
- Cannot access localhost or private networks
- Maximum response size: 128MB
The API returns appropriate HTTP status codes and error messages:
Status Code | Description |
---|---|
200 | Success |
404 | Invalid endpoint or missing required parameters |
500 | Scraping error (network, parsing, or server issues) |
-
Invalid URL:
{"error": "Status 404 requesting https://invalid-url.com"}
-
Invalid Selector:
{"error": "Invalid CSS selector"}
-
Network Timeout:
{"error": "Request timeout"}
Always encode URLs and selectors when making requests:
const encodedUrl = encodeURIComponent('https://example.com/path?param=value');
const encodedSelector = encodeURIComponent('div.class > p:first-child');
- Use specific selectors to reduce parsing time
- Avoid overly broad selectors like
*
ordiv
- Test selectors in browser DevTools first
Always handle potential errors in your client code:
try {
const response = await fetch(apiUrl);
if (!response.ok) {
throw new Error(`HTTP ${response.status}`);
}
const data = await response.json();
if (data.error) {
throw new Error(data.error);
}
return data.result;
} catch (error) {
console.error('Scraping failed:', error.message);
return null;
}
- Fork the repository
- Create a feature branch:
git checkout -b feature/amazing-feature
- Commit your changes:
git commit -m 'Add amazing feature'
- Push to the branch:
git push origin feature/amazing-feature
- Open a Pull Request
This project is licensed under the MIT License - see the LICENSE file for details.
- Hono - Ultrafast web framework
- Cloudflare Workers - Edge computing platform
- HTMLRewriter - Server-side HTML parsing
Happy Scraping!