Releases: unclecode/crawl4ai
Releases · unclecode/crawl4ai
v0.6.3
Release 0.6.3 (unreleased)
Features
- extraction: add
RegexExtractionStrategy
for pattern-based extraction, including built-in patterns for emails, URLs, phones, dates, support for custom regexes, an LLM-assisted pattern generator, optimized HTML preprocessing viafit_html
, and enhanced network response body capture (9b5ccac) - docker-api: introduce job-based polling endpoints—
POST /crawl/job
&GET /crawl/job/{task_id}
for crawls,POST /llm/job
&GET /llm/job/{task_id}
for LLM tasks—backed by Redis task management with configurable TTL, moved schemas toschemas.py
, and addeddemo_docker_polling.py
example (94e9959) - browser: improve profile management and cleanup—add process cleanup for existing Chromium instances on Windows/Unix, fix profile creation by passing full browser config, ship detailed browser/CLI docs and initial profile-creation test, bump version to 0.6.3 (9499164)
Fixes
- crawler: remove automatic page closure in
take_screenshot
andtake_screenshot_naive
, preventing premature teardown; callers now must explicitly close pages (BREAKING CHANGE) (a3e9ef9)
Documentation
- format bash scripts in
docs/apps/linkdin/README.md
so examples copy & paste cleanly (87d4b0f) - update the same README with full
litellm
argument details for correct script usage (bd5a9ac)
Refactoring
- logger: centralize color codes behind an
Enum
inasync_logger
,browser_profiler
,content_filter_strategy
and related modules for cleaner, type-safe formatting (cd2b490)
Experimental
- start migration of logging stack to
rich
(WIP, work ongoing) (b2f3cb0)
Crawl4AI 0.6.0
🚀 0.6.0 — 22 Apr 2025
Highlights
- World‑aware crawlers:
crun_cfg = CrawlerRunConfig(
url="https://browserleaks.com/geo", # test page that shows your location
locale="en-US", # Accept-Language & UI locale
timezone_id="America/Los_Angeles", # JS Date()/Intl timezone
geolocation=GeolocationConfig( # override GPS coords
latitude=34.0522,
longitude=-118.2437,
accuracy=10.0,
)
)
- Table‑to‑DataFrame extraction, flip
df = pd.DataFrame(result.media["tables"][0]["rows"], columns=result.media["tables"][0]["headers"])
and get CSV or pandas without extra parsing. - Crawler pool with pre‑warm, pages launch hot, lower P90 latency, lower memory.
- Network and console capture, full traffic log plus MHTML snapshot for audits and debugging.
Added
- Geolocation, locale, and timezone flags for every crawl.
- Browser pooling with page pre‑warming.
- Table extractor that exports to CSV or pandas.
- Crawler pool manager in SDK and Docker API.
- Network & console log capture, plus MHTML snapshot.
- MCP socket and SSE endpoints with playground UI.
- Stress‑test framework (
tests/memory
) for 1 k+ URL runs. - Docs v2: TOC, GitHub badge, copy‑code buttons, Docker API demo.
- “Ask AI” helper button, work in progress, shipping soon.
- New examples: geo location, network/console capture, Docker API, markdown source selection, crypto analysis.
Changed
- Browser strategy consolidation, legacy docker modules removed.
ProxyConfig
moved toasync_configs
.- Server migrated to pool‑based crawler management.
- FastAPI validators replace custom query validation.
- Docker build now uses a Chromium base image.
- Repo cleanup, ≈36 k insertions, ≈5 k deletions across 121 files.
Fixed
- Session leaks, duplicate visits, URL normalisation.
- Target‑element regressions in scraping strategies.
- Logged URL readability, encoded URL decoding, middle truncation.
- Closed issues: #701 #733 #756 #774 #804 #822 #839 #841 #842 #843 #867 #902 #911.
Removed
- Obsolete modules in
crawl4ai/browser/*
.
Deprecated
- Old markdown generator names now alias
DefaultMarkdownGenerator
and warn.
Upgrade notes
- Update any imports from
crawl4ai/browser/*
to the new pooled browser modules. - If you override
AsyncPlaywrightCrawlerStrategy.get_page
adopt the new signature. - Rebuild Docker images to pick up the Chromium layer.
- Switch to
DefaultMarkdownGenerator
to silence deprecation warnings.
121 files changed, ≈36 223 insertions, ≈4 975 deletions
Crawl4AI v0.5.0.post1
Crawl4AI v0.5.0.post1 Release
Release Theme: Power, Flexibility, and Scalability
Crawl4AI v0.5.0 is a major release focused on significantly enhancing the library's power, flexibility, and scalability.
Key Features
- Deep Crawling System - Explore websites beyond initial URLs with BFS, DFS, and BestFirst strategies, with page limiting and scoring capabilities
- Memory-Adaptive Dispatcher - Scale to thousands of URLs with intelligent memory monitoring and concurrency control
- Multiple Crawling Strategies - Choose between browser-based (Playwright) or lightweight HTTP-only crawling
- Docker Deployment - Easy deployment with FastAPI server, JWT authentication, and streaming/non-streaming endpoints
- Command-Line Interface - New
crwl
CLI provides convenient access to all features with intuitive commands - Browser Profiler - Create and manage persistent browser profiles to save authentication states for protected content
- Crawl4AI Coding Assistant - Interactive chat interface for asking questions about Crawl4AI and generating Python code examples
- LXML Scraping Mode - Fast HTML parsing using the
lxml
library for 10-20x speedup with complex pages - Proxy Rotation - Built-in support for dynamic proxy switching with authentication and session persistence
- PDF Processing - Extract and process data from PDF files (both local and remote)
Additional Improvements
- LLM Content Filter for intelligent markdown generation
- URL redirection tracking
- LLM-powered schema generation utility for extraction templates
- robots.txt compliance support
- Enhanced browser context management
- Improved serialization and config handling
Breaking Changes
This release contains several breaking changes. Please review the full release notes for migration guidance.
For complete details, visit: https://docs.crawl4ai.com/blog/releases/0.5.0/