Fix: Local HTML Files crawling bug #1073

saipavanmeruga7797 · 2025-05-04T14:14:35Z

Summary

This PR will fix the Issue Fixes #1072 due to which the users are not able to scrape a local html files.

List of files changed and why

async_crawler_strategy.py -- Because inside the crawl() function on line number 421, there is elif condition on 448 which deals with file:// path url and captured_console variable is reached only when the capture_console_messages is true in the config. However 8000 , the capture_console_messages variable is False by default. Due to which the captured_console variable is never reached (unless the sets the capture_console_messages to True). To solve this issue I initialized captured_console = [] on line 443, similar to other functions where captured_console is initialized at the start of function.

How Has This Been Tested?

Once i applied the fix i tried to test the change by scraping 50-60 different local HTML files using the file:// path url.

Checklist:

My code follows the style guidelines of this project
I have performed a self-review of my own code
I have commented my code, particularly in hard-to-understand areas
I have made corresponding changes to the documentation
I have added/updated unit tests that prove my fix is effective or that my feature works
New and existing unit tests pass locally with my changes

…here it cannot be associated with a value'

sandy1890 · 2025-05-05T09:41:18Z

Ran into the same bug here—debugging points to the same root cause. Since this PR already covers it, I’ll skip a duplicate. +1 on the fix!

The following code exhibits the reported issue

import asyncio

from crawl4ai import AsyncWebCrawler, CacheMode
from crawl4ai.async_configs import BrowserConfig, CrawlerRunConfig
from crawl4ai.extraction_strategy import JsonCssExtractionStrategy

browser_config = BrowserConfig(
    browser_type="chromium",
    headless=True,
    verbose=True,
    extra_args=[],
)


async def crawl_local_file():
    # local_file_path = "/code/english-list.html"
    file_url = f"file://{local_file_path}"

    crawler_config = CrawlerRunConfig(
        cache_mode=CacheMode.BYPASS,
        verbose=True,
        capture_console_messages=True,  # !!! must be true for local file
        extraction_strategy=JsonCssExtractionStrategy(
            schema={
                "name": "test",
                "baseSelector": ".detail-feed-video-item",
                "fields": [
                    {
                        "name": "title",
                        "selector": ".left-img .cover",
                        "type": "attribute",
                        "attribute": "alt",
                    },
                    {
                        "name": "vedio",
                        "selector": ".left-img",
                        "type": "attribute",
                        "attribute": "href",
                    },
                    {
                        "name": "cover",
                        "selector": ".cover",
                        "type": "attribute",
                        "attribute": "href",
                    },
                ],
            },
        ),
    )
    async with AsyncWebCrawler(
        config=browser_config,
    ) as crawler:
        result = await crawler.arun(
            url=file_url,
            config=crawler_config,
        )
        if result.success:
            print("Markdown Content from Local File:")
            print(result.markdown)
        else:
            print(f"Failed to crawl local file: {result.error_message}")


asyncio.run(crawl_local_file())

scris · 2025-05-11T06:02:34Z

@unclecode Can we merge this? I think this functionality is necessary.

chore:resolved a bug 'cannot access local variable captured_console w…

d39f4e1

…here it cannot be associated with a value'

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix: Local HTML Files crawling bug #1073

Fix: Local HTML Files crawling bug #1073

Fix: Local HTML Files crawling bug #1073

Are you sure you want to change the base?

Fix: Local HTML Files crawling bug #1073

Conversation

Summary

List of files changed and why

How Has This Been Tested?

Checklist: