8000 Fix: Local HTML Files crawling bug by saipavanmeruga7797 · Pull Request #1073 · unclecode/crawl4ai · GitHub
[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
Skip to content

Fix: Local HTML Files crawling bug #1073

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on Git 8000 Hub? Sign in to your account

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

saipavanmeruga7797
Copy link
@saipavanmeruga7797 saipavanmeruga7797 commented May 4, 2025

Summary

This PR will fix the Issue Fixes #1072 due to which the users are not able to scrape a local html files.

List of files changed and why

async_crawler_strategy.py -- Because inside the crawl() function on line number 421, there is elif condition on 448 which deals with file:// path url and captured_console variable is reached only when the capture_console_messages is true in the config. However 8000 , the capture_console_messages variable is False by default. Due to which the captured_console variable is never reached (unless the sets the capture_console_messages to True). To solve this issue I initialized captured_console = [] on line 443, similar to other functions where captured_console is initialized at the start of function.

How Has This Been Tested?

Once i applied the fix i tried to test the change by scraping 50-60 different local HTML files using the file:// path url.

Checklist:

  • My code follows the style guidelines of this project
  • I have performed a self-review of my own code
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation
  • I have added/updated unit tests that prove my fix is effective or that my feature works
  • New and existing unit tests pass locally with my changes

@sandy1890
Copy link

Ran into the same bug here—debugging points to the same root cause. Since this PR already covers it, I’ll skip a duplicate. +1 on the fix!

The following code exhibits the reported issue

import asyncio

from crawl4ai import AsyncWebCrawler, CacheMode
from crawl4ai.async_configs import BrowserConfig, CrawlerRunConfig
from crawl4ai.extraction_strategy import JsonCssExtractionStrategy

browser_config = BrowserConfig(
    browser_type="chromium",
    headless=True,
    verbose=True,
    extra_args=[],
)


async def crawl_local_file():
    # local_file_path = "/code/english-list.html"
    file_url = f"file://{local_file_path}"

    crawler_config = CrawlerRunConfig(
        cache_mode=CacheMode.BYPASS,
        verbose=True,
        capture_console_messages=True,  # !!! must be true for local file
        extraction_strategy=JsonCssExtractionStrategy(
            schema={
                "name": "test",
                "baseSelector": ".detail-feed-video-item",
                "fields": [
                    {
                        "name": "title",
                        "selector": ".left-img .cover",
                        "type": "attribute",
                        "attribute": "alt",
                    },
                    {
                        "name": "vedio",
                        "selector": ".left-img",
                        "type": "attribute",
                        "attribute": "href",
                    },
                    {
                        "name": "cover",
                        "selector": ".cover",
                        "type": "attribute",
                        "attribute": "href",
                    },
                ],
            },
        ),
    )
    async with AsyncWebCrawler(
        config=browser_config,
    ) as crawler:
        result = await crawler.arun(
            url=file_url,
            config=crawler_config,
        )
        if result.success:
            print("Markdown Content from Local File:")
            print(result.markdown)
        else:
            print(f"Failed to crawl local file: {result.error_message}")


asyncio.run(crawl_local_file())

@scris
Copy link
scris commented May 11, 2025

@unclecode Can we merge this? I think this functionality is necessary.

< 621C /form>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
3 participants
0