-
Notifications
You must be signed in to change notification setti 8000 ngs - Fork 4k
[Bug]: target_element does influence link extraction #902
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Root Cause Analysis: Link Count Discrepancy with Target ElementsIssueWhen using Root CauseThe issue was caused by shared references to DOM nodes. When elements were selected for SolutionWe implemented similar solutions for both implementations:
We chose reparsing over This approach successfully decoupled content targeting from link extraction in both implementations, ensuring consistent link counts regardless of target element settings. |
…imply by changing order of execution :) #902
I just tried the deepcopy fix above: d2648ea I cannot speak to the efficiency compared to reparsing or other methods, that would require some benchmarking work. |
…imply by changing order of execution :) unclecode#902
…ue in content generated. unclecode#902
crawl4ai version
0.5.0.post8
Expected Behavior
As stated in the docs:
So i expect to get the same amount of links, when using target_elements and when not using target_elements.
Current Behavior
Without the
target_elements
i am getting 727 links returned.With the config
target_elements=["#main"]
i am getting 410 links returned.Interestly some of the links that are missing are included in the main div.
Is this reproducible?
Yes
Inputs Causing the Bug
Steps to Reproduce
1. Excecute the code snippet 2. Comment out the line target_elements=["#main"] 3. Excecute the code snippet again
Code snippets
OS
Windows
Python version
3.12.0
Browser
No response
Browser version
No response
Error logs & Screenshots (if applicable)
python .\minimal_repo.py
[INIT].... → Crawl4AI 0.5.0.post8
[FETCH]... ↓ https://www.schorndorf.de/de/stadt-buerger/rathaus... | Status: True | Time: 1.68s
[SCRAPE].. ◆ https://www.schorndorf.de/de/stadt-buerger/rathaus... | Time: 0.332s
[COMPLETE] ● https://www.schorndorf.de/de/stadt-buerger/rathaus... | Status: True | Total: 2.02s
727
python .\minimal_repo.py
[INIT].... → Crawl4AI 0.5.0.post8
[FETCH]... ↓ https://www.schorndorf.de/de/stadt-buerger/rathaus... | Status: True | Time: 1.58s
[SCRAPE].. ◆ https://www.schorndorf.de/de/stadt-buerger/rathaus... | Time: 0.183s
[COMPLETE] ● https://www.schorndorf.de/de/stadt-buerger/rathaus... | Status: True | Total: 1.77s
410
The text was updated successfully, but these errors were encountered: