8000 Added ConcurrentOrdereredMap to avoid synchronization on GetQueues by klockla · Pull Request #117 · crawler-commons/url-frontier · GitHub
[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
Skip to content

Added ConcurrentOrdereredMap to avoid synchronization on GetQueues #117

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 3 commits into from
Mar 11, 2025

Conversation

klockla
Copy link
Collaborator
@klockla klockla commented Mar 7, 2025

The purpose of this PR is to avoid the synchronization on the internal queue map.
(Avoid synchronized blocks around GetQueues calls)

The synchronized(getQueues()) call causes contention during long queue traversals which are very likely to occur during calls to countURLs & ListURLs.

This change is based on the ConcurrentOrdereredMap class which uses Guava striped Locks to provide fine
fine-grained locking and internal ConcurrentHashMap and ConcurrentSkipListMap for maintaining insertion order.

AbstractFrontierService will use it instead of a synchronized LinkedHashMap to avoid contention during long lasting operations such as ListURLs or CountURLs.

Signed-off-by: Laurent Klock <Laurent.Klock@arhs-cube.com>
@klockla klockla requested a review from jnioche March 7, 2025 15:09
@klockla klockla self-assigned this Mar 7, 2025
Copy link
Collaborator
@jnioche jnioche left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

see comments about license headers.
Can you provide some figures showing what is gained when iterating on large queues compared to what we currently have?

Signed-off-by: Laurent Klock <Laurent.Klock@arhs-cube.com>
@klockla
Copy link
Collaborator Author
klockla commented Mar 10, 2025

Regarding the performance comparison, I setup locally a stress test which does the following:

  • Launch N threads wich will do the following X times each
  • Call CountURLs
  • Call ListURLs (start 0, size 100)
  • Call getURLStatus on a random element from the previously returned ListURLs
  • Call getURLs on a specific queue (maxURLsPerQueue: 1, delayRequestable: 30)

Each thread switches between two crawls containing 350257 and 697743 URLs

With 10 threads doing 10 times the loop, I get the following numbers:

  • With synchronized LinkedHashMap : 311 secs
  • With the new ConcurrentOrderedMap: 53 secs.

@jnioche jnioche added this to the 2.5 milestone Mar 10, 2025
@jnioche jnioche added the enhancement New feature or request label Mar 10, 2025
@jnioche
Copy link
Collaborator
jnioche commented Mar 10, 2025

Looks good. Maybe change the title of this PR so that it explains succinctly what it achieves?

Copy link
Member
@rzo1 rzo1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: Shouldn't the copyright year be 2025 for the newly added classes ? :)

@klockla klockla changed the title Added ConcurrentOrdereredMap Added ConcurrentOrdereredMap to avoid synchronization on GetQueues Mar 11, 2025
Signed-off-by: Laurent Klock <Laurent.Klock@arhs-cube.com>
@klockla
Copy link
Collaborator Author
klockla commented Mar 11, 2025

Nit: Shouldn't the copyright year be 2025 for the newly added classes ? :)

Done

@jnioche jnioche merged commit 191cacc into crawler-commons:master Mar 11, 2025
2 checks passed
@jnioche
Copy link
Collaborator
jnioche commented Mar 11, 2025

thanks @klockla for your work on this and @rzo1 for the review

@klockla klockla deleted the concurrentorderedmap branch March 12, 2025 08:57
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants
0