-
Notifications
You must be signed in to change notification settings - Fork 12
Added ConcurrentOrdereredMap to avoid synchronization on GetQueues #117
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
AbstractFrontierService will use it instead of a synchronized LinkedHashMap to avoid contention during long lasting operations such as ListURLs or CountURLs. Signed-off-by: Laurent Klock <Laurent.Klock@arhs-cube.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
see comments about license headers.
Can you provide some figures showing what is gained when iterating on large queues compared to what we currently have?
service/src/main/java/crawlercommons/urlfrontier/service/ConcurrentInsertionOrderMap.java
Show resolved
Hide resolved
service/src/main/java/crawlercommons/urlfrontier/service/ConcurrentOrderedMap.java
Show resolved
Hide resolved
service/src/test/java/crawlercommons/urlfrontier/service/ConcurrentOrderedMapTest.java
Show resolved
Hide resolved
Signed-off-by: Laurent Klock <Laurent.Klock@arhs-cube.com>
Regarding the performance comparison, I setup locally a stress test which does the following:
Each thread switches between two crawls containing 350257 and 697743 URLs With 10 threads doing 10 times the loop, I get the following numbers:
|
Looks good. Maybe change the title of this PR so that it explains succinctly what it achieves? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: Shouldn't the copyright year be 2025 for the newly added classes ? :)
Signed-off-by: Laurent Klock <Laurent.Klock@arhs-cube.com>
Done |
The purpose of this PR is to avoid the synchronization on the internal queue map.
(Avoid synchronized blocks around GetQueues calls)
The synchronized(getQueues()) call causes contention during long queue traversals which are very likely to occur during calls to countURLs & ListURLs.
This change is based on the ConcurrentOrdereredMap class which uses Guava striped Locks to provide fine
fine-grained locking and internal ConcurrentHashMap and ConcurrentSkipListMap for maintaining insertion order.