8000 Fix iterators in case we have removed an entry by klockla · Pull Request #120 · crawler-commons/url-frontier · GitHub
[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
Skip to content

Fix iterators in case we have removed an entry #120

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 2 commits into from
Apr 28, 2025

Conversation

klockla
Copy link
Collaborator
@klockla klockla commented Apr 22, 2025

Fix iterators in case we have removed an entry

(see issue #119 )

This can typically happen while iterating and we swap elements between the head and the tail of the queue The iterator on the insertion order map is weakly consistent, but accesses to the valuemap must be checked.

@klockla klockla self-assigned this Apr 22, 2025
@jnioche
Copy link
Collaborator
jnioche commented Apr 22, 2025

@klockla can you rebase post #121 ?
This should fix the failing check

8000

@klockla klockla force-pushed the fixiterators branch 2 times, most recently from 27c5a5a to 3eaaa1f Compare April 23, 2025 08:37
@klockla
Copy link
Collaborator Author
klockla commented Apr 23, 2025

I rebased my branch.
#121 fixed the build issue related to cache.
Thank you @jnioche

@jnioche jnioche added this to the 2.5 milestone Apr 23, 2025
@jnioche jnioche added the bug Something isn't working label Apr 23, 2025
Copy link
Collaborator
@jnioche jnioche left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Copy link
Member
@rzo1 rzo1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would drop the sys.out in the tests, but otherwise lgtm.

This can typically happen while iterating and we swap elements between the head and the tail of the queue
The iterator on the insertion order map is weakly consistent, but accesses to the valuemap must be checked.

Signed-off-by: Laurent Klock <Laurent.Klock@arhs-cube.com>
(do not return null but throw NoSuchElementException)

Track first entry in insertion order map to prevent potential race condition in pollFirstEntry

Signed-off-by: Laurent Klock <Laurent.Klock@arhs-cube.com>
@klockla klockla marked this pull request as draft April 24, 2025 16:14
@klockla klockla marked this pull request as ready for review April 25, 2025 15:03
@klockla
Copy link
Collaborator Author
klockla commented Apr 28, 2025

Updated iterators
(Previous implementation was breaking next() specifications, it should never return null but throw NoSuchElementException)
Added tacking of first entry to prevent possible race condition in pollFirstEntry

@klockla klockla requested a review from jnioche April 28, 2025 07:58
@klockla

This comment was marked as off-topic.

@klockla klockla merged commit 46dcbc6 into crawler-commons:master Apr 28, 2025
2 checks passed
@jnioche
Copy link
Collaborator
jnioche commented Apr 28, 2025

Dis, est-ce qu’éventuellement dans le cadre de StormCrawler, ca vous intéresserait d’avoir un bolt pour faire du PII removal avec Microsoft Presidio ( https://microsoft.github.io/presidio/ )

Possibly- assuming it is open source with a license compatible with ASF projects. I assume the best way to do it would be to assume that it is running as a server e.g. in a Docker container and use the API

@rzo1
Copy link
Member
rzo1 commented Apr 28, 2025

It is MIT, so would be okay. Would also be cool to abstract the impl in such a bolt. We could than also be able to use https://github.com/philterd/phileas (from Jeff from OpenNLP) 🫠

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants
0