-
Notifications
You must be signed in to change notification settings - Fork 264
#621 add batching for cloud updates, fix cloud requests #1544
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
/** | ||
* Flush all waiting updates for this slice to the slice leader The request will fail, if the | ||
* leader goes down before handling it | ||
*/ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We could add a retry policy to avoid this, but it will complicate the code even more. Additionally, missed Solr updates should not cause long-term issues in the crawl, since pages get refetched after some time, or am I missing something here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
They only get re-fetched, if the user has enabled re-fetch on error or didn't disable re-fetching at all :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, then I will add a simple retry policy to be on the safe side.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have added logic that re-queues update batches that failed (e.g. the shard leader we sent the update to went down before handling the request). The batch will be dealt with the next time a flush is triggered.
external/solr/src/main/java/org/apache/stormcrawler/solr/SolrConnection.java
Outdated
Show resolved
Hide resolved
I am wondering, if it would be possible, to add some (simple) Testcontainer based unit test? I guess, it would need zookeeper for cloud mode? Maybe there is a way to do it, because it would be really good to have some coverage :) |
external/solr/src/main/java/org/apache/stormcrawler/solr/SolrConnection.java
Outdated
Show resolved
Hide resolved
In principle, yes, but I had difficulties running everything in containers because the Zookeeper container advertises something like "zookeeper:2181" to the Solr containers inside the Docker network, and from the host (where StormCrawler runs), you cannot resolve "zookeeper". |
The guys from Apache Camel had a similar issue: https://github.com/apache/camel-quarkus/tree/e5e768acd680b0d78122fb7eee30b0a70947f3f9/integration-tests/solr/src/test - looks like their setup worked (somehow) - maybe we can just port it for our needs? They bascially rely on docker compose (in two variants) to spin it up. Maybe worth a look? |
Thanks @rzo1, this helps. |
We can add tests later for it, so no direct blocker from my side |
@rzo1, I will create an issue for this and take care after this PR is merged. Is that ok? |
8000 | } | |
|
||
if (slice == null) { | ||
LOG.error("Could not find an active slice for update {}", update); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should not happen unless a shard is being split, which, in turn, should not happen in the context of StormCrawler while the crawl is running.
#621 Batching for Solr Cloud updates and other Cloud improvements