8000 #621 add batching for cloud updates, fix cloud requests by mvolikas · Pull Request #1544 · apache/stormcrawler · GitHub
[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
Skip to content

#621 add batching for cloud updates, fix cloud requests #1544

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 3 commits into
base: main
Choose a base branch
from

Conversation

mvolikas
Copy link
Contributor
@mvolikas mvolikas commented May 31, 2025

#621 Batching for Solr Cloud updates and other Cloud improvements

  • I have tested this in Solr Cloud with multiple shards for the status collection, and 2 replicas.
  • I tried to make the minimum set of changes to have "async" support for the Cloud client while keeping most of the advantages offered by solrj out of the box. Of course, this approach is not as robust as the solrj CloudSolrClient, but I think it is enough in the context of crawling.

@mvolikas mvolikas self-assigned this May 31, 2025
/**
* Flush all waiting updates for this slice to the slice leader The request will fail, if the
* leader goes down before handling it
*/
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could add a retry policy to avoid this, but it will complicate the code even more. Additionally, missed Solr updates should not cause long-term issues in the crawl, since pages get refetched after some time, or am I missing something here?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

They only get re-fetched, if the user has enabled re-fetch on error or didn't disable re-fetching at all :)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, then I will add a simple retry policy to be on the safe side.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have added logic that re-queues update batches that failed (e.g. the shard leader we sent the update to went down before handling the request). The batch will be dealt with the next time a flush is triggered.

< 8000 div data-show-on-forbidden-error hidden>

Uh oh!

There was an error while loading. Please reload this page.

@mvolikas mvolikas requested review from jnioche, tballison and rzo1 May 31, 2025 14:09
@rzo1
Copy link
Contributor
rzo1 commented May 31, 2025

I am wondering, if it would be possible, to add some (simple) Testcontainer based unit test? I guess, it would need zookeeper for cloud mode? Maybe there is a way to do it, because it would be really good to have some coverage :)

@mvolikas
Copy link
Contributor Author
mvolikas commented Jun 1, 2025

I am wondering, if it would be possible, to add some (simple) Testcontainer based unit test? I guess, it would need zookeeper for cloud mode? Maybe there is a way to do it, because it would be really good to have some coverage :)

In principle, yes, but I had difficulties running everything in containers because the Zookeeper container advertises something like "zookeeper:2181" to the Solr containers inside the Docker network, and from the host (where StormCrawler runs), you cannot resolve "zookeeper".
That being said, I agree that we should have some tests for Solr Cloud, and I will give it another try.

@rzo1
Copy link
Contributor
rzo1 commented Jun 2, 2025

I am wondering, if it would be possible, to add some (simple) Testcontainer based unit test? I guess, it would need zookeeper for cloud mode? Maybe there is a way to do it, because it would be really good to have some coverage :)

In principle, yes, but I had difficulties running everything in containers because the Zookeeper container advertises something like "zookeeper:2181" to the Solr containers inside the Docker network, and from the host (where StormCrawler runs), you cannot resolve "zookeeper". That being said, I agree that we should have some tests for Solr Cloud, and I will give it another try.

The guys from Apache Camel had a similar issue: https://github.com/apache/camel-quarkus/tree/e5e768acd680b0d78122fb7eee30b0a70947f3f9/integration-tests/solr/src/test - looks like their setup worked (somehow) - maybe we can just port it for our needs?

They bascially rely on docker compose (in two variants) to spin it up. Maybe worth a look?

@mvolikas
Copy link
Contributor Author

The guys from Apache Camel had a similar issue: https://github.com/apache/camel-quarkus/tree/e5e768acd680b0d78122fb7eee30b0a70947f3f9/integration-tests/solr/src/test - looks like their setup worked (somehow) - maybe we can just port it for our needs?

They bascially rely on docker compose (in two variants) to spin it up. Maybe worth a look?

Thanks @rzo1, this helps.
The issue that remains is that the zookeeper gives back to StormCrawler Solr endpoints like "solr1:8983" that StormCrawler cannot resolve because it's not part of the Docker network.
I should be able to find a way around this, though.

@rzo1
Copy link
Contributor
rzo1 commented Jun 10, 2025

We can add tests later for it, so no direct blocker from my side

@mvolikas
Copy link
Contributor Author
mvolikas commented Jun 15, 2025

We can add tests later for it, so no direct blocker from my side

@rzo1, I will create an issue for this and take care after this PR is merged. Is that ok?

@mvolikas mvolikas requested a review from rzo1 June 15, 2025 12:33
8000
}

if (slice == null) {
LOG.error("Could not find an active slice for update {}", update);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should not happen unless a shard is being split, which, in turn, should not happen in the context of StormCrawler while the crawl is running.

@rzo1 rzo1 added this to the 3.4.0 milestone Jun 17, 2025
@rzo1 rzo1 added the SOLR label Jun 17, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants
0