#621 add batching for cloud updates, fix cloud requests #1544

mvolikas · 2025-05-31T14:07:28Z

#621 Batching for Solr Cloud updates and other Cloud improvements

I have tested this in Solr Cloud with multiple shards for the status collection, and 2 replicas.
I tried to make the minimum set of changes to have "async" support for the Cloud client while keeping most of the advantages offered by solrj out of the box. Of course, this approach is not as robust as the solrj CloudSolrClient, but I think it is enough in the context of crawling.

mvolikas · 2025-05-31T14:08:28Z

external/solr/src/main/java/org/apache/stormcrawler/solr/SolrConnection.java

+    /**
+     * Flush all waiting updates for this slice to the slice leader The request will fail, if the
+     * leader goes down before handling it
+     */


We could add a retry policy to avoid this, but it will complicate the code even more. Additionally, missed Solr updates should not cause long-term issues in the crawl, since pages get refetched after some time, or am I missing something here?

They only get re-fetched, if the user has enabled re-fetch on error or didn't disable re-fetching at all :)

Hmm, then I will add a simple retry policy to be on the safe side.

I have added logic that re-queues update batches that failed (e.g. the shard leader we sent the update to went down before handling the request). The batch will be dealt with the next time a flush is triggered.

external/solr/src/main/java/org/apache/stormcrawler/solr/SolrConnection.java

rzo1 · 2025-05-31T18:02:40Z

I am wondering, if it would be possible, to add some (simple) Testcontainer based unit test? I guess, it would need zookeeper for cloud mode? Maybe there is a way to do it, because it would be really good to have some coverage :)

external/solr/src/main/java/org/apache/stormcrawler/solr/SolrConnection.java

mvolikas · 2025-06-01T13:56:32Z

I am wondering, if it would be possible, to add some (simple) Testcontainer based unit test? I guess, it would need zookeeper for cloud mode? Maybe there is a way to do it, because it would be really good to have some coverage :)

In principle, yes, but I had difficulties running everything in containers because the Zookeeper container advertises something like "zookeeper:2181" to the Solr containers inside the Docker network, and from the host (where StormCrawler runs), you cannot resolve "zookeeper".
That being said, I agree that we should have some tests for Solr Cloud, and I will give it another try.

rzo1 · 2025-06-02T19:14:35Z

I am wondering, if it would be possible, to add some (simple) Testcontainer based unit test? I guess, it would need zookeeper for cloud mode? Maybe there is a way to do it, because it would be really good to have some coverage :)

In principle, yes, but I had difficulties running everything in containers because the Zookeeper container advertises something like "zookeeper:2181" to the Solr containers inside the Docker network, and from the host (where StormCrawler runs), you cannot resolve "zookeeper". That being said, I agree that we should have some tests for Solr Cloud, and I will give it another try.

The guys from Apache Camel had a similar issue: https://github.com/apache/camel-quarkus/tree/e5e768acd680b0d78122fb7eee30b0a70947f3f9/integration-tests/solr/src/test - looks like their setup worked (somehow) - maybe we can just port it for our needs?

They bascially rely on docker compose (in two variants) to spin it up. Maybe worth a look?

mvolikas · 2025-06-10T13:35:01Z

The guys from Apache Camel had a similar issue: https://github.com/apache/camel-quarkus/tree/e5e768acd680b0d78122fb7eee30b0a70947f3f9/integration-tests/solr/src/test - looks like their setup worked (somehow) - maybe we can just port it for our needs?

They bascially rely on docker compose (in two variants) to spin it up. Maybe worth a look?

Thanks @rzo1, this helps.
The issue that remains is that the zookeeper gives back to StormCrawler Solr endpoints like "solr1:8983" that StormCrawler cannot resolve because it's not part of the Docker network.
I should be able to find a way around this, though.

rzo1 · 2025-06-10T13:37:50Z

We can add tests later for it, so no direct blocker from my side

mvolikas · 2025-06-15T12:32:42Z

We can add tests later for it, so no direct blocker from my side

@rzo1, I will create an issue for this and take care after this PR is merged. Is that ok?

mvolikas · 2025-06-15T12:36:51Z

external/solr/src/main/java/org/apache/stormcrawler/solr/SolrConnection.java

+            }
+
+            if (slice == null) {
+                LOG.error("Could not find an active slice for update {}", update);


This should not happen unless a shard is being split, which, in turn, should not happen in the context of StormCrawler while the crawl is running.

apache#621 add batching for cloud updates, fix cloud requests

fa48ef0

mvolikas self-assigned this May 31, 2025

mvolikas commented May 31, 2025

View reviewed changes

external/solr/src/main/java/org/apache/stormcrawler/solr/SolrConnection.java Outdated Show resolved Hide resolved

mvolikas requested review from jnioche, tballison and rzo1 May 31, 2025 14:09

rzo1 reviewed May 31, 2025

View reviewed changes

external/solr/src/main/java/org/apache/stormcrawler/solr/SolrConnection.java Outdated Show resolved Hide resolved

mvolikas added 2 commits June 14, 2025 22:02

apache#621 add external properties for batching, update readme

9abebf0

apache#621 fix: re-queue failed batches

a0230bf

mvolikas requested a review from rzo1 June 15, 2025 12:33

mvolikas commented Jun 15, 2025

View reviewed changes

mvolikas linked an issue Jun 15, 2025 that may be closed by this pull request

[Improvement] Test coverage for Solr in cloud mode #1560

Open

mvolikas removed a link to an issue Jun 15, 2025

[Improvement] Test coverage for Solr in cloud mode #1560

Open

rzo1 approved these changes Jun 17, 2025

View reviewed changes

rzo1 added this to the 3.4.0 milestone Jun 17, 2025

rzo1 added the SOLR label Jun 17, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

#621 add batching for cloud updates, fix cloud requests #1544

#621 add batching for cloud updates, fix cloud requests #1544

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

#621 add batching for cloud updates, fix cloud requests #1544

Are you sure you want to change the base?

#621 add batching for cloud updates, fix cloud requests #1544

Uh oh!

Conversation

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!