-
Notifications
You must be signed in to change notification settings - Fork 16
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Incorrect clustering of records #3973
Comments
The process is described here https://data-blog.gbif.org/post/clustering-occurrences/ join the discussion here https://discourse.gbif.org/t/identifying-potentially-related-records-how-does-the-gbif-data-clustring-feature-work-gbif-data-blog/3083
|
In this case, it is because the catalogue number of the SMF record (746) overlaps with the "other catalogue numbers" of the NMR record and the lack of country, date and coordinates in the SMF record makes it non-conflicting. @timrobertson100 is there a way to specify that two records shouldn't be clustered? |
Thanks @ManonGros It's tricky with sparsely populated records like this. There's no current way to identify a specific record pair to ignore, but we could tighten the rules a bit - e.g. making it such that you need a match for either a date or location and not What do you think? |
thanks @timrobertson100, yes it might be a good idea to removed the non-conflicting date and location rule. Now that I see this example, I think this rule doesn't enough give evidence to infer the cluster. |
This has now been run in production and the records no longer cluster. This shows on the detail pages for the occurrences now, but it will take a little time before they disappear from the "is in cluster" search option. I've initiated a recrawl/reprocess for the two datasets involved, and that needs to complete for the search option to disappear. Thanks for offering to update the blogpost @ManonGros - can you please do that and then close this issue? |
Thanks Tim! I updated the blogpost (I originally put the link to this issue but I just changed it to point to the release note) |
Hi. I wonder if you prefer us to re-use this issue title for other incorrect clustering of records we found, or you prefer separated posts for this. I am posting them here for now (please mention me if you open a new issue, so I can track it).
If those are expected behaviours I would like to understand the reasons. Thanks a lot |
@timrobertson100 could you take a look? For the other issue mentioned above, perhaps it is time to run an update? |
Mmmm, this is an unusual bug The API call for that page is this: On there you can see the current record has a |
This is all fixed in code now.
That record no longer shows a cluster (it showed the cluster for
This is fixed in code, and being released and deployed in production data pipelines now. After which we'll reprocess the dataset to clear the mistake in the search index. It was the same bug as the one above but applied to how we build the search index. @abubelinha - thank you for raising this. Due to the way we hash records it would only appear occasionally, so went unnoticed before. I'll close this knowing it's addressed in code and that the data will shortly be updated. |
Thank you @ManonGros and @timrobertson100 !
I understand the bug above was the "other catalogue numbers" overlap + "lack of data in certain other fields". Correct? So I couldn't actually see the original cluster ... but I searched for the taxon+location+date combination and I am pretty sure this was it: Occurrence 29606717 cluster (which still contains 8 occurrences). I can confirm it is a good cluster (I know these all are copies of the same herbarium specimen, shared in exchange to several institutions).
I am particularly interested in this situation: Thanks a lot for this useful feature anyway! |
This thread is getting a little difficult to read, but I will do my best to answer. The bug spotted didn't need a change in the clustering algorithm (how we detect related records) or e.g. the use of The SANT:SANT:44553-A record was never in this cluster which I can confirm by looking at the backend database that holds the links. So, let's understand why that is... The blog post has this table that summarises the conditions that can trigger a detection: Looking at the Sant record and the cluster we can see they have the same accepted species, date, but coordinates that are 875m apart and differently formatted collector names which could be improved as logged here. In this case, if the collector name were made identical Please remember we currently run clustering frequently, but not automatically, so any publication would require us to rerun clustering before it appeared.
I think we would need to define new rules for that. Currently, unless they are related to type specimens they all rely on the same accepted species, so reidentifications are not captured. I think we'd need quite a tight ruleset including identifiers overlap, collector overlap and similar date/location, and perhaps some higher order taxon to avoid too many false positives. Does this help with understanding what you have observed please, @abubelinha? Thanks again for the feedback |
Thanks @timrobertson100 for your clear explanation. As that observation had been in a cluster, and removed from it ... I was blindly assumming it had to be that cluster because it was the good one. My bad. As for the coordinates difference, I am afraid all occurrences in the cluster have been rounded to 0.01 Lat/Lon degrees precision. The cluster-excluded observation hasn't been rounded, which explains the discrepance. I agree that normalizing some characters and removing spaces could improve clusters a lot. Regarding my suggestion of using But as you say this thread is getting difficult to read, I opened a new issue about this. |
Thanks - I had suspected something like that had happened and your record actually had the more accurate georeference. I chose the limits of 200m and 2km to accommodate 3 and 4 decimal place rounding globally, but here we are at 2 decimal places. I think we've arrived at a good point in understanding how things work and fixed the bug identified (i.e. that it wasn't working as expected). Let's continue the discussion on improving the normalisation of collector names and improvements relating to otherCatalogNumbers on those threads. Thanks for your interest in this! |
I found this pair of occurrences which I can't figure out why not in a cluster. I think they should be for a number of reasons (same taxon, typification relationship, same coordinates, location, collector and date). https://www.gbif.org/occurrence/1936346601 |
They are in the same dataset. We only look for links between records across datasets to help e.g. transfer knowledge between institutions. It's also the case that there are many datasets that would just cluster everything (e.g. gut analysis) that brought a technical consideration with cardinalities, and our feasibility of actually calculating these in a timely manner. |
Ah OK. You mean there are datasets which contain lots of repetitive occurrences, don't you? I should have figured this out. Thanks |
Cluster, an experimental feature
We encountered an interesting new clustering feature in one of the Dutch databases. Two occurrences are clustered, but probably because one occurrence from Germany lacks data on date & location. See https://www.gbif.org/occurrence/3128544409/cluster
Why are these specimens clustered, and where can I found more information about this feature?
User: See in registry
System: Chrome 99.0.4844 / Windows 10.0.0
Referer: https://www.gbif.org/occurrence/3128544409/cluster
Window size: width 1536 - height 722
API log
Site log
System health at time of feedback: OPERATIONAL
The text was updated successfully, but these errors were encountered: