8000 allow filtering by `mets:div/@TYPE` · Issue #1328 · OCR-D/core · GitHub
[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
Skip to content

allow filtering by mets:div/@TYPE #1328

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
bertsky opened this issue May 6, 2025 · 2 comments · May be fixed by #1329
Open

allow filtering by mets:div/@TYPE #1328

bertsky opened this issue May 6, 2025 · 2 comments · May be fixed by #1329

Comments

@bertsky
Copy link
Collaborator
bertsky commented May 6, 2025

In case the METS already contains a logical structMap (or we run a processor like ocrd-anybaseocr-layout-analysis as first step), there will be information on the type of every page (content vs title vs empty vs cover etc). This in turn might be very useful to allow filtering the page range when processing – complementary to the --page-id selection mechanism we already have (i.e. list or regex of physical page @ID, @ORDER, @ORDERLABEL, @LABEL, or @CONTENTIDS).

We might tie the list of types to consider to DFG Strukturdatenset – but in many cases, libraries extend that anyway (or use different spellings, like titlecase).

For example, a user may want to rule out processing cover_front,cover_back,binding pages. (Not only to avoid wasting resources, but because some processors may not handle these well, or even crash.)

Our current syntax does not allow for negative selection, so we'd need some new operator. I suggest ~, since ! might be interpreted by the shell via history expansion (even when quoted to prevent pathname expansion).

For .. range expressions, negation only makes sense when applied to the entire expression. For comma-separated lists, every atom could be negated individually. And for regexes, the entire expression.

So how about --page-id ~PHYS_0001..PHYS_0004, --page-id ~cover_front,~cover_back,~binding, --page-id "~//(cover_(front|back)|binding)"?

(We might even want to support selecting logical div @ID in addition to @TYPE, e.g. --page-id ~LOG_0001..LOG_0005 ...)

@bertsky
Copy link
Collaborator Author
bertsky commented May 7, 2025

See #1329 for a possible implementation.

One thing that might be problematic here: we now have (more) potential clashes across attributes. We used to have a disclaimer

def _initialize_caches(self) -> None:
self._file_cache = {}
# NOTE we can only guarantee uniqueness for @ID and @ORDER
self._page_cache = {k : {} for k in METS_PAGE_DIV_ATTRIBUTE}
self._fptr_cache = {}

Indeed, already the attributes @ORDERLABEL and @LABEL could be non-unique, but the data structures for our METS caching do not represent this (i.e. some keys will displace others, effectively preventing multiple matches).

But with this change, we add to the potential confusion the types from logical structmap divs: while @ID and @DMDID should still be unique, @TYPE certainly is not. While #1329 does allow multiple matches for @TYPE alone (by simply concatenating all corresponding physical page divs), we now might have additional clashes between @TYPE and (say) @LABEL (which this implementation does not prevent).

@bertsky
Copy link
Collaborator Author
bertsky commented May 21, 2025

Regarding possible conflation of too many attributes during matching (@TYPE with @ID or @LABEL, logical with physical):

@kba suggests adding prefixes to the expression atoms, e.g. logical:.

This could perhaps be extended to even (optionally) disambiguate attributes, e.g. logical:type:~//(cover_(front|back)|binding|colour_checker),logical:id:~LOG_0001,physical:order:10..100. This would then be safer and faster.

As an alternative, one could at least separate physical and logical, by giving the latter its own parameter, say --struct, distinct from --page-id.

Indeed, already the attributes @ORDERLABEL and @LABEL could be non-unique, but the data structures for our METS caching do not represent this (i.e. some keys will displace others, effectively preventing multiple matches).

So this is actually just a problem with OCRD_METS_CACHING. As these attributes are only convenience variants, it's tolerable.

But we should extend the logical matching to @LABEL IMO – too many libraries / collections in practice use labels when they deviate from DFG in @TYPE. And this is going to be clash-free (for the same reason that @TYPE was: because the implementation in #1329 uses an extra dict to map these values to physical IDs).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant
0