-
Notifications
You must be signed in to change notification settings - Fork 31
allow filtering by mets:div/@TYPE
#1328
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
See #1329 for a possible implementation. One thing that might be problematic here: we now have (more) potential clashes across attributes. We used to have a disclaimer core/src/ocrd_models/ocrd_mets.py Lines 151 to 155 in 57c1973
Indeed, already the attributes But with this change, we add to the potential confusion the types from logical structmap divs: while |
Regarding possible conflation of too many attributes during matching ( @kba suggests adding prefixes to the expression atoms, e.g. This could perhaps be extended to even (optionally) disambiguate attributes, e.g. As an alternative, one could at least separate physical and logical, by giving the latter its own parameter, say
So this is actually just a problem with But we should extend the logical matching to |
In case the METS already contains a logical structMap (or we run a processor like
ocrd-anybaseocr-layout-analysis
as first step), there will be information on the type of every page (content vs title vs empty vs cover etc). This in turn might be very useful to allow filtering the page range when processing – complementary to the--page-id
selection mechanism we already have (i.e. list or regex of physical page@ID
,@ORDER
,@ORDERLABEL
,@LABEL
, or@CONTENTIDS
).We might tie the list of types to consider to DFG Strukturdatenset – but in many cases, libraries extend that anyway (or use different spellings, like titlecase).
For example, a user may want to rule out processing
cover_front,cover_back,binding
pages. (Not only to avoid wasting resources, but because some processors may not handle these well, or even crash.)Our current syntax does not allow for negative selection, so we'd need some new operator. I suggest
~
, since!
might be interpreted by the shell via history expansion (even when quoted to prevent pathname expansion).For
..
range expressions, negation only makes sense when applied to the entire expression. For comma-separated lists, every atom could be negated individually. And for regexes, the entire expression.So how about
--page-id ~PHYS_0001..PHYS_0004
,--page-id ~cover_front,~cover_back,~binding
,--page-id "~//(cover_(front|back)|binding)"
?(We might even want to support selecting logical div
@ID
in addition to@TYPE
, e.g.--page-id ~LOG_0001..LOG_0005
...)The text was updated successfully, but these errors were encountered: