[go: up one dir, main page]
More Web Proxy on the site http://driver.im/

Guides

Rule-based matching

Find phrases and tokens, and match entities

Compared to using regular expressions on raw text, spaCy’s rule-based matcher engines and components not only let you find the words and phrases you’re looking for – they also give you access to the tokens within the document and their relationships. This means you can easily access and analyze the surrounding tokens, merge spans into single tokens or add entries to the named entities in doc.ents.

For complex tasks, it’s usually better to train a statistical entity recognition model. However, statistical models require training data, so for many situations, rule-based approaches are more practical. This is especially true at the start of a project: you can use a rule-based approach as part of a data collection process, to help you “bootstrap” a statistical model.

Training a model is useful if you have some examples and you want your system to be able to generalize based on those examples. It works especially well if there are clues in the local context. For instance, if you’re trying to detect person or company names, your application may benefit from a statistical named entity recognition model.

Rule-based systems are a good choice if there’s a more or less finite number of examples that you want to find in the data, or if there’s a very clear, structured pattern you can express with token rules or regular expressions. For instance, country names, IP addresses or URLs are things you might be able to handle well with a purely rule-based approach.

You can also combine both approaches and improve a statistical model with rules to handle very specific cases and boost accuracy. For details, see the section on rule-based entity recognition.

The PhraseMatcher is useful if you already have a large terminology list or gazetteer consisting of single or multi-token phrases that you want to find exact instances of in your data. As of spaCy v2.1.0, you can also match on the LOWER attribute for fast and case-insensitive matching.

The Matcher isn’t as blazing fast as the PhraseMatcher, since it compares across individual token attributes. However, it allows you to write very abstract representations of the tokens you’re looking for, using lexical attributes, linguistic features predicted by the model, operators, set membership and rich comparison. For example, you can find a noun, followed by a verb with the lemma “love” or “like”, followed by an optional determiner and another token that’s at least 10 characters long.

Token-based matching

spaCy features a rule-matching engine, the Matcher, that operates over tokens, similar to regular expressions. The rules can refer to token annotations (e.g. the token text or tag_, and flags like IS_PUNCT). The rule matcher also lets you pass in a custom callback to act on matches – for example, to merge entities and apply custom labels. You can also associate patterns with entity IDs, to allow some basic entity linking or disambiguation. To match large terminology lists, you can use the PhraseMatcher, which accepts Doc objects as match patterns.

Adding patterns

Let’s say we want to enable spaCy to find a combination of three tokens:

  1. A token whose lowercase form matches “hello”, e.g. “Hello” or “HELLO”.
  2. A token whose is_punct flag is set to True, i.e. any punctuation.
  3. A token whose lowercase form matches “world”, e.g. “World” or “WORLD”.

First, we initialize the Matcher with a vocab. The matcher must always share the same vocab with the documents it will operate on. We can now call matcher.add() with an ID and a list of patterns.

Editable CodespaCy v3.7 · Python 3 · via Binder

The matcher returns a list of (match_id, start, end) tuples – in this case, [('15578876784678163569', 0, 3)], which maps to the span doc[0:3] of our original document. The match_id is the hash value of the string ID “HelloWorld”. To get the string value, you can look up the ID in the StringStore.

Optionally, we could also choose to add more than one pattern, for example to also match sequences without punctuation between “hello” and “world”:

By default, the matcher will only return the matches and not do anything else, like merge entities or assign labels. This is all up to you and can be defined individually for each pattern, by passing in a callback function as the on_match argument on add(). This is useful, because it lets you write entirely custom and pattern-specific logic. For example, you might want to merge some patterns into one token, while adding entity labels for other pattern types. You shouldn’t have to create different matchers for each of those processes.

Available token attributes

The available token pattern keys correspond to a number of Token attributes. The supported attributes for rule-based matching are:

AttributeDescription
ORTHThe exact verbatim text of a token. str
TEXTThe exact verbatim text of a token. str
NORMThe normalized form of the token text. str
LOWERThe lowercase form of the token text. str
LENGTHThe length of the token text. int
IS_ALPHA, IS_ASCII, IS_DIGITToken text consists of alphabetic characters, ASCII characters, digits. bool
IS_LOWER, IS_UPPER, IS_TITLEToken text is in lowercase, uppercase, titlecase. bool
IS_PUNCT, IS_SPACE, IS_STOPToken is punctuation, whitespace, stop word. bool
IS_SENT_STARTToken is start of sentence. bool
LIKE_NUM, LIKE_URL, LIKE_EMAILToken text resembles a number, URL, email. bool
SPACYToken has a trailing space. bool
POS, TAG, MORPH, DEP, LEMMA, SHAPEThe token’s simple and extended part-of-speech tag, morphological analysis, dependency label, lemma, shape. Note that the values of these attributes are case-sensitive. For a list of available part-of-speech tags and dependency labels, see the Annotation Specifications. str
ENT_TYPEThe token’s entity label. str
_Properties in custom extension attributes. Dict[str, Any]
OPOperator or quantifier to determine how often to match a token pattern. str

No, it shouldn’t. spaCy will normalize the names internally and {"LOWER": "text"} and {"lower": "text"} will both produce the same result. Using the uppercase version is mostly a convention to make it clear that the attributes are “special” and don’t exactly map to the token attributes like Token.lower and Token.lower_.

spaCy can’t provide access to all of the attributes because the Matcher loops over the Cython data, not the Python objects. Inside the matcher, we’re dealing with a TokenC struct – we don’t have an instance of Token. This means that all of the attributes that refer to computed properties can’t be accessed.

The uppercase attribute names like LOWER or IS_PUNCT refer to symbols from the spacy.attrs enum table. They’re passed into a function that essentially is a big case/switch statement, to figure out which struct field to return. The same attribute identifiers are used in Doc.to_array, and a few other places in the code where you need to describe fields like this.


Extended pattern syntax and attributes

Instead of mapping to a single value, token patterns can also map to a dictionary of properties. For example, to specify that the value of a lemma should be part of a list of values, or to set a minimum character length. The following rich comparison attributes are available:

AttributeDescription
INAttribute value is member of a list. Any
NOT_INAttribute value is not member of a list. Any
IS_SUBSETAttribute value (for MORPH or custom list attributes) is a subset of a list. Any
IS_SUPERSETAttribute value (for MORPH or custom list attributes) is a superset of a list. Any
INTERSECTSAttribute value (for MORPH or custom list attributes) has a non-empty intersection with a list. Any
==, >=, <=, >, <Attribute value is equal, greater or equal, smaller or equal, greater or smaller. Union[int, float]

Regular expressions

In some cases, only matching tokens and token attributes isn’t enough – for example, you might want to match different spellings of a word, without having to add a new pattern for each spelling.

The REGEX operator allows defining rules for any attribute string value, including custom attributes. It always needs to be applied to an attribute like TEXT, LOWER or TAG:

Matching regular expressions on the full text

If your expressions apply to multiple tokens, a simple solution is to match on the doc.text with re.finditer and use the Doc.char_span method to create a Span from the character indices of the match. If the matched characters don’t map to one or more valid tokens, Doc.char_span returns None.

Editable CodespaCy v3.7 · Python 3 · via Binder

In some cases, you might want to expand the match to the closest token boundaries, so you can create a Span for "USA", even though only the substring "US" is matched. You can calculate this using the character offsets of the tokens in the document, available as Token.idx. This lets you create a list of valid token start and end boundaries and leaves you with a rather basic algorithmic problem: Given a number, find the next lowest (start token) or the next highest (end token) number that’s part of a given list of numbers. This will be the closest valid token boundary.

There are many ways to do this and the most straightforward one is to create a dict keyed by characters in the Doc, mapped to the token they’re part of. It’s easy to write and less error-prone, and gives you a constant lookup time: you only ever need to create the dict once per Doc.

You can then look up character at a given position, and get the index of the corresponding token that the character is part of. Your span would then be doc[token_start:token_end]. If a character isn’t in the dict, it means it’s the (white)space tokens are split on. That hopefully shouldn’t happen, though, because it’d mean your regex is producing matches with leading or trailing whitespace.

Fuzzy matching v3.5

Fuzzy matching allows you to match tokens with alternate spellings, typos, etc. without specifying every possible variant.

The FUZZY attribute allows fuzzy matches for any attribute string value, including custom attributes. Just like REGEX, it always needs to be applied to an attribute like TEXT or LOWER. By default FUZZY allows a Levenshtein edit distance of at least 2 and up to 30% of the pattern string length. Using the more specific attributes FUZZY1..FUZZY9 you can specify the maximum allowed edit distance directly.

Regex and fuzzy matching with lists v3.5

Starting in spaCy v3.5, both REGEX and FUZZY can be combined with the attributes IN and NOT_IN:


Operators and quantifiers

The matcher also lets you use quantifiers, specified as the 'OP' key. Quantifiers let you define sequences of tokens to be matched, e.g. one or more punctuation marks, or specify optional tokens. Note that there are no nested or scoped quantifiers – instead, you can build those behaviors with on_match callbacks.

OPDescription
!Negate the pattern, by requiring it to match exactly 0 times.
?Make the pattern optional, by allowing it to match 0 or 1 times.
+Require the pattern to match 1 or more times.
*Allow the pattern to match zero or more times.
{n}Require the pattern to match exactly n times.
{n,m}Require the pattern to match at least n but not more than m times.
{n,}Require the pattern to match at least n times.
{,m}Require the pattern to match at most m times.

Using wildcard token patterns

While the token attributes offer many options to write highly specific patterns, you can also use an empty dictionary, {} as a wildcard representing any token. This is useful if you know the context of what you’re trying to match, but very little about the specific token and its characters. For example, let’s say you’re trying to extract people’s user names from your data. All you know is that they are listed as “User name: {username}“. The name itself may contain any character, but no whitespace – so you’ll know it will be handled as one token.

Validating and debugging patterns

The Matcher can validate patterns against a JSON schema with the option validate=True. This is useful for debugging patterns during development, in particular for catching unsupported attributes.

Editable CodespaCy v3.7 · Python 3 · via Binder

Adding on_match rules

To move on to a more realistic example, let’s say you’re working with a large corpus of blog articles, and you want to match all mentions of “Google I/O” (which spaCy tokenizes as ['Google', 'I', '/', 'O']). To be safe, you only match on the uppercase versions, avoiding matches with phrases such as “Google i/o”.

Editable CodespaCy v3.7 · Python 3 · via Binder

A very similar logic has been implemented in the built-in EntityRuler by the way. It also takes care of handling overlapping matches, which you would otherwise have to take care of yourself.

We can now call the matcher on our documents. The patterns will be matched in the order they occur in the text. The matcher will then iterate over the matches, look up the callback for the match ID that was matched, and invoke it.

When the callback is invoked, it is passed four arguments: the matcher itself, the document, the position of the current match, and the total list of matches. This allows you to write callbacks that consider the entire set of matched phrases, so that you can resolve overlaps and other conflicts in whatever way you prefer.

ArgumentDescription
matcherThe matcher instance. Matcher
docThe document the matcher was used on. Doc
iIndex of the current match (matches[i]). int
matchesA list of (match_id, start, end) tuples, describing the matches. A match tuple describes a span doc[start:end]. List[Tuple[int, int int]]

Creating spans from matches

Creating Span objects from the returned matches is a very common use case. spaCy makes this easy by giving you access to the start and end token of each match, which you can use to construct a new span with an optional label. As of spaCy v3.0, you can also set as_spans=True when calling the matcher on a Doc, which will return a list of Span objects using the match_id as the span label.

Editable CodespaCy v3.7 · Python 3 · via Binder

Using custom pipeline components

Let’s say your data also contains some annoying pre-processing artifacts, like leftover HTML line breaks (e.g. <br> or <BR/>). To make your text easier to analyze, you want to merge those into one token and flag them, to make sure you can ignore them later. Ideally, this should all be done automatically as you process the text. You can achieve this by adding a custom pipeline component that’s called on each Doc object, merges the leftover HTML spans and sets an attribute bad_html on the token.

Editable CodespaCy v3.7 · Python 3 · via Binder

Instead of hard-coding the patterns into the component, you could also make it take a path to a JSON file containing the patterns. This lets you reuse the component with different patterns, depending on your application. When adding the component to the pipeline with nlp.add_pipe, you can pass in the argument via the config:

Example: Using linguistic annotations

Let’s say you’re analyzing user comments and you want to find out what people are saying about Facebook. You want to start off by finding adjectives following “Facebook is” or “Facebook was”. This is obviously a very rudimentary solution, but it’ll be fast, and a great way to get an idea for what’s in your data. Your pattern could look like this:

This translates to a token whose lowercase form matches “facebook” (like Facebook, facebook or FACEBOOK), followed by a token with the lemma “be” (for example, is, was, or ‘s), followed by an optional adverb, followed by an adjective. Using the linguistic annotations here is especially useful, because you can tell spaCy to match “Facebook’s annoying”, but not “Facebook’s annoying ads”. The optional adverb makes sure you won’t miss adjectives with intensifiers, like “pretty awful” or “very nice”.

To get a quick overview of the results, you could collect all sentences containing a match and render them with the displaCy visualizer. In the callback function, you’ll have access to the start and end of each match, as well as the parent Doc. This lets you determine the sentence containing the match, doc[start:end].sent, and calculate the start and end of the matched span within the sentence. Using displaCy in “manual” mode lets you pass in a list of dictionaries containing the text and entities to render.

Editable CodespaCy v3.7 · Python 3 · via Binder

Example: Phone numbers

Phone numbers can have many different formats and matching them is often tricky. During tokenization, spaCy will leave sequences of numbers intact and only split on whitespace and punctuation. This means that your match pattern will have to look out for number sequences of a certain length, surrounded by specific punctuation – depending on the national conventions.

The IS_DIGIT flag is not very helpful here, because it doesn’t tell us anything about the length. However, you can use the SHAPE flag, with each d representing a digit (up to 4 digits / characters):

This will match phone numbers of the format (123) 4567 8901 or (123) 4567-8901. To also match formats like (123) 456 789, you can add a second pattern using 'ddd' in place of 'dddd'. By hard-coding some values, you can match only certain, country-specific numbers. For example, here’s a pattern to match the most common formats of international German numbers:

Depending on the formats your application needs to match, creating an extensive set of rules like this is often better than training a model. It’ll produce more predictable results, is much easier to modify and extend, and doesn’t require any training data – only a set of test cases.

Editable CodespaCy v3.7 · Python 3 · via Binder

Example: Hashtags and emoji on social media

Social media posts, especially tweets, can be difficult to work with. They’re very short and often contain various emoji and hashtags. By only looking at the plain text, you’ll lose a lot of valuable semantic information.

Let’s say you’ve extracted a large sample of social media posts on a specific topic, for example posts mentioning a brand name or product. As the first step of your data exploration, you want to filter out posts containing certain emoji and use them to assign a general sentiment score, based on whether the expressed emotion is positive or negative, e.g. 😀 or 😞. You also want to find, merge and label hashtags like #MondayMotivation, to be able to ignore or analyze them later.

By default, spaCy’s tokenizer will split emoji into separate tokens. This means that you can create a pattern for one or more emoji tokens. Valid hashtags usually consist of a #, plus a sequence of ASCII characters with no whitespace, making them easy to match as well.

Editable CodespaCy v3.7 · Python 3 · via Binder

Because the on_match callback receives the ID of each match, you can use the same function to handle the sentiment assignment for both the positive and negative pattern. To keep it simple, we’ll either add or subtract 0.1 points – this way, the score will also reflect combinations of emoji, even positive and negative ones.

With a library like emoji, we can also retrieve a short description for each emoji – for example, 😍‘s official title is “Smiling Face With Heart-Eyes”. Assigning it to a custom attribute on the emoji span will make it available as span._.emoji_desc.

To label the hashtags, we can use a custom attribute set on the respective token:

Editable CodespaCy v3.7 · Python 3 · via Binder

Efficient phrase matching

If you need to match large terminology lists, you can also use the PhraseMatcher and create Doc objects instead of token patterns, which is much more efficient overall. The Doc patterns can contain single or multiple tokens.

Adding phrase patterns

Editable CodespaCy v3.7 · Python 3 · via Binder

Since spaCy is used for processing both the patterns and the text to be matched, you won’t have to worry about specific tokenization – for example, you can simply pass in nlp("Washington, D.C.") and won’t have to write a complex token pattern covering the exact tokenization of the term.

Matching on other token attributes

By default, the PhraseMatcher will match on the verbatim token text, e.g. Token.text. By setting the attr argument on initialization, you can change which token attribute the matcher should use when comparing the phrase pattern to the matched Doc. For example, using the attribute LOWER lets you match on Token.lower and create case-insensitive match patterns:

Editable CodespaCy v3.7 · Python 3 · via Binder

Another possible use case is matching number tokens like IP addresses based on their shape. This means that you won’t have to worry about how those strings will be tokenized and you’ll be able to find tokens and combinations of tokens based on a few examples. Here, we’re matching on the shapes ddd.d.d.d and ddd.ddd.d.d:

Editable CodespaCy v3.7 · Python 3 · via Binder

In theory, the same also works for attributes like POS. For example, a pattern nlp("I like cats") matched based on its part-of-speech tag would return a match for “I love dogs”. You could also match on boolean flags like IS_PUNCT to match phrases with the same sequence of punctuation and non-punctuation tokens as the pattern. But this can easily get confusing and doesn’t have much of an advantage over writing one or two token patterns.

Dependency Matcher v3.0Needs model

The DependencyMatcher lets you match patterns within the dependency parse using Semgrex operators. It requires a model containing a parser such as the DependencyParser. Instead of defining a list of adjacent tokens as in Matcher patterns, the DependencyMatcher patterns match tokens in the dependency parse and specify the relations between them.

A pattern added to the dependency matcher consists of a list of dictionaries, with each dictionary describing a token to match and its relation to an existing token in the pattern. Except for the first dictionary, which defines an anchor token using only RIGHT_ID and RIGHT_ATTRS, each pattern should have the following keys:

NameDescription
LEFT_IDThe name of the left-hand node in the relation, which has been defined in an earlier node. str
REL_OPAn operator that describes how the two nodes are related. str
RIGHT_IDA unique name for the right-hand node in the relation. str
RIGHT_ATTRSThe token attributes to match for the right-hand node in the same format as patterns provided to the regular token-based Matcher. Dict[str, Any]

Each additional token added to the pattern is linked to an existing token LEFT_ID by the relation REL_OP. The new token is given the name RIGHT_ID and described by the attributes RIGHT_ATTRS.

Dependency matcher operators

The following operators are supported by the DependencyMatcher, most of which come directly from Semgrex:

SymbolDescription
A < BA is the immediate dependent of B.
A > BA is the immediate head of B.
A << BA is the dependent in a chain to B following dep → head paths.
A >> BA is the head in a chain to B following head → dep paths.
A . BA immediately precedes B, i.e. A.i == B.i - 1, and both are within the same dependency tree.
A .* BA precedes B, i.e. A.i < B.i, and both are within the same dependency tree (Semgrex counterpart: ..).
A ; BA immediately follows B, i.e. A.i == B.i + 1, and both are within the same dependency tree (Semgrex counterpart: -).
A ;* BA follows B, i.e. A.i > B.i, and both are within the same dependency tree (Semgrex counterpart: --).
A $+ BB is a right immediate sibling of A, i.e. A and B have the same parent and A.i == B.i - 1.
A $- BB is a left immediate sibling of A, i.e. A and B have the same parent and A.i == B.i + 1.
A $++ BB is a right sibling of A, i.e. A and B have the same parent and A.i < B.i.
A $-- BB is a left sibling of A, i.e. A and B have the same parent and A.i > B.i.
A >+ B v3.5.1B is a right immediate child of A, i.e. A is a parent of B and A.i == B.i - 1 (not in Semgrex).
A >- B v3.5.1B is a left immediate child of A, i.e. A is a parent of B and A.i == B.i + 1 (not in Semgrex).
A >++ BB is a right child of A, i.e. A is a parent of B and A.i < B.i.
A >-- BB is a left child of A, i.e. A is a parent of B and A.i > B.i.
A <+ B v3.5.1B is a right immediate parent of A, i.e. A is a child of B and A.i == B.i - 1 (not in Semgrex).
A <- B v3.5.1B is a left immediate parent of A, i.e. A is a child of B and A.i == B.i + 1 (not in Semgrex).
A <++ BB is a right parent of A, i.e. A is a child of B and A.i < B.i.
A <-- BB is a left parent of A, i.e. A is a child of B and A.i > B.i.

Designing dependency matcher patterns

Let’s say we want to find sentences describing who founded what kind of company:

  • Smith founded a healthcare company in 2005.
  • Williams initially founded an insurance company in 1987.
  • Lee, an experienced CEO, has founded two AI startups.

The dependency parse for “Smith founded a healthcare company” shows types of relations and tokens we want to match:

The relations we’re interested in are:

  • the founder is the subject (nsubj) of the token with the text founded
  • the company is the object (dobj) of founded
  • the kind of company may be an adjective (amod, not shown above) or a compound (compound)

The first step is to pick an anchor token for the pattern. Since it’s the root of the dependency parse, founded is a good choice here. It is often easier to construct patterns when all dependency relation operators point from the head to the children. In this example, we’ll only use >, which connects a head to an immediate dependent as head > child.

The simplest dependency matcher pattern will identify and name a single token in the tree:

Editable CodespaCy v3.7 · Python 3 · via Binder

Now that we have a named anchor token (anchor_founded), we can add the founder as the immediate dependent (>) of founded with the dependency label nsubj:

Step 1

The direct object (dobj) is added in the same way:

Step 2

When the subject and object tokens are added, they are required to have names under the key RIGHT_ID, which are allowed to be any unique string, e.g. founded_subject. These names can then be used as LEFT_ID to link new tokens into the pattern. For the final part of our pattern, we’ll specify that the token founded_object should have a modifier with the dependency relation amod or compound:

Step 3

You can picture the process of creating a dependency matcher pattern as defining an anchor token on the left and building up the pattern by linking tokens one-by-one on the right using relation operators. To create a valid pattern, each new token needs to be linked to an existing token on its left. As for founded in this example, a token may be linked to more than one token on its right:

Dependency matcher pattern

The full pattern comes together as shown in the example below:

Editable CodespaCy v3.7 · Python 3 · via Binder

Rule-based entity recognition

The EntityRuler is a component that lets you add named entities based on pattern dictionaries, which makes it easy to combine rule-based and statistical named entity recognition for even more powerful pipelines.

Entity Patterns

Entity patterns are dictionaries with two keys: "label", specifying the label to assign to the entity if the pattern is matched, and "pattern", the match pattern. The entity ruler accepts two types of patterns:

  1. Phrase patterns for exact string matches (string).

  2. Token patterns with one dictionary describing one token (list).

Using the entity ruler

The EntityRuler is a pipeline component that’s typically added via nlp.add_pipe. When the nlp object is called on a text, it will find matches in the doc and add them as entities to the doc.ents, using the specified pattern label as the entity label. If any matches were to overlap, the pattern matching most tokens takes priority. If they also happen to be equally long, then the match occurring first in the Doc is chosen.

Editable CodespaCy v3.7 · Python 3 · via Binder

The entity ruler is designed to integrate with spaCy’s existing pipeline components and enhance the named entity recognizer. If it’s added before the "ner" component, the entity recognizer will respect the existing entity spans and adjust its predictions around it. This can significantly improve accuracy in some cases. If it’s added after the "ner" component, the entity ruler will only add spans to the doc.ents if they don’t overlap with existing entities predicted by the model. To overwrite overlapping entities, you can set overwrite_ents=True on initialization.

Editable CodespaCy v3.7 · Python 3 · via Binder

Validating and debugging EntityRuler patterns

The entity ruler can validate patterns against a JSON schema with the config setting "validate". See details under Validating and debugging patterns.

Adding IDs to patterns

The EntityRuler can also accept an id attribute for each pattern. Using the id attribute allows multiple patterns to be associated with the same entity.

Editable CodespaCy v3.7 · Python 3 · via Binder

If the id attribute is included in the EntityRuler patterns, the ent_id_ property of the matched entity is set to the id given in the patterns. So in the example above it’s easy to identify that “San Francisco” and “San Fran” are both the same entity.

Using pattern files

The to_disk and from_disk let you save and load patterns to and from JSONL (newline-delimited JSON) files, containing one pattern object per line.

patterns.jsonl

When you save out an nlp object that has an EntityRuler added to its pipeline, its patterns are automatically exported to the pipeline directory:

The saved pipeline now includes the "entity_ruler" in its config.cfg and the pipeline directory contains a file patterns.jsonl with the patterns. When you load the pipeline back in, all pipeline components will be restored and deserialized – including the entity ruler. This lets you ship powerful pipeline packages with binary weights and rules included!

Using a large number of phrase patterns

When using a large amount of phrase patterns (roughly > 10000) it’s useful to understand how the add_patterns function of the entity ruler works. For each phrase pattern, the EntityRuler calls the nlp object to construct a doc object. This happens in case you try to add the EntityRuler at the end of an existing pipeline with, for example, a POS tagger and want to extract matches based on the pattern’s POS signature. In this case you would pass a config value of "phrase_matcher_attr": "POS" for the entity ruler.

Running the full language pipeline across every pattern in a large list scales linearly and can therefore take a long time on large amounts of phrase patterns. As of spaCy v2.2.4 the add_patterns function has been refactored to use nlp.pipe on all phrase patterns resulting in about a 10x-20x speed up with 5,000-100,000 phrase patterns respectively. Even with this speedup (but especially if you’re using an older version) the add_patterns function can still take a long time. An easy workaround to make this function run faster is disabling the other language pipes while adding the phrase patterns.

Rule-based span matching v3.3.1

The SpanRuler is a generalized version of the entity ruler that lets you add spans to doc.spans or doc.ents based on pattern dictionaries, which makes it easy to combine rule-based and statistical pipeline components.

Span patterns

The pattern format is the same as for the entity ruler:

  1. Phrase patterns for exact string matches (string).

  2. Token patterns with one dictionary describing one token (list).

Using the span ruler

The SpanRuler is a pipeline component that’s typically added via nlp.add_pipe. When the nlp object is called on a text, it will find matches in the doc and add them as spans to doc.spans["ruler"], using the specified pattern label as the entity label. Unlike in doc.ents, overlapping matches are allowed in doc.spans, so no filtering is required, but optional filtering and sorting can be applied to the spans before they’re saved.

Editable CodespaCy v3.7 · Python 3 · via Binder

The span ruler is designed to integrate with spaCy’s existing pipeline components and enhance the SpanCategorizer and EntityRecognizer. The overwrite setting determines whether the existing annotation in doc.spans or doc.ents is preserved. Because overlapping entities are not allowed for doc.ents, the entities are always filtered, using util.filter_spans by default. See the SpanRuler API docs for more information about how to customize the sorting and filtering of matched spans.

Editable CodespaCy v3.7 · Python 3 · via Binder

Using pattern files

You can save patterns in a JSONL file (newline-delimited JSON) to load with SpanRuler.initialize or SpanRuler.add_patterns.

patterns.jsonl

Combining models and rules

You can combine statistical and rule-based components in a variety of ways. Rule-based components can be used to improve the accuracy of statistical models, by presetting tags, entities or sentence boundaries for specific tokens. The statistical models will usually respect these preset annotations, which sometimes improves the accuracy of other decisions. You can also use rule-based components after a statistical model to correct common errors. Finally, rule-based components can reference the attributes set by statistical models, in order to implement more abstract logic.

Example: Expanding named entities

When using a trained named entity recognition model to extract information from your texts, you may find that the predicted span only includes parts of the entity you’re looking for. Sometimes, this happens if statistical model predicts entities incorrectly. Other times, it happens if the way the entity type was defined in the original training corpus doesn’t match what you need for your application.

For example, the corpus spaCy’s English pipelines were trained on defines a PERSON entity as just the person name, without titles like “Mr.” or “Dr.”. This makes sense, because it makes it easier to resolve the entity type back to a knowledge base. But what if your application needs the full names, including the titles?

Editable CodespaCy v3.7 · Python 3 · via Binder

While you could try and teach the model a new definition of the PERSON entity by updating it with more examples of spans that include the title, this might not be the most efficient approach. The existing model was trained on over 2 million words, so in order to completely change the definition of an entity type, you might need a lot of training examples. However, if you already have the predicted PERSON entities, you can use a rule-based approach that checks whether they come with a title and if so, expands the entity span by one token. After all, what all titles in this example have in common is that if they occur, they occur in the previous token right before the person entity.

The above function takes a Doc object, modifies its doc.ents and returns it. Using the @Language.component decorator, we can register it as a pipeline component so it can run automatically when processing a text. We can use nlp.add_pipe to add it to the current pipeline.

Editable CodespaCy v3.7 · Python 3 · via Binder

An alternative approach would be to use an extension attribute like ._.person_title and add it to Span objects (which includes entity spans in doc.ents). The advantage here is that the entity text stays intact and can still be used to look up the name in a knowledge base. The following function takes a Span object, checks the previous token if it’s a PERSON entity and returns the title if one is found. The Span.doc attribute gives us easy access to the span’s parent document.

We can now use the Span.set_extension method to add the custom extension attribute "person_title", using get_person_title as the getter function.

Editable CodespaCy v3.7 · Python 3 · via Binder

Example: Using entities, part-of-speech tags and the dependency parse

Let’s say you want to parse professional biographies and extract the person names and company names, and whether it’s a company they’re currently working at, or a previous company. One approach could be to try and train a named entity recognizer to predict CURRENT_ORG and PREVIOUS_ORG – but this distinction is very subtle and something the entity recognizer may struggle to learn. Nothing about “Acme Corp Inc.” is inherently “current” or “previous”.

However, the syntax of the sentence holds some very important clues: we can check for trigger words like “work”, whether they’re past tense or present tense, whether company names are attached to it and whether the person is the subject. All of this information is available in the part-of-speech tags and the dependency parse.

Editable CodespaCy v3.7 · Python 3 · via Binder

Visualization of dependency parse

In this example, “worked” is the root of the sentence and is a past tense verb. Its subject is “Alex Smith”, the person who worked. “at Acme Corp Inc.” is a prepositional phrase attached to the verb “worked”. To extract this relationship, we can start by looking at the predicted PERSON entities, find their heads and check whether they’re attached to a trigger word like “work”. Next, we can check for prepositional phrases attached to the head and whether they contain an ORG entity. Finally, to determine whether the company affiliation is current, we can check the head’s part-of-speech tag.

To apply this logic automatically when we process a text, we can add it to the nlp object as a custom pipeline component. The above logic also expects that entities are merged into single tokens. spaCy ships with a handy built-in merge_entities that takes care of that. Instead of just printing the result, you could also write it to custom attributes on the entity Span – for example ._.orgs or ._.prev_orgs and ._.current_orgs.

Editable CodespaCy v3.7 · Python 3 · via Binder

If you change the sentence structure above, for example to “was working”, you’ll notice that our current logic fails and doesn’t correctly detect the company as a past organization. That’s because the root is a participle and the tense information is in the attached auxiliary “was”:

Visualization of dependency parse

To solve this, we can adjust the rules to also check for the above construction:

In your final rule-based system, you may end up with several different code paths to cover the types of constructions that occur in your data.