[go: up one dir, main page]
More Web Proxy on the site http://driver.im/

Guides

Linguistic Features

Processing raw text intelligently is difficult: most words are rare, and it’s common for words that look completely different to mean almost the same thing. The same words in a different order can mean something completely different. Even splitting text into useful word-like units can be difficult in many languages. While it’s possible to solve some problems starting from only the raw characters, it’s usually better to use linguistic knowledge to add useful information. That’s exactly what spaCy is designed to do: you put in raw text, and get back a Doc object, that comes with a variety of annotations.

Part-of-speech tagging Needs model

After tokenization, spaCy can parse and tag a given Doc. This is where the trained pipeline and its statistical models come in, which enable spaCy to make predictions of which tag or label most likely applies in this context. A trained component includes binary data that is produced by showing a system enough examples for it to make predictions that generalize across the language – for example, a word following “the” in English is most likely a noun.

Linguistic annotations are available as Token attributes. Like many NLP libraries, spaCy encodes all strings to hash values to reduce memory usage and improve efficiency. So to get the readable string representation of an attribute, we need to add an underscore _ to its name:

Editable CodespaCy v3.7 · Python 3 · via Binder

TextLemmaPOSTagDepShapealphastop
AppleapplePROPNNNPnsubjXxxxxTrueFalse
isbeAUXVBZauxxxTrueTrue
lookinglookVERBVBGROOTxxxxTrueFalse
atatADPINprepxxTrueTrue
buyingbuyVERBVBGpcompxxxxTrueFalse
U.K.u.k.PROPNNNPcompoundX.X.FalseFalse
startupstartupNOUNNNdobjxxxxTrueFalse
forforADPINprepxxxTrueTrue
$$SYM$quantmod$FalseFalse
11NUMCDcompounddFalseFalse
billionbillionNUMCDpobjxxxxTrueFalse

Using spaCy’s built-in displaCy visualizer, here’s what our example sentence and its dependencies look like:

Morphology

Inflectional morphology is the process by which a root form of a word is modified by adding prefixes or suffixes that specify its grammatical function but do not change its part-of-speech. We say that a lemma (root form) is inflected (modified/combined) with one or more morphological features to create a surface form. Here are some examples:

ContextSurfaceLemmaPOSMorphological Features
I was reading the paperreadingreadVERBVerbForm=Ger
I don’t watch the news, I read the paperreadreadVERBVerbForm=Fin, Mood=Ind, Tense=Pres
I read the paper yesterdayreadreadVERBVerbForm=Fin, Mood=Ind, Tense=Past

Morphological features are stored in the MorphAnalysis under Token.morph, which allows you to access individual morphological features.

Editable CodespaCy v3.7 · Python 3 · via Binder

Statistical morphology v3.0Needs model

spaCy’s statistical Morphologizer component assigns the morphological features and coarse-grained part-of-speech tags as Token.morph and Token.pos.

Editable CodespaCy v3.7 · Python 3 · via Binder

Rule-based morphology

For languages with relatively simple morphological systems like English, spaCy can assign morphological features through a rule-based approach, which uses the token text and fine-grained part-of-speech tags to produce coarse-grained part-of-speech tags and morphological features.

  1. The part-of-speech tagger assigns each token a fine-grained part-of-speech tag. In the API, these tags are known as Token.tag. They express the part-of-speech (e.g. verb) and some amount of morphological information, e.g. that the verb is past tense (e.g. VBD for a past tense verb in the Penn Treebank) .
  2. For words whose coarse-grained POS is not set by a prior process, a mapping table maps the fine-grained tags to a coarse-grained POS tags and morphological features.

Editable CodespaCy v3.7 · Python 3 · via Binder

Lemmatization v3.0

spaCy provides two pipeline components for lemmatization:

  1. The Lemmatizer component provides lookup and rule-based lemmatization methods in a configurable component. An individual language can extend the Lemmatizer as part of its language data.
  2. The EditTreeLemmatizer v3.3 component provides a trainable lemmatizer.

Editable CodespaCy v3.7 · Python 3 · via Binder

The data for spaCy’s lemmatizers is distributed in the package spacy-lookups-data. The provided trained pipelines already include all the required tables, but if you are creating new pipelines, you’ll probably want to install spacy-lookups-data to provide the data when the lemmatizer is initialized.

Lookup lemmatizer

For pipelines without a tagger or morphologizer, a lookup lemmatizer can be added to the pipeline as long as a lookup table is provided, typically through spacy-lookups-data. The lookup lemmatizer looks up the token surface form in the lookup table without reference to the token’s part-of-speech or context.

Rule-based lemmatizer Needs model

When training pipelines that include a component that assigns part-of-speech tags (a morphologizer or a tagger with a POS mapping), a rule-based lemmatizer can be added using rule tables from spacy-lookups-data:

The rule-based deterministic lemmatizer maps the surface form to a lemma in light of the previously assigned coarse-grained part-of-speech and morphological information, without consulting the context of the token. The rule-based lemmatizer also accepts list-based exception files. For English, these are acquired from WordNet.

Trainable lemmatizer Needs model

The EditTreeLemmatizer can learn form-to-lemma transformations from a training corpus that includes lemma annotations. This removes the need to write language-specific rules and can (in many cases) provide higher accuracies than lookup and rule-based lemmatizers.

Dependency Parsing Needs model

spaCy features a fast and accurate syntactic dependency parser, and has a rich API for navigating the tree. The parser also powers the sentence boundary detection, and lets you iterate over base noun phrases, or “chunks”. You can check whether a Doc object has been parsed by calling doc.has_annotation("DEP"), which checks whether the attribute Token.dep has been set returns a boolean value. If the result is False, the default sentence iterator will raise an exception.

Noun chunks

Noun chunks are “base noun phrases” – flat phrases that have a noun as their head. You can think of noun chunks as a noun plus the words describing the noun – for example, “the lavish green grass” or “the world’s largest tech fund”. To get the noun chunks in a document, simply iterate over Doc.noun_chunks.

Editable CodespaCy v3.7 · Python 3 · via Binder

Textroot.textroot.dep_root.head.text
Autonomous carscarsnsubjshift
insurance liabilityliabilitydobjshift
manufacturersmanufacturerspobjtoward

spaCy uses the terms head and child to describe the words connected by a single arc in the dependency tree. The term dep is used for the arc label, which describes the type of syntactic relation that connects the child to the head. As with other attributes, the value of .dep is a hash value. You can get the string value with .dep_.

Editable CodespaCy v3.7 · Python 3 · via Binder

TextDepHead textHead POSChildren
AutonomousamodcarsNOUN
carsnsubjshiftVERBAutonomous
shiftROOTshiftVERBcars, liability, toward
insurancecompoundliabilityNOUN
liabilitydobjshiftVERBinsurance
towardprepshiftNOUNmanufacturers
manufacturerspobjtowardADP

Because the syntactic relations form a tree, every word has exactly one head. You can therefore iterate over the arcs in the tree by iterating over the words in the sentence. This is usually the best way to match an arc of interest – from below:

Editable CodespaCy v3.7 · Python 3 · via Binder

If you try to match from above, you’ll have to iterate twice. Once for the head, and then again through the children:

To iterate through the children, use the token.children attribute, which provides a sequence of Token objects.

A few more convenience attributes are provided for iterating around the local tree from the token. Token.lefts and Token.rights attributes provide sequences of syntactic children that occur before and after the token. Both sequences are in sentence order. There are also two integer-typed attributes, Token.n_lefts and Token.n_rights that give the number of left and right children.

Editable CodespaCy v3.7 · Python 3 · via Binder

Editable CodespaCy v3.7 · Python 3 · via Binder

You can get a whole phrase by its syntactic head using the Token.subtree attribute. This returns an ordered sequence of tokens. You can walk up the tree with the Token.ancestors attribute, and check dominance with Token.is_ancestor

Editable CodespaCy v3.7 · Python 3 · via Binder

TextDepn_leftsn_rightsancestors
Creditnmod02holders, submit
andcc00holders, submit
mortgagecompound00account, Credit, holders, submit
accountconj10Credit, holders, submit
holdersnsubj10submit

Finally, the .left_edge and .right_edge attributes can be especially useful, because they give you the first and last token of the subtree. This is the easiest way to create a Span object for a syntactic phrase. Note that .right_edge gives a token within the subtree – so if you use it as the end-point of a range, don’t forget to +1!

Editable CodespaCy v3.7 · Python 3 · via Binder

TextPOSDepHead text
Credit and mortgage account holdersNOUNnsubjsubmit
mustVERBauxsubmit
submitVERBROOTsubmit
theirADJpossrequests
requestsNOUNdobjsubmit

The dependency parse can be a useful tool for information extraction, especially when combined with other predictions like named entities. The following example extracts money and currency values, i.e. entities labeled as MONEY, and then uses the dependency parse to find the noun phrase they are referring to – for example "Net income""$9.4 million".

Editable CodespaCy v3.7 · Python 3 · via Binder

Visualizing dependencies

The best way to understand spaCy’s dependency parser is interactively. To make this easier, spaCy comes with a visualization module. You can pass a Doc or a list of Doc objects to displaCy and run displacy.serve to run the web server, or displacy.render to generate the raw markup. If you want to know how to write rules that hook into some type of syntactic construction, just plug the sentence into the visualizer and see how spaCy annotates it.

Editable CodespaCy v3.7 · Python 3 · via Binder

Disabling the parser

In the trained pipelines provided by spaCy, the parser is loaded and enabled by default as part of the standard processing pipeline. If you don’t need any of the syntactic information, you should disable the parser. Disabling the parser will make spaCy load and run much faster. If you want to load the parser, but need to disable it for specific documents, you can also control its use on the nlp object. For more details, see the usage guide on disabling pipeline components.

Named Entity Recognition

spaCy features an extremely fast statistical entity recognition system, that assigns labels to contiguous spans of tokens. The default trained pipelines can identify a variety of named and numeric entities, including companies, locations, organizations and products. You can add arbitrary classes to the entity recognition system, and update the model with new examples.

Named Entity Recognition 101

A named entity is a “real-world object” that’s assigned a name – for example, a person, a country, a product or a book title. spaCy can recognize various types of named entities in a document, by asking the model for a prediction. Because models are statistical and strongly depend on the examples they were trained on, this doesn’t always work perfectly and might need some tuning later, depending on your use case.

Named entities are available as the ents property of a Doc:

Editable CodespaCy v3.7 · Python 3 · via Binder

TextStartEndLabelDescription
Apple05ORGCompanies, agencies, institutions.
U.K.2731GPEGeopolitical entity, i.e. countries, cities, states.
$1 billion4454MONEYMonetary values, including unit.

Using spaCy’s built-in displaCy visualizer, here’s what our example sentence and its named entities look like:

Apple ORG is looking at buying U.K. GPE startup for $1 billion MONEY

Accessing entity annotations and labels

The standard way to access entity annotations is the doc.ents property, which produces a sequence of Span objects. The entity type is accessible either as a hash value or as a string, using the attributes ent.label and ent.label_. The Span object acts as a sequence of tokens, so you can iterate over the entity or index into it. You can also get the text form of the whole entity, as though it were a single token.

You can also access token entity annotations using the token.ent_iob and token.ent_type attributes. token.ent_iob indicates whether an entity starts, continues or ends on the tag. If no entity type is set on a token, it will return an empty string.

Editable CodespaCy v3.7 · Python 3 · via Binder

Textent_iobent_iob_ent_type_Description
San3B"GPE"beginning of an entity
Francisco1I"GPE"inside an entity
considers2O""outside an entity
banning2O""outside an entity
sidewalk2O""outside an entity
delivery2O""outside an entity
robots2O""outside an entity

Setting entity annotations

To ensure that the sequence of token annotations remains consistent, you have to set entity annotations at the document level. However, you can’t write directly to the token.ent_iob or token.ent_type attributes, so the easiest way to set entities is to use the doc.set_ents function and create the new entity as a Span.

Editable CodespaCy v3.7 · Python 3 · via Binder

Keep in mind that Span is initialized with the start and end token indices, not the character offsets. To create a span from character offsets, use Doc.char_span:

Setting entity annotations from array

You can also assign entity annotations using the doc.from_array method. To do this, you should include both the ENT_TYPE and the ENT_IOB attributes in the array you’re importing from.

Editable CodespaCy v3.7 · Python 3 · via Binder

Setting entity annotations in Cython

Finally, you can always write to the underlying struct if you compile a Cython function. This is easy to do, and allows you to write efficient native code.

Obviously, if you write directly to the array of TokenC* structs, you’ll have responsibility for ensuring that the data is left in a consistent state.

Built-in entity types

Visualizing named entities

The displaCy ENT visualizer lets you explore an entity recognition model’s behavior interactively. If you’re training a model, it’s very useful to run the visualization yourself. To help you do that, spaCy comes with a visualization module. You can pass a Doc or a list of Doc objects to displaCy and run displacy.serve to run the web server, or displacy.render to generate the raw markup.

For more details and examples, see the usage guide on visualizing spaCy.

Named Entity example

When Sebastian Thrun PERSON started working on self-driving cars at Google ORG in 2007 DATE, few people outside of the company took him seriously.

Entity Linking

To ground the named entities into the “real world”, spaCy provides functionality to perform entity linking, which resolves a textual entity to a unique identifier from a knowledge base (KB). You can create your own KnowledgeBase and train a new EntityLinker using that custom knowledge base.

As an example on how to define a KnowledgeBase and train an entity linker model, see this tutorial using spaCy projects.

Accessing entity identifiers Needs model

The annotated KB identifier is accessible as either a hash value or as a string, using the attributes ent.kb_id and ent.kb_id_ of a Span object, or the ent_kb_id and ent_kb_id_ attributes of a Token object.

Tokenization

Tokenization is the task of splitting a text into meaningful segments, called tokens. The input to the tokenizer is a unicode text, and the output is a Doc object. To construct a Doc object, you need a Vocab instance, a sequence of word strings, and optionally a sequence of spaces booleans, which allow you to maintain alignment of the tokens into the original string.

During processing, spaCy first tokenizes the text, i.e. segments it into words, punctuation and so on. This is done by applying rules specific to each language. For example, punctuation at the end of a sentence should be split off – whereas “U.K.” should remain one token. Each Doc consists of individual tokens, and we can iterate over them:

Editable CodespaCy v3.7 · Python 3 · via Binder

012345678910
AppleislookingatbuyingU.K.startupfor$1billion

First, the raw text is split on whitespace characters, similar to text.split(' '). Then, the tokenizer processes the text from left to right. On each substring, it performs two checks:

  1. Does the substring match a tokenizer exception rule? For example, “don’t” does not contain whitespace, but should be split into two tokens, “do” and “n’t”, while “U.K.” should always remain one token.

  2. Can a prefix, suffix or infix be split off? For example punctuation like commas, periods, hyphens or quotes.

If there’s a match, the rule is applied and the tokenizer continues its loop, starting with the newly split substrings. This way, spaCy can split complex, nested tokens like combinations of abbreviations and multiple punctuation marks.

Example of the tokenization process

While punctuation rules are usually pretty general, tokenizer exceptions strongly depend on the specifics of the individual language. This is why each available language has its own subclass, like English or German, that loads in lists of hard-coded data and exception rules.

spaCy introduces a novel tokenization algorithm that gives a better balance between performance, ease of definition and ease of alignment into the original string.

After consuming a prefix or suffix, we consult the special cases again. We want the special cases to handle things like “don’t” in English, and we want the same rule to work for “(don’t)!“. We do this by splitting off the open bracket, then the exclamation, then the closed bracket, and finally matching the special case. Here’s an implementation of the algorithm in Python optimized for readability rather than performance:

The algorithm can be summarized as follows:

  1. Iterate over space-separated substrings.
  2. Check whether we have an explicitly defined special case for this substring. If we do, use it.
  3. Look for a token match. If there is a match, stop processing and keep this token.
  4. Check whether we have an explicitly defined special case for this substring. If we do, use it.
  5. Otherwise, try to consume one prefix. If we consumed a prefix, go back to #3, so that the token match and special cases always get priority.
  6. If we didn’t consume a prefix, try to consume a suffix and then go back to #3.
  7. If we can’t consume a prefix or a suffix, look for a URL match.
  8. If there’s no URL match, then look for a special case.
  9. Look for “infixes” – stuff like hyphens etc. and split the substring into tokens on all infixes.
  10. Once we can’t consume any more of the string, handle it as a single token.
  11. Make a final pass over the text to check for special cases that include spaces or that were missed due to the incremental processing of affixes.

Global and language-specific tokenizer data is supplied via the language data in spacy/lang. The tokenizer exceptions define special cases like “don’t” in English, which needs to be split into two tokens: {ORTH: "do"} and {ORTH: "n't", NORM: "not"}. The prefixes, suffixes and infixes mostly define punctuation rules – for example, when to split off periods (at the end of a sentence), and when to leave tokens containing periods intact (abbreviations like “U.S.”).

Tokenization rules that are specific to one language, but can be generalized across that language, should ideally live in the language data in spacy/lang – we always appreciate pull requests! Anything that’s specific to a domain or text type – like financial trading abbreviations or Bavarian youth slang – should be added as a special case rule to your tokenizer instance. If you’re dealing with a lot of customizations, it might make sense to create an entirely custom subclass.


Adding special case tokenization rules

Most domains have at least some idiosyncrasies that require custom tokenization rules. This could be very certain expressions, or abbreviations only used in this specific field. Here’s how to add a special case rule to an existing Tokenizer instance:

Editable CodespaCy v3.7 · Python 3 · via Binder

The special case doesn’t have to match an entire whitespace-delimited substring. The tokenizer will incrementally split off punctuation, and keep looking up the remaining substring. The special case rules also have precedence over the punctuation splitting.

Debugging the tokenizer

A working implementation of the pseudo-code above is available for debugging as nlp.tokenizer.explain(text). It returns a list of tuples showing which tokenizer rule or pattern was matched for each token. The tokens produced are identical to nlp.tokenizer() except for whitespace tokens:

Editable CodespaCy v3.7 · Python 3 · via Binder

Customizing spaCy’s Tokenizer class

Let’s imagine you wanted to create a tokenizer for a new language or specific domain. There are six things you may need to define:

  1. A dictionary of special cases. This handles things like contractions, units of measurement, emoticons, certain abbreviations, etc.
  2. A function prefix_search, to handle preceding punctuation, such as open quotes, open brackets, etc.
  3. A function suffix_search, to handle succeeding punctuation, such as commas, periods, close quotes, etc.
  4. A function infix_finditer, to handle non-whitespace separators, such as hyphens etc.
  5. An optional boolean function token_match matching strings that should never be split, overriding the infix rules. Useful for things like numbers.
  6. An optional boolean function url_match, which is similar to token_match except that prefixes and suffixes are removed before applying the match.

You shouldn’t usually need to create a Tokenizer subclass. Standard usage is to use re.compile() to build a regular expression object, and pass its .search() and .finditer() methods:

Editable CodespaCy v3.7 · Python 3 · via Binder

If you need to subclass the tokenizer instead, the relevant methods to specialize are find_prefix, find_suffix and find_infix.

Modifying existing rule sets

In many situations, you don’t necessarily need entirely custom rules. Sometimes you just want to add another character to the prefixes, suffixes or infixes. The default prefix, suffix and infix rules are available via the nlp object’s Defaults and the Tokenizer attributes such as Tokenizer.suffix_search are writable, so you can overwrite them with compiled regular expression objects using modified default rules. spaCy ships with utility functions to help you compile the regular expressions – for example, compile_suffix_regex:

Similarly, you can remove a character from the default suffixes:

The Tokenizer.suffix_search attribute should be a function which takes a unicode string and returns a regex match object or None. Usually we use the .search attribute of a compiled regex object, but you can use some other function that behaves the same way.

The prefix, infix and suffix rule sets include not only individual characters but also detailed regular expressions that take the surrounding context into account. For example, there is a regular expression that treats a hyphen between letters as an infix. If you do not want the tokenizer to split on hyphens between letters, you can modify the existing infix definition from lang/punctuation.py:

Editable CodespaCy v3.7 · Python 3 · via Binder

For an overview of the default regular expressions, see lang/punctuation.py and language-specific definitions such as lang/de/punctuation.py for German.

Hooking a custom tokenizer into the pipeline

The tokenizer is the first component of the processing pipeline and the only one that can’t be replaced by writing to nlp.pipeline. This is because it has a different signature from all the other components: it takes a text and returns a Doc, whereas all other components expect to already receive a tokenized Doc.

The processing pipeline

To overwrite the existing tokenizer, you need to replace nlp.tokenizer with a custom function that takes a text and returns a Doc.

ArgumentTypeDescription
textstrThe raw text to tokenize.

Example 1: Basic whitespace tokenizer

Here’s an example of the most basic whitespace tokenizer. It takes the shared vocab, so it can construct Doc objects. When it’s called on a text, it returns a Doc object consisting of the text split on single space characters. We can then overwrite the nlp.tokenizer attribute with an instance of our custom tokenizer.

Editable CodespaCy v3.7 · Python 3 · via Binder

Example 2: Third-party tokenizers (BERT word pieces)

You can use the same approach to plug in any other third-party tokenizers. Your custom callable just needs to return a Doc object with the tokens produced by your tokenizer. In this example, the wrapper uses the BERT word piece tokenizer, provided by the tokenizers library. The tokens available in the Doc object returned by spaCy now match the exact word pieces produced by the tokenizer.

Custom BERT word piece tokenizer

Training with custom tokenization v3.0

spaCy’s training config describes the settings, hyperparameters, pipeline and tokenizer used for constructing and training the pipeline. The [nlp.tokenizer] block refers to a registered function that takes the nlp object and returns a tokenizer. Here, we’re registering a function called whitespace_tokenizer in the @tokenizers registry. To make sure spaCy knows how to construct your tokenizer during training, you can pass in your Python file by setting --code functions.py when you run spacy train.

functions.py

Registered functions can also take arguments that are then passed in from the config. This allows you to quickly change and keep track of different settings. Here, the registered function called bert_word_piece_tokenizer takes two arguments: the path to a vocabulary file and whether to lowercase the text. The Python type hints str and bool ensure that the received values have the correct type.

functions.py

To avoid hard-coding local paths into your config file, you can also set the vocab path on the CLI by using the --nlp.tokenizer.vocab_file override when you run spacy train. For more details on using registered functions, see the docs in training with custom code.

Using pre-tokenized text

spaCy generally assumes by default that your data is raw text. However, sometimes your data is partially annotated, e.g. with pre-existing tokenization, part-of-speech tags, etc. The most common situation is that you have pre-defined tokenization. If you have a list of strings, you can create a Doc object directly. Optionally, you can also specify a list of boolean values, indicating whether each word is followed by a space.

Editable CodespaCy v3.7 · Python 3 · via Binder

If provided, the spaces list must be the same length as the words list. The spaces list affects the doc.text, span.text, token.idx, span.start_char and span.end_char attributes. If you don’t provide a spaces sequence, spaCy will assume that all words are followed by a space. Once you have a Doc object, you can write to its attributes to set the part-of-speech tags, syntactic dependencies, named entities and other attributes.

Aligning tokenization

spaCy’s tokenization is non-destructive and uses language-specific rules optimized for compatibility with treebank annotations. Other tools and resources can sometimes tokenize things differently – for example, "I'm"["I", "'", "m"] instead of ["I", "'m"].

In situations like that, you often want to align the tokenization so that you can merge annotations from different sources together, or take vectors predicted by a pretrained BERT model and apply them to spaCy tokens. spaCy’s Alignment object allows the one-to-one mappings of token indices in both directions as well as taking into account indices where multiple tokens align to one single token.

Editable CodespaCy v3.7 · Python 3 · via Binder

Here are some insights from the alignment information generated in the example above:

  • The one-to-one mappings for the first four tokens are identical, which means they map to each other. This makes sense because they’re also identical in the input: "i", "listened", "to" and "obama".
  • The value of x2y.data[6] is 5, which means that other_tokens[6] ("podcasts") aligns to spacy_tokens[5] (also "podcasts").
  • x2y.data[4] and x2y.data[5] are both 4, which means that both tokens 4 and 5 of other_tokens ("'" and "s") align to token 4 of spacy_tokens ("'s").

Merging and splitting

The Doc.retokenize context manager lets you merge and split tokens. Modifications to the tokenization are stored and performed all at once when the context manager exits. To merge several tokens into one single token, pass a Span to retokenizer.merge. An optional dictionary of attrs lets you set attributes that will be assigned to the merged token – for example, the lemma, part-of-speech tag or entity type. By default, the merged token will receive the same attributes as the merged span’s root.

Editable CodespaCy v3.7 · Python 3 · via Binder

If an attribute in the attrs is a context-dependent token attribute, it will be applied to the underlying Token. For example LEMMA, POS or DEP only apply to a word in context, so they’re token attributes. If an attribute is a context-independent lexical attribute, it will be applied to the underlying Lexeme, the entry in the vocabulary. For example, LOWER or IS_STOP apply to all words of the same spelling, regardless of the context.

Splitting tokens

The retokenizer.split method allows splitting one token into two or more tokens. This can be useful for cases where tokenization rules alone aren’t sufficient. For example, you might want to split “its” into the tokens “it” and “is” – but not the possessive pronoun “its”. You can write rule-based logic that can find only the correct “its” to split, but by that time, the Doc will already be tokenized.

This process of splitting a token requires more settings, because you need to specify the text of the individual tokens, optional per-token attributes and how the tokens should be attached to the existing syntax tree. This can be done by supplying a list of heads – either the token to attach the newly split token to, or a (token, subtoken) tuple if the newly split token should be attached to another subtoken. In this case, “New” should be attached to “York” (the second split subtoken) and “York” should be attached to “in”.

Editable CodespaCy v3.7 · Python 3 · via Binder

Specifying the heads as a list of token or (token, subtoken) tuples allows attaching split subtokens to other subtokens, without having to keep track of the token indices after splitting.

TokenHeadDescription
"New"(doc[3], 1)Attach this token to the second subtoken (index 1) that doc[3] will be split into, i.e. “York”.
"York"doc[2]Attach this token to doc[1] in the original Doc, i.e. “in”.

If you don’t care about the heads (for example, if you’re only running the tokenizer and not the parser), you can attach each subtoken to itself:

Overwriting custom extension attributes

If you’ve registered custom extension attributes, you can overwrite them during tokenization by providing a dictionary of attribute names mapped to new values as the "_" key in the attrs. For merging, you need to provide one dictionary of attributes for the resulting merged token. For splitting, you need to provide a list of dictionaries with custom attributes, one per split subtoken.

Editable CodespaCy v3.7 · Python 3 · via Binder

Sentence Segmentation

A Doc object’s sentences are available via the Doc.sents property. To view a Doc’s sentences, you can iterate over the Doc.sents, a generator that yields Span objects. You can check whether a Doc has sentence boundaries by calling Doc.has_annotation with the attribute name "SENT_START".

Editable CodespaCy v3.7 · Python 3 · via Binder

spaCy provides four alternatives for sentence segmentation:

  1. Dependency parser: the statistical DependencyParser provides the most accurate sentence boundaries based on full dependency parses.
  2. Statistical sentence segmenter: the statistical SentenceRecognizer is a simpler and faster alternative to the parser that only sets sentence boundaries.
  3. Rule-based pipeline component: the rule-based Sentencizer sets sentence boundaries using a customizable list of sentence-final punctuation.
  4. Custom function: your own custom function added to the processing pipeline can set sentence boundaries by writing to Token.is_sent_start.

Default: Using the dependency parse Needs model

Unlike other libraries, spaCy uses the dependency parse to determine sentence boundaries. This is usually the most accurate approach, but it requires a trained pipeline that provides accurate predictions. If your texts are closer to general-purpose news or web text, this should work well out-of-the-box with spaCy’s provided trained pipelines. For social media or conversational text that doesn’t follow the same rules, your application may benefit from a custom trained or rule-based component.

Editable CodespaCy v3.7 · Python 3 · via Binder

spaCy’s dependency parser respects already set boundaries, so you can preprocess your Doc using custom components before it’s parsed. Depending on your text, this may also improve parse accuracy, since the parser is constrained to predict parses consistent with the sentence boundaries.

Statistical sentence segmenter v3.0Needs model

The SentenceRecognizer is a simple statistical component that only provides sentence boundaries. Along with being faster and smaller than the parser, its primary advantage is that it’s easier to train because it only requires annotated sentence boundaries rather than full dependency parses. spaCy’s trained pipelines include both a parser and a trained sentence segmenter, which is disabled by default. If you only need sentence boundaries and no parser, you can use the exclude or disable argument on spacy.load to load the pipeline without the parser and then enable the sentence recognizer explicitly with nlp.enable_pipe.

Editable CodespaCy v3.7 · Python 3 · via Binder

Rule-based pipeline component

The Sentencizer component is a pipeline component that splits sentences on punctuation like ., ! or ?. You can plug it into your pipeline if you only need sentence boundaries without dependency parses.

Editable CodespaCy v3.7 · Python 3 · via Binder

Custom rule-based strategy

If you want to implement your own strategy that differs from the default rule-based approach of splitting on sentences, you can also create a custom pipeline component that takes a Doc object and sets the Token.is_sent_start attribute on each individual token. If set to False, the token is explicitly marked as not the start of a sentence. If set to None (default), it’s treated as a missing value and can still be overwritten by the parser.

Here’s an example of a component that implements a pre-processing rule for splitting on "..." tokens. The component is added before the parser, which is then used to further segment the text. That’s possible, because is_sent_start is only set to True for some of the tokens – all others still specify None for unset sentence boundaries. This approach can be useful if you want to implement additional rules specific to your data, while still being able to take advantage of dependency-based sentence segmentation.

Editable CodespaCy v3.7 · Python 3 · via Binder

Mappings & Exceptions v3.0

The AttributeRuler manages rule-based mappings and exceptions for all token-level attributes. As the number of pipeline components has grown from spaCy v2 to v3, handling rules and exceptions in each component individually has become impractical, so the AttributeRuler provides a single component with a unified pattern format for all token attribute mappings and exceptions.

The AttributeRuler uses Matcher patterns to identify tokens and then assigns them the provided attributes. If needed, the Matcher patterns can include context around the target token. For example, the attribute ruler can:

  • provide exceptions for any token attributes
  • map fine-grained tags to coarse-grained tags for languages without statistical morphologizers (replacing the v2.x tag_map in the language data)
  • map token surface form + fine-grained tags to morphological features (replacing the v2.x morph_rules in the language data)
  • specify the tags for space tokens (replacing hard-coded behavior in the tagger)

The following example shows how the tag and POS NNP/PROPN can be specified for the phrase "The Who", overriding the tags provided by the statistical tagger and the POS tag map.

Editable CodespaCy v3.7 · Python 3 · via Binder

Word vectors and semantic similarity

Similarity is determined by comparing word vectors or “word embeddings”, multi-dimensional meaning representations of a word. Word vectors can be generated using an algorithm like word2vec and usually look like this:

banana.vector

Pipeline packages that come with built-in word vectors make them available as the Token.vector attribute. Doc.vector and Span.vector will default to an average of their token vectors. You can also check if a token has a vector assigned, and get the L2 norm, which can be used to normalize vectors.

Editable CodespaCy v3.7 · Python 3 · via Binder

The words “dog”, “cat” and “banana” are all pretty common in English, so they’re part of the pipeline’s vocabulary, and come with a vector. The word “afskfsd” on the other hand is a lot less common and out-of-vocabulary – so its vector representation consists of 300 dimensions of 0, which means it’s practically nonexistent. If your application will benefit from a large vocabulary with more vectors, you should consider using one of the larger pipeline packages or loading in a full vector package, for example, en_core_web_lg, which includes 685k unique vectors.

spaCy is able to compare two objects, and make a prediction of how similar they are. Predicting similarity is useful for building recommendation systems or flagging duplicates. For example, you can suggest a user content that’s similar to what they’re currently looking at, or label a support ticket as a duplicate if it’s very similar to an already existing one.

Each Doc, Span, Token and Lexeme comes with a .similarity method that lets you compare it with another object, and determine the similarity. Of course similarity is always subjective – whether two words, spans or documents are similar really depends on how you’re looking at it. spaCy’s similarity implementation usually assumes a pretty general-purpose definition of similarity.

Editable CodespaCy v3.7 · Python 3 · via Binder

What to expect from similarity results

Computing similarity scores can be helpful in many situations, but it’s also important to maintain realistic expectations about what information it can provide. Words can be related to each other in many ways, so a single “similarity” score will always be a mix of different signals, and vectors trained on different data can produce very different results that may not be useful for your purpose. Here are some important considerations to keep in mind:

  • There’s no objective definition of similarity. Whether “I like burgers” and “I like pasta” is similar depends on your application. Both talk about food preferences, which makes them very similar – but if you’re analyzing mentions of food, those sentences are pretty dissimilar, because they talk about very different foods.
  • The similarity of Doc and Span objects defaults to the average of the token vectors. This means that the vector for “fast food” is the average of the vectors for “fast” and “food”, which isn’t necessarily representative of the phrase “fast food”.
  • Vector averaging means that the vector of multiple tokens is insensitive to the order of the words. Two documents expressing the same meaning with dissimilar wording will return a lower similarity score than two documents that happen to contain the same words while expressing different meanings.

Adding word vectors

Custom word vectors can be trained using a number of open-source libraries, such as Gensim, FastText, or Tomas Mikolov’s original Word2vec implementation. Most word vector libraries output an easy-to-read text-based format, where each line consists of the word followed by its vector. For everyday use, we want to convert the vectors into a binary format that loads faster and takes up less space on disk. The easiest way to do this is the init vectors command-line utility. This will output a blank spaCy pipeline in the directory /tmp/la_vectors_wiki_lg, giving you access to some nice Latin vectors. You can then pass the directory path to spacy.load or use it in the [initialize] of your config when you train a model.

To help you strike a good balance between coverage and memory usage, spaCy’s Vectors class lets you map multiple keys to the same row of the table. If you’re using the spacy init vectors command to create a vocabulary, pruning the vectors will be taken care of automatically if you set the --prune flag. You can also do it manually in the following steps:

  1. Start with a word vectors package that covers a huge vocabulary. For instance, the en_core_web_lg package provides 300-dimensional GloVe vectors for 685k terms of English.
  2. If your vocabulary has values set for the Lexeme.prob attribute, the lexemes will be sorted by descending probability to determine which vectors to prune. Otherwise, lexemes will be sorted by their order in the Vocab.
  3. Call Vocab.prune_vectors with the number of vectors you want to keep.

Vocab.prune_vectors reduces the current vector table to a given number of unique entries, and returns a dictionary containing the removed words, mapped to (string, score) tuples, where string is the entry the removed word was mapped to and score the similarity score between the two words.

Removed words

In the example above, the vector for “Shore” was removed and remapped to the vector of “coast”, which is deemed about 73% similar. “Leaving” was remapped to the vector of “leaving”, which is identical. If you’re using the init vectors command, you can set the --prune option to easily reduce the size of the vectors as you add them to a spaCy pipeline:

This will create a blank spaCy pipeline with vectors for the first 10,000 words in the vectors. All other words in the vectors are mapped to the closest vector among those retained.

Adding vectors individually

The vector attribute is a read-only numpy or cupy array (depending on whether you’ve configured spaCy to use GPU memory), with dtype float32. The array is read-only so that spaCy can avoid unnecessary copy operations where possible. You can modify the vectors via the Vocab or Vectors table. Using the Vocab.set_vector method is often the easiest approach if you have vectors in an arbitrary format, as you can read in the vectors with your own logic, and just set them with a simple loop. This method is likely to be slower than approaches that work with the whole vectors table at once, but it’s a great approach for once-off conversions before you save out your nlp object to disk.

Adding vectors

Language Data

Every language is different – and usually full of exceptions and special cases, especially amongst the most common words. Some of these exceptions are shared across languages, while others are entirely specific – usually so specific that they need to be hard-coded. The lang module contains all language-specific data, organized in simple Python files. This makes the data easy to update and extend.

The shared language data in the directory root includes rules that can be generalized across languages – for example, rules for basic punctuation, emoji, emoticons and single-letter abbreviations. The individual language data in a submodule contains rules that are only relevant to a particular language. It also takes care of putting together all components and creating the Language subclass – for example, English or German. The values are defined in the Language.Defaults.

NameDescription
Stop words
stop_words.py
List of most common words of a language that are often useful to filter out, for example “and” or “I”. Matching tokens will return True for is_stop.
Tokenizer exceptions
tokenizer_exceptions.py
Special-case rules for the tokenizer, for example, contractions like “can’t” and abbreviations with punctuation, like “U.K.”.
Punctuation rules
punctuation.py
Regular expressions for splitting tokens, e.g. on punctuation or special characters like emoji. Includes rules for prefixes, suffixes and infixes.
Character classes
char_classes.py
Character classes to be used in regular expressions, for example, Latin characters, quotes, hyphens or icons.
Lexical attributes
lex_attrs.py
Custom functions for setting lexical attributes on tokens, e.g. like_num, which includes language-specific words like “ten” or “hundred”.
Syntax iterators
syntax_iterators.py
Functions that compute views of a Doc object based on its syntax. At the moment, only used for noun chunks.
Lemmatizer
lemmatizer.py spacy-lookups-data
Custom lemmatizer implementation and lemmatization tables.

Creating a custom language subclass

If you want to customize multiple components of the language data or add support for a custom language or domain-specific “dialect”, you can also implement your own language subclass. The subclass should define two attributes: the lang (unique language code) and the Defaults defining the language data. For an overview of the available attributes that can be overwritten, see the Language.Defaults documentation.

Editable CodespaCy v3.7 · Python 3 · via Binder

The @spacy.registry.languages decorator lets you register a custom language class and assign it a string name. This means that you can call spacy.blank with your custom language name, and even train pipelines with it and refer to it in your training config.

Registering a custom language