How to just split the sentences? #33

sambaPython24 · 2025-05-12T14:25:56Z

Is there any way to just split the text into the sentences like the
from nltk.tokenize import sent_tokenize function?

The text was updated successfully, but these errors were encountered:

tsproisl · 2025-05-13T10:11:02Z

The sentence splitter operates on tokenized input, so splitting sentences without first tokenizing the text is not possible.

However, there are two ways to extract untokenized sentences from SoMaJo's output. You could either detokenize the output or you could use the character offset information to access the character span in the input.

For the first option, detokenizing SoMaJo's output, see the suggested solution in #17 (comment).

For the second option, accessing the corresponding character span in the input, something like this might suit your needs:

import io

from somajo import SoMaJo


def extract_raw_sentence(tokens, raw_text):
    start = tokens[0].character_offset[0]
    end = tokens[-1].character_offset[1]
    return raw_text[start:end]


pseudofile = io.StringIO(
    "der beste Betreuer?\n"
    "-- ProfSmith! : )\n"
    "\n"
    "Was machst du morgen Abend?! Lust auf Film?;-)"
)

tokenizer = SoMaJo("de_CMC", character_offsets=True)
raw_text = pseudofile.read()
pseudofile.seek(0)

sentences = tokenizer.tokenize_text_file(pseudofile, paragraph_separator="empty_lines")
for sentence in sentences:
    print(extract_raw_sentence(sentence, raw_text))

This produces the following output:

der beste Betreuer?
-- ProfSmith! : )
Was machst du morgen Abend?!
Lust auf Film?;-)

Note that the second option will be slower due to the overhead that the alignment algorithm for the character offsets incurs.

sambaPython24 · 2025-05-13T11:34:45Z

Thank you very much for both answers! How can I customize the abbreviations like No. or Nr. that should be not used for splitting the sentence? In which file can they be found?

Or in other words: How do you distinguish how to split the sentence?

tsproisl · 2025-05-19T08:20:04Z

Sorry for the delayed response. Abbreviations are defined in src/somajo/data:

abbreviations_(de|en).txt: Abbreviations that are not matched by (?:[[:alpha:]]\.){2,}, i.e. are not sequences of single letters followed by single dots.
eos_abbreviations.txt: Abbreviations that frequently occur at the end of a sentence. If such an abbreviation is followed by a potential sentence start, e.g. by a capital letter, it will be interpreted as the end of a sentence.
single_token_abbreviations_(de|en).txt: Multi-dot abbreviations that represent single tokens and should not be split.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How to just split the sentences? #33

How to just split the sentences? #33

How to just split the sentences? #33

How to just split the sentences? #33

Comments