8000 How to just split the sentences? · Issue #33 · tsproisl/SoMaJo · GitHub
[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
Skip to content

How to just split the sentences? #33

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
sambaPython24 opened this issue May 12, 2025 · 3 comments
Open

How to just split the sentences? #33

sambaPython24 opened this issue May 12, 2025 · 3 comments

Comments

@sambaPython24
Copy link

Is there any way to just split the text into the sentences like the
from nltk.tokenize import sent_tokenize function?

@tsproisl
Copy link
Owner

The sentence splitter operates on tokenized input, so splitting sentences without first tokenizing the text is not possible.

However, there are two ways to extract untokenized sentences from SoMaJo's output. You could either detokenize the output or you could use the character offset information to access the character span in the input.

For the first option, detokenizing SoMaJo's output, see the suggested solution in #17 (comment).

For the second option, accessing the corresponding character span in the input, something like this might suit your needs:

import io

from somajo import SoMaJo


def extract_raw_sentence(tokens, raw_text):
    start = tokens[0].character_offset[0]
    end = tokens[-1].character_offset[1]
    return raw_text[start:end]


pseudofile = io.StringIO(
    "der beste Betreuer?\n"
    "-- ProfSmith! : )\n"
    "\n"
    "Was machst du morgen Abend?! Lust auf Film?;-)"
)

tokenizer = SoMaJo("de_CMC", character_offsets=True)
raw_text = pseudofile.read()
pseudofile.seek(0)

sentences = tokenizer.tokenize_text_file(pseudofile, paragraph_separator="empty_lines")
for sentence in sentences:
    print(extract_raw_sentence(sentence, raw_text))

This produces the following output:

der beste Betreuer?
-- ProfSmith! : )
Was machst du morgen Abend?!
Lust auf Film?;-)

Note that the second option will be slower due to the overhead that the alignment algorithm for the character offsets incurs.

@sambaPython24
Copy link
Author
sambaPython24 commented May 13, 2025

Thank you very much for both answers! How can I customize the abbreviations like No. or Nr. that should be not used for splitting the sentence? In which file can they be found?

Or in other words: How do you distinguish how to split the sentence?

@tsproisl
Copy link
Owner

Sorry for the delayed response. Abbreviations are defined in src/somajo/data:

  • abbreviations_(de|en).txt: Abbreviations that are not matched by (?:[[:alpha:]]\.){2,}, i.e. are not sequences of single letters followed by single dots.
  • eos_abbreviations.txt: Abbreviations that frequently occur at the end of a sentence. If such an abbreviation is followed by a potential sentence start, e.g. by a capital letter, it will be interpreted as the end of a sentence.
  • single_token_abbreviations_(de|en).txt: Multi-dot abbreviations that represent single tokens and should not be split.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants
0