-
Notifications
You must be signed in to change notification settings - Fork 21
How to just split the sentences? #33
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
The sentence splitter operates on tokenized input, so splitting sentences without first tokenizing the text is not possible. However, there are two ways to extract untokenized sentences from SoMaJo's output. You could either detokenize the output or you could use the character offset information to access the character span in the input. For the first option, detokenizing SoMaJo's output, see the suggested solution in #17 (comment). For the second option, accessing the corresponding character span in the input, something like this might suit your needs: import io
from somajo import SoMaJo
def extract_raw_sentence(tokens, raw_text):
start = tokens[0].character_offset[0]
end = tokens[-1].character_offset[1]
return raw_text[start:end]
pseudofile = io.StringIO(
"der beste Betreuer?\n"
"-- ProfSmith! : )\n"
"\n"
"Was machst du morgen Abend?! Lust auf Film?;-)"
)
tokenizer = SoMaJo("de_CMC", character_offsets=True)
raw_text = pseudofile.read()
pseudofile.seek(0)
sentences = tokenizer.tokenize_text_file(pseudofile, paragraph_separator="empty_lines")
for sentence in sentences:
print(extract_raw_sentence(sentence, raw_text)) This produces the following output:
Note that the second option will be slower due to the overhead that the alignment algorithm for the character offsets incurs. |
Thank you very much for both answers! How can I customize the abbreviations like Or in other words: How do you distinguish how to split the sentence? |
Sorry for the delayed response. Abbreviations are defined in
|
Is there any way to just split the text into the sentences like the
from nltk.tokenize import sent_tokenize
function?The text was updated successfully, but these errors were encountered: