8000 fix to find Korean Stopwords by galaxytemple · Pull Request #138 · goose3/goose3 · GitHub
[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
Skip to content

fix to find Korean Stopwords #138

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 4 commits into from
Sep 14, 2022
Merged

Conversation

galaxytemple
Copy link
Contributor

Reference Issues/PRs

None

What does this implement/fix? Explain your changes.

In English, All words including StopWords are separated by spacing. We can simply find Stopwords using the HashSet.
But Korean StopWords are attached at a word without a space, so it requires another logic.

For example,
English : Nice to meet you ( 'to' and 'you' are Stopsword)
Korea : 만나서 반가워요 ( '요' is Stopsword )

You know, Aho-Corasick algorithm has been widely used for string matching due to its advantage of matching multiple string patterns in a single pass
So it's suitable

Any other comments?

@lababidi
Copy link
Contributor
lababidi commented May 9, 2022

@galaxytemple you'll need to add ahocorasick and write a test please

@galaxytemple
Copy link
Contributor Author

@lababidi Thank you for your feedback. I added pyahocorasick and tests/test_stopwords.py

@barrust
Copy link
Collaborator
barrust commented Sep 2, 2022

This PR looks good to me (all tests passed)! If there are no concerns I can merge and push a new version

@barrust barrust merged commit 79ff10d into goose3:master Sep 14, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants
0