This repository represents the task from hyperexponential realized by valeriu vicol, the task I decide to implement a small search_engine on some random text blocks from random medium articles. Core algorithm used for current implementation inverted index. Idea of the search is that we type a word and the algorithm return sentence where this word was found grouped by article name.
📦corpus # list articles on which will be performed revers index
📦media # images used for readme
📦search_engine # implementation
┣ 📂core
┃ ┣ 📜config.py
┃ ┗ 📜exceptions.py
┣ 📂services
┃ ┣ 📜corpus_service.py
┃ ┗ 📜preprocesing_service.py
┗ 📜engine.py
📦tests
┣ 📂resources
┣ 📜config.py
┣ 📜test_corpus_service.py
┣ 📜test_engine.py
┗ 📜test_preprocessing_service.py
- create virtual environment
python3 -m venv venv/
- enter virtual environment
source venv/bin/activate
- install dependencies
pip3 install -r requirements.txt
-
update the absolute path to the corpus in :
search_engine/core/config.py
-
enter the virtual environment
source venv/bin/activate
- run the algorithm
python3 main.py
- run the tests
pytest
with coverage :
pytest --cov
For avoiding have too algorithm heavy code, this implementation represent more a PoC rather than a ready to use library. Some of the steps which will improve the accuracy of search_engine are listed bellow:
-
Use a tokenizer
Instead of splitting sentence by '.', we should use a tokenizer as it takes into account other symbols.
nltk
tokenizer is a good choice for current task. -
Use a word stemming or lemmatizer
In current example algorithm won't make difference between cloud and clouds and will index them as 2 different words. In order to avoid it we should use a word stemming or lemmatizer for indexed word and for words on which we are searching. Like this we will increase the accuracy of the engine.
nltk
also have this functionality -
Implement fuzzy search
fuzzy searching might also a good improvement of the algorithm as the misspellings will be handled by algorithm in case if is not a big mistake. For fuzzy search we can use something like levenshtein distance.
-
word2vec
we can keep vector representation of each indexed word and then when we have a query we turn the input also into a vector and get the results based on distance of the vectors. Using this approach we can get even better results as we will get results based on vector similarity not on string matching.