Search engine

This repository represents the task from hyperexponential realized by valeriu vicol, the task I decide to implement a small search_engine on some random text blocks from random medium articles. Core algorithm used for current implementation inverted index. Idea of the search is that we type a word and the algorithm return sentence where this word was found grouped by article name.

📦corpus            # list articles on which will be performed revers index
📦media             # images used for readme
📦search_engine     # implementation
 ┣ 📂core
 ┃ ┣ 📜config.py
 ┃ ┗ 📜exceptions.py
 ┣ 📂services
 ┃ ┣ 📜corpus_service.py
 ┃ ┗ 📜preprocesing_service.py
 ┗ 📜engine.py
📦tests
 ┣ 📂resources          
 ┣ 📜config.py
 ┣ 📜test_corpus_service.py
 ┣ 📜test_engine.py
 ┗ 📜test_preprocessing_service.py

Set up the project

create virtual environment

python3 -m venv venv/

enter virtual environment

source venv/bin/activate

install dependencies

pip3 install -r requirements.txt

Running the algorithm

update the absolute path to the corpus in : search_engine/core/config.py
enter the virtual environment

source venv/bin/activate

run the algorithm

python3 main.py

run the tests

pytest

with coverage :

pytest --cov

Further work

For avoiding have too algorithm heavy code, this implementation represent more a PoC rather than a ready to use library. Some of the steps which will improve the accuracy of search_engine are listed bellow:

Use a tokenizer

Instead of splitting sentence by '.', we should use a tokenizer as it takes into account other symbols. nltk tokenizer is a good choice for current task.
Use a word stemming or lemmatizer

In current example algorithm won't make difference between cloud and clouds and will index them as 2 different words. In order to avoid it we should use a word stemming or lemmatizer for indexed word and for words on which we are searching. Like this we will increase the accuracy of the engine. nltk also have this functionality
Implement fuzzy search

fuzzy searching might also a good improvement of the algorithm as the misspellings will be handled by algorithm in case if is not a big mistake. For fuzzy search we can use something like levenshtein distance.
word2vec

we can keep vector representation of each indexed word and then when we have a query we turn the input also into a vector and get the results based on distance of the vectors. Using this approach we can get even better results as we will get results based on vector similarity not on string matching.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Search engine

Set up the project

Running the algorithm

Further work

About

Uh oh!

Releases

Packages

Uh oh!

Contributors 2

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 41 Commits
corpus		corpus
media		media
search_engine		search_engine
tests		tests
.gitignore		.gitignore
README.md		README.md
main.py		main.py
notebook.ipynb		notebook.ipynb
requirements.txt		requirements.txt

vicol13/hx

Folders and files

Latest commit

History

Repository files navigation

Search engine

Set up the project

Running the algorithm

Further work

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 2

Uh oh!

Languages

Packages