8000 GitHub - vicol13/hx
[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
Skip to content

vicol13/hx

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

41 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

alt text

Search engine

This repository represents the task from hyperexponential realized by valeriu vicol, the task I decide to implement a small search_engine on some random text blocks from random medium articles. Core algorithm used for current implementation inverted index. Idea of the search is that we type a word and the algorithm return sentence where this word was found grouped by article name.

📦corpus            # list articles on which will be performed revers index
📦media             # images used for readme
📦search_engine     # implementation
 ┣ 📂core
 ┃ ┣ 📜config.py
 ┃ ┗ 📜exceptions.py
 ┣ 📂services
 ┃ ┣ 📜corpus_service.py
 ┃ ┗ 📜preprocesing_service.py
 ┗ 📜engine.py
📦tests
 ┣ 📂resources          
 ┣ 📜config.py
 ┣ 📜test_corpus_service.py
 ┣ 📜test_engine.py
 ┗ 📜test_preprocessing_service.py

Set up the project

  1. create virtual environment
python3 -m venv venv/
  1. enter virtual environment
source venv/bin/activate
  1. install dependencies
pip3 install -r requirements.txt

Running the algorithm

  1. update the absolute path to the corpus in : search_engine/core/config.py

  2. enter the virtual environment

source venv/bin/activate
  1. run the algorithm
python3 main.py
  1. run the tests
pytest

with coverage :

pytest --cov

Further work

For avoiding have too algorithm heavy code, this implementation represent more a PoC rather than a ready to use library. Some of the steps which will improve the accuracy of search_engine are listed bellow:

  1. Use a tokenizer

    Instead of splitting sentence by '.', we should use a tokenizer as it takes into account other symbols. nltk tokenizer is a good choice for current task.

  2. Use a word stemming or lemmatizer

    In current example algorithm won't make difference between cloud and clouds and will index them as 2 different words. In order to avoid it we should use a word stemming or lemmatizer for indexed word and for words on which we are searching. Like this we will increase the accuracy of the engine. nltk also have this functionality

  3. Implement fuzzy search

    fuzzy searching might also a good improvement of the algorithm as the misspellings will be handled by algorithm in case if is not a big mistake. For fuzzy search we can use something like levenshtein distance.

  4. word2vec

    we can keep vector representation of each indexed word and then when we have a query we turn the input also into a vector and get the results based on distance of the vectors. Using this approach we can get even better results as we will get results based on vector similarity not on string matching.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 2

  •  
  •  
0