This Python project focuses on efficient text processing for both English and Persian documents. The primary operations include tokenization, stop word removal, parsing, and text cleaning. The goal is to enhance the readability, analysis, and information extraction from diverse textual content.
The project employs robust tokenization techniques to break down the input text into meaningful units, such as words or phrases. Tokenization facilitates subsequent analysis and processing steps.
Stop words, common words that contribute little to the overall meaning, are identified and removed from the text. This step enhances the relevance of the remaining words in the document.
The parsing phase involves analyzing the grammatical structure, sentence composition, and word dependencies within the text. This operation provides valuable insights into the linguistic aspects of the document.
The text cleaning process ensures the removal of unnecessary characters, punctuation, and artifacts introduced during previous operations. This final step aims to produce a refined and standardized version of the document.
The project seamlessly handles both English and Persian documents, making it versatile for multilingual text processing tasks.
- Information retrieval
- Document summarization
- Sentiment analysis
- Language-agnostic text analysis
This project is designed to be adaptable, allowing users to configure and customize parameters based on specific requirements. The modular structure facilitates easy integration into various natural language processing pipelines.
To clone and run this project, follow these steps:
git clone https://github.com/rezansrv/Natural-language-processing-NLP-.git
cd Natural-language-processing-NLP-
Ensure you have Python installed. Then, install the required packages using:
pip install nltk beautifulsoup4
Run the following Python script to download the necessary NLTK resources:
python -c "import nltk; nltk.download('punkt')"
Place your input documents (e.g., Eng.txt, Per.docx) in the 'inputs' directory.
Execute the provided Python script to clean, parse, and tokenize the documents:
python Clean.py
Follow the on-screen instructions to enter the input file name.
The cleaned text will be saved in the 'outputs' directory. Check the 'cleaned_text.txt' file for the processed result.
This project follows a modular approach, with each script (Clean.py, Parse.py, etc.) working sequentially. Users can customize and configure parameters based on specific requirements.