DocuLingo

🌍 Overview

DocuLingo combines "Document" processing with "Linguistic" intelligence, offering a powerful end-to-end parsing solution based on multimodal large language models. It enhances RAG (Retrieval Augmented Generation) workflows by intelligently parsing and structuring content from various document formats, including accurate extraction of formulas, tables, and images. While primarily optimized for Qwen2.5-VL, DocuLingo supports integration with other Vision Language Models (VLMs) through flexible configuration options, bridging the gap between document understanding and language processing capabilities.

✨ Features

PDF to HTML Conversion: Convert PDF documents to HTML format with preserved images
PDF to Markdown Conversion: Transform PDF content into clean, structured markdown
Office Document Support: Process DOCX, PPTX, and other office formats with proper image extraction
Intelligent Layout Analysis: Maintains document structure including tables, lists, and formatting
Multi-language Support: Works effectively with documents in multiple languages
Customizable Processing Parameters: Adjust DPI, token limits, and other settings based on your needs

🔧 Installation

Prerequisites

Python 3.11 or higher
LibreOffice (for processing Office documents)

Setup

Clone the repository:

git clone https://github.com/Niraya666/DocuLingo.git
cd DocuLingo

Create a conda environment:

conda create -n doculingo python=3.11
conda activate doculingo

Install required packages:

pip install -r requirements.txt

Additional Dependencies

For Mac:

brew install --cask libreoffice
brew install poppler

# Create symbolic link
sudo ln -s /Applications/LibreOffice.app/Contents/MacOS/soffice /usr/local/bin/libreoffice

For Linux:

apt-get update
apt-get install -y libreoffice
apt-get install -y unoconv

# Install Chinese fonts
apt-get update
apt-get install -y fonts-wqy-zenhei fonts-wqy-microhei
apt-get install -y fonts-noto-cjk

⚙️ Configuration

Copy the example environment file:

cp .env.example .env

Configure your API keys and model preferences:

OPENAI_API_KEY=sk-xxx
API_BASE=your-api-base
VISION_MODEL=Qwen/Qwen2-VL-72B-Instruct
TEXT_MODEL=Qwen/Qwen2.5-72B-Instruct

📚 Usage

Converting PDF to HTML with Images

python main.py \
    --pdf_path your-file-path \
    --output_dir path-to-save \
    --doc_type qwen_vl_html

Converting Office Documents (PPTX, DOCX) to HTML with Images

python main.py \
    --pdf_path your-file-path \
    --output_dir path-to-save \
    --dpi 150 \
    --max_tokens 4096 \
    --doc_type qwen_vl_html \
    --convert_office

Note: It's recommended to use absolute paths for file locations.

Command Line Arguments

--pdf_path        Path to the input PDF or Office file
--output_dir      Directory to save intermediate images
--dpi             DPI for PDF to image conversion (default: 150)
--max_tokens      Maximum tokens for LLM processing (default: 4096)
--doc_type        Document type for processing (default: qwen_vl_html)
--convert_office  Enable Office format conversion using LibreOffice

Concurrency and Retry Configuration

You can adjust the concurrency and retry parameters in .env:


MAX_RETRIES: 3    # Maximum number of retry attempts for failed requests
MAX_WORKERS: 2    # Maximum number of concurrent workers for parallel processing

🙏 Acknowledgements

Qwen2.5-VL for the powerful multimodal language model

Name		Name	Last commit message	Last commit date
Latest commit History 59 Commits
configs		configs
docs		docs
src		src
tests		tests
.env.example		.env.example
.gitignore		.gitignore
Dockerfile		Dockerfile
LICENSE		LICENSE
README.md		README.md
main.py		main.py
pdf_to_html_with_image.py		pdf_to_html_with_image.py
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

DocuLingo

🌍 Overview

✨ Features

🔧 Installation

Prerequisites

Setup

Additional Dependencies

For Mac:

For Linux:

⚙️ Configuration

📚 Usage

Converting PDF to HTML with Images

Converting Office Documents (PPTX, DOCX) to HTML with Images

Command Line Arguments

Concurrency and Retry Configuration

🙏 Acknowledgements

About

Uh oh!

Uh oh!

Languages

License

Niraya666/DocuLingo

Folders and files

Latest commit

History

Repository files navigation

DocuLingo

🌍 Overview

✨ Features

🔧 Installation

Prerequisites

Setup

Additional Dependencies

For Mac:

For Linux:

⚙️ Configuration

📚 Usage

Converting PDF to HTML with Images

Converting Office Documents (PPTX, DOCX) to HTML with Images

Command Line Arguments

Concurrency and Retry Configuration

🙏 Acknowledgements

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Uh oh!

Languages