8000 GitHub - Niraya666/DocuLingo: DocuLingo is a powerful document parsing tool built with multimodal large language models to enhance RAG (Retrieval Augmented Generation) workflows.
[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
Skip to content

DocuLingo is a powerful document parsing tool built with multimodal large language models to enhance RAG (Retrieval Augmented Generation) workflows.

License

Notifications You must be signed in to change notification settings

Niraya666/DocuLingo

Repository files navigation

DocuLingo

English | 中文

🌍 Overview

DocuLingo combines "Document" processing with "Linguistic" intelligence, offering a powerful end-to-end parsing solution based on multimodal large language models. It enhances RAG (Retrieval Augmented Generation) workflows by intelligently parsing and structuring content from various document formats, including accurate extraction of formulas, tables, and images. While primarily optimized for Qwen2.5-VL, DocuLingo supports integration with other Vision Language Models (VLMs) through flexible configuration options, bridging the gap between document understanding and language processing capabilities.

✨ Features

  • PDF to HTML Conversion: Convert PDF documents to HTML format with preserved images
  • PDF to Markdown Conversion: Transform PDF content into clean, structured markdown
  • Office Document Support: Process DOCX, PPTX, and other office formats with proper image extraction
  • Intelligent Layout Analysis: Maintains document structure including tables, lists, and formatting
  • Multi-language Support: Works effectively with documents in multiple languages
  • Customizable Processing Parameters: Adjust DPI, token limits, and other settings based on your needs

🔧 Installation

Prerequisites

  • Python 3.11 or higher
  • LibreOffice (for processing Office documents)

Setup

  1. Clone the repository:
git clone https://github.com/Niraya666/DocuLingo.git
cd DocuLingo
  1. Create a conda environment:
conda create -n doculingo python=3.11
conda activate doculingo
  1. Install required packages:
pip install -r requirements.txt

Additional Dependencies

For Mac:

brew install --cask libreoffice
brew install poppler

# Create symbolic link
sudo ln -s /Applications/LibreOffice.app/Contents/MacOS/soffice /usr/local/bin/libreoffice

For Linux:

apt-get update
apt-get install -y libreoffice
apt-get install -y unoconv

# Install Chinese fonts
apt-get update
apt-get install -y fonts-wqy-zenhei fonts-wqy-microhei
apt-get install -y fonts-noto-cjk

⚙️ Configuration

  1. Copy the example environment file:
cp .env.example .env
  1. Configure your API keys and model preferences:
OPENAI_API_KEY=sk-xxx
API_BASE=your-api-base
VISION_MODEL=Qwen/Qwen2-VL-72B-Instruct
TEXT_MODEL=Qwen/Qwen2.5-72B-Instruct

📚 Usage

Converting PDF to HTML with Images

python main.py \
    --pdf_path your-file-path \
    --output_dir path-to-save \
    --doc_type qwen_vl_html

Converting Office Documents (PPTX, DOCX) to HTML with Images

python main.py \
    --pdf_path your-file-path \
    --output_dir path-to-save \
    --dpi 150 \
    --max_tokens 4096 \
    --doc_type qwen_vl_html \
    --convert_office

Note: It's recommended to use absolute paths for file locations.

Command Line Arguments

--pdf_path        Path to the input PDF or Office file
--output_dir      Directory to save intermediate images
--dpi             DPI for PDF to image conversion (default: 150)
--max_tokens      Maximum tokens for LLM processing (default: 4096)
--doc_type        Document type for processing (default: qwen_vl_html)
--convert_office  Enable Office format conversion using LibreOffice

Concurrency and Retry Configuration

You can adjust the concurrency and retry parameters in .env:


MAX_RETRIES: 3    # Maximum number of retry attempts for failed requests
MAX_WORKERS: 2    # Maximum number of concurrent workers for parallel processing

🙏 Acknowledgements

  • Qwen2.5-VL for the powerful multimodal language model

About

DocuLingo is a powerful document parsing tool built with multimodal large language models to enhance RAG (Retrieval Augmented Generation) workflows.

Topics

Resources

License

Stars

Watchers

Forks

< 156D /ghcc-consent>
0