MCP Implementation

Marker MCP Server - Enhanced PDF Processing

An advanced Model Context Protocol (MCP) server for high-quality PDF to Markdown conversion with comprehensive monitoring, security, testing, and batch processing capabilities.

Status: ✅ Production Ready | All Tests Passing | Comprehensive Feature Set

🚀 Quick Start

Installation

# Install dependencies
pip install .

# Or using poetry
poetry install

# Install development dependencies (for testing)
pip install -r requirements-dev.txt

Basic Usage

# Start the MCP server
python -m marker_mcp_server

# Show help and available options
python -m src.marker_mcp_server.server --help

# Show version information
python -m src.marker_mcp_server.server --version

# Enable debug logging
python -m src.marker_mcp_server.server --debug

System Requirements

Python: 3.8+ (tested with 3.13)
Memory: 4GB recommended minimum
Storage: 1GB free space for temporary files
OS: macOS, Linux, Windows (tested on macOS with Apple Silicon)

🛠️ MCP Tools Available

1. `single_convert` - Single File Conversion

Convert individual PDF files with advanced options and monitoring.

arguments = {
    "file_path": "/path/to/document.pdf",
    "output_format": "markdown",
    "page_range": "1-5",
    "use_llm": True,
    "debug": False
}

Features:

Real-time resource monitoring
Security validation
Error recovery
Progress tracking

2. `batch_convert` - Enhanced Batch Processing

Convert multiple PDFs in a folder with full CLI argument support and concurrent processing.

arguments = {
    "in_folder": "/path/to/pdfs",
    "output_dir": "/path/to/outputs", 
    "max_workers": 3,
    "output_format": "markdown",
    "use_llm": True,
    "skip_existing": True
}

Features:

Concurrent processing (configurable workers)
Partial failure handling
Progress tracking
Resource limit enforcement

3. `chunk_convert` - Memory-Efficient Folder Processing

Process large collections of PDFs using memory-efficient chunking.

arguments = {
    "in_folder": "/path/to/large/collection",
    "out_folder": "/path/to/output",
    "chunk_size": 50,
    "use_llm": True
}

4. `batch_pages_convert` - 🆕 Advanced Chunked Processing

NEW FEATURE: Process large PDFs efficiently by splitting them into page chunks.

arguments = {
    "file_path": "/path/to/large_document.pdf",
    "pages_per_chunk": 5,
    "combine_output": True,
    "use_llm": True,
    "output_format": "markdown"
}

Benefits:

Memory efficient processing
Fault tolerant (chunk failures don't stop entire process)
Progress tracking for each chunk
Automatic output stitching

5. `get_system_health` - 🆕 System Monitoring

Real-time system health assessment and resource monitoring.

# Returns current system status
{
    "status": "healthy",
    "memory_usage_percent": 45.2,
    "cpu_usage_percent": 23.1,
    "active_jobs": 1,
    "queue_size": 0,
    "alerts": []
}

6. `get_metrics_summary` - 🆕 Performance Metrics

Get comprehensive performance metrics and operation statistics.

# Returns detailed metrics
{
    "total_operations": 156,
    "success_rate": 98.7,
    "average_processing_time": 12.3,
    "peak_memory_usage": 1024.5,
    "gpu_utilization": 67.2
}

arguments = {
"host": "0.0.0.0",
"port": 8080
}


## 🔧 Advanced Configuration

### Resource Limits

Configure system resource limits for safe operation:

```json
{
    "resource_limits": {
        "max_file_size_mb": 500,
        "max_memory_usage_mb": 4096,
        "max_processing_time_seconds": 600,
        "max_concurrent_jobs": 3,
        "max_queue_size": 10
    }
}

Monitoring Configuration

Enable comprehensive system monitoring:

{
    "monitoring": {
        "enable_metrics": true,
        "metrics_interval_seconds": 30,
        "log_performance": true,
        "alert_memory_threshold_percent": 85.0,
        "alert_processing_time_threshold_seconds": 300
    }
}

Security Configuration

Configure security validation and access controls:

{
    "security": {
        "validate_file_paths": true,
        "allowed_file_extensions": [".pdf"],
        "allowed_input_dirs": ["/safe/input"],
        "allowed_output_dirs": ["/safe/output"]
    }
}

LLM Integration

Enable high-quality processing with Large Language Models:

# Basic LLM usage
{
    "use_llm": True,
    "llm_service": "groq"  # Automatically normalized to full path
}

# Advanced LLM configuration
{
    "use_llm": True,
    "llm_service": "marker.services.groq.GroqService",
    "config_json": "examples/llm_enhanced_config.json"
}

Page Range Selection

Process specific page ranges efficiently:

{
    "page_range": "0-5",      # Pages 0 through 5
    "page_range": "0,3,5-10", # Pages 0, 3, and 5 through 10
    "page_range": "10-"       # Page 10 to end
}

Output Formats

Choose from multiple output formats:

{
    "output_format": "markdown",  # Default, clean markdown
    "output_format": "json",      # Structured JSON with metadata
    "output_format": "html"       # Styled HTML output
}

Debug Mode

Enable comprehensive debugging:

{
    "debug": True  # Saves debug images, processing data, and detailed logs
}

📁 Configuration Examples

Basic Configuration

{
  "use_llm": false,
  "output_format": "markdown",
  "debug": false,
  "extract_images": true,
  "pdftext_workers": 2
}

LLM-Enhanced Configuration

{
  "use_llm": true,
  "llm_service": "marker.services.groq.GroqService",
  "output_format": "markdown",
  "debug": false,
  "extract_images": true,
  "format_lines": true
}

High-Performance Configuration

{
  "workers": 3,
  "max_tasks_per_worker": 20,
  "disable_multiprocessing": false,
  "pdftext_workers": 4,
  "chunk_size": 100,
  "resource_limits": {
    "max_memory_usage_mb": 8192
  }
}

Production Configuration

{
  "resource_limits": {
    "max_file_size_mb": 1000,
    "max_memory_usage_mb": 8192,
    "max_processing_time_seconds": 1200,
    "max_concurrent_jobs": 5
  },
  "monitoring": {
    "enable_metrics": true,
    "log_performance": true,
    "alert_memory_threshold_percent": 80.0
  },
  "security": {
    "validate_file_paths": true,
    "allowed_input_dirs": ["/app/input"],
    "allowed_output_dirs": ["/app/output"]
  }
}

📊 Monitoring & Performance

Real-Time System Health

Monitor system performance and resource usage in real-time:

# Get current system health
health = await client.call_tool("get_system_health")
# Returns:
{
    "status": "healthy",           # Overall system status
    "memory_usage_percent": 45.2,  # Current memory usage
    "cpu_usage_percent": 23.1,     # Current CPU usage
    "gpu_usage_percent": 67.2,     # GPU utilization (if available)
    "active_jobs": 1,              # Currently running operations
    "queue_size": 0,               # Pending operations
    "alerts": [],                  # Active system alerts
    "uptime_seconds": 3600         # Server uptime
}

Performance Metrics

Track operation performance and success rates:

# Get comprehensive metrics
metrics = await client.call_tool("get_metrics_summary")
# Returns:
{
    "total_operations": 156,
    "success_rate": 98.7,
    "average_processing_time": 12.3,
    "peak_memory_usage": 1024.5,
    "total_pages_processed": 2340,
    "errors_by_type": {
        "memory_limit": 1,
        "timeout": 1,
        "file_not_found": 0
    }
}

Resource Monitoring Features

Memory Usage Tracking: Real-time memory consumption with alerts
CPU Utilization: Multi-core usage monitoring
GPU Monitoring: Apple Silicon MPS device tracking
Processing Time Analysis: Per-operation timing with thresholds
Operation Lifecycle: Start, progress, completion tracking
Error Rate Analysis: Failure categorization and trending

Alert System

Configurable alerts for resource thresholds:

Memory usage > 85% (configurable)
Processing time > 300 seconds (configurable)
Queue size approaching limits
Error rate spikes

Performance Characteristics

Throughput: ~2-5 pages/second (device dependent)
Concurrent Jobs: 3 by default (configurable)
Memory Efficiency: Streaming processing for large files
Apple Silicon: Optimized for M1/M2 processors

🛡️ Security Features

File System Protection

Comprehensive security validation and access controls:

Path Traversal Prevention: Blocks ../ and absolute path attacks
Directory Restriction: Enforces allowed input/output directories
Extension Validation: Restricts to approved file types (.pdf)
Filename Sanitization: Prevents malicious filename patterns

Input Validation

Parameter Sanitization: Type checking and range validation
Configuration Validation: Schema-based security settings
Access Logging: Detailed security event tracking
Size Limits: Configurable file and memory limits

Security Event Logging

All security events are logged with details:

File access attempts outside allowed directories
Invalid file extension attempts
Path traversal attack attempts
Resource limit violations

Example Security Configuration

{
    "security": {
        "validate_file_paths": true,
        "allowed_file_extensions": [".pdf"],
        "allowed_input_dirs": [
            "/safe/input",
            "/approved/documents"
        ],
        "allowed_output_dirs": [
            "/safe/output",
            "/processed/results"
        ]
    }
}

🧪 Testing Suite

Comprehensive Test Coverage

The system includes 65+ comprehensive test cases covering:

Unit Tests: Component-level testing with mocking
Integration Tests: End-to-end workflow validation
Security Tests: Attack scenario prevention
Performance Tests: Load testing and benchmarking
Error Handling Tests: Edge cases and failure scenarios

Test Categories

# Run all tests
python -m pytest tests/ -v

# Run specific test categories  
python -m pytest tests/test_monitoring.py -v      # 23 monitoring tests
python -m pytest tests/test_security.py -v       # 24 security tests
python -m pytest tests/test_error_handling.py -v # 18 error handling tests
python -m pytest tests/test_pdf_conversion.py -v # 20 real functionality tests
python -m pytest tests/test_tools.py -v          # 23 tool integration tests
python -m pytest tests/test_config.py -v         # Configuration tests

# Run with coverage
python -m pytest tests/ --cov=src/marker_mcp_server

Test Results (Latest)

✅ All Tests Passing: 100% success rate across all test suites
✅ Real Functionality: Tests use actual PDF conversion without extensive mocking
✅ Comprehensive Coverage: Monitoring, security, error handling, performance
✅ Integration Validated: End-to-end workflows tested

Testing Philosophy

Real-World Testing: Minimal mocking, actual functionality validation
Comprehensive Coverage: All major features and edge cases
Performance Validation: Resource usage and timing verification
Security Hardening: Attack scenario prevention testing

📚 API Documentation

MCP Client Usage

# Example MCP client usage
import asyncio
from mcp.client import MCPClient

async def main():
    client = MCPClient()
  
    # Convert single PDF
    result = await client.call_tool("single_convert", {
        "file_path": "/path/to/document.pdf",
        "output_format": "markdown",
        "use_llm": True
    })
  
    # Check system health
    health = await client.call_tool("get_system_health")
    print(f"System status: {health['status']}")
  
    # Get performance metrics
    metrics = await client.call_tool("get_metrics_summary")
    print(f"Success rate: {metrics['success_rate']}%")

asyncio.run(main())

REST API (Optional)

When the server is started with FastAPI mode, it automatically exposes:

Swagger UI: Available at /docs for interactive API testing
ReDoc: Available at /redoc for comprehensive documentation
OpenAPI Schema: Available at /openapi.json

Available Endpoints

POST /convert/single - Single file conversion
POST /convert/batch - Batch processing
POST /convert/chunk - Chunked processing
GET /health - System health check
GET /metrics - Performance metrics

🤝 Contributing

We welcome contributions! Here's how to get started:

How to Contribute

Fork the repository and create your branch from main
Write tests for your changes (see tests/ directory)
Document new features or changes in the README
Open a Pull Request with a clear description of your changes

Development Guidelines

Follow PEP8 and use type hints where possible
Write docstrings for all public functions and classes
Add tests for new functionality (aim for high coverage)
Update documentation for user-facing changes

Adding New Features

New Processors

Add processors in marker/processors/ and register them:

from marker.processors import register_processor

@register_processor('my_processor')
class MyProcessor:
    def process(self, document):
        return processed_document

New MCP Tools

Add tools in src/marker_mcp_server/tools.py and register in server.py:

@mcp.tool("my_tool")
async def handle_my_tool(arguments: Dict[str, Any]) -> Dict[str, Any]:
    # Implementation with monitoring
    async with metrics_collector.track_operation("my_operation"):
        return result

Testing Your Changes

# Run all tests
python -m pytest tests/ -v

# Run specific test files
python -m pytest tests/test_new_feature.py -v

# Check test coverage
python -m pytest tests/ --cov=src/marker_mcp_server

# Run integration tests
python -m pytest tests/test_pdf_conversion.py -v

Code Review Process

All submissions require review
Tests must pass on all supported Python versions
Documentation must be updated for user-facing changes
Performance impact should be considered for core changes

🔍 Troubleshooting

Common Issues

High Memory Usage

Symptom: System running out of memory during processing
Solution:

Reduce max_concurrent_jobs in configuration
Lower max_memory_usage_mb limit
Use chunked processing for large files
Check system health with get_system_health tool

Processing Timeouts

Symptom: Operations timing out before completion
Solution:

Increase max_processing_time_seconds limit
Use chunked processing for large documents
Check system resources and reduce concurrent operations
Enable debug mode to identify bottlenecks

File Size Errors

Symptom: "File too large" errors
Solution:

Increase max_file_size_mb limit
Use chunked processing for very large PDFs
Check available disk space
Consider splitting large PDFs before processing

Path Validation Failures

Symptom: "Path not allowed" security errors
Solution:

Configure allowed_input_dirs and allowed_output_dirs
Use absolute paths
Check file permissions
Verify paths don't contain traversal patterns (../)

LLM Integration Issues

Symptom: LLM service connection failures
Solution:

Verify API credentials are configured
Check network connectivity
Validate llm_service configuration
Review service-specific documentation

Debug Mode

Enable comprehensive debugging:

# Start server with debug logging
python -m src.marker_mcp_server.server --debug

# Or configure in JSON
{
    "debug": true,
    "monitoring": {
        "log_performance": true
    }
}

Debug mode provides:

Detailed operation logging
Resource usage tracking
Intermediate file preservation
Extended error information

Health Checks

Monitor system health:

# Check system status
health = await client.call_tool("get_system_health")

# Analyze performance metrics
metrics = await client.call_tool("get_metrics_summary")

Log Analysis

Key log locations:

Application logs: Check console output or log files
Security events: Security validation failures and attempts
Performance metrics: Resource usage and timing information
Error details: Stack traces and diagnostic information

📄 License & Legal

License

This project is licensed under the terms specified in the LICENSE file.

Contributor License Agreement

Contributors must agree to the Marker Contributor Agreement (MCA). See CLA.md for details.

Key points:

Joint ownership of contributions with Endless Labs, Inc.
Permission to sublicense contributions
Patent grants for contributed technology
Compliance with export control laws

Third-Party Dependencies

This project uses various open-source libraries. See pyproject.toml and requirements-dev.txt for complete dependency lists.

Key dependencies:

Marker PDF: Core PDF processing capabilities
FastAPI: REST API server (optional)
pdftext: PDF text extraction
WeasyPrint: HTML to PDF rendering
pytest: Testing framework

Copyright Notice

Copyright (c) 2025 Endless Labs, Inc.
Licensed under the Marker Contributor Agreement

📞 Support

Getting Help

GitHub Issues: For bug reports and feature requests
Documentation: See this README and example configurations
Testing: Run integration tests to validate setup
Debugging: Enable debug mode for detailed diagnostics

Reporting Issues

When reporting issues, please include:

System information: OS, Python version, hardware specs
Configuration: Relevant configuration files (remove sensitive data)
Error messages: Complete error logs and stack traces
Reproduction steps: Minimal example to reproduce the issue
Expected behavior: What should happen vs what actually happens

Feature Requests

Feature requests should include:

Use case: Clear description of the problem being solved
Proposed solution: How the feature should work
Alternatives considered: Other approaches that were evaluated
Impact: Who would benefit and how

🎯 Project Status

✅ Production Ready: All major features implemented and tested
✅ Comprehensive Testing: 65+ test cases with 100% pass rate
✅ Security Hardened: File system protection and input validation
✅ Performance Monitored: Real-time resource tracking and alerts
✅ MCP Compliant: Full Model Context Protocol implementation
✅ Well Documented: Complete API and configuration documentation

The Marker MCP Server provides a robust, secure, and scalable solution for PDF to Markdown conversion with comprehensive monitoring and testing capabilities.

Last updated: June 7, 2025

Description

Marker MCP Server - Enhanced PDF Processing

📋 Table of Contents

🚀 Quick Start

Installation

Basic Usage

System Requirements

🛠️ MCP Tools Available

1. single_convert - Single File Conversion

2. batch_convert - Enhanced Batch Processing

3. chunk_convert - Memory-Efficient Folder Processing

4. batch_pages_convert - 🆕 Advanced Chunked Processing

5. get_system_health - 🆕 System Monitoring

6. get_metrics_summary - 🆕 Performance Metrics

Monitoring Configuration

Security Configuration

LLM Integration

Page Range Selection

Output Formats

Debug Mode

📁 Configuration Examples

Basic Configuration

LLM-Enhanced Configuration

High-Performance Configuration

Production Configuration

📊 Monitoring & Performance

Real-Time System Health

Performance Metrics

Resource Monitoring Features

Alert System

Performance Characteristics

🛡️ Security Features

File System Protection

Input Validation

Security Event Logging

Example Security Configuration

🧪 Testing Suite

Comprehensive Test Coverage

Test Categories

Test Results (Latest)

Testing Philosophy

📚 API Documentation

MCP Client Usage

REST API (Optional)

Available Endpoints

🤝 Contributing

How to Contribute

Development Guidelines

Adding New Features

New Processors

New MCP Tools

Testing Your Changes

Code Review Process

🔍 Troubleshooting

Common Issues

High Memory Usage

Processing Timeouts

File Size Errors

Path Validation Failures

LLM Integration Issues

Debug Mode

Health Checks

Log Analysis

📄 License & Legal

License

Contributor License Agreement

Third-Party Dependencies

Copyright Notice

📞 Support

Getting Help

Reporting Issues

Feature Requests

🎯 Project Status

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

1. `single_convert` - Single File Conversion

2. `batch_convert` - Enhanced Batch Processing

3. `chunk_convert` - Memory-Efficient Folder Processing

4. `batch_pages_convert` - 🆕 Advanced Chunked Processing

5. `get_system_health` - 🆕 System Monitoring

6. `get_metrics_summary` - 🆕 Performance Metrics