Description
Marker MCP Server - Enhanced PDF Processing
An advanced Model Context Protocol (MCP) server for high-quality PDF to Markdown conversion with comprehensive monitoring, security, testing, and batch processing capabilities.
Status: ✅ Production Ready | All Tests Passing | Comprehensive Feature Set
📋 Table of Contents
- 🚀 Quick Start
- 🛠️ MCP Tools Available
- 🔧 Advanced Configuration
- 📊 Monitoring & Performance
- 🛡️ Security Features
- 🧪 Testing Suite
- 📚 API Documentation
- 🤝 Contributing
- 📄 License & Legal
🚀 Quick Start
Installation
# Install dependencies
pip install .
# Or using poetry
poetry install
# Install development dependencies (for testing)
pip install -r requirements-dev.txt
Basic Usage
# Start the MCP server
python -m marker_mcp_server
# Show help and available options
python -m src.marker_mcp_server.server --help
# Show version information
python -m src.marker_mcp_server.server --version
# Enable debug logging
python -m src.marker_mcp_server.server --debug
System Requirements
- Python: 3.8+ (tested with 3.13)
- Memory: 4GB recommended minimum
- Storage: 1GB free space for temporary files
- OS: macOS, Linux, Windows (tested on macOS with Apple Silicon)
🛠️ MCP Tools Available
1. single_convert
- Single File Conversion
Convert individual PDF files with advanced options and monitoring.
arguments = {
"file_path": "/path/to/document.pdf",
"output_format": "markdown",
"page_range": "1-5",
"use_llm": True,
"debug": False
}
Features:
- Real-time resource monitoring
- Security validation
- Error recovery
- Progress tracking
2. batch_convert
- Enhanced Batch Processing
Convert multiple PDFs in a folder with full CLI argument support and concurrent processing.
arguments = {
"in_folder": "/path/to/pdfs",
"output_dir": "/path/to/outputs",
"max_workers": 3,
"output_format": "markdown",
"use_llm": True,
"skip_existing": True
}
Features:
- Concurrent processing (configurable workers)
- Partial failure handling
- Progress tracking
- Resource limit enforcement
3. chunk_convert
- Memory-Efficient Folder Processing
Process large collections of PDFs using memory-efficient chunking.
arguments = {
"in_folder": "/path/to/large/collection",
"out_folder": "/path/to/output",
"chunk_size": 50,
"use_llm": True
}
4. batch_pages_convert
- 🆕 Advanced Chunked Processing
NEW FEATURE: Process large PDFs efficiently by splitting them into page chunks.
arguments = {
"file_path": "/path/to/large_document.pdf",
"pages_per_chunk": 5,
"combine_output": True,
"use_llm": True,
"output_format": "markdown"
}
Benefits:
- Memory efficient processing
- Fault tolerant (chunk failures don't stop entire process)
- Progress tracking for each chunk
- Automatic output stitching
5. get_system_health
- 🆕 System Monitoring
Real-time system health assessment and resource monitoring.
# Returns current system status
{
"status": "healthy",
"memory_usage_percent": 45.2,
"cpu_usage_percent": 23.1,
"active_jobs": 1,
"queue_size": 0,
"alerts": []
}
6. get_metrics_summary
- 🆕 Performance Metrics
Get comprehensive performance metrics and operation statistics.
# Returns detailed metrics
{
"total_operations": 156,
"success_rate": 98.7,
"average_processing_time": 12.3,
"peak_memory_usage": 1024.5,
"gpu_utilization": 67.2
}
arguments = {
"host": "0.0.0.0",
"port": 8080
}
## 🔧 Advanced Configuration
### Resource Limits
Configure system resource limits for safe operation:
```json
{
"resource_limits": {
"max_file_size_mb": 500,
"max_memory_usage_mb": 4096,
"max_processing_time_seconds": 600,
"max_concurrent_jobs": 3,
"max_queue_size": 10
}
}
Monitoring Configuration
Enable comprehensive system monitoring:
{
"monitoring": {
"enable_metrics": true,
"metrics_interval_seconds": 30,
"log_performance": true,
"alert_memory_threshold_percent": 85.0,
"alert_processing_time_threshold_seconds": 300
}
}
Security Configuration
Configure security validation and access controls:
{
"security": {
"validate_file_paths": true,
"allowed_file_extensions": [".pdf"],
"allowed_input_dirs": ["/safe/input"],
"allowed_output_dirs": ["/safe/output"]
}
}
LLM Integration
Enable high-quality processing with Large Language Models:
# Basic LLM usage
{
"use_llm": True,
"llm_service": "groq" # Automatically normalized to full path
}
# Advanced LLM configuration
{
"use_llm": True,
"llm_service": "marker.services.groq.GroqService",
"config_json": "examples/llm_enhanced_config.json"
}
Page Range Selection
Process specific page ranges efficiently:
{
"page_range": "0-5", # Pages 0 through 5
"page_range": "0,3,5-10", # Pages 0, 3, and 5 through 10
"page_range": "10-" # Page 10 to end
}
Output Formats
Choose from multiple output formats:
{
"output_format": "markdown", # Default, clean markdown
"output_format": "json", # Structured JSON with metadata
"output_format": "html" # Styled HTML output
}
Debug Mode
Enable comprehensive debugging:
{
"debug": True # Saves debug images, processing data, and detailed logs
}
📁 Configuration Examples
Basic Configuration
{
"use_llm": false,
"output_format": "markdown",
"debug": false,
"extract_images": true,
"pdftext_workers": 2
}
LLM-Enhanced Configuration
{
"use_llm": true,
"llm_service": "marker.services.groq.GroqService",
"output_format": "markdown",
"debug": false,
"extract_images": true,
"format_lines": true
}
High-Performance Configuration
{
"workers": 3,
"max_tasks_per_worker": 20,
"disable_multiprocessing": false,
"pdftext_workers": 4,
"chunk_size": 100,
"resource_limits": {
"max_memory_usage_mb": 8192
}
}
Production Configuration
{
"resource_limits": {
"max_file_size_mb": 1000,
"max_memory_usage_mb": 8192,
"max_processing_time_seconds": 1200,
"max_concurrent_jobs": 5
},
"monitoring": {
"enable_metrics": true,
"log_performance": true,
"alert_memory_threshold_percent": 80.0
},
"security": {
"validate_file_paths": true,
"allowed_input_dirs": ["/app/input"],
"allowed_output_dirs": ["/app/output"]
}
}
📊 Monitoring & Performance
Real-Time System Health
Monitor system performance and resource usage in real-time:
# Get current system health
health = await client.call_tool("get_system_health")
# Returns:
{
"status": "healthy", # Overall system status
"memory_usage_percent": 45.2, # Current memory usage
"cpu_usage_percent": 23.1, # Current CPU usage
"gpu_usage_percent": 67.2, # GPU utilization (if available)
"active_jobs": 1, # Currently running operations
"queue_size": 0, # Pending operations
"alerts": [], # Active system alerts
"uptime_seconds": 3600 # Server uptime
}
Performance Metrics
Track operation performance and success rates:
# Get comprehensive metrics
metrics = await client.call_tool("get_metrics_summary")
# Returns:
{
"total_operations": 156,
"success_rate": 98.7,
"average_processing_time": 12.3,
"peak_memory_usage": 1024.5,
"total_pages_processed": 2340,
"errors_by_type": {
"memory_limit": 1,
"timeout": 1,
"file_not_found": 0
}
}
Resource Monitoring Features
- Memory Usage Tracking: Real-time memory consumption with alerts
- CPU Utilization: Multi-core usage monitoring
- GPU Monitoring: Apple Silicon MPS device tracking
- Processing Time Analysis: Per-operation timing with thresholds
- Operation Lifecycle: Start, progress, completion tracking
- Error Rate Analysis: Failure categorization and trending
Alert System
Configurable alerts for resource thresholds:
- Memory usage > 85% (configurable)
- Processing time > 300 seconds (configurable)
- Queue size approaching limits
- Error rate spikes
Performance Characteristics
- Throughput: ~2-5 pages/second (device dependent)
- Concurrent Jobs: 3 by default (configurable)
- Memory Efficiency: Streaming processing for large files
- Apple Silicon: Optimized for M1/M2 processors
🛡️ Security Features
File System Protection
Comprehensive security validation and access controls:
- Path Traversal Prevention: Blocks
../
and absolute path attacks - Directory Restriction: Enforces allowed input/output directories
- Extension Validation: Restricts to approved file types (
.pdf
) - Filename Sanitization: Prevents malicious filename patterns
Input Validation
- Parameter Sanitization: Type checking and range validation
- Configuration Validation: Schema-based security settings
- Access Logging: Detailed security event tracking
- Size Limits: Configurable file and memory limits
Security Event Logging
All security events are logged with details:
- File access attempts outside allowed directories
- Invalid file extension attempts
- Path traversal attack attempts
- Resource limit violations
Example Security Configuration
{
"security": {
"validate_file_paths": true,
"allowed_file_extensions": [".pdf"],
"allowed_input_dirs": [
"/safe/input",
"/approved/documents"
],
"allowed_output_dirs": [
"/safe/output",
"/processed/results"
]
}
}
🧪 Testing Suite
Comprehensive Test Coverage
The system includes 65+ comprehensive test cases covering:
- Unit Tests: Component-level testing with mocking
- Integration Tests: End-to-end workflow validation
- Security Tests: Attack scenario prevention
- Performance Tests: Load testing and benchmarking
- Error Handling Tests: Edge cases and failure scenarios
Test Categories
# Run all tests
python -m pytest tests/ -v
# Run specific test categories
python -m pytest tests/test_monitoring.py -v # 23 monitoring tests
python -m pytest tests/test_security.py -v # 24 security tests
python -m pytest tests/test_error_handling.py -v # 18 error handling tests
python -m pytest tests/test_pdf_conversion.py -v # 20 real functionality tests
python -m pytest tests/test_tools.py -v # 23 tool integration tests
python -m pytest tests/test_config.py -v # Configuration tests
# Run with coverage
python -m pytest tests/ --cov=src/marker_mcp_server
Test Results (Latest)
✅ All Tests Passing: 100% success rate across all test suites
✅ Real Functionality: Tests use actual PDF conversion without extensive mocking
✅ Comprehensive Coverage: Monitoring, security, error handling, performance
✅ Integration Validated: End-to-end workflows tested
Testing Philosophy
- Real-World Testing: Minimal mocking, actual functionality validation
- Comprehensive Coverage: All major features and edge cases
- Performance Validation: Resource usage and timing verification
- Security Hardening: Attack scenario prevention testing
📚 API Documentation
MCP Client Usage
# Example MCP client usage
import asyncio
from mcp.client import MCPClient
async def main():
client = MCPClient()
# Convert single PDF
result = await client.call_tool("single_convert", {
"file_path": "/path/to/document.pdf",
"output_format": "markdown",
"use_llm": True
})
# Check system health
health = await client.call_tool("get_system_health")
print(f"System status: {health['status']}")
# Get performance metrics
metrics = await client.call_tool("get_metrics_summary")
print(f"Success rate: {metrics['success_rate']}%")
asyncio.run(main())
REST API (Optional)
When the server is started with FastAPI mode, it automatically exposes:
- Swagger UI: Available at
/docs
for interactive API testing - ReDoc: Available at
/redoc
for comprehensive documentation - OpenAPI Schema: Available at
/openapi.json
Available Endpoints
POST /convert/single
- Single file conversionPOST /convert/batch
- Batch processingPOST /convert/chunk
- Chunked processingGET /health
- System health checkGET /metrics
- Performance metrics
🤝 Contributing
We welcome contributions! Here's how to get started:
How to Contribute
- Fork the repository and create your branch from
main
- Write tests for your changes (see
tests/
directory) - Document new features or changes in the README
- Open a Pull Request with a clear description of your changes
Development Guidelines
- Follow PEP8 and use type hints where possible
- Write docstrings for all public functions and classes
- Add tests for new functionality (aim for high coverage)
- Update documentation for user-facing changes
Adding New Features
New Processors
Add processors in marker/processors/
and register them:
from marker.processors import register_processor
@register_processor('my_processor')
class MyProcessor:
def process(self, document):
return processed_document
New MCP Tools
Add tools in src/marker_mcp_server/tools.py
and register in server.py
:
@mcp.tool("my_tool")
async def handle_my_tool(arguments: Dict[str, Any]) -> Dict[str, Any]:
# Implementation with monitoring
async with metrics_collector.track_operation("my_operation"):
return result
Testing Your Changes
# Run all tests
python -m pytest tests/ -v
# Run specific test files
python -m pytest tests/test_new_feature.py -v
# Check test coverage
python -m pytest tests/ --cov=src/marker_mcp_server
# Run integration tests
python -m pytest tests/test_pdf_conversion.py -v
Code Review Process
- All submissions require review
- Tests must pass on all supported Python versions
- Documentation must be updated for user-facing changes
- Performance impact should be considered for core changes
🔍 Troubleshooting
Common Issues
High Memory Usage
Symptom: System running out of memory during processing
Solution:
- Reduce
max_concurrent_jobs
in configuration - Lower
max_memory_usage_mb
limit - Use chunked processing for large files
- Check system health with
get_system_health
tool
Processing Timeouts
Symptom: Operations timing out before completion
Solution:
- Increase
max_processing_time_seconds
limit - Use chunked processing for large documents
- Check system resources and reduce concurrent operations
- Enable debug mode to identify bottlenecks
File Size Errors
Symptom: "File too large" errors
Solution:
- Increase
max_file_size_mb
limit - Use chunked processing for very large PDFs
- Check available disk space
- Consider splitting large PDFs before processing
Path Validation Failures
Symptom: "Path not allowed" security errors
Solution:
- Configure
allowed_input_dirs
andallowed_output_dirs
- Use absolute paths
- Check file permissions
- Verify paths don't contain traversal patterns (
../
)
LLM Integration Issues
Symptom: LLM service connection failures
Solution:
- Verify API credentials are configured
- Check network connectivity
- Validate
llm_service
configuration - Review service-specific documentation
Debug Mode
Enable comprehensive debugging:
# Start server with debug logging
python -m src.marker_mcp_server.server --debug
# Or configure in JSON
{
"debug": true,
"monitoring": {
"log_performance": true
}
}
Debug mode provides:
- Detailed operation logging
- Resource usage tracking
- Intermediate file preservation
- Extended error information
Health Checks
Monitor system health:
# Check system status
health = await client.call_tool("get_system_health")
# Analyze performance metrics
metrics = await client.call_tool("get_metrics_summary")
Log Analysis
Key log locations:
- Application logs: Check console output or log files
- Security events: Security validation failures and attempts
- Performance metrics: Resource usage and timing information
- Error details: Stack traces and diagnostic information
📄 License & Legal
License
This project is licensed under the terms specified in the LICENSE file.
Contributor License Agreement
Contributors must agree to the Marker Contributor Agreement (MCA). See CLA.md for details.
Key points:
- Joint ownership of contributions with Endless Labs, Inc.
- Permission to sublicense contributions
- Patent grants for contributed technology
- Compliance with export control laws
Third-Party Dependencies
This project uses various open-source libraries. See pyproject.toml
and requirements-dev.txt
for complete dependency lists.
Key dependencies:
- Marker PDF: Core PDF processing capabilities
- FastAPI: REST API server (optional)
- pdftext: PDF text extraction
- WeasyPrint: HTML to PDF rendering
- pytest: Testing framework
Copyright Notice
Copyright (c) 2025 Endless Labs, Inc.
Licensed under the Marker Contributor Agreement
📞 Support
Getting Help
- GitHub Issues: For bug reports and feature requests
- Documentation: See this README and example configurations
- Testing: Run integration tests to validate setup
- Debugging: Enable debug mode for detailed diagnostics
Reporting Issues
When reporting issues, please include:
- System information: OS, Python version, hardware specs
- Configuration: Relevant configuration files (remove sensitive data)
- Error messages: Complete error logs and stack traces
- Reproduction steps: Minimal example to reproduce the issue
- Expected behavior: What should happen vs what actually happens
Feature Requests
Feature requests should include:
- Use case: Clear description of the problem being solved
- Proposed solution: How the feature should work
- Alternatives considered: Other approaches that were evaluated
- Impact: Who would benefit and how
🎯 Project Status
✅ Production Ready: All major features implemented and tested
✅ Comprehensive Testing: 65+ test cases with 100% pass rate
✅ Security Hardened: File system protection and input validation
✅ Performance Monitored: Real-time resource tracking and alerts
✅ MCP Compliant: Full Model Context Protocol implementation
✅ Well Documented: Complete API and configuration documentation
The Marker MCP Server provides a robust, secure, and scalable solution for PDF to Markdown conversion with comprehensive monitoring and testing capabilities.
Last updated: June 7, 2025