PDF Parser

This project contains a PDF parsing tool that extracts text and metadata from PDF files.

Prerequisites

Python 3.10+
Install dependencies through pip
Java Runtime Environment (JRE) openjdk@11 or openjdk@17
If local debug mode, add below into .vscode/settings.json (create such file if don't have):

{
    "version": "0.2.0",
    "configurations": [
        {
            "name": "Python: Current File",
            "type": "debugpy",
            "request": "launch",
            "program": "${file}",
            "console": "integratedTerminal",
            "cwd": "${workspaceFolder}",
            "env": {
                "PYTHONPATH": "${workspaceFolder}"
            },
            "justMyCode": true
        }
    ]
}

Usage

To process a PDF file, use the process_pdf_task function in the process_pdf.py file. Here's how to use it:

Set PYTHONPATH to the root directory of the project:

export PYTHONPATH=$(pwd)
# OR
export PYTHONPATH=/`pwd`

Note that if you would like to have section with respective results, you need to set the run_section_extraction to True, this might take a while and also requires OPENAI_API_KEY to be set. Update the section mapping file in pdf_parser/section_mapping.json if you would like to change the section names.
```
OPENAI_API_KEY=example_openai_api_key
```
Then, you can call the function as follows:
```
from pdf_parser.process_pdf import process_pdf_task
client = OpenAI()
results = process_pdf_task(client, 'path/to/your/pdf.pdf', run_section_extraction=True, output_path="output")
```

How it works

The process_pdf_task function copies the input PDF to a temporary directory for processing.
It then runs the Java-based PDF parser using the pdf-parser.jar file.
The parser extracts metadata and text information from the PDF and saves it as a JSON file.
The process_json function in text_extraction.py processes this JSON data to create a full text representation of the PDF content.
The results, including metadata and full text, are saved in the results/pdf_parser directory as a JSON file.

Customization

You can customize the text extraction process by modifying the create_full_text function in text_extraction.py. This function determines how the extracted text is formatted and structured.

Troubleshooting

If you encounter any issues:

Ensure that Java openjdk@17 is installed and accessible from the command line.
Install the required packages with pip install -r requirements.txt.
Check that the pdf-parser.jar file is present in the pdf_parser directory.
Verify that the input PDF file exists in the specified location.

For any errors during processing, check the logs printed to the console for more information.

Name		Name	Last commit message	Last commit date
Latest commit History 78 Commits
orig		orig
pdf_parser		pdf_parser
.env.example		.env.example
.gitignore		.gitignore
LICENSE.txt		LICENSE.txt
README.md		README.md
requirements.in		requirements.in
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

PDF Parser

Prerequisites

Usage

How it works

Customization

Troubleshooting

About

Uh oh!

Releases

Packages

Languages

License

Bcpeh/pdffigures2

Folders and files

Latest commit

History

Repository files navigation

PDF Parser

Prerequisites

Usage

How it works

Customization

Troubleshooting

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages