This project contains a PDF parsing tool that extracts text and metadata from PDF files.
- Python 3.10+
- Install dependencies through pip
- Java Runtime Environment (JRE) openjdk@11 or openjdk@17
- If local debug mode, add below into
.vscode/settings.json
(create such file if don't have):
{
"version": "0.2.0",
"configurations": [
{
"name": "Python: Current File",
"type": "debugpy",
"request": "launch",
"program": "${file}",
"console": "integratedTerminal",
"cwd": "${workspaceFolder}",
"env": {
"PYTHONPATH": "${workspaceFolder}"
},
"justMyCode": true
}
]
}
To process a PDF file, use the process_pdf_task
function in the process_pdf.py
file. Here's how to use it:
- Set PYTHONPATH to the root directory of the project:
export PYTHONPATH=$(pwd) # OR export PYTHONPATH=/`pwd`
- Note that if you would like to have section with respective results, you need to set the
run_section_extraction
toTrue
, this might take a while and also requires OPENAI_API_KEY to be set. Update the section mapping file inpdf_parser/section_mapping.json
if you would like to change the section names.Then, you can call the function as follows:OPENAI_API_KEY=example_openai_api_key
from pdf_parser.process_pdf import process_pdf_task client = OpenAI() results = process_pdf_task(client, 'path/to/your/pdf.pdf', run_section_extraction=True, output_path="output")
-
The
process_pdf_task
function copies the input PDF to a temporary directory for processing. -
It then runs the Java-based PDF parser using the
pdf-parser.jar
file. -
The parser extracts metadata and text information from the PDF and saves it as a JSON file.
-
The
process_json
function intext_extraction.py
processes this JSON data to create a full text representation of the PDF content. -
The results, including metadata and full text, are saved in the
results/pdf_parser
directory as a JSON file.
You can customize the text extraction process by modifying the create_full_text
function in text_extraction.py
. This function determines how the extracted text is formatted and structured.
If you encounter any issues:
- Ensure that Java
openjdk@17
is installed and accessible from the command line. - Install the required packages with
pip install -r requirements.txt
. - Check that the
pdf-parser.jar
file is present in thepdf_parser
directory. - Verify that the input PDF file exists in the specified location.
For any errors during processing, check the logs printed to the console for more information.