prettyparser is a Python library for parallel processing and parsing PDF/TXT and Python objects with text (str, list) using rules (regular expressions). In case of PDF files, the package reads the content using pdfplumber and then performs a series of data manipulations to generate a higher quality output, removing the boilerplate code needed to read/process/write the content of multiple files with multiple pages. A custom processing function using pdfplumber that takes a page and returns a processed text is also allowed. Additional data processing steps can be added via custom regular expressions, that are compiled for improved speed.
$ git clone https://github.com/leandroroser/prettyparser
$ cd prettyparser
$ pip install -e .
or
$ pip install prettyparser
import regex as re
from prettyparser import PrettyParser
files = ["./BOOKS/PDF/PDF1.pdf", "./BOOKS/PDF/PDF2.pdf"]
output = "./BOOKS/TXT"
parser = PrettyParser(files, None, output, mode = 'pdf',
args = [[r"(\n\s*\d+\s*\n)|(\n\s*\d+\s*$)", r'\n\n'],
[r"\n\s*-\d-\s*\n", r'\n\n'],
[r"\n\s*(\* *)+\s*\n", r'\n\n'],
[r"__some_header_text", r'\n\n', re.IGNORECASE]],
remove_whitelines = True,
paragraphs_spacing = 1,
remove_hyphen_eol = True)
parser.run()
import regex as re
from prettyparser import PrettyParser
directory = "./BOOKS/PDF"
output = "./BOOKS/TXT"
parser = PrettyParser(None, directory, output, mode = 'pdf',
args = [[r"(\n\s*\d+\s*\n)|(\n\s*\d+\s*$)", r'\n\n'],
[r"\n\s*-\d-\s*\n", r'\n\n'],
[r"\n\s*(\* *)+\s*\n", r'\n\n'],
[r"__some_header_text", r'\n\n', re.IGNORECASE]],
remove_whitelines = True,
paragraphs_spacing = 1,
remove_hyphen_eol = True)
parser.run()
Let's assume that the previous output isn't good enough and needs additional corrections. A quicker way for testing additional corrections can be implemented by using the previous TXT output:
directory = "./BOOKS/TXT"
output = "./BOOKS/TXT_REPARSED"
parser = PrettyParser(None, directory, output, mode = 'txt',
args=[[r"some other header.*\d+", r''],
[r"^\d+.*", r'', re.MULTILINE],
[r"([A-Z]+)( *\n)([A-Z]+)", r'\1\3'],
remove_whitelines = True,
paragraphs_spacing = 1,
remove_hyphen_eol = True)
parser.run()
import regex as re
from prettyparser import PrettyParser
txt = """
header to remove
This is a text with multiple problems. For exam-
ple the latter word can be joined.
The portions of this line can be
joined
in a single line.
HERE ALSO IS SOME
UPPERCASE TEXT
TO JOIN
Some Other Ugly Stuff To Remove IGNORING Case.
Remove the line below:
* * *
Remove empty lines and finally separate paragraphs with a blank line.
Below is the page number->.
99
"""
parser = PrettyParser(txt, mode = "pyobj", args = [[r"\s*header to remove\s*\n",r""],
[r"(\n\s*\d+\s*\n)", r'\n\n'],
[r"\n\s*(\* *)+\s*\n", r'\n\n'],
[r"\n.*some other ugly stuff.*",
r'\n\n', re.IGNORECASE]],
remove_whitelines = True,
paragraphs_spacing = 1,
remove_hyphen_eol = True)
output = parser.run()
print(output[0])
This is a text with multiple problems. For example the latter word can be joined.
The portions of this line can be joined in a single line.
HERE ALSO IS SOME UPPERCASE CASE TEXT TO JOIN
Remove the line below:
Remove empty lines and finally separate each line with a blank line.
Below is the page number->.
prettyparser --directories /home/BOOKS --output /home/BOOKS_PARSED --mode 'pdf'
- files (list or str): Path to parse for pdf/txt operations. If a string is passed, it will be treated as a directory when mode is 'pdf' or 'txt'. If a str or list is passed when mode is 'pyobj', it will be treated as a str/list of text files already loaded in memory in the corresponding object
- output (str): output directory
- args (list): list of tuples of the form (regex, replacement, flags). The flag can be absent
- mode (str): 'pdf', 'txt' or 'pyobj' (the latter for Python lists and strings)
- default (bool): if True, perform several default cleanup operations (default)
- remove_whitelines (bool): if True, remove whitespaces
- paragraphs_spacing (int): number of newlines between paragraphs
- page_spacing (str): string to insert between pages
- remove_hyphen_eol (bool): if True, remove end of line hyphens and merge subwords
- custom_pdf_fun (Callable): custom function to parse pdf files
- overwrite(bool): Overwrite file if exists. Default False
- n_jobs(int): Number of jobs. Default: number of cores -1 It must accept a pdfplumber page as argument and return a text to be joined with previous pages
English, Spanish, German, French, Portuguese
© Leandro Roser, 2023. Licensed under an Apache-2 license.