GitHub - leandroroser/prettyparser: Parallel processing and parsing PDF and TXT files, and Python objects with text (str, list) using rules (regular expressions).

prettyparser is a Python library for parallel processing and parsing PDF/TXT and Python objects with text (str, list) using rules (regular expressions). In case of PDF files, the package reads the content using pdfplumber and then performs a series of data manipulations to generate a higher quality output, removing the boilerplate code needed to read/process/write the content of multiple files with multiple pages. A custom processing function using pdfplumber that takes a page and returns a processed text is also allowed. Additional data processing steps can be added via custom regular expressions, that are compiled for improved speed.

Installation

$ git clone https://github.com/leandroroser/prettyparser
$ cd prettyparser
$ pip install -e .

or

$ pip install prettyparser

Example: processing a series PDF files

import regex as re
from prettyparser import PrettyParser

files = ["./BOOKS/PDF/PDF1.pdf", "./BOOKS/PDF/PDF2.pdf"]
output = "./BOOKS/TXT"
parser = PrettyParser(files, None, output, mode = 'pdf',
                      args = [[r"(\n\s*\d+\s*\n)|(\n\s*\d+\s*$)", r'\n\n'],
                            [r"\n\s*-\d-\s*\n", r'\n\n'], 
                            [r"\n\s*(\* *)+\s*\n", r'\n\n'],
                            [r"__some_header_text", r'\n\n', re.IGNORECASE]],
                            remove_whitelines = True,
                            paragraphs_spacing = 1,
                            remove_hyphen_eol = True)
parser.run()

Example: processing a folder with multiple PDF files

import regex as re
from prettyparser import PrettyParser

directory = "./BOOKS/PDF"
output = "./BOOKS/TXT"
parser = PrettyParser(None, directory, output, mode = 'pdf',
                      args = [[r"(\n\s*\d+\s*\n)|(\n\s*\d+\s*$)", r'\n\n'],
                            [r"\n\s*-\d-\s*\n", r'\n\n'], 
                            [r"\n\s*(\* *)+\s*\n", r'\n\n'],
                            [r"__some_header_text", r'\n\n', re.IGNORECASE]],
                            remove_whitelines = True,
                            paragraphs_spacing = 1,
                            remove_hyphen_eol = True)
parser.run()

Example: processing a folder with multiple TXT files

Let's assume that the previous output isn't good enough and needs additional corrections. A quicker way for testing additional corrections can be implemented by using the previous TXT output:

directory = "./BOOKS/TXT"
output = "./BOOKS/TXT_REPARSED"
parser = PrettyParser(None, directory, output,  mode = 'txt', 
                        args=[[r"some other header.*\d+", r''],
                            [r"^\d+.*", r'', re.MULTILINE], 
                            [r"([A-Z]+)( *\n)([A-Z]+)", r'\1\3'],
                            remove_whitelines = True,
                            paragraphs_spacing = 1,
                            remove_hyphen_eol = True)
parser.run()

Example: processing a Python str for a quick test of the app

import regex as re
from prettyparser import PrettyParser


txt = """
header to remove

This is a text with multiple problems. For exam-
ple the latter word can be joined. 
The portions of this line can be
joined
in a single line.
HERE ALSO IS SOME
UPPERCASE TEXT
TO JOIN
Some Other Ugly Stuff To Remove IGNORING Case. 

Remove the line below:

* * * 

Remove empty lines and finally separate paragraphs with a blank line.


Below is the page number->.
99
"""
parser = PrettyParser(txt, mode = "pyobj", args = [[r"\s*header to remove\s*\n",r""],
                                                    [r"(\n\s*\d+\s*\n)", r'\n\n'],
                                                    [r"\n\s*(\* *)+\s*\n", r'\n\n'],
                                                    [r"\n.*some other ugly stuff.*", 
                                                    r'\n\n', re.IGNORECASE]],
                                                    remove_whitelines = True,
                                                    paragraphs_spacing = 1,
                                                    remove_hyphen_eol = True)
output = parser.run()
print(output[0])

This is a text with multiple problems. For example the latter word can be joined.

The portions of this line can be joined in a single line.

HERE ALSO IS SOME UPPERCASE CASE TEXT TO JOIN

Remove the line below: 

Remove empty lines and finally separate each line with a blank line.

Below is the page number->.

Runnning from the command line

 prettyparser --directories /home/BOOKS --output /home/BOOKS_PARSED --mode 'pdf'

Arguments

files (list or str): Path to parse for pdf/txt operations. If a string is passed, it will be treated as a directory when mode is 'pdf' or 'txt'. If a str or list is passed when mode is 'pyobj', it will be treated as a str/list of text files already loaded in memory in the corresponding object
output (str): output directory
args (list): list of tuples of the form (regex, replacement, flags). The flag can be absent
mode (str): 'pdf', 'txt' or 'pyobj' (the latter for Python lists and strings)
default (bool): if True, perform several default cleanup operations (default)
remove_whitelines (bool): if True, remove whitespaces
paragraphs_spacing (int): number of newlines between paragraphs
page_spacing (str): string to insert between pages
remove_hyphen_eol (bool): if True, remove end of line hyphens and merge subwords
custom_pdf_fun (Callable): custom function to parse pdf files
overwrite(bool): Overwrite file if exists. Default False
n_jobs(int): Number of jobs. Default: number of cores -1 It must accept a pdfplumber page as argument and return a text to be joined with previous pages

Current language support for the default parser

English, Spanish, German, French, Portuguese

Name		Name	Last commit message	Last commit date
Latest commit History 90 Commits
.gitignore		.gitignore
.python-version		.python-version
LICENSE.txt		LICENSE.txt
MANIFEST.in		MANIFEST.in
README.md		README.md
README.rst		README.rst
__init.py__		__init.py__
main.py		main.py
prettyparser.py		prettyparser.py
requirements.txt		requirements.txt
setup.cfg		setup.cfg
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Installation

Example: processing a series PDF files

Example: processing a folder with multiple PDF files

Example: processing a folder with multiple TXT files

Example: processing a Python str for a quick test of the app

Runnning from the command line

Arguments

Current language support for the default parser

License

About

Uh oh!

Releases

Packages

Uh oh!

Languages

License

leandroroser/prettyparser

Folders and files

Latest commit

History

Repository files navigation

Installation

Example: processing a series PDF files

Example: processing a folder with multiple PDF files

Example: processing a folder with multiple TXT files

Example: processing a Python str for a quick test of the app

Runnning from the command line

Arguments

Current language support for the default parser

License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages