Docsplit

Docsplit is a command-line utility and Ruby library for splitting apart documents into their component parts: searchable UTF-8 plain text via OCR if necessary, page images or thumbnails in any format, PDFs, single pages, and document metadata (title, author, number of pages...)

Docsplit is currently at version 0.3.2.

Docsplit is an open-source component of DocumentCloud.

Installation & Dependencies | Usage | Internals | Change Log

Installation & Dependencies

  1. Grab the gem:
    gem install docsplit
  2. Install GraphicsMagick. Its ‘gm’ command is used to generate images.
    Either compile it from source, or use a package manager:
    [aptitude | port] install graphicsmagick
  3. Install Poppler. On Linux, use aptitude, apt-get or yum:
    aptitude install poppler-utils
    On the Mac, you can install from source or use MacPorts:
    sudo port install poppler
  4. (Optional) Install Tesseract:
    [aptitude | port] install tesseract
    Without Tesseract installed, you'll still be able to extract text from documents, but you won't be able to automatically OCR them.
  5. (Optional) Install pdftk. On Linux, use aptitude, apt-get or yum:
    aptitude install pdftk
    On the Mac, you can download a recent installer for the binary. Without pdftk installed, you can use Docsplit, but won't be able to split apart a multi-page PDF into single-page PDFs.
  6. (Optional) Install OpenOffice. On Linux, use aptitude, apt-get or yum:
    aptitude install openoffice.org openoffice.org-java-common
    On the Mac, download and install the latest release.
    When you're ready to convert non-PDF documents, you'll need to launch OpenOffice in headless server mode:
    soffice -headless -accept="socket,host=127.0.0.1,port=8100;urp;" -nofirststartwizard
    On the Mac you'll find the “soffice” command here: /Applications/OpenOffice.org.app/Contents/MacOS/soffice.bin

Note: the gem will take a minute to download — the JODConverter jar file tips the scales at 2MB.

Usage

The Docsplit gem includes both the docsplit command-line utility as well as a Ruby API. The available commands and options are identical in both.
--output or -o can be passed to any command in order to store the generated files in a directory of your choosing.

images--size --format --pages Ruby: extract_images
Generates an image for each page in the document at the specified resolution and format. Pass --pages or -p to choose the specific pages to image. Passing
--size or -s will specify the desired image resolution, and --format or -f will select the format of the final images.

docsplit images example.pdf
docsplit images docs/*.pdf --size 700x,50x50 --format gif --pages 3,10-15,42
Docsplit.extract_images('example.doc', :size => '1000x', :format => [:png, :jpg])

text--pages --ocr --no-ocr Ruby: extract_text
Extract the complete UTF-8-encoded plain text of a document to a single file. If you'd like to extract the text for each page separately, pass --pages all. You can use the --ocr and --no-ocr flags to force OCR, or disable it, respectively. By default (if Tesseract is installed) Docsplit will OCR the text of each page for which it fails to extract text directly from the document.

docsplit text path/to/doc.pdf --pages all
docs = Dir['storage/originals/*.doc']
Docsplit.extract_text(docs, :ocr => false, :output => 'storage/text')

pages--pages Ruby: extract_text
Burst apart a document into single-page PDFs. Use --pages to specify the individual pages (or ranges of pages) you'd like to generate.

docsplit pages path/to/doc.pdf --pages 1-10
Docsplit.extract_pages('path/to/presentation.ppt')
Docsplit.extract_pages('doc.pdf', :pages => 1..10)

pdf Ruby: extract_pdf
Convert documents into PDFs. Any type of document that OpenOffice can read may be converted. These include the Microsoft Office formats: doc, docx, ppt, xls and so on, as well as html, odf, rtf, swf, svg, and wpd. The first time that you convert a new file type, OpenOffice will lazy-load the code that processes it — subsequent conversions will be much faster.

docsplit pdf documentation/*.html
Docsplit.extract_pdf('expense_report.xls')

author, date, creator, keywords, producer, subject, title, length
Ruby: extract_...
Retrieve a piece of metadata about the document. The docsplit utility will print to stdout, the Ruby API will return the value.

docsplit title path/to/stooges.pdf
=> Disorder in the Court
Docsplit.extract_length('path/to/stooges.pdf')
=> 36

Internals

Under the hood, Docsplit is a thin wrapper around the excellent GraphicsMagick, Poppler, PDFTK, Tesseract, and JODConverter libraries. Poppler is used to extract text and metadata from PDF documents, PDFTK is used to split them apart into pages, and GraphicsMagick is used to generate the page images (internally, it's rendering them with GhostScript). JODConverter communicates with OpenOffice to perform the PDF conversions. Tesseract provides the transparent OCR fallback support, if the document is a simple scan, and the file doesn't contain any embedded text.

Because documents need to be in PDF format before any metadata, text, or images are extracted, it's faster to use docsplit pdf to convert it up front, if you're planning to run more than one extraction. Otherwise Docsplit will write out the PDF version to a temporary file before proceeding with each command.

Change Log

0.3.2
Start using the MAGICK_TMPDIR environment variable to prevent parallel Docsplit runs from having the potential to clobber each other's temporary image files.

0.3.1
Added a memory limit to GraphicsMagick while generating the TIFFs for Tesseract OCR -- prevents gm from gobbling up all available memory on large files.

0.3.0
OCR support added via Tesseract, and the --ocr and --no-ocr flags. PDFBox is no longer a dependency, and the gem is many megabytes lighter for it.

0.2.0
Moving to Poppler's pdftotext. PDFBox had issues with Unicode in PDFs and incorrectly split individual pages of text.

0.1.3
Fixing a bug with specifying explicit page ranges for image extraction.

0.1.2
Limiting the memory usage of GraphicsMagick to avoid out of memory errors on very large PDFs.

0.1.1
Upgraded for compatibility with GraphicsMagick 1.3.11.

0.1.0
Initial Docsplit release.


A DocumentCloud Project