8000 Parsing PDF file takes a lot of time · Issue #130 · docwire/docwire · GitHub
[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
Skip to content
Parsing PDF file takes a lot of time #130
Open
@dimoqz

Description

@dimoqz

I'm trying to parse a PDF using the example, but parsing a small 209 kb file requires more than 5 seconds.

using namespace docwire;
std::stringstream out_stream;
std::filesystem::path("D:\\pdf.pdf") | ParseDetectedFormat<OfficeFormatsParserProvider>() | PlainTextExporter() | out_stream;

Compiler MSVC Version 19.39.33523 for x64.
Build type = release.
DocWire version - 2024.04.04.

But when parsing directly using PoDoFo - it's required 70ms...

    std::string input = "D:\\pdf.pdf";
    PdfMemDocument doc;
    doc.Load(input);
    auto& pages = doc.GetPages();
    std::stringstream ss;
    for (unsigned i = 0; i < pages.GetCount(); i++)
    {
        auto& page = pages.GetPageAt(i);

        std::vector<PdfTextEntry> entries;
        page.ExtractTextTo(entries);


        for (auto& entry : entries)
            ss << entry.Text << "\n";
    }

pdf.pdf

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions

      0