10000 Parsing PDF file takes a lot of time · Issue #130 · docwire/docwire · GitHub
[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
Skip to content

Parsing PDF file takes a lot of time #130

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
dimoqz opened this issue Jun 14, 2024 · 7 comments
Open

Parsing PDF file takes a lot of time #130

dimoqz opened this issue Jun 14, 2024 · 7 comments

Comments

@dimoqz
Copy link
dimoqz commented Jun 14, 2024

I'm trying to parse a PDF using the example, but parsing a small 209 kb file requires more than 5 seconds.

using namespace docwire;
std::stringstream out_stream;
std::filesystem::path("D:\\pdf.pdf") | ParseDetectedFormat<OfficeFormatsParserProvider>() | PlainTextExporter() | out_stream;

Compiler MSVC Version 19.39.33523 for x64.
Build type = release.
DocWire version - 2024.04.04.

But when parsing directly using PoDoFo - it's required 70ms...

    std::string input = "D:\\pdf.pdf";
    PdfMemDocument doc;
    doc.Load(input);
    auto& pages = doc.GetPages();
    std::stringstream ss;
    for (unsigned i = 0; i < pages.GetCount(); i++)
    {
        auto& page = pages.GetPageAt(i);

        std::vector<PdfTextEntry> entries;
        page.ExtractTextTo(entries);


        for (auto& entry : entries)
            ss << entry.Text << "\n";
    }

pdf.pdf

@dimoqz dimoqz changed the title Parsing PDF file takes file takes a lot of time Parsing PDF file takes a lot of time Jun 14, 2024
@as-ascii
Copy link
Contributor

Thank you for your report and example file. We will analyze the issue using callgrind and check what is happening.

@as-ascii
Copy link
Contributor

Could you please recheck with the latest code? There were a lot of optimizations introduced in this release:
https://github.com/docwire/docwire/releases/tag/2024.06.19
Please find our quick test below.

Debug version:

$ time vcpkg/installed/x64-linux-dynamic/debug/tools/docwire ~/pdf.pdf > out.txt

real    0m0,888s
user    0m0,859s
sys     0m0,027s

Release version is quicker of course:

$ time vcpkg/installed/x64-linux-dynamic/tools/docwire ~/pdf.pdf > out.txt

real    0m0,381s
user    0m0,142s
sys     0m0,068s

It's still not 70ms but there is some initialization code in CLI like parsing command line arguments, loading shared libraries etc.
Please let us know what is the situation with the latest version of code.

@dimoqz
Copy link
Author
dimoqz commented Jun 27, 2024
#include <iostream>
#include "docwire/docwire.h"



uint64_t timeSinceEpochMillisec() {
    using namespace std::chrono;
    return duration_cast<milliseconds>(system_clock::now().time_since_epoch()).count();
}

int main(int argc, char *argv[])
{
    using namespace docwire;
    auto start = timeSinceEpochMillisec();
    std::stringstream out_stream;
    std::ifstream("D:/pdf.pdf", std::ios_base::binary) | ParseDetectedFormat<OfficeFormatsParserProvider>() | PlainTextExporter() | out_stream;
    std::cout << "Total time: " << timeSinceEpochMillisec() - start << "ms" << std::endl;
    return 0;
}

Same time with new version.
Total time: 5013ms

@as-ascii
Copy link
Contributor

We will check with your sample and please check the CLI time on your machine. Just run the command in shell, 5s vs <1s should be visible even without clock checking. It can be something platform-specific as well but there is little platform-specific code there. We will double check on Windows. Are you compiling in debug or release mode?

@dimoqz
Copy link
Author
dimoqz commented Jun 27, 2024

I checked it with the CLI utility docwire with 2024.06.24 release.
Execution time is about 6-7 seconds.
In debug mode conversion takes about 20 seconds. In release mode - 5 seconds.
My machine:

Edition	Windows 10 Enterprise x64
Version	22H2
OS build	19045.4529
Experience	Windows Feature Experience Pack 1000.19058.1000.0

@as-ascii
Copy link
Contributor

Could be platform specific then. We will verify it and let you know.

@as-ascii
Copy link
Contributor
as-ascii commented Jul 4, 2024

We analyzed the log artifacts from latest CI build
https://github.com/docwire/docwire/actions/runs/9653829575
Compared files named test-docwire-[triplet]-rel-out.log from Ubuntu 24.04, Windows-2022 and MacOS-12.

Ubuntu 24.04:

test 51
        Start  51: BasicTests/DocumentTests.ParseFromPathTest/pdf_basic_tests  # GetParam() = (1, 9, 0x5623a671137e pointing to "pdf")

51: Test command: /home/runner/work/docwire/docwire/vcpkg/buildtrees/docwire/x64-linux-dynamic-rel/tests/api_tests "--gtest_filter=BasicTests/DocumentTests.ParseFromPathTest/pdf_basic_tests" "--gtest_also_run_disabled_tests"
51: Working Directory: /home/runner/work/docwire/docwire/vcpkg/buildtrees/docwire/x64-linux-dynamic-rel/tests
51: Test timeout computed to be: 10000000
51: Running main() from /home/runner/work/docwire/docwire/vcpkg/buildtrees/gtest/src/v1.14.0-bcf93537a8.clean/googletest/src/gtest_main.cc
51: Note: Google Test filter = BasicTests/DocumentTests.ParseFromPathTest/pdf_basic_tests
51: [==========] Running 1 test from 1 test suite.
51: [----------] Global test environment set-up.
51: [----------] 1 test from BasicTests/DocumentTests
51: [ RUN      ] BasicTests/DocumentTests.ParseFromPathTest/pdf_basic_tests
51: [       OK ] BasicTests/DocumentTests.ParseFromPathTest/pdf_basic_tests (8 ms)
51: [----------] 1 test from BasicTests/DocumentTests (8 ms total)
51: 
51: [----------] Global test environment tear-down
51: [==========] 1 test from 1 test suite ran. (8 ms total)
51: [  PASSED  ] 1 test.
test 79
        Start  79: BasicTests/DocumentTests.ParseFromStreamTest/pdf_basic_tests  # GetParam() = (1, 9, 0x5623a671137e pointing to "pdf")

79: Test command: /home/runner/work/docwire/docwire/vcpkg/buildtrees/docwire/x64-linux-dynamic-rel/tests/api_tests "--gtest_filter=BasicTests/DocumentTests.ParseFromStreamTest/pdf_basic_tests" "--gtest_also_run_disabled_tests"
79: Working Directory: /home/runner/work/docwire/docwire/vcpkg/buildtrees/docwire/x64-linux-dynamic-rel/tests
79: Test timeout computed to be: 10000000
79: Running main() from /home/runner/work/docwire/docwire/vcpkg/buildtrees/gtest/src/v1.14.0-bcf93537a8.clean/googletest/src/gtest_main.cc
79: Note: Google Test filter = BasicTests/DocumentTests.ParseFromStreamTest/pdf_basic_tests
79: [==========] Running 1 test from 1 test suite.
79: [----------] Global test environment set-up.
79: [----------] 1 test from BasicTests/DocumentTests
79: [ RUN      ] BasicTests/DocumentTests.ParseFromStreamTest/pdf_basic_tests
79: [       OK ] BasicTests/DocumentTests.ParseFromStreamTest/pdf_basic_tests (8 ms)
79: [----------] 1 test from BasicTests/DocumentTests (8 ms total)
79: 
79: [----------] Global test environment tear-down
79: [==========] 1 test from 1 test suite ran. (8 ms total)
79: [  PASSED  ] 1 test.
test 212
        Start 212: BasicTests/MultithreadedTest.ReadFromFileTests/pdf_multithreaded_tests  # GetParam() = (1, 9, 0x5623a671137e pointing to "pdf")

212: Test command: /home/runner/work/docwire/docwire/vcpkg/buildtrees/docwire/x64-linux-dynamic-rel/tests/api_tests "--gtest_filter=BasicTests/MultithreadedTest.ReadFromFileTests/pdf_multithreaded_tests" "--gtest_also_run_disabled_tests"
212: Working Directory: /home/runner/work/docwire/docwire/vcpkg/buildtrees/docwire/x64-linux-dynamic-rel/tests
212: Test timeout computed to be: 10000000
212: Running main() from /home/runner/work/docwire/docwire/vcpkg/buildtrees/gtest/src/v1.14.0-bcf93537a8.clean/googletest/src/gtest_main.cc
212: Note: Google Test filter = BasicTests/MultithreadedTest.ReadFromFileTests/pdf_multithreaded_tests
212: [==========] Running 1 test from 1 test suite.
212: [----------] Global test environment set-up.
212: [----------] 1 test from BasicTests/MultithreadedTest
212: [ RUN      ] BasicTests/MultithreadedTest.ReadFromFileTests/pdf_multithreaded_tests
212: [       OK ] BasicTests/MultithreadedTest.ReadFromFileTests/pdf_multithreaded_tests (16 ms)
212: [----------] 1 test from BasicTests/MultithreadedTest (16 ms total)
212: 
212: [----------] Global test environment tear-down
212: [==========] 1 test from 1 test suite ran. (16 ms total)
212: [  PASSED  ] 1 test.

Windows-2022:

51: Test command: D:\a\docwire\docwire\vcpkg\buildtrees\docwire\x64-windows-rel\tests\api_tests.exe "--gtest_filter=BasicTests/DocumentTests.ParseFromPathTest/pdf_basic_tests" "--gtest_also_run_disabled_tests"
51: Working Directory: D:/a/docwire/docwire/vcpkg/buildtrees/docwire/x64-windows-rel/tests
51: Test timeout computed to be: 10000000
51: Running main() from D:\a\docwire\docwire\vcpkg\buildtrees\gtest\src\v1.14.0-bcf93537a8.clean\googletest\src\gtest_main.cc
51: Note: Google Test filter = BasicTests/DocumentTests.ParseFromPathTest/pdf_basic_tests
51: [==========] Running 1 test from 1 test suite.
51: [----------] Global test environment set-up.
51: [----------] 1 test from BasicTests/DocumentTests
51: [ RUN      ] BasicTests/DocumentTests.ParseFromPathTest/pdf_basic_tests
51: [       OK ] BasicTests/DocumentTests.ParseFromPathTest/pdf_basic_tests (67 ms)
51: [----------] 1 test from BasicTests/DocumentTests (67 ms total)
51: 
51: [----------] Global test environment tear-down
51: [==========] 1 test from 1 test suite ran. (67 ms total)
51: [  PASSED  ] 1 test.
test 79
        Start  79: BasicTests/DocumentTests.ParseFromStreamTest/pdf_basic_tests  # GetParam() = (1, 9, 00007FF6BF15F85C pointing to "pdf")

79: Test command: D:\a\docwire\docwire\vcpkg\buildtrees\docwire\x64-windows-rel\tests\api_tests.exe "--gtest_filter=BasicTests/DocumentTests.ParseFromStreamTest/pdf_basic_tests" "--gtest_also_run_disabled_tests"
79: Working Directory: D:/a/docwire/docwire/vcpkg/buildtrees/docwire/x64-windows-rel/tests
79: Test timeout computed to be: 10000000
79: Running main() from D:\a\docwire\docwire\vcpkg\buildtrees\gtest\src\v1.14.0-bcf93537a8.clean\googletest\src\gtest_main.cc
79: Note: Google Test filter = BasicTests/DocumentTests.ParseFromStreamTest/pdf_basic_tests
79: [==========] Running 1 test from 1 test suite.
79: [----------] Global test environment set-up.
79: [----------] 1 test from BasicTests/DocumentTests
79: [ RUN      ] BasicTests/DocumentTests.ParseFromStreamTest/pdf_basic_tests
79: [       OK ] BasicTests/DocumentTests.ParseFromStreamTest/pdf_basic_tests (66 ms)
79: [----------] 1 test from BasicTests/DocumentTests (66 ms total)
79: 
79: [----------] Global test environment tear-down
79: [==========] 1 test from 1 test suite ran. (66 ms total)
79: [  PASSED  ] 1 test.
test 212
        Start 212: BasicTests/MultithreadedTest.ReadFromFileTests/pdf_multithreaded_tests  # GetParam() = (1, 9, 00007FF6BF15F85C pointing to "pdf")

212: Test command: D:\a\docwire\docwire\vcpkg\buildtrees\docwire\x64-windows-rel\tests\api_tests.exe "--gtest_filter=BasicTests/MultithreadedTest.ReadFromFileTests/pdf_multithreaded_tests" "--gtest_also_run_disabled_tests"
212: Working Directory: D:/a/docwire/docwire/vcpkg/buildtrees/docwire/x64-windows-rel/tests
212: Test timeout computed to be: 10000000
212: Running main() from D:\a\docwire\docwire\vcpkg\buildtrees\gtest\src\v1.14.0-bcf93537a8.clean\googletest\src\gtest_main.cc
212: Note: Google Test filter = BasicTests/MultithreadedTest.ReadFromFileTests/pdf_multithreaded_tests
212: [==========] Running 1 test from 1 test suite.
212: [----------] Global test environment set-up.
212: [----------] 1 test from BasicTests/MultithreadedTest
212: [ RUN      ] BasicTests/MultithreadedTest.ReadFromFileTests/pdf_multithreaded_tests
212: [       OK ] BasicTests/MultithreadedTest.ReadFromFileTests/pdf_multithreaded_tests (133 ms)
212: [----------] 1 test from BasicTests/MultithreadedTest (133 ms total)
212: 
212: [----------] Global test environment tear-down
212: [==========] 1 test from 1 test suite ran. (133 ms total)
212: [  PASSED  ] 1 test.

MacOS-12:

test 51
        Start  51: BasicTests/DocumentTests.ParseFromPathTest/pdf_basic_tests  # GetParam() = (1, 9, 0x1038fdde5 pointing to "pdf")

51: Test command: /Users/runner/work/docwire/docwire/vcpkg/buildtrees/docwire/x64-osx-dynamic-rel/tests/api_tests "--gtest_filter=BasicTests/DocumentTests.ParseFromPathTest/pdf_basic_tests" "--gtest_also_run_disabled_tests"
51: Working Directory: /Users/runner/work/docwire/docwire/vcpkg/buildtrees/docwire/x64-osx-dynamic-rel/tests
51: Test timeout computed to be: 10000000
51: Running main() from /Users/runner/work/docwire/docwire/vcpkg/buildtrees/gtest/src/v1.14.0-bcf93537a8.clean/googletest/src/gtest_main.cc
51: Note: Google Test filter = BasicTests/DocumentTests.ParseFromPathTest/pdf_basic_tests
51: [==========] Running 1 test from 1 test suite.
51: [----------] Global test environment set-up.
51: [----------] 1 test from BasicTests/DocumentTests
51: [ RUN      ] BasicTests/DocumentTests.ParseFromPathTest/pdf_basic_tests
51: [       OK ] BasicTests/DocumentTests.ParseFromPathTest/pdf_basic_tests (10 ms)
51: [----------] 1 test from BasicTests/DocumentTests (10 ms total)
51: 
51: [----------] Global test environment tear-down
51: [==========] 1 test from 1 test suite ran. (10 ms total)
51: [  PASSED  ] 1 test.
test 79
        Start  79: BasicTests/DocumentTests.ParseFromStreamTest/pdf_basic_tests  # GetParam() = (1, 9, 0x1038fdde5 pointing to "pdf")

79: Test command: /Users/runner/work/docwire/docwire/vcpkg/buildtrees/docwire/x64-osx-dynamic-rel/tests/api_tests "--gtest_filter=BasicTests/DocumentTests.ParseFromStreamTest/pdf_basic_tests" "--gtest_also_run_disabled_tests"
79: Working Directory: /Users/runner/work/docwire/docwire/vcpkg/buildtrees/docwire/x64-osx-dynamic-rel/tests
79: Test timeout computed to be: 10000000
79: Running main() from /Users/runner/work/docwire/docwire/vcpkg/buildtrees/gtest/src/v1.14.0-bcf93537a8.clean/googletest/src/gtest_main.cc
79: Note: Google Test filter = BasicTests/DocumentTests.ParseFromStreamTest/pdf_basic_tests
79: [==========] Running 1 test from 1 test suite.
79: [----------] Global test environment set-up.
79: [----------] 1 test from BasicTests/DocumentTests
79: [ RUN      ] BasicTests/DocumentTests.ParseFromStreamTest/pdf_basic_tests
79: [       OK ] BasicTests/DocumentTests.ParseFromStreamTest/pdf_basic_tests (9 ms)
79: [----------] 1 test from BasicTests/DocumentTests (9 ms total)
79: 
79: [----------] Global test environment tear-down
79: [==========] 1 test from 1 test suite ran. (9 ms total)
79: [  PASSED  ] 1 test.
test 212
        Start 212: BasicTests/MultithreadedTest.ReadFromFileTests/pdf_multithreaded_tests  # GetParam() = (1, 9, 0x1038fdde5 pointing to "pdf")

212: Test command: /Users/runner/work/docwire/docwire/vcpkg/buildtrees/docwire/x64-osx-dynamic-rel/tests/api_tests "--gtest_filter=BasicTests/MultithreadedTest.ReadFromFileTests/pdf_multithreaded_tests" "--gtest_also_run_disabled_tests"
212: Working Directory: /Users/runner/work/docwire/docwire/vcpkg/buildtrees/docwire/x64-osx-dynamic-rel/tests
212: Test timeout computed to be: 10000000
212: Running main() from /Users/runner/work/docwire/docwire/vcpkg/buildtrees/gtest/src/v1.14.0-bcf93537a8.clean/googletest/src/gtest_main.cc
212: Note: Google Test filter = BasicTests/MultithreadedTest.ReadFromFileTests/pdf_multithreaded_tests
212: [==========] Running 1 test from 1 test suite.
212: [----------] Global test environment set-up.
212: [----------] 1 test from BasicTests/MultithreadedTest
212: [ RUN      ] BasicTests/MultithreadedTest.ReadFromFileTests/pdf_multithreaded_tests
212: [       OK ] BasicTests/MultithreadedTest.ReadFromFileTests/pdf_multithreaded_tests (17 ms)
212: [----------] 1 test from BasicTests/MultithreadedTest (17 ms total)
212: 
212: [----------] Global test environment tear-down
212: [==========] 1 test from 1 test suite ran. (17 ms total)
212: [  PASSED  ] 1 test.

It looks like text extracting from PDF with Windows version of DocWire SDK is significally slower than with Linux or MacOS version. Many factors can cause this. As there are ideas how to refactor and optimize PDF processing so the most important will be to check speed on Windows before merging refactored code. I will leave the issue open until the refactoring is done.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants
0