-
Notifications
You must be signed in to change notification settings - Fork 18
Parsing PDF file takes a lot of time #130
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Thank you for your report and example file. We will analyze the issue using callgrind and check what is happening. |
Could you please recheck with the latest code? There were a lot of optimizations introduced in this release: Debug version: $ time vcpkg/installed/x64-linux-dynamic/debug/tools/docwire ~/pdf.pdf > out.txt
real 0m0,888s
user 0m0,859s
sys 0m0,027s Release version is quicker of course: $ time vcpkg/installed/x64-linux-dynamic/tools/docwire ~/pdf.pdf > out.txt
real 0m0,381s
user 0m0,142s
sys 0m0,068s It's still not 70ms but there is some initialization code in CLI like parsing command line arguments, loading shared libraries etc. |
Same time with new version. |
We will check with your sample and please check the CLI time on your machine. Just run the command in shell, 5s vs <1s should be visible even without clock checking. It can be something platform-specific as well but there is little platform-specific code there. We will double check on Windows. Are you compiling in debug or release mode? |
I checked it with the CLI utility docwire with 2024.06.24 release.
|
Could be platform specific then. We will verify it and let you know. |
We analyzed the log artifacts from latest CI build Ubuntu 24.04:
Windows-2022:
MacOS-12:
It looks like text extracting from PDF with Windows version of DocWire SDK is significally slower than with Linux or MacOS version. Many factors can cause this. As there are ideas how to refactor and optimize PDF processing so the most important will be to check speed on Windows before merging refactored code. I will leave the issue open until the refactoring is done. |
I'm trying to parse a PDF using the example, but parsing a small 209 kb file requires more than 5 seconds.
Compiler MSVC Version 19.39.33523 for x64.
Build type = release.
DocWire version - 2024.04.04.
But when parsing directly using PoDoFo - it's required 70ms...
pdf.pdf
The text was updated successfully, but these errors were encountered: