8000 ass: don't fail to read files larger than INT_MAX by TheOneric · Pull Request #717 · libass/libass · GitHub
[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
Skip to content

ass: don't fail to read files larger than INT_MAX #717

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 3 commits into
base: master
Choose a base branch
from

Conversation

TheOneric
Copy link
Member

Yet another int overuse in old code. Without this UB will occur on conversion to signed int and if the value wraps around to something negative, claim there was an IO error using whatever the last errno number was which is misleading.

Not sure if it’s best practice to check ferror, feof or both.

@TheOneric
Copy link
Member Author
TheOneric commented Nov 30, 2023

Not sure if it’s best practice to check ferror, feof or both.

In the previous revision checking just one should have been fine afaict.
However the patch omitted to also converted bytes_read a successfully sz to size_t and there were two more preëxisting problems.

  • due to ftell returning a long file size was effectively still limited to <4GiB on Win32 LLP64
  • ftell+fseek is not guaranteed to work in ISO C (see message of second commit)

A newly added second commit now replaces the fseek+ftell approach

EDIT: force-pushed again as I accidentally merged most of my changes of commit 1 with commit 2 before
... maybe there’s no point in keeping those separate anyway?

@rcombs
Copy link
Member
rcombs commented Nov 30, 2023

Why not just use ftello or _ftelli64 when available? I believe that should cover all relevant platforms.

@astiob
Copy link
Member
astiob commented Nov 30, 2023

FWIW I think it’s good to support files that are non-seekable; and even a seekable file’s size can change mid-read if someone concurrently writes to it. We could, of course, use ftello etc. as an optimization hint if it’s available.

On a separate note, why do we need to read the entire file into a buffer at all? Can’t we process it chunk by chunk? We probably need whole lines, but we should be able to read a chunk, process any whole lines, and buffer only the last, incomplete line as we proceed to read the next chunk.

@astiob
Copy link
Member
astiob commented Nov 30, 2023

(Another non-ISO-C alternative for getting the file size is the stat family of functions.)

@TheOneric
Copy link
Member Author
TheOneric commented Dec 1, 2023

EDIT: Welp, GitHub only now shows me the new comments

Why not just use ftello or _ftelli64 when available? I believe that should cover all relevant platforms.

I didn’t know Win32 has a ftello equivalent. Some care would need to be taken to correctly deal with the different types, possible _FILE_OFFSET_BITS values and SIZE_MAX but seems doable. However, apart from the easily avoidable fonts_dir, this would afaik be the first instance of us requiring a platform-specific API without an ISO fallback. Also, does POSIX and Win32 actually guarantee fseek(..., SEEK_END) to be working and meaningful for binary streams?

ISO C, here draft n3047 of C23 says:

§ 7.23.9.2 - Paragraph 3

[...] A binary stream need not meaningfully support fseek calls with a whence value of SEEK_END.

POSIX docs are silent about this particular case, but as usual defer to ISO C for everything not explicitly marked as an extension and Win32 docs are also silent on the matter.

In practice it seems indeed unlikely anyone is using ass_load_file on a system without proper support for SEEK_END judging from how nobody complained about it so far. Non-POSIX and non-Win32 might perhaps also be rare.

There is a nice bonus to the read-loop though, apart from avoiding possibly purely theoretical problems: it works with streams which are never seekable and we can now do something like

zstd -c -d ./large.ass.zst | test/test out.png /dev/stdin 12

Side note: while I think attempts to seek on streams should fail cleanly with POSIX or ISO C, concerningly Microsoft docs contain the sentence “On devices incapable of seeking, the return value is undefined.”.

@TheOneric
Copy link
Member Author
TheOneric commented Dec 1, 2023

On a separate note, why do we need to read the entire file into a buffer at all? Can’t we process it chunk by chunk? We probably need whole lines, but we should be able to read a chunk, process any whole lines, and buffer only the last, incomplete line as we proceed to read the next chunk.

We could parse ASS files line by line, yeah. But ass_load_file is also used to open and load font files from the fonts directory if specified.

In fact loading fonts file and parsing are the only uses of this funtion. I guess moving the “load whole file” version into fontselect.c and for parsing instead streaming the file content with line buffering to ass_read_data would be a good future optimisation and if #506 is revived would not even have to care about line buffering itself.

@MrSmile
Copy link
Member
MrSmile commented Dec 1, 2023

I think even in the case of incremental read it's better to use some sensible approximation for initial buffer size (fstat?).
Actually, ass_add_font does malloc + memcpy again, so in the future it'd be good to eliminate that too.

@TheOneric
Copy link
Member Author

Now using fstat if available to allocate just the right size if it can be determine beforehand.

Copy link
Member
@astiob astiob left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We probably want to define _FILE_OFFSET_BITS=64 as well, and whatever equivalent incantations other platforms may need. There’s probably (hopefully) a ready-to-use Autoconf macro for this.

@TheOneric TheOneric force-pushed the ass-read-intmax branch 2 times, most recently from 31cd464 to c7fc9cc Compare December 2, 2023 00:47
@TheOneric
Copy link
Member Author

We probably want to define _FILE_OFFSET_BITS=64 as well, and whatever equivalent incantations

Now done via AC_SYS_LARGEFILE

@TheOneric
Copy link
Member Author

Though.. since we currently load the file wholesale into memory, does LARGEFILE even make a difference? SIZE_MAX will limit us to 32bits anyway.
I guess it doesn't hurt though and will become useful when in the future we switch to parsing ASS files chunkwise.

@astiob
Copy link
Member
astiob commented Dec 2, 2023

Good point 😅 But yeah, it shouldn’t hurt and might help in the future, so it’s probably good to have it added while we remember about it.

Copy link
Member
@MrSmile MrSmile left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also there's typo (forregular) in the second commit message.

if (!buf)
goto fail;

size_t bytes_read = 0;
while (true) {
size_t read = fread(buf + bytes_read, 1, CHUNK_SIZE, fp);
size_t maxread = bufmax - bytes_read;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suggest to ensure that maxread is always greater than CHUNK_SIZE to avoid suboptimal behavior at the file end.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CHUNK_SIZE no longer exists, but the new version will no longer reallloc anything for a know file size (which was not modified since the fstat call). Does this seem good enough?

if (fstat(fd, &info))
return false;

if (!S_ISREG(info.st_mode) || info.st_size > SIZE_MAX)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For > 4G files at 32-bit system that triggers failure, so ass_load_file fallbacks to incremental read which eventually fail due to out of memory. I think ass_get_file_size should return uint64_t (or int64_t which can be combined with −1 error) and let caller handle overflow.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, right. If st_size >= SIZE_MAX, we probably just want to abandon the file immediately.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ass_get_file_size now returns an int and differentiates success (0), unknown size (>0) and too large (<0), which gets used in load_file.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think a function with the name ass_get_file_size should return a known file size even if that size is larger than SIZE_MAX. Currently we don't have a use for such sizes, but that can change in the future (for example, for a detailed error message).

Copy link
Member Author
@TheOneric TheOneric Dec 5, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While we are presumably unlikely to encounter files larger than >2^64 bytes soon, POSIX’s stat struct records the size in off_t (Windows’s fstati64 as a 64bit type) and off_t might be any width, so the return type would need to be intmax_t (when avoiding <sys/types.h> and the Win32 compatibility macros for off_t to be reexported by ass_filesystem.h).
However we only load two types of files:

  • font files, which we’ll need to fully load into memory and thus limit to SIZE_MAX anyway
  • ASS styles or full files for which we want to switch to a streaming approach so get_file_size will be irrelevant (i was intending to do that some later date, but if optimising for this case becomes a problem perhaps I should already do that now)

I’d be surprised if we ever have to deal with more than those types and therefore don't see where get_file_size > SIZE_MAX would ever be useful. Do you have some future usecase in mind?

Copy link
Member
@astiob astiob Dec 5, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I mean, there’s not much choice on platforms with 64-bit size_t. (Some) common compilers only got 128-bit support relatively recently, if indeed at all. Must be the same reason why intmax_t remained 64-bit when I last checked.

Then we can't log a warning though for too large files

Why not log it inside this function?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't care about platforms with 64-bit size_t, such large files would be unfeasible for many years to come. I care about platforms with 32-bit size_t, but > 4G files. In such cases file size is perfectly known, but ass_get_file_size returns a error, which I think isn't right, especially if that function would be reused in another context in the future.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

especially if that function would be reused in another context in the future.

At least with what I have in mind, it ass_get_file_size will only ever be used for files we fully load into memory (font files and atm ASS files), while future streamed ASS files won't use it at all (instead read fixed size chunks like the second commit of this series to have an almost fixed memory usage (if lines are bigger than CHUNK_SIZE realloc will still be needed))

Do you prefer one of the alternative names or have another name suggestion to reflect the limitation?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's worth noting that even if we stream a larger file in the future, we still load all events into memory, so unless the file is full of comment lines or embedded fonts, we'll likely run out of 32-bit memory space anyway.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I prefer (u)int64_t for a file size instead of complications with another failure mode and its subsequent handling. Among alternative names I feel that ass_get_file_memory_size is the most appropriate.

Streams do not support seek, thus we are currently unable to load them.
But even for regular files, there are two problems with fseek+ftell.

For one, ftell returns a long, which on Win32’s LLP64 ABI
will artificially limit the loadable file size to 2GiB.

But even if long and size_t are of equal width, ISO C doesn’t require
fseek(.., SEEK_END) to be meaningfully implemented for binary-mode files
as we are using here (and for text-mode ftell need not be meaningful).

Instead just repeatedly load the next chunk of bytes until the file ends
or an error occurs. This avoids those problems, but has the drawback of
likely overallocating the buffer to some degree.
For all current uses of ass_load_file the returned buffer is short-lived
and this is liekly inproblematic, but the following commit will optimise
this for all files whose size is known ahead of time.
@TheOneric TheOneric added this to the 0.18.0 milestone Dec 5, 2023
@TheOneric
Copy link
Member Author

Renamed the function to ass_get_file_memory_size to better reflect its additional limit to what can be loaded into (virtually) continuos memory and added a warning log for too large files.

If this is now acceptable (or this + using in band SIZE_MAX/0 as error codes instead of split return size and error), great.
But otherwise, perhaps I should just already implement the chunk-based parsing for ASS files which will make it clearer how some of this will (not) be used in the future and so we don't over-optimise something getting replaced soon. I was originally intending to do that sometime later, but given this PR no longer is the initial simple “just change a few types” anyway, I might as well go all the way.

@TheOneric TheOneric mentioned this pull request Dec 8, 2023
8 tasks
arch1t3cht added a commit to arch1t3cht/libass that referenced this pull request Dec 27, 2023
This is in preparation for libass#717 .
Large file support is already enabled by default in meson.
torque pushed a commit to TypesettingTools/libass that referenced this pull request Apr 8, 2024
This is in preparation for libass#717 .
Large file support is already enabled by default in meson.
@TheOneric TheOneric modified the milestones: 0.17.2, 0.18.0 May 6, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants
0