ass: don't fail to read files larger than INT_MAX #717

TheOneric · 2023-11-29T19:22:04Z

Yet another int overuse in old code. Without this UB will occur on conversion to signed int and if the value wraps around to something negative, claim there was an IO error using whatever the last errno number was which is misleading.

Not sure if it’s best practice to check ferror, feof or both.

TheOneric · 2023-11-30T17:47:16Z

Not sure if it’s best practice to check ferror, feof or both.

In the previous revision checking just one should have been fine afaict.
However the patch omitted to also converted bytes_read a successfully sz to size_t and there were two more preëxisting problems.

due to ftell returning a long file size was effectively still limited to <4GiB on Win32 LLP64
ftell+fseek is not guaranteed to work in ISO C (see message of second commit)

A newly added second commit now replaces the fseek+ftell approach

EDIT: force-pushed again as I accidentally merged most of my changes of commit 1 with commit 2 before
... maybe there’s no point in keeping those separate anyway?

rcombs · 2023-11-30T20:04:13Z

Why not just use ftello or _ftelli64 when available? I believe that should cover all relevant platforms.

astiob · 2023-11-30T23:49:57Z

FWIW I think it’s good to support files that are non-seekable; and even a seekable file’s size can change mid-read if someone concurrently writes to it. We could, of course, use ftello etc. as an optimization hint if it’s available.

On a separate note, why do we need to read the entire file into a buffer at all? Can’t we process it chunk by chunk? We probably need whole lines, but we should be able to read a chunk, process any whole lines, and buffer only the last, incomplete line as we proceed to read the next chunk.

astiob · 2023-11-30T23:53:48Z

(Another non-ISO-C alternative for getting the file size is the stat family of functions.)

TheOneric · 2023-12-01T00:12:46Z

EDIT: Welp, GitHub only now shows me the new comments

Why not just use ftello or _ftelli64 when available? I believe that should cover all relevant platforms.

I didn’t know Win32 has a ftello equivalent. Some care would need to be taken to correctly deal with the different types, possible _FILE_OFFSET_BITS values and SIZE_MAX but seems doable. However, apart from the easily avoidable fonts_dir, this would afaik be the first instance of us requiring a platform-specific API without an ISO fallback. Also, does POSIX and Win32 actually guarantee fseek(..., SEEK_END) to be working and meaningful for binary streams?

ISO C, here draft n3047 of C23 says:

§ 7.23.9.2 - Paragraph 3

[...] A binary stream need not meaningfully support fseek calls with a whence value of SEEK_END.

POSIX docs are silent about this particular case, but as usual defer to ISO C for everything not explicitly marked as an extension and Win32 docs are also silent on the matter.

In practice it seems indeed unlikely anyone is using ass_load_file on a system without proper support for SEEK_END judging from how nobody complained about it so far. Non-POSIX and non-Win32 might perhaps also be rare.

There is a nice bonus to the read-loop though, apart from avoiding possibly purely theoretical problems: it works with streams which are never seekable and we can now do something like

zstd -c -d ./large.ass.zst | test/test out.png /dev/stdin 12

Side note: while I think attempts to seek on streams should fail cleanly with POSIX or ISO C, concerningly Microsoft docs contain the sentence “On devices incapable of seeking, the return value is undefined.”.

TheOneric · 2023-12-01T00:16:09Z

On a separate note, why do we need to read the entire file into a buffer at all? Can’t we process it chunk by chunk? We probably need whole lines, but we should be able to read a chunk, process any whole lines, and buffer only the last, incomplete line as we proceed to read the next chunk.

We could parse ASS files line by line, yeah. But ass_load_file is also used to open and load font files from the fonts directory if specified.

In fact loading fonts file and parsing are the only uses of this funtion. I guess moving the “load whole file” version into fontselect.c and for parsing instead streaming the file content with line buffering to ass_read_data would be a good future optimisation and if #506 is revived would not even have to care about line buffering itself.

MrSmile · 2023-12-01T07:54:18Z

I think even in the case of incremental read it's better to use some sensible approximation for initial buffer size (fstat?).
Actually, ass_add_font does malloc + memcpy again, so in the future it'd be good to eliminate that too.

TheOneric · 2023-12-01T19:59:14Z

Now using fstat if available to allocate just the right size if it can be determine beforehand.

libass/ass.c

libass/ass_filesystem.c

libass/ass.c

astiob

We probably want to define _FILE_OFFSET_BITS=64 as well, and whatever equivalent incantations other platforms may need. There’s probably (hopefully) a ready-to-use Autoconf macro for this.

TheOneric · 2023-12-02T00:48:52Z

We probably want to define _FILE_OFFSET_BITS=64 as well, and whatever equivalent incantations

Now done via AC_SYS_LARGEFILE

TheOneric · 2023-12-02T00:54:14Z

Though.. since we currently load the file wholesale into memory, does LARGEFILE even make a difference? SIZE_MAX will limit us to 32bits anyway.
I guess it doesn't hurt though and will become useful when in the future we switch to parsing ASS files chunkwise.

astiob · 2023-12-02T15:24:54Z

Good point 😅 But yeah, it shouldn’t hurt and might help in the future, so it’s probably good to have it added while we remember about it.

MrSmile

Also there's typo (forregular) in the second commit message.

libass/ass.c

MrSmile · 2023-12-03T12:38:15Z

libass/ass.c

    if (!buf)
        goto fail;

    size_t bytes_read = 0;
    while (true) {
-        size_t read = fread(buf + bytes_read, 1, CHUNK_SIZE, fp);
+        size_t maxread = bufmax - bytes_read;


I suggest to ensure that maxread is always greater than CHUNK_SIZE to avoid suboptimal behavior at the file end.

CHUNK_SIZE no longer exists, but the new version will no longer reallloc anything for a know file size (which was not modified since the fstat call). Does this seem good enough?

libass/ass.c

MrSmile · 2023-12-03T12:49:54Z

libass/ass_filesystem.c

+    if (fstat(fd, &info))
+        return false;
+
+    if (!S_ISREG(info.st_mode) || info.st_size > SIZE_MAX)


For > 4G files at 32-bit system that triggers failure, so ass_load_file fallbacks to incremental read which eventually fail due to out of memory. I think ass_get_file_size should return uint64_t (or int64_t which can be combined with −1 error) and let caller handle overflow.

Oh, right. If st_size >= SIZE_MAX, we probably just want to abandon the file immediately.

ass_get_file_size now returns an int and differentiates success (0), unknown size (>0) and too large (<0), which gets used in load_file.

I think a function with the name ass_get_file_size should return a known file size even if that size is larger than SIZE_MAX. Currently we don't have a use for such sizes, but that can change in the future (for example, for a detailed error message).

While we are presumably unlikely to encounter files larger than >2^64 bytes soon, POSIX’s stat struct records the size in off_t (Windows’s fstati64 as a 64bit type) and off_t might be any width, so the return type would need to be intmax_t (when avoiding <sys/types.h> and the Win32 compatibility macros for off_t to be reexported by ass_filesystem.h).
However we only load two types of files:

font files, which we’ll need to fully load into memory and thus limit to SIZE_MAX anyway

ASS styles or full files for which we want to switch to a streaming approach so get_file_size will be irrelevant (i was intending to do that some later date, but if optimising for this case becomes a problem perhaps I should already do that now)

I’d be surprised if we ever have to deal with more than those types and therefore don't see where get_file_size > SIZE_MAX would ever be useful. Do you have some future usecase in mind?

I mean, there’s not much choice on platforms with 64-bit size_t. (Some) common compilers only got 128-bit support relatively recently, if indeed at all. Must be the same reason why intmax_t remained 64-bit when I last checked.

Then we can't log a warning though for too large files

Why not log it inside this function?

I don't care about platforms with 64-bit size_t, such large files would be unfeasible for many years to come. I care about platforms with 32-bit size_t, but > 4G files. In such cases file size is perfectly known, but ass_get_file_size returns a error, which I think isn't right, especially if that function would be reused in another context in the future.

especially if that function would be reused in another context in the future.

At least with what I have in mind, it ass_get_file_size will only ever be used for files we fully load into memory (font files and atm ASS files), while future streamed ASS files won't use it at all (instead read fixed size chunks like the second commit of this series to have an almost fixed memory usage (if lines are bigger than CHUNK_SIZE realloc will still be needed))

Do you prefer one of the alternative names or have another name suggestion to reflect the limitation?

It's worth noting that even if we stream a larger file in the future, we still load all events into memory, so unless the file is full of comment lines or embedded fonts, we'll likely run out of 32-bit memory space anyway.

I prefer (u)int64_t for a file size instead of complications with another failure mode and its subsequent handling. Among alternative names I feel that ass_get_file_memory_size is the most appropriate.

Streams do not support seek, thus we are currently unable to load them. But even for regular files, there are two problems with fseek+ftell. For one, ftell returns a long, which on Win32’s LLP64 ABI will artificially limit the loadable file size to 2GiB. But even if long and size_t are of equal width, ISO C doesn’t require fseek(.., SEEK_END) to be meaningfully implemented for binary-mode files as we are using here (and for text-mode ftell need not be meaningful). Instead just repeatedly load the next chunk of bytes until the file ends or an error occurs. This avoids those problems, but has the drawback of likely overallocating the buffer to some degree. For all current uses of ass_load_file the returned buffer is short-lived and this is liekly inproblematic, but the following commit will optimise this for all files whose size is known ahead of time.

TheOneric · 2023-12-07T22:36:52Z

Renamed the function to ass_get_file_memory_size to better reflect its additional limit to what can be loaded into (virtually) continuos memory and added a warning log for too large files.

If this is now acceptable (or this + using in band SIZE_MAX/0 as error codes instead of split return size and error), great.
But otherwise, perhaps I should just already implement the chunk-based parsing for ASS files which will make it clearer how some of this will (not) be used in the future and so we don't over-optimise something getting replaced soon. I was originally intending to do that sometime later, but given this PR no longer is the initial simple “just change a few types” anyway, I might as well go all the way.

This is in preparation for libass#717 . Large file support is already enabled by default in meson.

TheOneric mentioned this pull request Nov 29, 2023

Add per-event multithreaded rendering #636

Closed

5 tasks

TheOneric force-pushed the ass-read-intmax branch from 9a215a1 to 0a5e372 Compare November 30, 2023 17:42

TheOneric force-pushed the ass-read-intmax branch from 0a5e372 to 710bcff Compare November 30, 2023 17:57

Don't fail to read files larger than INT_MAX

69c4088

TheOneric force-pushed the ass-read-intmax branch from 710bcff to 1aa9a93 Compare December 1, 2023 19:53

astiob reviewed Dec 2, 2023

8000 View reviewed changes

libass/ass.c Outdated Show resolved Hide resolved

astiob reviewed Dec 2, 2023

View reviewed changes

libass/ass_filesystem.c Outdated Show resolved Hide resolved

astiob reviewed Dec 2, 2023

View reviewed changes

libass/ass_filesystem.c Outdated Show resolved Hide resolved

astiob reviewed Dec 2, 2023

View reviewed changes

libass/ass.c Outdated Show resolved Hide resolved

astiob reviewed Dec 2, 2023

View reviewed changes

TheOneric force-pushed the ass-read-intmax branch 2 times, most recently from 31cd464 to c7fc9cc Compare December 2, 2023 00:47

TheOneric force-pushed the ass-read-intmax branch from c7fc9cc to 182f58c Compare December 2, 2023 02:10

astiob approved these changes Dec 2, 2023

View reviewed changes

MrSmile reviewed Dec 3, 2023

View reviewed changes

TheOneric force-pushed the ass-read-intmax branch from 182f58c to 623f7b5 Compare December 5, 2023 01:07

TheOneric added the enhancement label Dec 5, 2023

TheOneric added this to the 0.18.0 milestone Dec 5, 2023

load_file: optimise allocation if files size is known

cc13343

TheOneric force-pushed the ass-read-intmax branch from 623f7b5 to cc13343 Compare December 7, 2023 22:26

TheOneric mentioned this pull request Dec 8, 2023

Port of the build system to meson #330

Merged

8 tasks

arch1t3cht added a commit to arch1t3cht/libass that referenced this pull request Dec 27, 2023

Check for fstat function

47d5ddd

This is in preparation for libass#717 . Large file support is already enabled by default in meson.

torque pushed a commit to TypesettingTools/libass that referenced this pull request Apr 8, 2024

Check for fstat function

b56cfc0

This is in preparation for libass#717 . Large file support is already enabled by default in meson.

TheOneric modified the milestones: 0.17.2, 0.18.0 May 6, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ass: don't fail to read files larger than INT_MAX #717

ass: don't fail to read files larger than INT_MAX #717

ass: don't fail to read files larger than INT_MAX #717

Are you sure you want to change the base?

ass: don't fail to read files larger than INT_MAX #717

Conversation

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment