-
Notifications
You must be signed in to change notification settings - Fork 226
ass: don't fail to read files larger than INT_MAX #717
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
9a215a1
to
0a5e372
Compare
In the previous revision checking just one should have been fine afaict.
A newly added second commit now replaces the EDIT: force-pushed again as I accidentally merged most of my changes of commit 1 with commit 2 before |
0a5e372
to
710bcff
Compare
Why not just use |
FWIW I think it’s good to support files that are non-seekable; and even a seekable file’s size can change mid-read if someone concurrently writes to it. We could, of course, use ftello etc. as an optimization hint if it’s available. On a separate note, why do we need to read the entire file into a buffer at all? Can’t we process it chunk by chunk? We probably need whole lines, but we should be able to read a chunk, process any whole lines, and buffer only the last, incomplete line as we proceed to read the next chunk. |
(Another non-ISO-C alternative for getting the file size is the |
EDIT: Welp, GitHub only now shows me the new comments
I didn’t know Win32 has a ISO C, here draft n3047 of C23 says:
POSIX docs are silent about this particular case, but as usual defer to ISO C for everything not explicitly marked as an extension and Win32 docs are also silent on the matter. In practice it seems indeed unlikely anyone is using There is a nice bonus to the read-loop though, apart from avoiding possibly purely theoretical problems: it works with streams which are never seekable and we can now do something like zstd -c -d ./large.ass.zst | test/test out.png /dev/stdin 12 Side note: while I think attempts to seek on streams should fail cleanly with POSIX or ISO C, concerningly Microsoft docs contain the sentence “On devices incapable of seeking, the return value is undefined.”. |
We could parse ASS files line by line, yeah. But In fact loading fonts file and parsing are the only uses of this funtion. I guess moving the “load whole file” version into |
I think even in the case of incremental read it's better to use some sensible approximation for initial buffer size (fstat?). |
710bcff
to
1aa9a93
Compare
Now using |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We probably want to define _FILE_OFFSET_BITS=64
as well, and whatever equivalent incantations other platforms may need. There’s probably (hopefully) a ready-to-use Autoconf macro for this.
31cd464
to
c7fc9cc
Compare
Now done via |
Though.. since we currently load the file wholesale into memory, does |
c7fc9cc
to
182f58c
Compare
Good point 😅 But yeah, it shouldn’t hurt and might help in the future, so it’s probably good to have it added while we remember about it. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also there's typo (forregular
) in the second commit message.
if (!buf) | ||
goto fail; | ||
|
||
size_t bytes_read = 0; | ||
while (true) { | ||
size_t read = fread(buf + bytes_read, 1, CHUNK_SIZE, fp); | ||
size_t maxread = bufmax - bytes_read; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suggest to ensure that maxread
is always greater than CHUNK_SIZE
to avoid suboptimal behavior at the file end.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
CHUNK_SIZE
no longer exists, but the new version will no longer reallloc anything for a know file size (which was not modified since the fstat
call). Does this seem good enough?
libass/ass_filesystem.c
Outdated
if (fstat(fd, &info)) | ||
return false; | ||
|
||
if (!S_ISREG(info.st_mode) || info.st_size > SIZE_MAX) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For > 4G files at 32-bit system that triggers failure, so ass_load_file
fallbacks to incremental read which eventually fail due to out of memory. I think ass_get_file_size
should return uint64_t
(or int64_t
which can be combined with −1 error) and let caller handle overflow.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, right. If st_size >= SIZE_MAX
, we probably just want to abandon the file immediately.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ass_get_file_size
now returns an int
and differentiates success (0
), unknown size (>0
) and too large (<0
), which gets used in load_file
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think a function with the name ass_get_file_size
should return a known file size even if that size is larger than SIZE_MAX
. Currently we don't have a use for such sizes, but that can change in the future (for example, for a detailed error message).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
While we are presumably unlikely to encounter files larger than >2^64 bytes soon, POSIX’s stat
struct records the size in off_t
(Windows’s fstati64
as a 64bit type) and off_t
might be any width, so the return type would need to be intmax_t
(when avoiding <sys/types.h>
and the Win32 compatibility macros for off_t
to be reexported by ass_filesystem.h
).
However we only load two types of files:
- font files, which we’ll need to fully load into memory and thus limit to
SIZE_MAX
anyway - ASS styles or full files for which we want to switch to a streaming approach so
get_file_size
will be irrelevant (i was intending to do that some later date, but if optimising for this case becomes a problem perhaps I should already do that now)
I’d be surprised if we ever have to deal with more than those types and therefore don't see where get_file_size > SIZE_MAX
would ever be useful. Do you have some future usecase in mind?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I mean, there’s not much choice on platforms with 64-bit size_t
. (Some) common compilers only got 128-bit support relatively recently, if indeed at all. Must be the same reason why intmax_t
remained 64-bit when I last checked.
Then we can't log a warning though for too large files
Why not log it inside this function?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't care about platforms with 64-bit size_t
, such large files would be unfeasible for many years to come. I care about platforms with 32-bit size_t
, but > 4G files. In such cases file size is perfectly known, but ass_get_file_size
returns a error, which I think isn't right, especially if that function would be reused in another context in the future.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
especially if that function would be reused in another context in the future.
At least with what I have in mind, it ass_get_file_size
will only ever be used for files we fully load into memory (font files and atm ASS files), while future streamed ASS files won't use it at all (instead read fixed size chunks like the second commit of this series to have an almost fixed memory usage (if lines are bigger than CHUNK_SIZE realloc will still be needed))
Do you prefer one of the alternative names or have another name suggestion to reflect the limitation?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's worth noting that even if we stream a larger file in the future, we still load all events into memory, so unless the file is full of comment lines or embedded fonts, we'll likely run out of 32-bit memory space anyway.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I prefer (u)int64_t
for a file size instead of complications with another failure mode and its subsequent handling. Among alternative names I feel that ass_get_file_memory_size
is the most appropriate.
Streams do not support seek, thus we are currently unable to load them. But even for regular files, there are two problems with fseek+ftell. For one, ftell returns a long, which on Win32’s LLP64 ABI will artificially limit the loadable file size to 2GiB. But even if long and size_t are of equal width, ISO C doesn’t require fseek(.., SEEK_END) to be meaningfully implemented for binary-mode files as we are using here (and for text-mode ftell need not be meaningful). Instead just repeatedly load the next chunk of bytes until the file ends or an error occurs. This avoids those problems, but has the drawback of likely overallocating the buffer to some degree. For all current uses of ass_load_file the returned buffer is short-lived and this is liekly inproblematic, but the following commit will optimise this for all files whose size is known ahead of time.
182f58c
to
623f7b5
Compare
623f7b5
to
cc13343
Compare
Renamed the function to If this is now acceptable (or this + using in band |
This is in preparation for libass#717 . Large file support is already enabled by default in meson.
This is in preparation for libass#717 . Large file support is already enabled by default in meson.
Yet another
int
overuse in old code. Without this UB will occur on conversion to signed int and if the value wraps around to something negative, claim there was an IO error using whatever the last errno number was which is misleading.Not sure if it’s best practice to check
ferror
,feof
or both.