Ushering out strlcpy()

Posted Aug 26, 2022 15:48 UTC (Fri) by wtarreau (subscriber, #51152)
In reply to: Ushering out strlcpy() by tialaramex
Parent article: Ushering out strlcpy()

> You'll notice that in the newer protocols HTTP/2 and HTTP/3 the headers are no longer expressed this way. Today this folding is also considered obsolete in HTTP/1.1 and servers are entitled to just return a 400 error for requests with folded headers.

Yep absolutely, but I thought about it because it was a perfectly valid real-world example of something that can be done extremely efficiently when you manipulate bytes and that one cannot afford to process as individual strings using allocations nor doing memmove() etc.

> I assume you don't consider the behaviour if hdr is in fact pointing at something that is not a zero-terminated string (ie a buffer overflow with undefined behaviour) to be desirable, and so we don't need Rust's unsafe which is the only way to duplicate that.

Absolutely. These are final implementation details. Just like it would be fine to require to know the length upfront and use strnchr() if needed.

> https://gist.github.com/rust-play/59883fd0aecbfa0c988f2bf...
> [ I have not tested this code, but I believe it does what your C does ]

Thanks! [ I have not tested mine either :-) ]
So overall once presented as a byte array like this it looks similar and should be of equivalent complexity. It may miss some optimizations that can be done for strchr() for example, that would allow to skip 8 bytes at once (or possibly even more using vector instructions), but overall it looks similar.

> One very obvious difference is that Rust doesn't have pointer arithmetic, so we're writing the Rust in terms of indexing into the slice

I'm fine with this, I'm used to both forms in C as well, it's just a matter of preference. a[b] is exactly the same as b[a] and *(a+b) or *(b+a) in C, so writing -1[ptr] is valid but only useful to confuse the reader :-)

> Rust doesn't think that "any slice of bytes" is a string, but, on the other hand, it also thinks most of C's "string" features are reasonable things to do to a slice of bytes, so this is mostly a matter of terminology. You can for example call make_ascii_lowercase() on a slice.

Makes sense. The only thing you'll miss then is the machine-specific optimizations that went into a number of C libraries for byte search, fill or move (which can be significant for strchr() or memset()).

> If it wasn't actually ASCII text before then what this does is well defined but probably not very useful.

Sure but the point of such protocols precisely is that you don't care as they're byte-oriented and very fast to process when done right.

Thanks!

Ushering out strlcpy()

Posted Aug 26, 2022 16:59 UTC (Fri) by NYKevin (subscriber, #129325) [Link] (5 responses)

I believe Rust is perfectly capable of emitting libc calls in contexts where it is safe to do so, because many of the slice primitives are described with phrases like "using memcpy" in the documentation. That suggests to me that they really do emit a memcpy(3) function call, or at least something which LLVM will transform into a memcpy call during compilation. So that's probably going to perform just as well as C, because under the hood, it is C.

Ushering out strlcpy()

Posted Aug 26, 2022 18:54 UTC (Fri) by wtarreau (subscriber, #51152) [Link] (4 responses)

Here it's different, it's not a matter of relying on libc calls for other calls but rather a question of ability for a compiler to detect a well-known pattern from something that doesn't look similar. Just like we constantly have to help the C compiler catch what we're trying to do by sometimes making ugly constructs, any other language's compiler will have the same problem, and it's not necessarily trivial to match an strchr() pattern in the loop above. And that's actually part of the things I often don't like in more abstract or stricter languages, which is less freedom to tell the compiler what you're trying to do.

Ushering out strlcpy()

Posted Aug 26, 2022 20:47 UTC (Fri) by NYKevin (subscriber, #129325) [Link] (3 responses)

I don't understand. You said:

> The only thing you'll miss then is the machine-specific optimizations that went into a number of C libraries for byte search, fill or move (which can be significant for strchr() or memset()).

strchr and memset are libc functions, so I interpreted "a number of C libraries" as meaning "libc, and possibly other libraries, but mostly libc." If libc has machine-specific optimizations, and Rust can emit libc calls into the LLVM IR (which can then be inlined if it is advantageous to do so), then what exactly is the problem? Most of the interesting libc functions have some sort of safe wrapper in the Rust stdlib, and the language also provides a full FFI for calling arbitrary C code, so I really don't see what disadvantage you're talking about here.

Ushering out strlcpy()

Posted Aug 26, 2022 21:05 UTC (Fri) by wtarreau (subscriber, #51152) [Link] (2 responses)

I'm not speaking about being able or not to emit libc calls. I can understand that it can emit them when the call is obvious. What I'm saying is that just like any compiler it's extremely unlikely to spot situations where this is possible based on a loop like above. Even for a human it's not trivial to spot this. If it had strings mapped to arrays for example, like in C, I would easily imagine that we'd have an equivalent of strchr() applying to that array that would rely on the libc call. But a loop like this with tests in the middle is extremely unlikely to be turned to a series of strchr() on one given value. And by the way in such a parser it should be left to the developer's choice, because short headers would cost more with an strchr() call while long ones would benefit from a fast strchr(). Thus it would be nice to be able to call the equivalent of strchr() on positions in this array. But overall it looks like reimplementing the C string model that many people dislike.

Ushering out strlcpy()

Posted Aug 26, 2022 22:00 UTC (Fri) by NYKevin (subscriber, #129325) [Link] (1 responses)

Well, sure, but you wouldn't write that loop (in Rust) in the first place. You'd call str::find[1] with suitable arguments, and then it's the implementation's problem to emit the most efficient way to do this. Since one possible set of arguments to str::find can be statically proven to be exactly equivalent to strchr (purely in terms of the type system, so this can be done in the monomorphization pass), the implementation should emit strchr if it is optimal to do so. If the implementation does not in fact emit strchr, and you think this is suboptimal, then you should file a bug with them. But don't tell me it's impossible. It's clearly possible.

[1]: https://doc.rust-lang.org/std/primitive.str.html#method.find

Ushering out strlcpy()

Posted Aug 29, 2022 9:37 UTC (Mon) by wtarreau (subscriber, #51152) [Link]

To be honest, I think you lost me, especially since initially I understood that it had to be handled as an array of bytes and not a string, but that sort of comforts me in the idea that it can quickly become very complicated to write simple yet efficient functions for a number of low-level use cases. And my fear with languages that are complicated to use is that either the developer will introduce logic bugs when trying to hack around constraints, or that they will give up and fall back to the default, much less efficient solution (which is often what we see in practice).

Ushering out strlcpy()

Posted Aug 26, 2022 22:37 UTC (Fri) by tialaramex (subscriber, #21167) [Link] (12 responses)

https://docs.rs/memchr/latest/memchr/ is the uh, extremely complicated result of somebody who really cares about the sort of optimizations you're discussing attacking this problem. It is of course named after C's memchr() function which is exactly like strchr() except not specialized for C-style 0-terminated strings.

Specifically this is by BurntSushi, who also wrote ripgrep the very fast grep implementation. So yes, you certainly can (at least amortized over huge input slices) go significantly faster than the naive loop I wrote if you are willing to work a lot harder than I did, or to use somebody else's work.

Ushering out strlcpy()

Posted Aug 29, 2022 9:57 UTC (Mon) by wtarreau (subscriber, #51152) [Link] (11 responses)

Thanks! But indeed as I said above, that's the limit, i.e. it seems it can quickly become quite complicated to achieve C-like efficiency for real-world use cases that are routinely found in low-level programming. I mean, we both spent less than 5 minutes writing each of our versions, and they're probably as easy to read for a developer of each language so at this point that's fine. However the rust one will be significantly less efficient on medium sized headers (user-agent, cookie, referer etc) and suddenly it seems to become quite a bit more difficult to try to recover that last part because there are different ways to hack around this that one must explore. And for me, ease of use is a concern because I want to see developers spend more time thinking about design, reliability and security than about how to find their way through strict types and constraints, otherwise these constraints can quickly reduce the quality of what they're doing. At least that's my feeling.

Ushering out strlcpy()

Posted Aug 29, 2022 11:42 UTC (Mon) by Wol (subscriber, #4433) [Link] (8 responses)

> And for me, ease of use is a concern because I want to see developers spend more time thinking about design, reliability and security

Aren't these constraints? Just different ones?

> than about how to find their way through strict types and constraints, otherwise these constraints can quickly reduce the quality of what they're doing. At least that's my feeling.

I think you're over-thinking it.

The question is, are those artificial constraints, or are they real constraints that need to be addressed somewhere. I'll give you an example - from databases of course :-)

In Pick, the underlying datatype is string. The database is built around handling random-length strings. So when I'm designing my database, because Pick separates the underlying value from the display format, THERE IS NO CONCEPT OF LENGTH in the database itself. So the only time I have to concern myself with "how long is a piece of string" is when I'm trying to output it. Compare that with eg C, which is a plentiful source of buffer over-runs because programmers have no clue how long a piece of string is, or Fortran which iirc is a plentiful source of truncated data because programmers get it wrong ...

Same thing with numbers. Why should the programmer be forced to address "what sort of number is this?". Who cares whether it's integer / float / double / whatever, and it's far too easy to make the wrong choice and end up with precision errors etc. ffs it's a number!

Bear in mind, I don't know Rust, but I get the impression most / all of these constraints are questions which MUST be answered - like "who owns this variable, and when?". Not knowing the answer to that is a serious bug. Whereas on the other hand "is this number a float or a double?" most of the time is both a nuisance and an irrelevance. And mathematically meaningless to boot, but get it wrong and you've just corrupted your data.

Cheers,
Wol

Ushering out strlcpy()

Posted Aug 29, 2022 14:11 UTC (Mon) by mpr22 (subscriber, #60784) [Link] (5 responses)

> ffs it's a number!

*hollow mathematical laughter*

Once you put it in your digital computer, it's a digital computer number.

The mathematics of the many kinds of digital computer number are their own distinctive branches of mathematics, which diverge from the "natural" rules of "everyday" arithmetic in a variety of ways.

When you are operating in contexts such as, say, frivolous Javascript programs for the entertainment of the idle, where enormous performance costs are routinely accepted (if they weren't, you wouldn't be using Javascript in the first place) and correctness is an afterthought, "ffs it's a number!" is an unremarkable stance.

(Javascript has two native numeric types. "number" is IEEE 754 double precision floating point (but you are allowed to use bitwise ops on its purported integer value!), and "bigint" is an "arbitrary-precision" computer integer type for holding integer values outside the precise integer range of IEEE 754 double precision floating point.)

When you are operating in the contexts that C or Rust are intended for, where correctness, speed of execution, runtime memory footprint, etc. all matter, it is not.

Ushering out strlcpy()

Posted Aug 29, 2022 14:34 UTC (Mon) by Wol (subscriber, #4433) [Link] (4 responses)

> When you are operating in the contexts that C or Rust are intended for, where correctness, speed of execution, runtime memory footprint, etc. all matter, it is not.

When you are operating in the contexts that Pick is intended for, where correctness (or the lack of it) can land you in jail, and speed is all important - because lack of it means that the job needed for tomorrow won't be ready until next week ...

STEP BACK AND LOOK AT THE FOREST!

I'm not saying your concerns aren't valid IN SOME CIRCUMSTANCES. But forcing the programmer to decide what sort of number, when the problem at hand doesn't give a shit, is not wise. Equally, not bothering the programmer to worry about things like buffer over-runs, use-after-free, and all that stuff when the problem at hand lends itself to making those mistakes, is equally stupid.

YOU NEED TO WORRY ABOUT THE THINGS THAT MATTER. And my worry is that wtarreau is missing the big picture. What's the difference between a good design woefully implemented, and a crap design that works as designed? I don't know, but both are probably useless for the intended purpose ...

I see it time and time again, people can't see the big picture, and when the resulting brown stuff hits the rotating blades, they're completely stunned by something that was blatantly obvious to anyone who took a step back. Maybe it helps that I'm pretty new in my current work role (and I'm coming at it from a completely different angle to most of colleagues), but the amount of poor design / fire-fighting fixes / technical debt I see is horrendous. Luckily, I'm being left alone to do what I think is right to fix all this - I have to explain what and why I'm doing it, but my manager sees the benefits (and I'm new extra resource), so I leave the rest of them to carry on while I try and fix the technical debt.

> "ffs it's a number!" is an unremarkable stance.

It's also an unremarkable stance for big accountancy systems managing huge projects - I worked on Sellafield THORP.

Cheers,
Wol

Ushering out strlcpy()

Posted Aug 29, 2022 17:07 UTC (Mon) by wtarreau (subscriber, #51152) [Link] (1 responses)

> But forcing the programmer to decide what sort of number, when the problem at hand doesn't give a shit, is not wise.

But you wrote the correct term: "programmer". That person got some training to learn that the machine they were going to program for uses discrete numbers, doesn't have infinite memory and so on. Otherwise it's just a human. Programmers know that no machine can accurately represent Pi. Programmers know that it's extremely dangerous to store numeric identifiers in Javascript because by default they're stored as floats and that if their absolute value is larger than 2^52 they will lose precision and possibly confuse some people, sessions, digital signatures or whatever. Nowadays a lot of people think that after having hacked in a text area on an online training site they are programmers. And the day they write code like: if (a < b) do_something(); else if (a == b) do_domething() else if (a > b) do_something_else(), they don't understand why there are some entire classes of values for a and b which result in no function being called. But this is taugh in computer courses, that remain an absolute necessity when writing programs aimed at running on real computers. Otherwise, as you're saying, some bugs can put you to jail (or even put someone else to jail).

Processing protocols requires to handle them using the types they were designed with. Most of the time the common type is the byte and protocol elements are just sequences of bytes. In this case it makes sense to be able to handle them as naturally as possible. That's why C strings are overly used in that area even though they don't shine for their efficiency in this specific purpose, but they're wonderful for many other cases.

And I really *do* want programmers to care for the machine they're writing code for. The languages can help, the compilers as well, and in that regard C is not your ally with its absurd precedence, annoying shifts and unclear type promotion, but it remains extremely reliable, so much that it still runs almost the entierety of the internet's core services and operating systems, many systems which would probably not exist if the only choices by then were much more lenient in terms of resource usage, or too constraining in terms of development. Some will say it's not user-friendly, others will say it is user-friendly, it's just selective about who its friends are. My personal feeling is that it could be way better but that I'm not willing to deal with caprices from another language that constantly puts itself in my way when I'm trying to do something simple for the machine. When I can't do something in C I do it in asm, and I grumble that the interface between the two is ugly and not very transparent (though it improved lately in gcc by allowing to return flags).

> amount of poor design / fire-fighting fixes / technical debt I see is horrendous

That's true and it seems to speed up with newer languages sadly, where developers seem to prefer to throw away everything and restart again and again, without rethinking the design. When your language of choice requires a bit of thinking, you're becoming very stingy about your code and you prefer to be certain to get the design right. You end up drawing stuff on paper before writing code and that significantly helps keeping code running well for decades with an acceptable level of maintenance.

Ushering out strlcpy()

Posted Aug 29, 2022 17:54 UTC (Mon) by mpr22 (subscriber, #60784) [Link]

> And the day they write code like: if (a < b) do_something(); else if (a == b) do_domething() else if (a > b) do_something_else()

I firmly agree with the general thrust of your post, but I do feel the need to point out that even in Javascript, with its historical dedication to Doing Something That Isn't Showing An Error And Refusing To Run when presented with ill-formed code, that code gets an ingestion-time syntax error at the first 'else'.

Ushering out strlcpy()

Posted Aug 29, 2022 17:08 UTC (Mon) by NYKevin (subscriber, #129325) [Link] (1 responses)

The problem is, the real numbers are uncountable, which means that there is no encoding of the real numbers in (finite-length) bit strings (even allowing for variable-length bit strings, arbitrary precision encodings, etc.). Furthermore, there exist non-computable real numbers (real numbers for which there does not exist any algorithm that can output their digits), and even if you restrict yourself to the computable reals, there are many computable numbers for which basic arithmetic is problematic (for example, zero and "2^-n, if Turing machine M halts in exactly n steps, or else zero if it never halts" are both computable numbers, but deciding whether they are equal is equivalent to the halting problem and hence undecidable).

You can't just say "it's a number" and call it a day. The math doesn't work out. You have to make some sort of compromise, and operate on some computable subset of the real numbers, which in practice is going to be a great deal smaller than the computable numbers. You can shout in ALL CAPS about how programmers should not be required to think about that compromise, but clearly somebody has to, at some point. If you just want to make the accountants happy, then use some sort of decimal fixed point (like SQL's DECIMAL type) or something similar. That's not "just a number" because it still does intermediate rounding, but at least it's doing the rounding that accountants probably want it to do.

Ushering out strlcpy()

Posted Aug 29, 2022 18:26 UTC (Mon) by Wol (subscriber, #4433) [Link]

> You can't just say "it's a number" and call it a day. The math doesn't work out. You have to make some sort of compromise, and operate on some computable subset of the real numbers, which in practice is going to be a great deal smaller than the computable numbers. You can shout in ALL CAPS about how programmers should not be required to think about that compromise, but clearly somebody has to, at some point. If you just want to make the accountants happy, then use some sort of decimal fixed point (like SQL's DECIMAL type) or something similar. That's not "just a number" because it still does intermediate rounding, but at least it's doing the rounding that accountants probably want it to do.

:-)

Pick mostly uses decimal fixed point (and makes the programmer think about it), but it really is "a number is 14sf to 4dp unless you tell me otherwise", which is sufficient for most purposes. It's a case of "if the sort of number matters, then I need to do something about it, but "number" covers most use cases".

Apologies if I'm maligning him, but wtarreau came across as complaining about Rust's memory checking and borrow checking and all the minutiae of system programming that causes real grief it it goes wrong.

People get too bogged down in the minutiae of what they're doing, and it can be very hard to step back and ask yourself "does this REALLY matter?". Quite often the answer is "no". If the computable subset is larger than your problem space (as it usually is), do you really want to be forced to pay strict attention to it? :-) And when the answer is "yes" then you do have to pay attention.

Cheers,
Wol

Ushering out strlcpy()

Posted Aug 29, 2022 23:09 UTC (Mon) by notriddle (subscriber, #130608) [Link]

> Not knowing the answer to that [ownership question] is a serious bug.

No, it just doesn’t compile.

Ushering out strlcpy()

Posted Aug 30, 2022 7:03 UTC (Tue) by milesrout (subscriber, #126894) [Link]

> Same thing with numbers. Why should the programmer be forced to address "what sort of number is this?". Who cares whether it's integer / float / double / whatever, and it's far too easy to make the wrong choice and end up with precision errors etc. ffs it's a number!

There is no such thing as "just a number" in mathematics, even. An integer is a completely different sort of thing from a real number, or a rational number, or a complex number.

What is the result of calling the sine function on "a number"? What is the result of taking the square root of "a number"?

"Bigint" is a _very_ good default for a high-level programming or scripting language, for integers. But most of them keep using floating-point numbers despite their obvious flaws for a good reason: it's very hard to come up with anything else that will survive the kinds of operations people want to do on them.

You could use rational numbers, but there are lots of numbers people want to use that aren't rationals, and the fractional representations of rationals get huge very quickly when you use them. And even if you want to keep them only within a certain user-configured precision, knowing when to perform that reduction is not trivial if you want it to be performant (it shouldn't be after every operation).

Now there are other possible representations. You could represent numbers as functions that compute the value of the number to a given precision. You could represent numbers in many ways. What exactly a "number" is really isn't clear.

Ushering out strlcpy()

Posted Aug 29, 2022 19:26 UTC (Mon) by tialaramex (subscriber, #21167) [Link] (1 responses)

> However the rust one will be significantly less efficient on medium sized headers (user-agent, cookie, referer etc)

This might be more compelling with experimental results but I'd be astonished if it made a discernible difference actually. Without such results I don't see any reason I shouldn't argue the exact opposite, that your solution is needlessly complicated and dangerous.

BurntSushi did this work because of ripgrep which is processing rather more data than a HTTP User-agent header. I can well believe it's worth doing all that for running grep over an entire monorepo or a few weeks log files or whatever but I have my doubts for Mozilla/5.0 (X11; Linux x86_64; rv:103.0) Gecko/20100101 Firefox/103.0 which is, as I count it, 70 bytes.

Ushering out strlcpy()

Posted Aug 31, 2022 3:51 UTC (Wed) by wtarreau (subscriber, #51152) [Link]

It actually does matter when you have to deal with DDoS for example, or when processing logs/traces.

Optimized grep like those relying on the Boyer-Moore algorithm are extremely effective to look for patterns in large blocks (a complete file), because the initialization cost is very quickly amortized (often after just a few lines).

HTTP headers are painful because they're rarely large enough to amortize many optimizations, need to be processed at very high frequency and as such often suffer from even a function call or the construct of a likely()/unlikely().

I wrote a log parser quite a long time ago, that I'm still using, that processes between 2 and 4 GB of logs per second, looking for some fields or producing statistics on output. It shared similar principles, and all the time was spent in fgets() looking for the LF! I reimplemented this in asm for i386 by then, then for x86_64, then added an option to choose whether or not to use memchr() instead when glibc improved and could sometimes be faster and compensate for the extra cost of the function call, and the differences can be significant. I then adopted similar mechanisms for HTTP headers and gained ~10% global performance probably indicating +50-100% on the parsing alone, so that does count quite a lot.

For the HTTP header example above, among the possibilities to be faster are taking into account the abysmally low probability to meet the pattern. On average sized headers it could be faster to use the fast grep twice over the block if you know it's fit in L1 cache, searching for "\n " then "\n\t" in the whole block at once. Since most headers will be at least ~100 bytes long, this could amortize the cost of initializing the lookup function.

But I definitely agree with you that in any case this must be measured, confronted to various pathological patterns (and that's what I'm doing all day long, trashing lots of apparently smart code changes that don't pass the performance test on real data).