Ushering out strlcpy()
Ushering out strlcpy()
Posted Aug 26, 2022 15:48 UTC (Fri) by wtarreau (subscriber, #51152)In reply to: Ushering out strlcpy() by tialaramex
Parent article: Ushering out strlcpy()
Yep absolutely, but I thought about it because it was a perfectly valid real-world example of something that can be done extremely efficiently when you manipulate bytes and that one cannot afford to process as individual strings using allocations nor doing memmove() etc.
> I assume you don't consider the behaviour if hdr is in fact pointing at something that is not a zero-terminated string (ie a buffer overflow with undefined behaviour) to be desirable, and so we don't need Rust's unsafe which is the only way to duplicate that.
Absolutely. These are final implementation details. Just like it would be fine to require to know the length upfront and use strnchr() if needed.
> https://gist.github.com/rust-play/59883fd0aecbfa0c988f2bf...
> [ I have not tested this code, but I believe it does what your C does ]
Thanks! [ I have not tested mine either :-) ]
So overall once presented as a byte array like this it looks similar and should be of equivalent complexity. It may miss some optimizations that can be done for strchr() for example, that would allow to skip 8 bytes at once (or possibly even more using vector instructions), but overall it looks similar.
> One very obvious difference is that Rust doesn't have pointer arithmetic, so we're writing the Rust in terms of indexing into the slice
I'm fine with this, I'm used to both forms in C as well, it's just a matter of preference. a[b] is exactly the same as b[a] and *(a+b) or *(b+a) in C, so writing -1[ptr] is valid but only useful to confuse the reader :-)
> Rust doesn't think that "any slice of bytes" is a string, but, on the other hand, it also thinks most of C's "string" features are reasonable things to do to a slice of bytes, so this is mostly a matter of terminology. You can for example call make_ascii_lowercase() on a slice.
Makes sense. The only thing you'll miss then is the machine-specific optimizations that went into a number of C libraries for byte search, fill or move (which can be significant for strchr() or memset()).
> If it wasn't actually ASCII text before then what this does is well defined but probably not very useful.
Sure but the point of such protocols precisely is that you don't care as they're byte-oriented and very fast to process when done right.
Thanks!
Posted Aug 26, 2022 16:59 UTC (Fri)
by NYKevin (subscriber, #129325)
[Link] (5 responses)
Posted Aug 26, 2022 18:54 UTC (Fri)
by wtarreau (subscriber, #51152)
[Link] (4 responses)
Posted Aug 26, 2022 20:47 UTC (Fri)
by NYKevin (subscriber, #129325)
[Link] (3 responses)
> The only thing you'll miss then is the machine-specific optimizations that went into a number of C libraries for byte search, fill or move (which can be significant for strchr() or memset()).
strchr and memset are libc functions, so I interpreted "a number of C libraries" as meaning "libc, and possibly other libraries, but mostly libc." If libc has machine-specific optimizations, and Rust can emit libc calls into the LLVM IR (which can then be inlined if it is advantageous to do so), then what exactly is the problem? Most of the interesting libc functions have some sort of safe wrapper in the Rust stdlib, and the language also provides a full FFI for calling arbitrary C code, so I really don't see what disadvantage you're talking about here.
Posted Aug 26, 2022 21:05 UTC (Fri)
by wtarreau (subscriber, #51152)
[Link] (2 responses)
Posted Aug 26, 2022 22:00 UTC (Fri)
by NYKevin (subscriber, #129325)
[Link] (1 responses)
[1]: https://doc.rust-lang.org/std/primitive.str.html#method.find
Posted Aug 29, 2022 9:37 UTC (Mon)
by wtarreau (subscriber, #51152)
[Link]
Posted Aug 26, 2022 22:37 UTC (Fri)
by tialaramex (subscriber, #21167)
[Link] (12 responses)
Specifically this is by BurntSushi, who also wrote ripgrep the very fast grep implementation. So yes, you certainly can (at least amortized over huge input slices) go significantly faster than the naive loop I wrote if you are willing to work a lot harder than I did, or to use somebody else's work.
Posted Aug 29, 2022 9:57 UTC (Mon)
by wtarreau (subscriber, #51152)
[Link] (11 responses)
Posted Aug 29, 2022 11:42 UTC (Mon)
by Wol (subscriber, #4433)
[Link] (8 responses)
Aren't these constraints? Just different ones?
> than about how to find their way through strict types and constraints, otherwise these constraints can quickly reduce the quality of what they're doing. At least that's my feeling.
I think you're over-thinking it.
The question is, are those artificial constraints, or are they real constraints that need to be addressed somewhere. I'll give you an example - from databases of course :-)
In Pick, the underlying datatype is string. The database is built around handling random-length strings. So when I'm designing my database, because Pick separates the underlying value from the display format, THERE IS NO CONCEPT OF LENGTH in the database itself. So the only time I have to concern myself with "how long is a piece of string" is when I'm trying to output it. Compare that with eg C, which is a plentiful source of buffer over-runs because programmers have no clue how long a piece of string is, or Fortran which iirc is a plentiful source of truncated data because programmers get it wrong ...
Same thing with numbers. Why should the programmer be forced to address "what sort of number is this?". Who cares whether it's integer / float / double / whatever, and it's far too easy to make the wrong choice and end up with precision errors etc. ffs it's a number!
Bear in mind, I don't know Rust, but I get the impression most / all of these constraints are questions which MUST be answered - like "who owns this variable, and when?". Not knowing the answer to that is a serious bug. Whereas on the other hand "is this number a float or a double?" most of the time is both a nuisance and an irrelevance. And mathematically meaningless to boot, but get it wrong and you've just corrupted your data.
Cheers,
Posted Aug 29, 2022 14:11 UTC (Mon)
by mpr22 (subscriber, #60784)
[Link] (5 responses)
*hollow mathematical laughter*
Once you put it in your digital computer, it's a digital computer number.
The mathematics of the many kinds of digital computer number are their own distinctive branches of mathematics, which diverge from the "natural" rules of "everyday" arithmetic in a variety of ways.
When you are operating in contexts such as, say, frivolous Javascript programs for the entertainment of the idle, where enormous performance costs are routinely accepted (if they weren't, you wouldn't be using Javascript in the first place) and correctness is an afterthought, "ffs it's a number!" is an unremarkable stance.
(Javascript has two native numeric types. "number" is IEEE 754 double precision floating point (but you are allowed to use bitwise ops on its purported integer value!), and "bigint" is an "arbitrary-precision" computer integer type for holding integer values outside the precise integer range of IEEE 754 double precision floating point.)
When you are operating in the contexts that C or Rust are intended for, where correctness, speed of execution, runtime memory footprint, etc. all matter, it is not.
Posted Aug 29, 2022 14:34 UTC (Mon)
by Wol (subscriber, #4433)
[Link] (4 responses)
When you are operating in the contexts that Pick is intended for, where correctness (or the lack of it) can land you in jail, and speed is all important - because lack of it means that the job needed for tomorrow won't be ready until next week ...
STEP BACK AND LOOK AT THE FOREST!
I'm not saying your concerns aren't valid IN SOME CIRCUMSTANCES. But forcing the programmer to decide what sort of number, when the problem at hand doesn't give a shit, is not wise. Equally, not bothering the programmer to worry about things like buffer over-runs, use-after-free, and all that stuff when the problem at hand lends itself to making those mistakes, is equally stupid.
YOU NEED TO WORRY ABOUT THE THINGS THAT MATTER. And my worry is that wtarreau is missing the big picture. What's the difference between a good design woefully implemented, and a crap design that works as designed? I don't know, but both are probably useless for the intended purpose ...
I see it time and time again, people can't see the big picture, and when the resulting brown stuff hits the rotating blades, they're completely stunned by something that was blatantly obvious to anyone who took a step back. Maybe it helps that I'm pretty new in my current work role (and I'm coming at it from a completely different angle to most of colleagues), but the amount of poor design / fire-fighting fixes / technical debt I see is horrendous. Luckily, I'm being left alone to do what I think is right to fix all this - I have to explain what and why I'm doing it, but my manager sees the benefits (and I'm new extra resource), so I leave the rest of them to carry on while I try and fix the technical debt.
> "ffs it's a number!" is an unremarkable stance.
It's also an unremarkable stance for big accountancy systems managing huge projects - I worked on Sellafield THORP.
Cheers,
Posted Aug 29, 2022 17:07 UTC (Mon)
by wtarreau (subscriber, #51152)
[Link] (1 responses)
But you wrote the correct term: "programmer". That person got some training to learn that the machine they were going to program for uses discrete numbers, doesn't have infinite memory and so on. Otherwise it's just a human. Programmers know that no machine can accurately represent Pi. Programmers know that it's extremely dangerous to store numeric identifiers in Javascript because by default they're stored as floats and that if their absolute value is larger than 2^52 they will lose precision and possibly confuse some people, sessions, digital signatures or whatever. Nowadays a lot of people think that after having hacked in a text area on an online training site they are programmers. And the day they write code like: if (a < b) do_something(); else if (a == b) do_domething() else if (a > b) do_something_else(), they don't understand why there are some entire classes of values for a and b which result in no function being called. But this is taugh in computer courses, that remain an absolute necessity when writing programs aimed at running on real computers. Otherwise, as you're saying, some bugs can put you to jail (or even put someone else to jail).
Processing protocols requires to handle them using the types they were designed with. Most of the time the common type is the byte and protocol elements are just sequences of bytes. In this case it makes sense to be able to handle them as naturally as possible. That's why C strings are overly used in that area even though they don't shine for their efficiency in this specific purpose, but they're wonderful for many other cases.
And I really *do* want programmers to care for the machine they're writing code for. The languages can help, the compilers as well, and in that regard C is not your ally with its absurd precedence, annoying shifts and unclear type promotion, but it remains extremely reliable, so much that it still runs almost the entierety of the internet's core services and operating systems, many systems which would probably not exist if the only choices by then were much more lenient in terms of resource usage, or too constraining in terms of development. Some will say it's not user-friendly, others will say it is user-friendly, it's just selective about who its friends are. My personal feeling is that it could be way better but that I'm not willing to deal with caprices from another language that constantly puts itself in my way when I'm trying to do something simple for the machine. When I can't do something in C I do it in asm, and I grumble that the interface between the two is ugly and not very transparent (though it improved lately in gcc by allowing to return flags).
> amount of poor design / fire-fighting fixes / technical debt I see is horrendous
That's true and it seems to speed up with newer languages sadly, where developers seem to prefer to throw away everything and restart again and again, without rethinking the design. When your language of choice requires a bit of thinking, you're becoming very stingy about your code and you prefer to be certain to get the design right. You end up drawing stuff on paper before writing code and that significantly helps keeping code running well for decades with an acceptable level of maintenance.
Posted Aug 29, 2022 17:54 UTC (Mon)
by mpr22 (subscriber, #60784)
[Link]
I firmly agree with the general thrust of your post, but I do feel the need to point out that even in Javascript, with its historical dedication to Doing Something That Isn't Showing An Error And Refusing To Run when presented with ill-formed code, that code gets an ingestion-time syntax error at the first 'else'.
Posted Aug 29, 2022 17:08 UTC (Mon)
by NYKevin (subscriber, #129325)
[Link] (1 responses)
You can't just say "it's a number" and call it a day. The math doesn't work out. You have to make some sort of compromise, and operate on some computable subset of the real numbers, which in practice is going to be a great deal smaller than the computable numbers. You can shout in ALL CAPS about how programmers should not be required to think about that compromise, but clearly somebody has to, at some point. If you just want to make the accountants happy, then use some sort of decimal fixed point (like SQL's DECIMAL type) or something similar. That's not "just a number" because it still does intermediate rounding, but at least it's doing the rounding that accountants probably want it to do.
Posted Aug 29, 2022 18:26 UTC (Mon)
by Wol (subscriber, #4433)
[Link]
:-)
Pick mostly uses decimal fixed point (and makes the programmer think about it), but it really is "a number is 14sf to 4dp unless you tell me otherwise", which is sufficient for most purposes. It's a case of "if the sort of number matters, then I need to do something about it, but "number" covers most use cases".
Apologies if I'm maligning him, but wtarreau came across as complaining about Rust's memory checking and borrow checking and all the minutiae of system programming that causes real grief it it goes wrong.
People get too bogged down in the minutiae of what they're doing, and it can be very hard to step back and ask yourself "does this REALLY matter?". Quite often the answer is "no". If the computable subset is larger than your problem space (as it usually is), do you really want to be forced to pay strict attention to it? :-) And when the answer is "yes" then you do have to pay attention.
Cheers,
Posted Aug 29, 2022 23:09 UTC (Mon)
by notriddle (subscriber, #130608)
[Link]
No, it just doesn’t compile.
Posted Aug 30, 2022 7:03 UTC (Tue)
by milesrout (subscriber, #126894)
[Link]
There is no such thing as "just a number" in mathematics, even. An integer is a completely different sort of thing from a real number, or a rational number, or a complex number.
What is the result of calling the sine function on "a number"? What is the result of taking the square root of "a number"?
"Bigint" is a _very_ good default for a high-level programming or scripting language, for integers. But most of them keep using floating-point numbers despite their obvious flaws for a good reason: it's very hard to come up with anything else that will survive the kinds of operations people want to do on them.
You could use rational numbers, but there are lots of numbers people want to use that aren't rationals, and the fractional representations of rationals get huge very quickly when you use them. And even if you want to keep them only within a certain user-configured precision, knowing when to perform that reduction is not trivial if you want it to be performant (it shouldn't be after every operation).
Now there are other possible representations. You could represent numbers as functions that compute the value of the number to a given precision. You could represent numbers in many ways. What exactly a "number" is really isn't clear.
Posted Aug 29, 2022 19:26 UTC (Mon)
by tialaramex (subscriber, #21167)
[Link] (1 responses)
This might be more compelling with experimental results but I'd be astonished if it made a discernible difference actually. Without such results I don't see any reason I shouldn't argue the exact opposite, that your solution is needlessly complicated and dangerous.
BurntSushi did this work because of ripgrep which is processing rather more data than a HTTP User-agent header. I can well believe it's worth doing all that for running grep over an entire monorepo or a few weeks log files or whatever but I have my doubts for Mozilla/5.0 (X11; Linux x86_64; rv:103.0) Gecko/20100101 Firefox/103.0 which is, as I count it, 70 bytes.
Posted Aug 31, 2022 3:51 UTC (Wed)
by wtarreau (subscriber, #51152)
[Link]
Optimized grep like those relying on the Boyer-Moore algorithm are extremely effective to look for patterns in large blocks (a complete file), because the initialization cost is very quickly amortized (often after just a few lines).
HTTP headers are painful because they're rarely large enough to amortize many optimizations, need to be processed at very high frequency and as such often suffer from even a function call or the construct of a likely()/unlikely().
I wrote a log parser quite a long time ago, that I'm still using, that processes between 2 and 4 GB of logs per second, looking for some fields or producing statistics on output. It shared similar principles, and all the time was spent in fgets() looking for the LF! I reimplemented this in asm for i386 by then, then for x86_64, then added an option to choose whether or not to use memchr() instead when glibc improved and could sometimes be faster and compensate for the extra cost of the function call, and the differences can be significant. I then adopted similar mechanisms for HTTP headers and gained ~10% global performance probably indicating +50-100% on the parsing alone, so that does count quite a lot.
For the HTTP header example above, among the possibilities to be faster are taking into account the abysmally low probability to meet the pattern. On average sized headers it could be faster to use the fast grep twice over the block if you know it's fit in L1 cache, searching for "\n " then "\n\t" in the whole block at once. Since most headers will be at least ~100 bytes long, this could amortize the cost of initializing the lookup function.
But I definitely agree with you that in any case this must be measured, confronted to various pathological patterns (and that's what I'm doing all day long, trashing lots of apparently smart code changes that don't pass the performance test on real data).
Ushering out strlcpy()
Ushering out strlcpy()
Ushering out strlcpy()
Ushering out strlcpy()
Ushering out strlcpy()
Ushering out strlcpy()
Ushering out strlcpy()
Ushering out strlcpy()
Ushering out strlcpy()
Wol
Ushering out strlcpy()
Ushering out strlcpy()
Wol
Ushering out strlcpy()
Ushering out strlcpy()
Ushering out strlcpy()
Ushering out strlcpy()
Wol
Ushering out strlcpy()
Ushering out strlcpy()
Ushering out strlcpy()
Ushering out strlcpy()