-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
JSON.stringify produces invalid UTF-16 #944
Comments
Prior discussion: https://esdiscuss.org/topic/code-points-vs-unicode-scalar-values#content-14 I’m in favor of making |
I couple follow-up thoughts to the old discussion.
|
I think it's at least worth pointing out the easy fix, which is to change the phrase "String in UTF-16 encoded JSON format" to "String in JSON format". |
If you need UTF-8, you need to replace unpaired surrogates in a string with U+FFFD. Most platform APIs will do this for you. Perhaps that's a primitive we should expose somehow... |
I’m curious, does escaping not just move the problem around? ES permits invalid Unicode, and even if unpaired surrogate code units are escaped, the result still describes an invalid Unicode string, doesn’t it? Maybe that’s still better than nothing since other agents can then safely decode. Is that the only goal? |
If you replace unpaired surrogates with U+FFFD, you no longer have code points that cannot be represented by UTF-8 (or UTF-16). I'm not really sure what you mean with "invalid Unicode". |
@annevk Oh, I misunderstood — I thought the idea was to replace with |
@bathos It describes a Unicode string that is not well-formed, but as mentioned in 3.9 of the standard (http://www.unicode.org/versions/Unicode9.0.0/ch03.pdf), "Unicode strings need not contain well-formed code unit sequences under all conditions". It gives an example there of concatenating ill-formed UTF-16 sequences to create a well-formed UTF-16 sequence, like in my split/join examples. |
Thanks, @Maxdamantus. "Ill-formed" is probably the right term for what I meant, though either way, my comment was an error — I got this confused with something proposed elsewhere. Though re-reading the initial issue here, I guess my comment was still applicable after all? This does seem to be about escaping bad surrogate sequences to |
I imagine it would be undesirable at this point to produce replacement characters, since parsing the string that is currently produced works perfectly well for the relevant cases just as long as it never gets encoded into another UTF (or more accurately, gets interpreted as UTF-16) in the middle (i.e., I think generally it should be perfectly fine to encode arbitrary ES strings and that since the JSON encoding is supposedly a textual one based on Unicode characters (eg, As it currently is, |
There is now a stage 3 proposal fixing this. Once this lands this can be closed, I expect. |
JSON.stringify
is described as returning a "String in UTF-16 encoded JSON format representing an ECMAScript value", but due to the fact that ECMAScript strings can contain non-UTF-16 sequences of code units and thatQuoteJSONString
does not account for this, the UTF-16 ill-formedness of the value encoded is carried over to the encoding output.The effect of this is that the JSON string can not be correctly converted to another UTF such as UTF-8, so the conversion might fail or involve inserting replacement characters.
Example strange behaviour when invoking
js
(SpiderMonkey) ornode
:I propose something similar to the fragment below (not currently written in formal ECMAScript style) to be added to the specification for
QuoteJSONString
, before the final "Else" that currently exists.Note that this change would only have a minimal effect on the encoding of strings that are not well-formed UTF-16 code unit sequences (and will most likely fail to translate to other UTFs). Any well-formed UTF-16 code unit sequence (a sequence where every member is either a high surrogate code unit followed by a low surrogate code unit or is a non-surrogate code unit) is encoded in the same way as is currently specified.
The text was updated successfully, but these errors were encountered: