Escape non-printable characters in JSON output #1949

riemass · 2024-09-15T20:29:40Z

Fixes #1937.

codecov · 2024-09-15T20:48:33Z

Codecov Report

Attention: Patch coverage is 75.82418% with 22 lines in your changes missing coverage. Please review.

Project coverage is 71.12%. Comparing base (f7effcf) to head (a59eaab).
Report is 6 commits behind head on main.

Files with missing lines	Patch %	Lines
libcaf_core/caf/detail/json.cpp	81.25%	13 Missing and 2 partials ⚠️
libcaf_core/caf/json_reader.cpp	25.00%	6 Missing ⚠️
libcaf_core/caf/detail/print.hpp	66.66%	0 Missing and 1 partial ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #1949      +/-   ##
==========================================
+ Coverage   70.94%   71.12%   +0.18%     
==========================================
  Files         658      658              
  Lines       29031    29072      +41     
  Branches     3154     3158       +4     
==========================================
+ Hits        20595    20677      +82     
+ Misses       6649     6602      -47     
- Partials     1787     1793       +6

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Neverlord · 2024-09-16T18:32:19Z

@riemass what about the JOSN reader? Do we already support the \xxxx syntax there? I would like to see a roundtrip test. If that would require touching the parser and you don't want to touch it, that's OK too. I would put it onto my todo list in that case.

riemass · 2024-09-18T10:53:29Z

I didn't think of it, good catch. I'll update the PR once ready.

riemass · 2024-09-21T12:32:46Z

@Neverlord question regarding the \u escapes. According to the specs, every code point might be escaped and written as either
\uxxxx or \uxxxx\uxxxx, which corresponds to the way UTF16 code points are encoded.
How do we threat string encodings in CAF, should I translate the escaped code point to a UTF-8 variable length encoding every time, or do it somehow test what encoding the system uses? Do some systems still use UTF-16 by default?

Neverlord · 2024-09-22T18:31:48Z

Using local encoding would go against the spec (see 8.1):

JSON text exchanged between systems that are not part of a closed ecosystem MUST be encoded using UTF-8 [RFC3629].

If a user wants to render a received JSON string in local encoding, they should to the (most likely lossy) conversion themselves.

(...) might be escaped and written as either \uxxxx or \uxxxx\uxxxx

The way I read it, the latter syntax is only used if a character exceeds a single 16 bit character in UTF-16. Either way, I think the output the API should produce needs to be UTF-8 according to the spec.