feat: vis-encode all path characters except visible ASCII #3

plobsing · 2025-05-09T01:41:23Z

This allows archive input files to include Unicode characters. It should also support the more esoteric corners of the ASCII set (e.g. backslashes, newlines, control codes), if the user can find a way to convince Bazel to use such a file. This latter feature probably isn't a good idea to actually use; but it is nice to have confidence that this ruleset won't choke if it comes up.

Generating an archive with non-ASCII characters does report a diagnostic, however this does not appear to correspond to a real issue — the file is still placed into the archive with the expected path and contents.

INFO: From Tar lib/tests/tar/7.tar:
tar: lib/tests/tar/srcdir/Unicode® support?🤞: Can't translate pathname 'lib/tests/tar/srcdir/Unicode® support?🤞' to UTF-8

Encoding of 7-bit ASCII could be performed in Bazel, however Starlark does not provide access to a string's bytes, only its codepoints, at least until the spec and implementation work to get a bytes type lands. So a second, out-of-process build-time pass is needed. Since a separate pass is needed anyways, we move most of the encoding responsibility to this external script (except cases that it cannot handle without interfering with the mtree file format: newline, space, and backslash).

We use gawk for this second pass. A prior attempt at this feature had done a similar external pass with the vis utility, but was not accepted because the vis tool is not expected to be available everywhere. gawk is provided through a BCR module of the same name.

The extended feature set of GNU Awk is necessary to implement these scripts succinctly; POSIX awk that might be assumed to be present on most systems is insufficient. At the very least,
it is difficult to do byte-by-byte encoding without --characters-as-bytes nor efficient, one-pass, table-based replacement without 4-arg split.

We also take this opportunity to canonicalize the paths derived from user-provided mtree specs when computing the unused inputs set. This should ensure that the comparison is exact, which substantially reduces the risks associated with this feature and should enable it to become a default.

This change was first proposed against bazel-lib as bazel-contrib/bazel-lib#985.
It has been ported to tar.bzl to reflect the new home of the functionality of interest.

Fixes #17

This allows archive input files to include Unicode characters. It should also support the more esoteric corners of the ASCII set (e.g. backslashes, newlines, control codes), if the user can find a way to convince Bazel to use such a file. This latter feature probably isn't a good idea to actually use; but it is nice to have confidence that this ruleset won't choke if it comes up. Generating an archive with non-ASCII characters reports a diagnostic, however this does not appear to correspond to a real issue — the file is still placed into the archive with the expected path and contents. ``` INFO: From Tar lib/tests/tar/7.tar: tar: lib/tests/tar/srcdir/Unicode® support?🤞: Can't translate pathname 'lib/tests/tar/srcdir/Unicode® support?🤞' to UTF-8 ``` Encoding of 7-bit ASCII could be performed in Bazel, however Starlark does not provide access to a string's bytes, only its codepoints, at least until the spec and implementation work to get a [`bytes` type](bazelbuild/starlark#112) lands. So a second, out-of-process build-time pass is needed. Since a separate pass is needed anyways, we move most of the encoding responsibility to this external script (except cases that it cannot handle without interfering with the mtree file format: newline, space, and backslash). We use `gawk` for this second pass. A [prior attempt](bazel-contrib/bazel-lib#795) at this feature had done a similar external pass with the `vis` utility, but has not been accepted because the `vis` tool is not expected to be available everywhere. `gawk` is provided through a [BCR module of the same name](https://registry.bazel.build/modules/gawk). The extended feature set of GNU Awk is necessary to implement these scripts succinctly; POSIX `awk` that might be assumed to be present on most systems is insufficient. At the very least, it is difficult to do byte-by-byte encoding without `--characters-as-bytes` nor efficient, one-pass, table-based replacement without 4-arg `split`. We also take this opportunity to canonicalize the paths derived from user-provided mtree specs when computing the unused inputs set. This should ensure that the comparison is exact, which substantially reduces the risks associated with this feature and should enable it to become a default. Fixes https://github.com/bazel-contrib/bazel-lib/issues/794

It doesn't like the BA occuring in hexdump content (i.e. representing 0xBA). It did find one typo though, so leaving it on wherever possible seems valuable.

Adding a terminal newline does change the size of the file, which shows up in archive listing golden content.

Rather than representing the gap corresponding to the "Delete" character with placeholder whitespace, we just end the line then and there. This is fine in this case because the gap comes at the end of the sequence, and logically the "Delete" character isn't really in the character set of interest, it just happens to be immediately adjacent and fills out the block to a nicely padded number.

This mirrors the dependency in MODULE.bazel: ``` bazel_dep(name = "gawk", version = "5.3.2.bcr.1") ``` `http_archive` definition, including overlay, mirrors content of the Bazel module in BCR.

Should fix issues with cyclic deps reports in WORKSPACE-mode e2e test.

alexeagle · 2025-05-19T04:18:48Z

MODULE.bazel

@@ -8,12 +8,14 @@ module(

 bazel_dep(name = "aspect_bazel_lib", version = "2.14.0")
 bazel_dep(name = "bazel_skylib", version = "1.4.1")
+bazel_dep(name = "gawk", version = "5.3.2.bcr.1")


is this introducing a new cc toolchain dependency, such that tar.bzl users must wait for this to compile (if a cache miss) and see any warning spam from gcc? (and if their cc toolchain is broken, not even be able to build)?

Should we make a gawk_prebuilt repo like bsdtar_prebuilt ?

is this introducing a new cc toolchain dependency,

Yes, this does introduce a CC dependency.

such that tar.bzl users must wait for this to compile (if a cache miss) and see any warning spam from gcc?

Gawk seems to compile relatively quickly (2s or just under on an M1 mac; <1s crit. path). The Bazel overlay configures it to compile silently ( https://github.com/bazelbuild/bazel-central-registry/blob/6335496d1d111d8d6cdf7bb572cd76cde5e440e8/modules/gawk/5.3.2.bcr.1/overlay/BUILD.bazel#L4 ).

(and if their cc toolchain is broken, not even be able to build)?

Yeah, those setups would be out of luck without a pre-built.

Should we make a gawk_prebuilt repo like bsdtar_prebuilt ?

I would be in favour; I do appreciate the speedup of prebuilts even if it is slight in this case. I suppose it might depend on our motivation for avoiding CC dependencies. Are we mostly interested in speeding up builds and eliminating console noise? Or are we specifically interested in supporting users who do not have a workin 8000 g C compiler? If the latter, definitely we need it.

If you think we need a bsdtar_prebuilt, I'd be happy to try to make that happen. Where do you think that should live? In bazel-lib alongside the bsdtar and coreutils toolchains we already pull from there? Let me know what you think.

Also, if we want to avoid taking a CC toolchain dependency, perhaps we should set up the E2E environments with intentionally non-functional toolchains to "shift left" this concern?

If it's fast and silent, I think that's sufficient to ship this as-is.

It might be nice to also support broken cc toolchain users, and in that case yes we'd want to cripple the ability for our e2e to compile it (we've been doing that with protoc). It's some effort to setup and run the release train for it however.

If it's fast and silent, I think that's sufficient to ship this as-is.

OK, I believe that standard is met.

It might be nice to also support broken cc toolchain users, and in that case yes we'd want to cripple the ability for our e2e to compile it (we've been doing that with protoc). It's some effort to setup and run the release train for it however.

Can we defer that support to a time after landing this? I can file a tracking bug either in this repo or bazel-lib to get a prebuilt, whichever repo you feel is more appropriate to use for that purpose.

plobsing mentioned this pull request May 9, 2025

feat: vis-encode all path characters except visible ASCII bazel-contrib/bazel-lib#985

Closed

plobsing and others added 6 commits May 9, 2025 00:56

Tame the typo-checker

7d70a45

It doesn't like the BA occuring in hexdump content (i.e. representing 0xBA). It did find one typo though, so leaving it on wherever possible seems valuable.

Accept buildifier fixes

9e05174

Accept end-of-file-fixer changes

2b913b6

Adding a terminal newline does change the size of the file, which shows up in archive listing golden content.

Provide a repo rule to pull @gawk into WORKSPACE-based builds

717513e

This mirrors the dependency in MODULE.bazel: ``` bazel_dep(name = "gawk", version = "5.3.2.bcr.1") ``` `http_archive` definition, including overlay, mirrors content of the Bazel module in BCR.

Update WORKSPACE.bazel

d876843

plobsing force-pushed the vis-encode-improvements branch 2 times, most recently from 7896e9a to 913048a Compare May 18, 2025 17:54

plobsing added 3 commits May 18, 2025 13:54

Accept buildifier fixes

fc19fe0

Add module docstring to satisfy buildifier-lint

a329faf

Reorder loads to follow toposortedness of dependencies

ff561c9

Should fix issues with cyclic deps reports in WORKSPACE-mode e2e test.

plobsing force-pushed the vis-encode-improvements branch from 913048a to ff561c9 Compare May 18, 2025 17:54

alexeagle reviewed May 19, 2025

View reviewed changes

plobsing requested a review from alexeagle June 4, 2025 06:01

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

feat: vis-encode all path characters except visible ASCII #3

feat: vis-encode all path characters except visible ASCII #3

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

feat: vis-encode all path characters except visible ASCII #3

Are you sure you want to change the base?

feat: vis-encode all path characters except visible ASCII #3

Uh oh!

Conversation

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!