8000 feat: vis-encode all path characters except visible ASCII by plobsing · Pull Request #3 · bazel-contrib/tar.bzl · GitHub
[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
Skip to content

feat: vis-encode all path characters except visible ASCII #3

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 10 commits into
base: main
Choose a base branch
from

Conversation

plobsing
Copy link
@plobsing plobsing commented May 9, 2025

This allows archive input files to include Unicode characters. It should also support the more esoteric corners of the ASCII set (e.g. backslashes, newlines, control codes), if the user can find a way to convince Bazel to use such a file. This latter feature probably isn't a good idea to actually use; but it is nice to have confidence that this ruleset won't choke if it comes up.

Generating an archive with non-ASCII characters does report a diagnostic, however this does not appear to correspond to a real issue — the file is still placed into the archive with the expected path and contents.

INFO: From Tar lib/tests/tar/7.tar:
tar: lib/tests/tar/srcdir/Unicode® support?🤞: Can't translate pathname 'lib/tests/tar/srcdir/Unicode® support?🤞' to UTF-8

Encoding of 7-bit ASCII could be performed in Bazel, however Starlark does not provide access to a string's bytes, only its codepoints, at least until the spec and implementation work to get a bytes type lands. So a second, out-of-process build-time pass is needed. Since a separate pass is needed anyways, we move most of the encoding responsibility to this external script (except cases that it cannot handle without interfering with the mtree file format: newline, space, and backslash).

We use gawk for this second pass. A prior attempt at this feature had done a similar external pass with the vis utility, but was not accepted because the vis tool is not expected to be available everywhere. gawk is provided through a BCR module of the same name.

The extended feature set of GNU Awk is necessary to implement these scripts succinctly; POSIX awk that might be assumed to be present on most systems is insufficient. At the very least,
it is difficult to do byte-by-byte encoding without --characters-as-bytes nor efficient, one-pass, table-based replacement without 4-arg split.

We also take this opportunity to canonicalize the paths derived from user-provided mtree specs when computing the unused inputs set. This should ensure that the comparison is exact, which substantially reduces the risks associated with this feature and should enable it to become a default.

This change was first proposed against bazel-lib as bazel-contrib/bazel-lib#985.
It has been ported to tar.bzl to reflect the new home of the functionality of interest.

Fixes #17

This allows archive input files to include Unicode characters.
It should also support the more esoteric corners of the ASCII set
(e.g. backslashes, newlines, control codes), if the user can find a way
to convince Bazel to use such a file. This latter feature probably isn't
a good idea to actually use; but it is nice to have confidence that this
ruleset won't choke if it comes up.

Generating an archive with non-ASCII characters reports a diagnostic,
however this does not appear to correspond to a real issue — the file
is still placed into the archive with the expected path and contents.

```
INFO: From Tar lib/tests/tar/7.tar:
tar: lib/tests/tar/srcdir/Unicode® support?🤞: Can't translate pathname 'lib/tests/tar/srcdir/Unicode® support?🤞' to UTF-8
```
Encoding of 7-bit ASCII could be performed in Bazel, however Starlark does not provide
access to a string's bytes, only its codepoints, at least until the spec and implementation work
to get a [`bytes` type](bazelbuild/starlark#112) lands.
So a second, out-of-process build-time pass is needed. Since a separate pass
is needed anyways, we move most of the encoding responsibility to this external script
(except cases that it cannot handle without interfering with the mtree file format: newline, space, and backslash).

We use `gawk` for this second pass. A [prior attempt](bazel-contrib/bazel-lib#795)
at this feature had done a similar external pass with the `vis` utility, but has
not been accepted because the `vis` tool is not expected to be available everywhere.
`gawk` is provided through a [BCR module of the same
name](https://registry.bazel.build/modules/gawk).

The extended feature set of GNU Awk is necessary to implement these
scripts succinctly; POSIX `awk` that might be assumed to be present on
most systems is insufficient. At the very least,
it is difficult to do byte-by-byte encoding without
`--characters-as-bytes` nor efficient, one-pass, table-based replacement without
4-arg `split`.

We also take this opportunity to canonicalize the paths derived
from user-provided mtree specs when computing the unused
inputs set. This should ensure that the comparison is exact,
which substantially reduces the risks associated with this feature
and should enable it to become a default.

Fixes https://github.com/bazel-contrib/bazel-lib/issues/794
plobsing and others added 6 commits May 9, 2025 00:56
It doesn't like the BA occuring in hexdump content (i.e. representing
0xBA). It did find one typo though, so leaving it on wherever possible
seems valuable.
Adding a terminal newline does change the size of the file, which shows
up in archive listing golden content.
Rather than representing the gap corresponding to the "Delete" character
with placeholder whitespace, we just end the line then and there. This
is fine in this case because the gap comes at the end of the sequence,
and logically the "Delete" character isn't really in the character set
of interest, it just happens to be immediately adjacent and fills out
the block to a nicely padded number.
This mirrors the dependency in MODULE.bazel:

```
 bazel_dep(name = "gawk", version = "5.3.2.bcr.1")
```
`http_archive` definition, including overlay, mirrors content of the
Bazel module in BCR.
@plobsing plobsing force-pushed the vis-encode-improvements branch 2 times, most recently from 7896e9a to 913048a Compare May 18, 2025 17:54
@plobsing plobsing force-pushed the vis-encode-improvements branch from 913048a to ff561c9 Compare May 18, 2025 17:54
@@ -8,12 +8,14 @@ module(

bazel_dep(name = "aspect_bazel_lib", version = "2.14.0")
bazel_dep(name = "bazel_skylib", version = "1.4.1")
bazel_dep(name = "gawk", version = "5.3.2.bcr.1")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this introducing a new cc toolchain dependency, such that tar.bzl users must wait for this to compile (if a cache miss) and see any warning spam from gcc? (and if their cc toolchain is broken, not even be able to build)?

Should we make a gawk_prebuilt repo like bsdtar_prebuilt ?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this introducing a new cc toolchain dependency,

Yes, this does introduce a CC dependency.

such that tar.bzl users must wait for this to compile (if a cache miss) and see any warning spam from gcc?

Gawk seems to compile relatively quickly (2s or just under on an M1 mac; <1s crit. path). The Bazel overlay configures it to compile silently ( https://github.com/bazelbuild/bazel-central-registry/blob/6335496d1d111d8d6cdf7bb572cd76cde5e440e8/modules/gawk/5.3.2.bcr.1/overlay/BUILD.bazel#L4 ).

(and if their cc toolchain is broken, not even be able to build)?

Yeah, those setups would be out of luck without a pre-built.

Should we make a gawk_prebuilt repo like bsdtar_prebuilt ?

I would be in favour; I do appreciate the speedup of prebuilts even if it is slight in this case. I suppose it might depend on our motivation for avoiding CC dependencies. Are we mostly interested in speeding up builds and eliminating console noise? Or are we specifically interested in supporting users who do not have a workin 8000 g C compiler? If the latter, definitely we need it.

If you think we need a bsdtar_prebuilt, I'd be happy to try to make that happen. Where do you think that should live? In bazel-lib alongside the bsdtar and coreutils toolchains we already pull from there? Let me know what you think.

Also, if we want to avoid taking a CC toolchain dependency, perhaps we should set up the E2E environments with intentionally non-functional toolchains to "shift left" this concern?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If it's fast and silent, I think that's sufficient to ship this as-is.

It might be nice to also support broken cc toolchain users, and in that case yes we'd want to cripple the ability for our e2e to compile it (we've been doing that with protoc). It's some effort to setup and run the release train for it however.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If it's fast and silent, I think that's sufficient to ship this as-is.

OK, I believe that standard is met.

It might be nice to also support broken cc toolchain users, and in that case yes we'd want to cripple the ability for our e2e to compile it (we've been doing that with protoc). It's some effort to setup and run the release train for it however.

Can we defer that support to a time after landing this? I can file a tracking bug either in this repo or bazel-lib to get a prebuilt, whichever repo you feel is more appropriate to use for that purpose.

@plobsing plobsing requested a review from alexeagle June 4, 2025 06:01
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Bug]: mtree_spec rule does not encode filenames with special characters
2 participants
0