-
-
Notifications
You must be signed in to change notification settings - Fork 2
feat: vis-encode all path characters except visible ASCII #3
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
This allows archive input files to include Unicode characters. It should also support the more esoteric corners of the ASCII set (e.g. backslashes, newlines, control codes), if the user can find a way to convince Bazel to use such a file. This latter feature probably isn't a good idea to actually use; but it is nice to have confidence that this ruleset won't choke if it comes up. Generating an archive with non-ASCII characters reports a diagnostic, however this does not appear to correspond to a real issue — the file is still placed into the archive with the expected path and contents. ``` INFO: From Tar lib/tests/tar/7.tar: tar: lib/tests/tar/srcdir/Unicode® support?🤞: Can't translate pathname 'lib/tests/tar/srcdir/Unicode® support?🤞' to UTF-8 ``` Encoding of 7-bit ASCII could be performed in Bazel, however Starlark does not provide access to a string's bytes, only its codepoints, at least until the spec and implementation work to get a [`bytes` type](bazelbuild/starlark#112) lands. So a second, out-of-process build-time pass is needed. Since a separate pass is needed anyways, we move most of the encoding responsibility to this external script (except cases that it cannot handle without interfering with the mtree file format: newline, space, and backslash). We use `gawk` for this second pass. A [prior attempt](bazel-contrib/bazel-lib#795) at this feature had done a similar external pass with the `vis` utility, but has not been accepted because the `vis` tool is not expected to be available everywhere. `gawk` is provided through a [BCR module of the same name](https://registry.bazel.build/modules/gawk). The extended feature set of GNU Awk is necessary to implement these scripts succinctly; POSIX `awk` that might be assumed to be present on most systems is insufficient. At the very least, it is difficult to do byte-by-byte encoding without `--characters-as-bytes` nor efficient, one-pass, table-based replacement without 4-arg `split`. We also take this opportunity to canonicalize the paths derived from user-provided mtree specs when computing the unused inputs set. This should ensure that the comparison is exact, which substantially reduces the risks associated with this feature and should enable it to become a default. Fixes https://github.com/bazel-contrib/bazel-lib/issues/794
It doesn't like the BA occuring in hexdump content (i.e. representing 0xBA). It did find one typo though, so leaving it on wherever possible seems valuable.
Adding a terminal newline does change the size of the file, which shows up in archive listing golden content.
Rather than representing the gap corresponding to the "Delete" character with placeholder whitespace, we just end the line then and there. This is fine in this case because the gap comes at the end of the sequence, and logically the "Delete" character isn't really in the character set of interest, it just happens to be immediately adjacent and fills out the block to a nicely padded number.
This mirrors the dependency in MODULE.bazel: ``` bazel_dep(name = "gawk", version = "5.3.2.bcr.1") ``` `http_archive` definition, including overlay, mirrors content of the Bazel module in BCR.
7896e9a
to
913048a
Compare
Should fix issues with cyclic deps reports in WORKSPACE-mode e2e test.
913048a
to
ff561c9
Compare
@@ -8,12 +8,14 @@ module( | |||
|
|||
bazel_dep(name = "aspect_bazel_lib", version = "2.14.0") | |||
bazel_dep(name = "bazel_skylib", version = "1.4.1") | |||
bazel_dep(name = "gawk", version = "5.3.2.bcr.1") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is this introducing a new cc toolchain dependency, such that tar.bzl
users must wait for this to compile (if a cache miss) and see any warning spam from gcc? (and if their cc toolchain is broken, not even be able to build)?
Should we make a gawk_prebuilt repo like bsdtar_prebuilt ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is this introducing a new cc toolchain dependency,
Yes, this does introduce a CC dependency.
such that
tar.bzl
users must wait for this to compile (if a cache miss) and see any warning spam from gcc?
Gawk seems to compile relatively quickly (2s or just under on an M1 mac; <1s crit. path). The Bazel overlay configures it to compile silently ( https://github.com/bazelbuild/bazel-central-registry/blob/6335496d1d111d8d6cdf7bb572cd76cde5e440e8/modules/gawk/5.3.2.bcr.1/overlay/BUILD.bazel#L4 ).
(and if their cc toolchain is broken, not even be able to build)?
Yeah, those setups would be out of luck without a pre-built.
Should we make a gawk_prebuilt repo like bsdtar_prebuilt ?
I would be in favour; I do appreciate the speedup of prebuilts even if it is slight in this case. I suppose it might depend on our motivation for avoiding CC dependencies. Are we mostly interested in speeding up builds and eliminating console noise? Or are we specifically interested in supporting users who do not have a workin 8000 g C compiler? If the latter, definitely we need it.
If you think we need a bsdtar_prebuilt
, I'd be happy to try to make that happen. Where do you think that should live? In bazel-lib
alongside the bsdtar and coreutils toolchains we already pull from there? Let me know what you think.
Also, if we want to avoid taking a CC toolchain dependency, perhaps we should set up the E2E environments with intentionally non-functional toolchains to "shift left" this concern?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If it's fast and silent, I think that's sufficient to ship this as-is.
It might be nice to also support broken cc toolchain users, and in that case yes we'd want to cripple the ability for our e2e to compile it (we've been doing that with protoc). It's some effort to setup and run the release train for it however.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If it's fast and silent, I think that's sufficient to ship this as-is.
OK, I believe that standard is met.
It might be nice to also support broken cc toolchain users, and in that case yes we'd want to cripple the ability for our e2e to compile it (we've been doing that with protoc). It's some effort to setup and run the release train for it however.
Can we defer that support to a time after landing this? I can file a tracking bug either in this repo or bazel-lib
to get a prebuilt, whichever repo you feel is more appropriate to use for that purpose.
This allows archive input files to include Unicode characters. It should also support the more esoteric corners of the ASCII set (e.g. backslashes, newlines, control codes), if the user can find a way to convince Bazel to use such a file. This latter feature probably isn't a good idea to actually use; but it is nice to have confidence that this ruleset won't choke if it comes up.
Generating an archive with non-ASCII characters does report a diagnostic, however this does not appear to correspond to a real issue — the file is still placed into the archive with the expected path and contents.
Encoding of 7-bit ASCII could be performed in Bazel, however Starlark does not provide access to a string's bytes, only its codepoints, at least until the spec and implementation work to get a
bytes
type lands. So a second, out-of-process build-time pass is needed. Since a separate pass is needed anyways, we move most of the encoding responsibility to this external script (except cases that it cannot handle without interfering with the mtree file format: newline, space, and backslash).We use
gawk
for this second pass. A prior attempt at this feature had done a similar external pass with thevis
utility, but was not accepted because thevis
tool is not expected to be available everywhere.gawk
is provided through a BCR module of the same name.The extended feature set of GNU Awk is necessary to implement these scripts succinctly; POSIX
awk
that might be assumed to be present on most systems is insufficient. At the very least,it is difficult to do byte-by-byte encoding without
--characters-as-bytes
nor efficient, one-pass, table-based replacement without 4-argsplit
.We also take this opportunity to canonicalize the paths derived from user-provided mtree specs when computing the unused inputs set. This should ensure that the comparison is exact, which substantially reduces the risks associated with this feature and should enable it to become a default.
This change was first proposed against
bazel-lib
as bazel-contrib/bazel-lib#985.It has been ported to
tar.bzl
to reflect the new home of the functionality of interest.Fixes #17