8000 feat: support chardet config file setting by rasa · Pull Request #457 · editorconfig-checker/editorconfig-checker · GitHub
[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
Skip to content

feat: support chardet config file setting #457

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Draft
wants to merge 11 commits into
base: main
Choose a base branch
from

Conversation

rasa
Copy link
Contributor
@rasa rasa commented Mar 26, 2025

Fixes #40

no longer applicable Per [here](https://github.com//pull/457/files#diff-2a0547966abe4b6fcace630584fe01fee0d2498396cc80c8ee2a36c0d46fae28R202): ``` // The below file fails the test, but it may not be a valid UTF-16LE file. // For example, the Linux file command doesn't identify the file as // "Unicode text, UTF-16, little-endian text" // but simply // "data" // but since the file is from // https://cs.opensource.google/go/x/text/+/master:encoding/testdata/ // I think it's correct to fail the test, and fix the chardet package. {"candide-utf-16le.txt", "utf16le"}, ```

@rasa rasa changed the title feat: Support chardet config file setting feat: support chardet config file setting Mar 26, 2025
@ccoVeille
Copy link
Contributor

Please let me know when we can review, I have already remarks, but I want for your go, because I try to cure myself
395770858-718c5a79-c97e-4a60-bb00-c543221e328f.jpg

@rasa
Copy link
Contributor Author
rasa commented Mar 28, 2025

@ccoVeille Yeah, hold off for a bit. I'm gonna toss chardet and use our own code to determine if a file is latin1, utf-8, utf-8-bom, utf-16be or utf-16le.

Here's a start:

expand ``` package main

import (
"bytes"
"fmt"
"unicode/utf16"
"unicode/utf8"
)

func detectEncoding(data []byte) string {
if len(data) >= 2 {
if bytes.HasPrefix(data, []byte{0xFE, 0xFF}) {
return "utf-16be (with BOM)"
}
if bytes.HasPrefix(data, []byte{0xFF, 0xFE}) {
return "utf-16le (with BOM)"
}
}

if utf8.Valid(data) {
	return "utf-8"
}

if isValidUTF16LE(data) {
	return "utf-16le (no BOM)"
}
if isValidUTF16BE(data) {
	return "utf-16be (no BOM)"
}

if isLikelyLatin1(data) {
	return "latin1"
}

return "binary (unknown or invalid text)"

}

func isValidUTF16LE(data []byte) bool {
if len(data)%2 != 0 {
return false
}
u16 := make([]uint16, len(data)/2)
for i := 0; i < len(u16); i++ {
u16[i] = uint16(data[2i]) | uint16(data[2i+1])<<8
}
decoded := utf16.Decode(u16)
for _, r := range decoded {
if r == utf8.RuneError {
return false
}
}
return true
}

func isValidUTF16BE(data []byte) bool {
if len(data)%2 != 0 {
return false
}
u16 := make([]uint16, len(data)/2)
for i := 0; i < len(u16); i++ {
u16[i] = uint16(data[2i+1]) | uint16(data[2i])<<8
}
decoded := utf16.Decode(u16)
for _, r := range decoded {
if r == utf8.RuneError {
return false
}
}
return true
}

func isLikelyLatin1(data []byte) bool {
const disallowed = "" +
"\x00\x01\x02\x03\x04\x05\x06\x07\x08" + // C0 controls
"\x0B" + // Vertical Tab
"\x0E\x0F\x10\x11\x12\x13\x14\x15\x16\x17\x18\x19\x1A\x1B\x1C\x1D\x1E\x1F" +
"\x80\x81 8000 \x82\x83\x84\x85\x86\x87\x88\x89\x8A\x8B\x8C\x8D\x8E\x8F" + // C1 controls
"\x90\x91\x92\x93\x94\x95\x96\x97\x98\x99\x9A\x9B\x9C\x9D\x9E\x9F"

return !bytes.ContainsAny(data, disallowed)

}

func main() {
data := []byte("Hello,\tworld!\n") // Replace with file or test content
fmt.Println("Detected encoding:", detectEncoding(data))
}

</details>
See https://go.dev/play/p/lcye7XmZLJv

Based on my research, I think we should interpret `latin1` to mean `iso8859-1`, so if a file has any bytes in the 00-31 (except tab, lf, cr, and ff) or 128-159 range (`windows-1252` uses these), we would reject it as not `latin1`. If the user wants to allow those bytes, they need to use `charset = unset`.

Alternatively, we treat `latin1` as `unset`, and allow any byte stream, including valid uft8/16/32.

Thoughts?

(love the pic btw!)

@klaernie
Copy link
Member

I would probably ignore the entire set of control characters entirely (0-32). If I recall correctly they always stayed the same in meaning throughout all the iso-8859 family and Unicode.

Also IIRC if we only find byte values <128 it could be both utf8 and latin1, so both should be accepted.

@ccoVeille
Copy link
Contributor

I looked for a lib about that.

More looking for examples and ideas, more than looking for a lib to import.

I found this

https://github.com/softlandia/cpd
License Apache 2.0

I didn't check about how they handle the char 0-32.

But, I can tell their test files are interesting.

I like it considers codepage I wouldn't have thought about.

Does anyone know another lib? We could look at.

@rasa
Copy link
Contributor Author
rasa commented Apr 1, 2025

I found this

https://github.com/softlandia/cpd License Apache 2.0

@ccoVeille Thank you for the suggestion. Unfortunately, of the ISO8859s, it only identifies ISO8859-5, not ISO8859-1, which is what I think is the best match for the latin1 config setting.

@rasa
Copy link
Contributor Author
rasa commented Apr 1, 2025

I would probably ignore the entire set of control characters entirely (0-32). If I recall correctly they always stayed the same in meaning throughout all the iso-8859 family and Unicode.

@klaernie Sorry, I don't quite follow. The 0-32 characters (other than TAB, FF, LF, CR) are the best way to determine if a file is text, or binary. It's how dos2unix, and many other programs, determines this. And utf-16/32 uses 0-32, so I don't see how they have the "same meaning" when in a utf-8 file, or an iso-8859-1 file.

Also IIRC if we only find byte values <128 it could be both utf8 and latin1, so both should be accepted.

That's true. The trick is to determine if a file is latin1 (aka iso-8859-1), or some other non-UTF encoding. The solution is clearly non-trivial, and the best solution I've found so far is uchardet.

@klaernie
Copy link
Member
klaernie commented Apr 2, 2025

I would probably ignore the entire set of control characters entirely (0-32). If I recall correctly they always stayed the same in meaning throughout all the iso-8859 family and Unicode.

@klaernie Sorry, I don't quite follow. The 0-32 characters (other than TAB, FF, LF, CR) are the best way to determine if a file is text, or binary. It's how dos2unix, and many other programs, determines this. And utf-16/32 uses 0-32, so I don't see how they have the "same meaning" when in a utf-8 file, or an iso-8859-1 file.

I was thinking of the meaning of the first 32 unicode codepoints being identical in meaning to the first 32 in ASCII - but I didn't think about the fact that despite them being the first 32 codepoints they are not represented as bytes with the values 0-32 in utf16 and utf32. However in utf8 this assumption would hold.

But no matter, you are indeed correct that this is the only chance to differentiate text from binary files. I think I should no spend too much time on GitHub having just gotten out of bed before my first coffee ;)

Thanks a lot for the effort you are putting into this!

@rasa
Copy link
Contributor Author
rasa commented Apr 2, 2025

The new chardet library works much better than the old one. Here are the only failures:

click
--- FAIL: TestEqual (1.39s)
    encoding_test.go:283: Equal(): "iso88591.txt": expected: latin1, got: windows1255
    encoding_test.go:105: result={Encoding:Windows-1255 Confidence:0.99 Language:Hebrew} (first 4 bytes: 54686973)
    encoding_test.go:283: Equal(): "utf8-sdl.txt": expected: latin1, got: windows1254
    encoding_test.go:105: result={Encoding:Windows-1254 Confidence:0.5250663680561672 Language:Turkish} (first 4 bytes: 5554462d)
    encoding_test.go:283: Equal(): "utf8.txt": expected: utf8, got: windows1254
    encoding_test.go:105: result={Encoding:Windows-1254 Confidence:0.5132700345166831 Language:Turkish} (first 4 bytes: 7072656d)
FAIL
I need to drill into these.

@rasa rasa force-pushed the rs/support-charset-setting branch 2 times, most recently from 7624a20 to f554002 Compare April 4, 2025 18:09
@rasa
Copy link
Contributor Author
rasa commented Apr 4, 2025

So I added the 159 testdata files from
https://gitlab.freedesktop.org/uchardet/uchardet/-/tree/master/test?ref_type=heads
which our new detector, https://github.com/wlynxg/chardet, fails on 31 of them:

click to expand
testdata/uchardet/bg/windows-1251.txt: got ISO-8859-1, want Windows-1251 (confidence 0.99, language Bulgarian
testdata/uchardet/da/iso-8859-15.txt: got ISO-8859-1, want ISO-8859-15 (confidence 0.73, language 
testdata/uchardet/et/iso-8859-13.txt: got ISO-8859-1, want ISO-8859-13 (confidence 0.73, language 
testdata/uchardet/et/windows-1257.txt: got ISO-8859-1, want Windows-1257 (confidence 0.73, language 
testdata/uchardet/fi/iso-8859-1.txt: got MacRoman, want ISO-8859-1 (confidence 0.7159344894026975, language 
testdata/uchardet/he/iso-8859-8.txt: got ISO-8859-1, want ISO-8859-8 (confidence 0.99, language Hebrew
testdata/uchardet/he/windows-1255.txt: got ISO-8859-1, want Windows-1255 (confidence 0.9773686833361969, language Hebrew
testdata/uchardet/hr/iso-8859-13.txt: got ISO-8859-1, want ISO-8859-13 (confidence 0.73, language 
testdata/uchardet/hr/iso-8859-16.txt: got ISO-8859-1, want ISO-8859-16 (confidence 0.73, language 
testdata/uchardet/hr/iso-8859-2.txt: got ISO-8859-1, want ISO-8859-2 (confidence 0.73, language 
testdata/uchardet/hu/iso-8859-2.txt: got ISO-8859-1, want ISO-8859-2 (confidence 0.73, language 
testdata/uchardet/lt/iso-8859-10.txt: got ISO-8859-1, want ISO-8859-10 (confidence 0.73, language 
testdata/uchardet/lt/iso-8859-13.txt: got ISO-8859-1, want ISO-8859-13 (confidence 0.73, language 
testdata/uchardet/lt/iso-8859-4.txt: got ISO-8859-1, want ISO-8859-4 (confidence 0.73, language 
testdata/uchardet/lv/iso-8859-10.txt: got ISO-8859-1, want ISO-8859-10 (confidence 0.73, language 
testdata/uchardet/lv/iso-8859-13.txt: got ISO-8859-1, want ISO-8859-13 (confidence 0.73, language 
testdata/uchardet/lv/iso-8859-4.txt: got ISO-8859-1, want ISO-8859-4 (confidence 0.73, language 
testdata/uchardet/mt/iso-8859-3.txt: got ISO-8859-1, want ISO-8859-3 (confidence 0.73, language 
testdata/uchardet/no/iso-8859-15.txt: got ISO-8859-1, want ISO-8859-15 (confidence 0.73, language 
testdata/uchardet/pl/iso-8859-13.txt: got ISO-8859-1, want ISO-8859-13 (confidence 0.6365243902439024, language 
testdata/uchardet/pl/iso-8859-16.txt: got ISO-8859-1, want ISO-8859-16 (confidence 0.73, language 
testdata/uchardet/pl/iso-8859-2.txt: got ISO-8859-1, want ISO-8859-2 (confidence 0.73, language 
testdata/uchardet/ro/iso-8859-16.txt: got ISO-8859-1, want ISO-8859-16 (confidence 0.73, language 
testdata/uchardet/sk/iso-8859-2.txt: got ISO-8859-1, want ISO-8859-2 (confidence 0.6586976744186046, language 
testdata/uchardet/sl/iso-8859-16.txt: got ISO-8859-1, want ISO-8859-16 (confidence 0.73, language 
testdata/uchardet/sl/iso-8859-2.txt: got ISO-8859-1, want ISO-8859-2 (confidence 0.73, language 
testdata/uchardet/tr/iso-8859-3.txt: got ISO-8859-1, want ISO-8859-3 (confidence 0.73, language 
testdata/uchardet/tr/iso-8859-9.txt: got ISO-8859-1, want ISO-8859-9 (confidence 0.73, language 
superseded That's over 20%, so I propose the following solution:

For example: `ec --exclude node_modules`
## Charset setting
Our current charset detector accurately identifies `utf-8`, `utf-8-bom`, `utf-16be`, and `utf-16le`
encodings, as well as files that are UTF32 encoded.
Unfortunately, it struggles to correctly indentify `latin1` (aka ISO-8859-1) encoded files.
So, by default, we don't check if a file is `latin1` encoded. If you want to enable this check,
you will need to add the following to your configuration file:
```json
{
...
"Charsets": {
"Latin1": 50
}
...
}
```
In the example above, the number `50` identifies the minimum confidence level (between 0 and 100)
that is found that the file is indeed `latin1` encoded. A higher number indicates more confidence,
and a lower number indicates less confidence.
A value of `0`, disables the `latin1` charset check.
Since our charset detector accurately identifies `utf-8`, `utf-8-bom`, `utf-16be`, and `utf-16le`,
this check is enabled by default, with a default confidence factor of 50. If you are seeing files
are being identified incorrectly, you can disable this charset check by adding any of the
following entries to your configuration file:
```json
{
...
"Charsets": {
"UTF8": 0,
"UTF8BOM": 0,
"UTF16BE": 0,
"UTF16LE": 0
}
...
}
```

@klaernie
Copy link
Member
klaernie commented Apr 4, 2025

Reading the tests in the https://github.com/editorconfig/editorconfig-plugin-tests I'm pretty sure all the ISO-8859 variants are all grouped as latin1, or the understanding of latin1 means pure ASCII.

So I think I would reduce the ISO8859 variants to all be treated as latin1 - assuming that a user specifying latin1 in .editorconfig intends to accept all variants of ISO8859, specifically the variant they use locally - no matter if it is ISO8859-1 or IS=8859-15.

The more important use case should be IMHO that we correctly identify UTF8, 16 and 32 in both their endiannesses and detect a BOM. This will be the more frequent use case for people wanting to ensure their codebase is up to a modern standard and not introducing files containing what I'd call legacy encoding.

@rasa
Copy link
Contributor Author
rasa commented Apr 4, 2025

@klaernie Good feedback. I thought of that too. But what led me to think latin1 means ISO-8859-1 is https://github.com/editorconfig/editorconfig/wiki/Character-Set-Support which links latin1 to https://en.wikipedia.org/wiki/ISO/IEC_8859-1 . That page says it's "Latin alphabet no. 1" https://www.iana.org/assignments/character-sets/character-sets.xhtml also says an alias for ISO-8859-1 is latin1.

But I think you're right. By default, we interpret latin1 to mean "any text file that's not a UTF* encoding. edit But perhaps we add an option to interpret latin1 as ISO-8869-1. Perhaps something like:

outdated
{ 
  ... 
  "Charsets": {
    "Latin1": ["ISO-8859-1"]
  }
  ... 
} 

And if the user adds this to their config file, we'll report charset mismatches for latin1.

Thoughts?

edit: Aside: I would assume identifying UTF16 files is 100% accurate, but the chardet library failed on two UTF16 files, so I added some extra checks here.

But note that https://en.wikipedia.org/wiki/Byte_order_mark#UTF-16 says my logic "can result in both false positives and false negatives."

I guess if the user has files that fail our checks, they can always exclude them.

@rasa rasa force-pushed the rs/support-charset-setting branch 2 times, most recently from b525ba2 to c8905c0 Compare April 7, 2025 00:40
@rasa rasa marked this pull request as ready for review April 7, 2025 00:41
@rasa rasa force-pushed the rs/support-charset-setting branch from c8905c0 to fb7a7c7 Compare April 7, 2025 00:43
@rasa rasa force-pushed the rs/support-charset-setting branch from fb7a7c7 to 6f1c4bc Compare April 7, 2025 01:11
@rasa
Copy link
Contributor Author
rasa commented Apr 7, 2025

Note, I had to add "application/octet-stream", to validation.go's textRegexes list, for things to work as expected.

Now it's in sync (again) with pkg/config/config.go.

I'm not sure how things worked without this.

Copy link
codecov bot commented Apr 7, 2025

Codecov Report

Attention: Patch coverage is 86.40777% with 28 lines in your changes missing coverage. Please review.

Project coverage is 87.37%. Comparing base (af09b21) to head (1fed588).
Report is 48 commits behind head on main.

Files with missing lines Patch % Lines
pkg/encoding/encoding.go 86.13% 20 Missing and 8 partials ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main     #457      +/-   ##
==========================================
+ Coverage   86.72%   87.37%   +0.65%     
==========================================
  Files          11       11              
  Lines        1017     1228     +211     
==========================================
+ Hits          882     1073     +191     
- Misses        102      120      +18     
- Partials       33       35       +2     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@klaernie
Copy link
Member
klaernie commented Apr 7, 2025

@klaernie Good feedback. I thought of that too. But what led me to think latin1 means ISO-8859-1 is https://github.com/editorconfig/editorconfig/wiki/Character-Set-Support which links latin1 to https://en.wikipedia.org/wiki/ISO/IEC_8859-1 . That page says it's "Latin alphabet no. 1" https://www.iana.org/assignments/character-sets/character-sets.xhtml also says an alias for ISO-8859-1 is latin1.

I did not find the wiki page - that uncovers the hints I failed to notice.

So we should implement latin1 as strictly ISO-8859-1, and not support any of the other ISO-8859-* variants for now, as the editorconfig itself does not list them as supported. According to spec we must ignore any value we do no implement and treat it as charset = unset, so there is no harm in not supporting the other variants.

This also underpins the sentiment I get from the editorconfig wiki - I read it as "use unicode as a first choice" - which I personally also find as the most reasonable.

I'll hopefully get to review the code later, but that might be not today.

@rasa
Copy link
Contributor Author
rasa commented Apr 7, 2025

So we should implement latin1 as strictly ISO-8859-1, and not support any of the other ISO-8859-* variants for now, as the editorconfig itself does not list them as supported. According to spec we must ignore any value we do no implement and treat it as charset = unset, so there is no harm in not supporting the other variants.

I hear you. I initially thought so as well, but you convinced me to be more lenient given our false positives in identifying iso-8859-1 files.

So, note that
https://github.com/editorconfig/editorconfig/wiki/Character-Set-Support#supported-character-sets says

Other character sets could be specified outside of this set and they would be ignored if not understood by the editor.

Since we are not an editor, I think we can include support for other character sets. The spec says that latin1 and the utf*s are what "all plugins should attempt to support at a minimum." It doesn't say we can't support others.

This also underpins the sentiment I get from the editorconfig wiki - I read it as "use unicode as a first choice" - which I personally also find as the most reasonable.

Agreed. And I can see many people using editorconfig-checker to help them migrate their legacy files to Unicode.

I'll hopefully get to review the code later, but that might be not today.

edit: I don't intend to make any more changes, but take your time. It's a big change.

superseded

Take your time. I feel it's ready for review, but there are three small changes I am considering:

  1. Add a test for encoding.Supported which I somehow overlooked.
  2. Use sparse-checkout in our encoding package's Makefile, and
  3. Use golang/text's ascii encoder for ASCII file decoding.


const defaultConfidence = 1

const testResultsJson = "test-results.json"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do I understand this correctly as being a test snapshot file? If so, why not reuse snaps - there the order of the tests would not matter?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To be honest, I haven't worked with go-snaps before. Are you suggesting we should here?

Note: If EDITORCONFIG_ADD_NEW_FILES=1 is set in the environment, the tests suite will scan for new files in testdata and add them to the test-results.json. Otherwise the suite runs only on the testdata files listed in the file.

I found the file very helping in my debugging, as I could run a git diff after a run to see if anything changed. That was a lot easier than scanning the log output for hundreds of files.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

okay, so basically you implement the reverse behaviour of snaps. Snaps will add new snapshots, but never change existing snapshots unless UPDATE_SNAPS is set to a true value.

In this use case I would think using snaps.MatchStandaloneJSON(t, someValue) would be better.
https://github.com/gkampitakis/go-snaps?tab=readme-ov-file#matchjson

This would mean:

  • always scan for testfiles
  • during teardown call snaps.MatchStandaloneJSON() for each test. Although one might argue that matching the snapshot inline would be easier, but right now the test architecture touches the central tests slice multiple times, if I understand it correctly.

snaps itself will then generate test failures when a previously created snapshot is not matched.

UnknownEncoding = "unknown"

// See https://spec.editorconfig.org/#supported-pairs
// CharsetUnset defines the value allowing for file encoding.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

charsetFound = "utf8bom"
}

if !supported(charsetFound) {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

supportedUTFEncoding() would be a more apt name.

}

// We need to check for UTF16/32 encodings first, as
// UTF16/32 encoded first can be valid UTF8 files (surprisingly).
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/first/files/

@rasa
Copy link
Contributor Author
rasa commented Apr 20, 2025

I think the test files at
https://github.com/arthenica/libiconv/tree/master/tests and
https://github.com/pa-0/dos2unix/tree/master/dos2unix/test
may be of higher quality, so I am thinking of adding them to our testdata.

If code review hasn't started, lemme know, and I'll set this to draft status, and add them, as well as the comments noted above.

@@ -1,5 +1,6 @@
ifeq ($(OS),Windows_NT)
STDERR=con
EXEEXT=.exe
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

where did that come from?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@klaernie Windows' exe's need an extension for Windows to execute them.

Comment on lines +48 to +53
testnorace: ## Run test suite without -race which requires cgo
go test -coverprofile=coverage.txt -covermode=atomic ./...
go test -trimpath -coverprofile=coverage.txt -covermode=atomic ./...
go vet ./...
@test -z $(shell gofmt -s -l . | tee $(STDERR)) || (echo "[ERROR] Fix formatting issues with 'gofmt'" && exit 1)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is that a requirement for the tests to pass, or just to work around a limitation on your dev machine?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not a requirement. Just needed to run the tests when CGO is not available.

@klaernie
Copy link
Member

I think the test files at https://github.com/arthenica/libiconv/tree/master/tests and https://github.com/pa-0/dos2unix/tree/master/dos2unix/test may be of higher quality, so I am thinking of adding them to our testdata.

I wonder if there is an optimal set instead of collecting test files from everywhere. But currently I'm on neither side of the fence, so feel free to add them.

If code review hasn't started, lemme know, and I'll set this to draft status, and add them, as well as the comments noted above.

There is no binary state of code review, feel free to make changes as you see fit. We maintainers would be stupid to keep you from iterating towards the best solution - after all you're doing the hard work right now, and I'm very thankful for that!


Generally I'm a bit vary of the huge implementation of the test cases, since I still haven't wrapped my head around it fully. It seems fairly complicated, but probably isn't. But I think if you convert this from storing tests in a json file to matching snapshots it might become clearer.

@rasa rasa marked this pull request as draft April 30, 2025 04:34
@rasa
Copy link
Contributor Author
rasa commented Apr 30, 2025

The code I believe is production ready, but I'm converting to draft to revision the test framework, and explore using snapshots, as @klaernie suggested. Please be patient as I haven't worked with this tooling before. I think only using the highest quality test files, makes sense too.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Support checking all files' text encodings, if declared to be UTF-8
3 participants
0