-
Notifications
You must be signed in to change notification settings - Fork 59
feat: support chardet config file setting #457
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
@ccoVeille Yeah, hold off for a bit. I'm gonna toss chardet and use our own code to determine if a file is latin1, utf-8, utf-8-bom, utf-16be or utf-16le. Here's a start: expand``` package mainimport ( func detectEncoding(data []byte) string {
} func isValidUTF16LE(data []byte) bool { func isValidUTF16BE(data []byte) bool { func isLikelyLatin1(data []byte) bool {
} func main() {
|
I would probably ignore the entire set of control characters entirely (0-32). If I recall correctly they always stayed the same in meaning throughout all the iso-8859 family and Unicode. Also IIRC if we only find byte values <128 it could be both utf8 and latin1, so both should be accepted. |
I looked for a lib about that. More looking for examples and ideas, more than looking for a lib to import. I found this https://github.com/softlandia/cpd I didn't check about how they handle the char 0-32. But, I can tell their test files are interesting. I like it considers codepage I wouldn't have thought about. Does anyone know another lib? We could look at. |
@ccoVeille Thank you for the suggestion. Unfortunately, of the ISO8859s, it only identifies ISO8859-5, not ISO8859-1, which is what I think is the best match for the |
@klaernie Sorry, I don't quite follow. The 0-32 characters (other than TAB, FF, LF, CR) are the best way to determine if a file is text, or binary. It's how dos2unix, and many other programs, determines this. And utf-16/32 uses 0-32, so I don't see how they have the "same meaning" when in a utf-8 file, or an iso-8859-1 file.
That's true. The trick is to determine if a file is |
I was thinking of the meaning of the first 32 unicode codepoints being identical in meaning to the first 32 in ASCII - but I didn't think about the fact that despite them being the first 32 codepoints they are not represented as bytes with the values 0-32 in utf16 and utf32. However in utf8 this assumption would hold. But no matter, you are indeed correct that this is the only chance to differentiate text from binary files. I think I should no spend too much time on GitHub having just gotten out of bed before my first coffee ;) Thanks a lot for the effort you are putting into this! |
The new chardet library works much better than the old one. Here are the only failures: click
|
7624a20
to
f554002
Compare
So I added the 159 testdata files from click to expand
supersededThat's over 20%, so I propose the following solution:editorconfig-checker/README.md Lines 378 to 420 in dde3580
|
Reading the tests in the https://github.com/editorconfig/editorconfig-plugin-tests I'm pretty sure all the ISO-8859 variants are all grouped as So I think I would reduce the ISO8859 variants to all be treated as The more important use case should be IMHO that we correctly identify UTF8, 16 and 32 in both their endiannesses and detect a BOM. This will be the more frequent use case for people wanting to ensure their codebase is up to a modern standard and not introducing files containing what I'd call legacy encoding. |
@klaernie Good feedback. I thought of that too. But what led me to think But I think you're right. By default, we interpret outdated{
...
"Charsets": {
"Latin1": ["ISO-8859-1"]
}
...
}
Thoughts? edit: But note that https://en.wikipedia.org/wiki/Byte_order_mark#UTF-16 says my logic "can result in both false positives and false negatives." I guess if the user has files that fail our checks, they can always exclude them. |
b525ba2
to
c8905c0
Compare
c8905c0
to
fb7a7c7
Compare
fb7a7c7
to
6f1c4bc
Compare
Note, I had to add Now it's in sync (again) with pkg/config/config.go. I'm not sure how things worked without this. |
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #457 +/- ##
==========================================
+ Coverage 86.72% 87.37% +0.65%
==========================================
Files 11 11
Lines 1017 1228 +211
==========================================
+ Hits 882 1073 +191
- Misses 102 120 +18
- Partials 33 35 +2 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
I did not find the wiki page - that uncovers the hints I failed to notice. So we should implement This also underpins the sentiment I get from the editorconfig wiki - I read it as "use unicode as a first choice" - which I personally also find as the most reasonable. I'll hopefully get to review the code later, but that might be not today. |
I hear you. I initially thought so as well, but you convinced me to be more lenient given our false positives in identifying So, note that
Since we are not an editor, I think we can include support for other character sets. The spec says that
Agreed. And I can see many people using editorconfig-checker to help them migrate their legacy files to Unicode.
edit: I don't intend to make any more changes, but take your time. It's a big change. supersededTake your time. I feel it's ready for review, but there are three small changes I am considering:
|
|
||
const defaultConfidence = 1 | ||
|
||
const testResultsJson = "test-results.json" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do I understand this correctly as being a test snapshot file? If so, why not reuse snaps - there the order of the tests would not matter?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To be honest, I haven't worked with go-snaps before. Are you suggesting we should here?
Note: If EDITORCONFIG_ADD_NEW_FILES=1
is set in the environment, the tests suite will scan for new files in testdata
and add them to the test-results.json
. Otherwise the suite runs only on the testdata
files listed in the file.
I found the file very helping in my debugging, as I could run a git diff
after a run to see if anything changed. That was a lot easier than scanning the log output for hundreds of files.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
okay, so basically you implement the reverse behaviour of snaps. Snaps will add new snapshots, but never change existing snapshots unless UPDATE_SNAPS
is set to a true value.
In this use case I would think using snaps.MatchStandaloneJSON(t, someValue)
would be better.
https://github.com/gkampitakis/go-snaps?tab=readme-ov-file#matchjson
This would mean:
- always scan for testfiles
- during teardown call
snaps.MatchStandaloneJSON()
for each test. Although one might argue that matching the snapshot inline would be easier, but right now the test architecture touches the centraltests
slice multiple times, if I understand it correctly.
snaps itself will then generate test failures when a previously created snapshot is not matched.
UnknownEncoding = "unknown" | ||
|
||
// See https://spec.editorconfig.org/#supported-pairs | ||
// CharsetUnset defines the value allowing for file encoding. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We could use the constants at https://github.com/editorconfig/editorconfig-core-go/blob/7404f9a31780afcfa5bc3a62697476e720b39c5d/editorconfig.go#L36 instead.
charsetFound = "utf8bom" | ||
} | ||
|
||
if !supported(charsetFound) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
supportedUTFEncoding()
would be a more apt name.
} | ||
|
||
// We need to check for UTF16/32 encodings first, as | ||
// UTF16/32 encoded first can be valid UTF8 files (surprisingly). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
s/first/files/
I think the test files at If code review hasn't started, lemme know, and I'll set this to draft status, and add them, as well as the comments noted above. |
@@ -1,5 +1,6 @@ | |||
ifeq ($(OS),Windows_NT) | |||
STDERR=con | |||
EXEEXT=.exe |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
where did that come from?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@klaernie Windows' exe's need an extension for Windows to execute them.
testnorace: ## Run test suite without -race which requires cgo | ||
go test -coverprofile=coverage.txt -covermode=atomic ./... | ||
go test -trimpath -coverprofile=coverage.txt -covermode=atomic ./... | ||
go vet ./... | ||
@test -z $(shell gofmt -s -l . | tee $(STDERR)) || (echo "[ERROR] Fix formatting issues with 'gofmt'" && exit 1) | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is that a requirement for the tests to pass, or just to work around a limitation on your dev machine?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not a requirement. Just needed to run the tests when CGO is not available.
I wonder if there is an optimal set instead of collecting test files from everywhere. But currently I'm on neither side of the fence, so feel free to add them.
There is no binary state of code review, feel free to make changes as you see fit. We maintainers would be stupid to keep you from iterating towards the best solution - after all you're doing the hard work right now, and I'm very thankful for that! Generally I'm a bit vary of the huge implementation of the test cases, since I still haven't wrapped my head around it fully. It seems fairly complicated, but probably isn't. But I think if you convert this from storing tests in a json file to matching snapshots it might become clearer. |
The code I believe is production ready, but I'm converting to draft to revision the test framework, and explore using snapshots, as @klaernie suggested. Please be patient as I haven't worked with this tooling before. I think only using the highest quality test files, makes sense too. |
Fixes #40
no longer applicable
Per [here](https://github.com//pull/457/files#diff-2a0547966abe4b6fcace630584fe01fee0d2498396cc80c8ee2a36c0d46fae28R202): ``` // The below file fails the test, but it may not be a valid UTF-16LE file. // For example, the Linux file command doesn't identify the file as // "Unicode text, UTF-16, little-endian text" // but simply // "data" // but since the file is from // https://cs.opensource.google/go/x/text/+/master:encoding/testdata/ // I think it's correct to fail the test, and fix the chardet package. {"candide-utf-16le.txt", "utf16le"}, ```