8000 cut: add whitespace option for separating fields by TechHara · Pull Request #4232 · uutils/coreutils · GitHub
[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
Skip to content

cut: add whitespace option for separating fields #4232

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 24 commits into from
Jan 1, 2023

Conversation

TechHara
Copy link
Contributor
@TechHara TechHara commented Dec 13, 2022

TLDR: "-w" flag in cut is an extension option implemented in FreeBSD. This option proves to be useful when cutting fields separated by one or more whitespaces. GNU coreutils does not currently implement this option. This PR implements -w option and adds test cases. The performance difference compared to FreeBSD's cut implementation is about 9x.

Example use case: "ls" or "ps" returns columns separated by multiple inconsistent whitespaces. We can use -w option to select columns, e.g. "ps | cut -w -f1" to print only pids.

Implementation: For best performance, I created a new function (cut_fields_whitespace) dedicated for -w option. Performance of "cut -f" before and after this PR is essentially the same, as show below:

Before

$ hyperfine -w3 "./target/release/coreutils cut -f2-4,8 data.tsv"
Benchmark 1: ./target/release/coreutils cut -f2-4,8 data.tsv
  Time (mean ± σ):      1.126 s ±  0.006 s    [User: 1.035 s, System: 0.090 s]
  Range (min … max):    1.118 s …  1.138 s    10 runs

$  hyperfine -w3 "./target/release/coreutils cut -f2-4,8 -d' ' data.tsv"
Benchmark 1: ./target/release/coreutils cut -f2-4,8 -d' ' data.tsv
  Time (mean ± σ):      1.026 s ±  0.003 s    [User: 0.940 s, System: 0.086 s]
  Range (min … max):    1.023 s …  1.031 s    10 runs

After

$ hyperfine -w3 "./target/release/coreutils cut -f2-4,8 data.tsv"
Benchmark 1: ./target/release/coreutils cut -f2-4,8 data.tsv
  Time (mean ± σ):      1.127 s ±  0.021 s    [User: 1.037 s, System: 0.089 s]
  Range (min … max):    1.110 s …  1.169 s    10 runs

$ hyperfine -w3 "./target/release/coreutils cut -f2-4,8 -d' ' data.tsv"
Benchmark 1: ./target/release/coreutils cut -f2-4,8 -d' ' data.tsv
  Time (mean ± σ):      1.045 s ±  0.026 s    [User: 0.956 s, System: 0.088 s]
  Range (min … max):    1.025 s …  1.103 s    10 runs

The performance of "cut -w -f" compared to FreeBSD is about 9x faster:

$ hyperfine -w3 "./target/release/coreutils cut -f2-4,8 -w data.tsv" "/usr/bin/cut -f2-4,8 -w data.tsv"
Benchmark 1: ./target/release/coreutils cut -f2-4,8 -w data.tsv
  Time (mean ± σ):      1.242 s ±  0.003 s    [User: 1.153 s, System: 0.088 s]
  Range (min … max):    1.240 s …  1.249 s    10 runs

Benchmark 2: /usr/bin/cut -f2-4,8 -w data.tsv
  Time (mean ± σ):     11.529 s ±  0.089 s    [User: 11.385 s, System: 0.131 s]
  Range (min … max):   11.451 s … 11.731 s    10 runs

Summary
  './target/release/coreutils cut -f2-4,8 -w data.tsv' ran
    9.28 ± 0.08 times faster than '/usr/bin/cut -f2-4,8 -w data.tsv'

@TechHara TechHara marked this pull request as draft December 13, 2022 05:33
@TechHara TechHara marked this pull request as ready for review December 13, 2022 05:48
Copy link
Contributor
@sylvestre sylvestre left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why not
could you please add a comment in
docs/src/extensions.md
about this difference

@sylvestre
Copy link
Contributor

btw, do you know where are the cut tests in freebsd?

@tertsdiepraam
Copy link
Member

I think they're here: https://github.com/freebsd/freebsd-src/tree/373ffc62c158e52cde86a5b934ab4a51307f9f2e/usr.bin/cut/tests

Though this is not very extensive.

@TechHara
Copy link
Contributor Author

@TechHara TechHara requested a review from sylvestre December 14, 2022 03:00
@TechHara TechHara requested a review from sylvestre December 14, 2022 09:12
@github-actions
Copy link

GNU testsuite comparison:

GNU test failed: tests/tail-2/inotify-dir-recreate. tests/tail-2/inotify-dir-recreate is passing on 'main'. Maybe you have to rebase?

@TechHara
Copy link
Contributor Author

Hi @sylvestre, is there anything else that needs to be addressed in the PR?

@github-actions
Copy link

GNU testsuite comparison:

GNU test failed: tests/tail-2/inotify-dir-recreate. tests/tail-2/inotify-dir-recreate is passing on 'main'. Maybe you have to rebase?

@TechHara
Copy link
Contributor Author

GNU testsuite comparison:

GNU test failed: tests/tail-2/inotify-dir-recreate. tests/tail-2/inotify-dir-recreate is passing on 'main'. Maybe you have to rebase?

This is not related to this PR and seems to be failing in other PRs as well

@TechHara
Copy link
Contributor Author

The CICD style/lint failures come from the code that I didn't modify. Not sure why they are failing in this PR but not in other PRs. As for spelling error, should I rename multispace to multi_space perhaps?

@sylvestre
Copy link
Contributor

for multispace, add it to the top of the file in the ignore spell list :)

@TechHara
Copy link
Contributor Author

for multispace, add it to the top of the file in the ignore spell list :)

done!

Copy link
Member
@tertsdiepraam tertsdiepraam left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cool feature! I left a few comments for a bit cleaner code.

@github-actions
Copy link

GNU testsuite comparison:

GNU test failed: tests/misc/timeout. tests/misc/timeout is passing on 'main'. Maybe you have to rebase?

@TechHara TechHara requested review from tertsdiepraam and removed request for sylvestre December 21, 2022 01:51
Copy link
Member
@tertsdiepraam tertsdiepraam left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Few more small improvements and then I think this is ready!

@github-actions
Copy link

GNU testsuite comparison:

Congrats! The gnu test tests/tail-2/inotify-dir-recreate is no longer failing!

@TechHara
Copy link
Contributor Author

Hi @tertsdiepraam just wanted to check in to see if anything else needs to be addressed with this PR to be merged.

@tertsdiepraam
Copy link
Member

Yeah I think it's good enough now as far as adding the feature goes. I do think cut is due for a refactor that removes some of the duplicated code though.

@tertsdiepraam tertsdiepraam merged commit 36f3507 into uutils:main Jan 1, 2023
@tertsdiepraam
Copy link
Member
tertsdiepraam commented Jan 1, 2023

Oops forgot to squash again. @sylvestre I don't have permissions to remove the merge commit. Could you try to remove it from main. Then I'll do a squash merge instead.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants
0