GitHub - BioRadOpenSource/ish: Alignment-based filtering CLI tool

Fast record filtering by alignment-scores 🔥

ish is a CLI tool for searching for matches against records using different alignment methods.

Build

Install pixi
pixi run build
./ish --help

Pixi / Conda install

pixi global install -c conda-forge -c https://repo.prefix.dev/modular-community -c https://conda.modular.com/max ish
# Or
conda install -c conda-forge -c https://repo.prefix.dev/modular-community -c https://conda.modular.com/max ish

For best performance it's recommended to build from source.

Usage

❯ ./ish --help
ish
Search for inexact patterns in files.

ARGS:
	<ARGS (>=1)>...
		Pattern to search for, then any number of files or directories to search.
FLAGS:
	--help <Bool> [Default: False]
		Show help message

	--verbose <Bool> [Default: False]
		Verbose logging output.

OPTIONS:
	--scoring-matrix <String> [Default: ascii]
		The scoring matrix to use.
		ascii: does no encoding of input bytes, matches are 2, mismatch is -2.
		blosum62: encodes searched inputs as amino acids and uses the classic Blosum62 scoring matrix.
		actgn: encodes searched inputs as nucleotides, matches are 2, mismatch is -2, Ns match anything.
		actgn0: encodes searched inputs as nucleotides, matches are 2, mismatch is -2, Ns don't count toward score.


	--score <Float> [Default: 0.8]
		The min score needed to return a match. Results >= this value will be returned. The score is the found alignment score / the optimal score for the given scoring matrix and gap-open / gap-extend penalty.

	--gap-open <Int> [Default: 3]
		Score penalty for opening a gap.

	--gap-extend <Int> [Default: 1]
		Score penalty for extending a gap.

	--match-algo <String> [Default: striped-semi-global]
		The algorithm to use for matching: [striped-local, striped-semi-global]

	--record-type <String> [Default: line]
		The input record type: [line, fastx]

	--threads <Int> [Default: 10]
		The number of threads to use. Defaults to the number of physical cores.

	--batch-size <Int> [Default: 268435456]
		The number of bytes in a parallel processing batch. Note that this may use 2-3x this amount to account for intermediate transfer buffers.

	--max-gpus <Int> [Default: 0]
		The max number of GPUs to try to use. If set to 0 this will ignore any found GPUs. In general, if you have only one query then there won't be much using more than 1 GPU. GPUs won't always be faster than CPU parallelization depending on the profile of data you are working with.

	--output-file <String> [Default: /dev/stdout]
		The file to write the output to, defaults to stdout.

	--sg-ends-free <String> [Default: FFTT]
		The ends-free for semi-global alignment, if used. The free ends are: (query_start, query_end, target_start, target_end). These must be specified with a T or F, all four must be specified. By default this target ends are free.

# Some actual usage.
❯ ./ish blosum62 ./ish_bench_aligner.mojo 
./ish_bench_aligner.mojo:94             default_value=String("Blosum50"),
./ish_bench_aligner.mojo:96                 "Scoring matrix to use. Currently supports: [Blosum50,"
./ish_bench_aligner.mojo:97                 " Blosum62, ACTGN]"
./ish_bench_aligner.mojo:379     if matrix_name == "Blosum50":
./ish_bench_aligner.mojo:380         matrix = ScoringMatrix.blosum50()
./ish_bench_aligner.mojo:381     elif matrix_name == "Blosum62":
./ish_bench_aligner.mojo:382         matrix = ScoringMatrix.blosum62()
./ish_bench_aligner.mojo:390     ## Assuming we are using Blosum50 AA matrix for everything below this for now.

🔥 Note

The filepath:linenumber in the match allows you to cmd-click on the match and have vscode open the file at that location.

Match Methods

striped-semi-global: Striped Semi-global, SIMD accelerated, GPU accelerated when available, supports affine gaps and scoring matrices. Specify ends-free with the --sg-ends-free options.
striped-local: Striped Smith-Waterman, SIMD accelerated, supports affine gaps and scoring matrices.

Record Types

line: match against one line at a time, a-la grep
fastx: match against the sequence portion of FASTA or FASTQ records.

ish-aligner

This is a benchmarking tool based on parasail_aligner.

⚠️ Warning

ish-aligner and all variations of it are for development purposes only.

Future Work

Support multiple queries
Choose a better default between cpu and gpu / think about more. GPU crushes on big files / long running / many files, cpu is faster for small jobs
Add ability to not skip dotfiles

Rattler Build

For testing the build process for modular-community

pixi global install rattler-build
rattler-build build -c https://repo.prefix.dev/modular-community -c https://conda.modular.com/max -c conda-forge --skip-existing=all -r ./recipe.yaml

Name		Name	Last commit message	Last commit date
Latest commit History 107 Commits
.circleci		.circleci
.github		.github
benchmarking		benchmarking
ishlib		ishlib
tests		tests
.gitattributes		.gitattributes
.gitignore		.gitignore
CHANGELOG.md		CHANGELOG.md
CITATION.cff		CITATION.cff
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
NOTICE.txt		NOTICE.txt
README.md		README.md
ish_logo.png		ish_logo.png
main.mojo		main.mojo
pixi.lock		pixi.lock
pixi.toml		pixi.toml
recipe.yaml		recipe.yaml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Build

Pixi / Conda install

Usage

Match Methods

Record Types

ish-aligner

Further Reading

Future Work

Rattler Build

About

Uh oh!

Releases 4

Packages

Languages

License

BioRadOpenSource/ish

Folders and files

Latest commit

History

Repository files navigation

Build

Pixi / Conda install

Usage

Match Methods

Record Types

ish-aligner

Further Reading

Future Work

Rattler Build

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 4

Packages 0

Languages

Packages