[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
Skip to content

Tags: EwaMarek/cnvkit-cbsMethod

Tags

v0.9.6

Toggle v0.9.6's commit message
Version 0.9.6

=============

Much-needed maintenance and bug fixes, for the most part. Some key dependencies
have changed, though this should be generally painless for you, and one or two
regressions introduced by recent optimizations have been fixed.

This will be the last CNVkit version to run on Python 2.7. The next major
release of pandas (0.25.0) will remove support for Python 2.7, and once that
happens it will become increasingly difficult to install future versions of
CNVkit on Python 2.7 -- so we're not going to try.

The segmentation method `flasso` depends on the R package `cghFLasso`, which is
unmaintained and has been removed from CRAN.  For now, `segment -m flasso` is
still supported if you already have `cghFLasso` installed. But given the above,
`flasso` will be removed from the next CNVkit version in favor of the HMM-based
methods.

Dependencies
------------

- Raised minimum pandas version from 0.18.1 to 0.20.1, and support up to 0.24.2,
  resolving some warnings and an error in pandas 0.22+. (etal#413; thanks @chapmanb)
- The soft dependency on `hmmlearn` is replaced with an explicit dependency on
  `pomegranate` for the HMM-based segmentation methods. This dependency will now
  be pulled in automatically when installing via `pip` or `conda`.
- The R package `cghFLasso` has been removed from CRAN, and therefore is no
  longer a dependency of CNVkit and will not be installed automatically through
  the standard `conda` installation method. (etal#419)

Commands
--------

`antitarget`:

- Be more specific in removing noncanonical chromosomes (e.g. alternate
  contigs, mitochondria) from the binned regions. This avoids skipping
  chromosomes of interest in some non-human genomes with non-numeric contig
  names, like yeast. (etal#388; credit for regexes to @brentp)

`coverage`:

- With `--count-reads`, use query aligned length to handle soft-clipped reads
  properly. Now the results with and without this option should be similar.
(etal#411; thanks @desnar)

`segment`:

- For `-m flasso`, partition array by chromosome to avoid edge effects. (etal#409, etal#412; thanks @giladmishne)
- Removed the deprecated option `--rlibpath`; use `--rscript-path` instead.
- Note that the HMM methods are still provisional. A stable, supported version
  of these methods will be provided in the next CNVkit release.

Python API
----------

- `do_scatter` now returns a figure (etal#408; thanks @jeremy9959)

Bug fixes
---------

- `scatter`: Whole chromosomes can once again be specified with `-c`. (In the
  previous release, a chromosome without coordinates would cause an IndexError.)
  (etal#393)
- `import-rna`: Option --max-log2 can now be specified by users. (Previously,
  only the default value of +3.0 worked.)
- VCF I/O (`skgenome.tabio`): Support GATK 4's VCF files that contain records
  with empty ALT alleles, substituting zero if ALT AD is missing. (etal#391; thanks
  @chapmanb)
- Due to a certain versioning-dependent interaction between numpy, pandas,
  cython, and conda (details [here](numpy/numpy#432)),
  CNVkit may have printed spurious RuntimeWarning messages which could be safely
  ignored. The current release attempts to silence these messages if they occur.
  (etal#390).

v0.9.5

Toggle v0.9.5's commit message
Minor bugfix and usability improvement.

`autobin`:
    Ensure targets are non-empty and match BAM chrom names (closes etal#371)

`segment`:
    segment: Suppress help text for deprecated --rlibpath (etal#317)
    segment: Fix help text display (etal#380)

v0.9.4

Toggle v0.9.4's commit message
Bump version to 0.9.4

v0.9.3

Toggle v0.9.3's commit message
Version 0.9.3

This release fixes a single bug that caused the `segmetrics` command to crash
(etal#325).

Specifically, the command would crash unless at least one option from each of
the following option sets was specified:

- Location statistics: --mean, --median, --mode
- Spread statistics: --stdev, --sem, --mad, --mse, --iqr, --bivar
- Interval statistics: --ci, --pi

This bug would not be triggered by calling `cnvlib.do_segmetrics` through the
Python API, which is why it was not caught in automated testing.

v0.9.2

Toggle v0.9.2's commit message
Version 0.9.2

This release contains a new command `import-rna` to infer coarse-grained copy
number from RNA expression data. (etal#151)

Three new HMM-based segmentation methods are offered: 'hmm', 'hmm-germline', and
'hmm-tumor'. These should be considered experimental and used with caution; the
implementations are likely change in the next release.

The option `--male-reference` in the commands `batch`, `reference`, `fix`,
`call`, and `export` (at least) has been renamed to `--haploid-x-reference`
everywhere to reduce user confusion. A shim is in place so `--male-reference`
will continue to work.

Documentation, logging, and some error messages are improved.

Thanks to @chapmanb, @MajoroMask, and others for contributing to this release.

Dependencies
------------

- 'pandas' version 0.22 is supported.
- 'pysam' version 0.13.0 is supported.
- 'hmmlearn' version 0.2 is a run-time requirement to use the new HMM-based
  segmentation methods. The rest of CNVkit can be run without it. To ensure the
  right version is installed, install CNVkit with conda as usual, then install
  hmmlearn with pip within the CNVkit conda environment.
- Assume and require pip/setuptools for installation. (This is included with
  stock Python 2.7 and later.)

Scripts
-------

- New script "skg_convert.py" to convert between BED, GATK interval list, GFF,
  VCF, and tabular formats using the 'skgenome.tabio' sub-package, with options
  for simple post-processing.
- Removed the deprecated script refFlat2bed.py. (Use skg_convert.py instead.)

Commands
--------

`access`:

- Drop noncanonical, untargeted contigs/chromsomes by default. This affects
  analyses run from scratch with `batch`, too. (etal#169, etal#299)

`segment`:

- Three new methods can be specified with `-m`: `hmm`, `hmm-germline`, and
  `hmm-tumor`.
- With `-m flasso`, force a breakpoint at centromeres, as was already done for
  the default 'cbs' method.

`reference`:

- The option `--antitargets` is no longer required to build a flat reference.
  Previously, building a flat reference for WGS or TAS required creating an
  empty file to use as antitargets alongside the target BED.
- Print a warning if the sample sex inferred from targets does not match that of
  antitargets. (etal#281)

`scatter`:

- Removed the deprecated, invisible option `--background-marker`. (Use
  `--antitarget-marker` instead.)
- Trendlines should reflect small CNVs better, while preserving overall
  smoothing. The implementation now uses the Savitzky-Golay method instead of a
  Kaiser window, and the smoothing bandwidth is better-tuned. (This can also
  slightly improve outlier filtering in `segment`.)

`export seg`:

- Add option `--enumerate-chroms` to replace chromosome or contig names with
  sequential integers. Previously, this renumbering was always done, following
  some version of the SEG format. But since most tools don't require the contigs
  to be sequential integers, and this behavior causes trouble for users, it's
  now disabled by default. (etal#282)

`gainloss`/`genemetrics`:

- Rename `gainloss` command to `genemetrics`. A shim is in place so `cnvkit.py
  gainloss` will continue to work. (etal#278)
- Report segment- and bin-level weight and probes separately. (etal#107, etal#278)

Bug fixes
---------

- autobin: Require -g/--access for WGS (etal#289)
- batch: Use the "access" regions for the WGS workflow to choose bin size; these
  were previously being ignored, so bin sizes were too large, being based on the
  size of the whole genome, not just sequencing-accessible regions.
- call: Safely handle bins with zero weight when running `call --filter cn`.
  (bcbio/bcbio-nextgen#2112; thanks @chapmanb)
- coverage, guess_baits.py: Handle input BED files containing >4 columns. (etal#301)
- gainloss: Without `-s`, make 'depth' the weighted mean of bins, not just the
  first bin's value.
- segment: Ensure the .cns output file's columns are sorted properly (etal#291)
- vcfio: Don't crash if a record has no ALT values (etal#279)
- tabio:

    - Recognize BED format with decimal in chromosome name (etal#293)
    - Improvements to GFF/GTF/GFF3 parsing. The new options are mostly
      accessible through the Python API and the script 'skg_convert.py'. (etal#311)
    - In 'read_auto' (and all CNVkit commands that take regions as input),
      determine the file format first by checking the file extension and
      verifying the format of the first(-ish) line. Only if that doesn't work,
      fallback to the original method of testing the first(-ish) line against a
      brittle series of regular expressions. (etal#315)

Python API
----------

- cnvlib.write: Newly available at the top level to write tabular files (like
  .cnr and .cns), symmetric with 'cnvlib.read()'. The 'cnvlib.tabio' alias to
  'skgenome.tabio' has been removed; to read and write formats other than
  TSV-with-header ('tab'), import and use 'skgenome.tabio' directly.
- CopyNumArray.squash_genes: remove deprecated keyword argument 'squash_background'. Use 'squash_antitarget' instead.
- segmetrics: Move the functions supporting this command from 'cnvlib.command' to
  a new module 'cnvlib.segmetrics'.

v0.9.1

Toggle v0.9.1's commit message
Version 0.9.1

Highlights: Useful enhancements and changes to plotting and segmentation, and a
new script for single-exon CNV testing. Plus, bug fixes and usability
improvements to avoid unexpected errors. (etal#250, etal#255, etal#262, etc.)

Dependencies
------------

- Compatible with the most recent pandas version 0.21.0
  (etal#273, etal#274; thanks @chapmanb)
- R dependencies were reduced to simplify installation

Scripts
-------

- Renamed "cnn_*.py" to "cnv_*.py"
- New script "cnv_ztest.py" to detect single-bin (e.g. single exon) deep
  deletions and high-level amplifications.
- In "cnv_updater.py", rename "Background" (i.e. off-target) bins to
  "Antitarget", addition to adding a "depth" column if it's missing.

Commands
--------

`autobin`:

- Raise the maximum target/antitarget bin sizes to 50kb/1Mb.

`fix`:

- Allow specifying sample_id via ``--sample-id``/``-id``, in case the input
  coverage filenames do not have the expected form
  "sample_id.targetcoverage.cnn" and "sample_id.antitargetcoverage.cnn".
  (etal#269; thanks @chapmanb)

`segment`:

- Process each chromosome arm separately (with 'cbs' and 'haar', but not
  'flasso'). Centromere locations are guessed from the largest gap between
  sequencing-accessible regions, and are not necessarily the true locations,
  although they do match fairly well on the human genome.
- Logging of dropped bins is streamlined somewhat.
- New method `-m none` to only calculate arm-level segment means (for testing
  and experimentation).

`scatter`:

- Highlight non-neutral segments from .call.cns. If segments have the columns
  'cn' and potentially also 'cn1' and 'cn2' (as added by the `call` command),
  use those fields to display copy number alterations, LOH and allelic imbalance
  with colorized segments (orange by default), and use gray for neutral
  segments. If a VCF is also given, the same is done for SNVs in the lower
  panel.  Otherwise, all segments are colorized as before. (etal#18, etal#157)
- New option `--by-bins` to display x-axis positions by sequential bin number on
  each chromosome, rather than genomic coordinates. This makes the plots much
  more useful with targeted amplicon sequencing data, or very small gene panels.
  (etal#63)
- Trend line (`--trend`) now accounts for bin weights, which generally results
  in a better fit.
- Improved interaction of -c and -g options:

    - Only apply the window margin (-w) if -g is used alone, or -c specifies a small
      chromosomal region with no genes.
    - Allow an empty gene list (-g '' or -g ',') to prevent highlighting and
      labeling of any genes / small non-genic "Selection" in the -c region.
    - If any gene in -g is not fully within the region specified by -c, name that
      gene and its coordinates in the error message.
    - If the -c region has size <=0, show a specific error message.
    - Handle NaN log2 values when calculating y-axis limits.

`heatmap`:

- Incorporate the `--by-bins` argument to match `scatter`. (etal#63)
- Warn if selected region contains no data for a sample. This helps troubleshoot
  if a chromosome name was mis-specified on the command line. (etal#268)

`export seg`:

- Change column headers to match DNAcopy output. The column headers generally
  don't matter in the SEG format, but the DNAcopy dataframe is considered the
  canonical form.

Python API
----------

- cnvlib.do_segment -- new keyword argument min_weight to drop bins with
  'weight' below the specified value. If not used, then only bins with weight 0
  will be dropped. This feature is not recommended for normal usage and is not
  available on the command line.
- cnvlib.do_scatter -- Remove deprecated keyword argument 'background_marker' in
  favor of 'antitarget_marker', corresponding to `scatter` options deprecated in
  v0.9.0.
- cnvlib.cnary.CopyNumArray: Add method 'smoothed', which calculates the
  trendline displayed by the `scatter` command.
- skgenome.tabio: Add read support for samtools 'dict' format, which resembles the
  plain-text SAM header and can contain chromosome names and sizes.
- skgenome.gary.GenomicArray: Add magic methods __bool__ (Py3) and __nonzero__
  (Py2) to ensure an empty GenomicArray, i.e. 0 rows, is treated as false-ish on
  both Python 2.7 and 3.x.

v0.9.0

Toggle v0.9.0's commit message
Version 0.9.0

=============

In addition to bug fixes, documentation updates, and usability improvements,
this release includes some larger changes:

- The off-target bins in .cnn and .cnr files are now assigned the label
  "Antitarget" instead of "Background" in the "gene" column. The label
  "Background" in existing files will still be handled the same way, but new
  output files generated with CNVkit 0.9.0 and later will use the "Antitarget"
  label -- so, earlier versions of CNVkit may have problems with files produced
  by CNVkit 0.9.0. Some command line options and API keyword arguments similarly
  replace "background" with "antitarget", with shims in place for compatibility
  with existing scripts. (etal#171)

- The sub-packages 'genome' and 'tabio' are now in a separate top-level package
  'skgenome', still included in the CNVkit distribution. (See "Python API"
  below.) This does not affect the command-line usage of CNVkit, but clears the
  way to extract a scikit-genome package that can be installed and used
  separately from CNVkit for computing with genomic intervals.

Documentation
-------------

- Link to example VCF in the test suite
- Describe the 'breaks' command's output columns ( etal#220)
- Show an example customizing a plot with pyplot ( etal#196)

Dependencies
------------

- pysam: raise minimum to 0.10; support new version 0.11.2.1 (etal#218; thanks
  @chapmanb)
- pandas: support new version 0.20.1 (etal#215)
- numpy: support new version 0.13 (etal#235, etal#238)

Commands
--------

`batch`:

- Log the CNVkit version number at the start of the run
- Print a message at the end if no tumor/test samples specified. (etal#214)
- Clarify error messages for bad option combinations (etal#216)
- Removed deprecated, suppressed/invisible option `--split`. It was a shim in
  the 0.8 series to support old scripts.

`reference`:

- Ensure the inferred chromosomal sex matches between the targets and
  antitargets for the same sample. If the inferences do not match, prefer
  antitargets. (etal#234, etal#237)

`fix`:

- Warn & don't reweight bins if most antitargets have no/low coverage. This
  avoids a variety of surprising downstream problems when the input was
  specified as hybrid capture (the default), but is actualy from
  targeted amplicon sequencing, or otherwise has no reads mapped to most
  off-target bins.

`segment`:

- Log the segmentation and p-value/q-value threshold

`call`:

- Add option --center-at
- Let --center w/o argument do 'median'

`diagram`:

- New option `--title` to add a custom title to the top of the generated figure
  (etal#239; thanks @micknudsen)

`export vcf`:

- When given a .cnr file corresponding to the usual segmented input file (.cns),
  emit the CIPOS and CIEND tags in the generated VCF. These indicate the
  "fuzzy" coordinates of segment breakpoints. Here, the ranges are simply the
  widths of the underlying bins adjacent to each segment breakpoint. These tags
  can help meta-methods aggregate/harmonize CNVkit's calls with those of other
  structural variant callers. (etal#72)

`import-picard`:

- Don't accept directory as an argument (was deprecated).
- Be a little more flexible in filenames accepted: instead of requiring input
  files to be named `*.targetcoverage.???` or `*.antitargetcoverage.???`, strip
  the full suffix and default to 'targetcoverage.cnn' output suffix, or
  'antitargetcoverage.cnn' if input filename contains 'antitarget'. Works the
  same for filenames following the earlier convention, but now pretty safe for
  amplicon targets with arbitrary filenames, and slightly less spooky.

Bug fixes
---------

- `antitarget`: Don't crash if -g/--access is not given (etal#207)
- `batch`: Don't crash in 'wgs' mode when given just targets (-t) without a
  FASTA reference genome sequence (-f)
-`call --filter ampdel`: Drop segments with copy number (`cn` field) between 0
  and 5, exclusive, as the documentation indicates. Previously, it was just
  merging adjacent segments with copy number 1--4, but not dropping them. (etal#222)
- `export cdt`: Match the CDT spec. Fix a regression in which columns could be
  swapped/misaligned versus the header. Add a dummy "EWEIGHT" row to ensure Java
  TreeView starts reading data from the correct line in the file.
- `export theta`: Don't crash on bins where reference is NaN. (etal#168)
- `metrics`, `descriptives`: Handle degenerate/trivial cases consistently. (etal#202)
- `segment`: Handle sample names that are integers with leading zeros (etal#213)
- `sex`: Don't crash if chrX and chrY are both missing (etal#236)
- VCF parsing (`call`, `scatter`, `segment`):
    - Safely handle small or empty VCF files that previously could trigger a
      crash during BAF calculation. Now, with an empty VCF an all-blank "baf"
      will be emitted. (etal#218, etal#224; thanks @chapmanb)
    - Improve handling of Mutect2 VCF files, somewhat. Mutect2 VCFs are still
      not recommended as input to CNVkit; try FreeBayes or GATK HaplotypeCaller
      instead. (etal#195)

Python API
----------

Moved sub-packages 'genome' and 'tabio' to separate top-level package 'skgenome'
(etal#201). The top-level `cnvlib` API is mostly the same otherwise, but supporting
modules were refactored to decouple `skgenome` from `cnvlib` and remove
redundancies. In particular:

- Split module `cnvlib.core` split into `skgenome.tabio` and `cnvlib.cmdutil`
- Remove GenomicArray static method `row2label` in favor of functions `to_label`
  and `from_label` in new module `skgenome.rangelabel`.
- The SEG writer in 'tabio' now replaces chromosome names with 1-based integer
  indices, per SEG spec/convention. The `export seg` command now uses this
  writer directly.

Scripts
-------

- Remove the script `coverage_bin_size.py`, previously deprecated in favor of
  the `autobin` command.
- Add `skg_convert.py` to convert between tabular formats.
- Add `cnn_annotate.py` to replace the 'gene' field for each bin in a .cnn or
  .cnr file, given a gene annotation database like refFlat.txt. The need for
  this comes up occasionally when users notice at the end of an analysis that
  vendor-annotated targets are not the desired gene names.

v0.8.5

Toggle v0.8.5's commit message
Version 0.8.5

New 'autobin' command, replacing the script `coverage_bin_size.py`. Fixed some
bugs and usability issues. Unit tests improved, especially for the
'cnvlib.genome' sub-package.

Dependencies
------------

- Pandas 0.18.1 is once again supported. Previously the minimum version was
  0.19.1. (bcbio/bcbio-nextgen#1836)
- Pysam minimum version is still 0.9.1.4, but slightly older versions in the
  0.9 series may still work too. (etal#192)

Commands
--------

`autobin`:

- New command, replacing and extending the script `coverage_bin_size.py`. The
  script is still included (and shares most of the same code), but is considered
  deprecated and will be removed in the 0.9.0 release. (etal#170)
- In 'amplicon' and 'hybrid' modes, ensure sampling regions for coverage is the
  same in every run by set random seed. (etal#191)

`antitarget`, `autobin`, `batch`:

- Fix an issue in GenomicArray.subtract() that caused some of the expected
  output regions to be missing. In cases where this caused an entire chromosome
  to be lost, the `coverage_bin_size.py` script` and autobin` and `batch`
  commands in `hybrid` mode would crash. (bcbio/bcbio-nextgen#1799)

`batch`, `diagram`:

- Fix creation of chromosomal diagrams with `--diagram` and the `diagram`
  command. (etal#190)

`export`:

- In `export seg`, use 1-based indexing in the SEG output. (etal#197)
- Fix `export cdt` format; it was generating Java TreeView (jtv) earlier.

v0.8.4

Toggle v0.8.4's commit message
Version 0.8.4

This minor release focuses on improving usability and fixing some bugs.
Documentation is updated (thanks @kyleabeauchamp for etal#186).

Dependencies
------------

- Raise minimum pandas version from 0.18.1 to 0.19.0
- Raise minimum matplotlib version to 1.3.1

Commands
--------

`fix`, `metrics`:

- Set PRNG seed to ensure reproducible results. The pipeline is now fully
  repeatable with identical results if run in serial, i.e. without `-p`.

`fix`, `reference`:

- Ensure bias smoothing window size is at least 5. This reduces the occurrence of
  0-log2, 0-spread bins on a 32-bin dataset, but doesn't eliminate it. (etal#181)

`fix`:

- Don't complain about mismatched sample IDs if antitargets are blank. This
  allows reusing a blank "MT" file in a shell loop for WGS and amplicon data.

`reference`:

- Make antitargets (antitarget.bed or \*.antitargetcoverage.cnn) an optional
  argument. Previously this argument was required, so processing WGS or amplicon
  data, which has no off-target regions or reads, required the user to create
  and provide a blank BED file or appropriately named, empty .cnn files. (etal#183)

`segment`:

- Don't log "Dropped 0 low-coverage bins". Only log when it actually drops bins.

`diagram`, `heatmap`:

- Add option `--no-shift-xy`.  Shifting X and Y according reference and sample
  sex was done in diagram, but not heatmap. Now it's optional in both.

`heatmap`:

- Add a legend of log2 ratio colors to the plot. (etal#36)
- Add options `-x`/`--sample-sex` and `-y`/`--male-reference`. (etal#172)

`gender`/`sex`:

- Rename 'gender' command to 'sex', with shim for backward compatibility. (etal#182)
- In other commands, the `-g`/`--gender`` argument is renamed to
  `-x`/`--sample-sex`, also with a compatibility shim. Argument values `x` and
  `y` are accepted in addition to `f`/`female` and `m`/`male`, respectively.

`import-picard`:

- Deprecate searching a directory tree for files. It was a vestige of early lab
  work, and makes a shaky assumption about Picard CalculateHsMetrics
  ``--PER_TARGET_COVERAGE`` output filenames.

API
---

- The ``do_*`` function implementations moved to their named modules. The
  ``do_*`` functions can still be called or imported from the `cnvlib` and
  `cnvlib.commands` modules.
- All parsing and serialization of "chr:start-end" genomic region labels is
  consolidated under a new module, `cnvlib.genome.rangelabel`. These functions
  are used in in tabio.textcoord, GenomicArray.labels(), and elsewhere to ensure
  consistent behavior.

Internal
--------

- `cnvlib.genome`: Handle nested bins correctly in the `merge`, `flatten`, and
  `intersect` modules, functions and GenomicArray methods. Verified with
  thorough unit tests.
- VCF: If the paired normal sample's genotypes are all 0/0 or missing, fall back
  to `--zygosity-freq` (inference from b-allele frequency) rather than marking
  all variants as somatic.  Then infer and drop additional somatic SNVs based on
  genotype after parsing, and only if that wouldn't drop all records.  This
  allows CNVkit to safely distinguish somatic vs. germline in VCFs from Mutect2,
  though Mutect2 is still not recommended. (etal#184)

v0.8.3

Toggle v0.8.3's commit message
Bump version to 0.8.3