1
94%
master: 94%

Ran 04 Mar 2019 11:06PM UTC

Files 65

Run time 15s

Badge

Embed ▾

Committed 04 Mar 2019 10:35PM UTC coverage: 95.127% (+0.02%) from 95.106%

Job # 296.1

Build Type

push

travis-ci-com

Committed by

web-flow

Commit Message

Merge pull request #25 from broadinstitute/cluster-genomes

Add options to cluster input sequences and design on each cluster

This PR addresses #24.

Issue #24 gives background and reasons for this feature. In short, [`design.py`](https://github.com/broadinstitute/catch/blob/master/bin/design.py) can be slow when given a large number of highly divergent sequences (e.g., all sequences for all eight segments of influenza A virus). One solution is to cluster input sequences (alignment-free), solve a separate set cover instance on each cluster, and then merge the output probes from each cluster.

This PR adds the argument `--cluster-and-design-separately`. When provided, it produces a signature (or "sketch") of each input sequence using MinHash (similar to what is done in [Mash](https://genomebiology.biomedcentral.com/articles/10.1186/s13059-016-0997-x)) and clusters the sequences by comparing their signatures. Clustering itself can be slow and memory-intensive, but using signatures enables fast pairwise comparison of sequences. Then, it both generates candidate probes independently on each cluster and runs a collection of filters on those candidate probes (typically including [`set_cover_filter`](https://github.com/broadinstitute/catch/blob/master/catch/filter/set_cover_filter.py)) independently on each cluster. It merges the resulting probes (removing exact duplicates), and runs final filters (e.g., [`adapter_filter`](https://github.com/broadinstitute/catch/blob/master/catch/filter/adapter_filter.py)) on the merged set of probes.

Depending on the resource requirements of clustering, this can generally improve runtime and memory usage overall because solving independent, smaller set cover instances requires fewer resources than solving the complete one. One downside is that this can increase the size of the resulting probe set (e.g., if there is homology between input sequences that are placed into different clusters).

This PR also adds the ... (continued)

Run Details

1683 of 1876 branches covered (89.71%)

5193 of 5459 relevant lines covered (95.13%)

0.95 hits per line

broadinstitute / catch / 296 / 1
94%
master: 94%

README BADGES
x

Markdown

Textile

RDoc

HTML

Rst

Source Files on job 296.1

broadinstitute / catch / 296 / 1 94% master: 94%

README BADGES x

Markdown

Textile

RDoc

HTML

Rst

Source Files on job 296.1

broadinstitute / catch / 296 / 1
94%
master: 94%

README BADGES
x