296
94%

Ran 04 Mar 2019 11:05PM UTC

Jobs 3

Files 65

Run time 1min

Badge

Embed ▾

pending completion

Build # 296

Build Type

push

travis-ci-com

Committed by

web-flow

Commit Message

Merge pull request #25 from broadinstitute/cluster-genomes

Add options to cluster input sequences and design on each cluster

This PR addresses #24.

Issue #24 gives background and reasons for this feature. In short, [`design.py`](https://github.com/broadinstitute/catch/blob/master/bin/design.py) can be slow when given a large number of highly divergent sequences (e.g., all sequences for all eight segments of influenza A virus). One solution is to cluster input sequences (alignment-free), solve a separate set cover instance on each cluster, and then merge the output probes from each cluster.

This PR adds the argument `--cluster-and-design-separately`. When provided, it produces a signature (or "sketch") of each input sequence using MinHash (similar to what is done in [Mash](https://genomebiology.biomedcentral.com/articles/10.1186/s13059-016-0997-x)) and clusters the sequences by comparing their signatures. Clustering itself can be slow and memory-intensive, but using signatures enables fast pairwise comparison of sequences. Then, it both generates candidate probes independently on each cluster and runs a collection of filters on those candidate probes (typically including [`set_cover_filter`](https://github.com/broadinstitute/catch/blob/master/catch/filter/set_cover_filter.py)) independently on each cluster. It merges the resulting probes (removing exact duplicates), and runs final filters (e.g., [`adapter_filter`](https://github.com/broadinstitute/catch/blob/master/catch/filter/adapter_filter.py)) on the merged set of probes.

Depending on the resource requirements of clustering, this can generally improve runtime and memory usage overall because solving independent, smaller set cover instances requires fewer resources than solving the complete one. One downside is that this can increase the size of the resulting probe set (e.g., if there is homology between input sequences that are placed into different clusters).

This PR also adds the ... (continued)

Run Details

1516 of 1709 branches covered (88.71%)

5193 of 5459 relevant lines covered (95.13%)

2.85 hits per line

Jobs

ID	Job ID	Ran	Coverage
1	296.1	04 Mar 2019 11:06PM UTC	95.13	Travis Job 296.1
2	296.2	04 Mar 2019 11:06PM UTC	95.13	Travis Job 296.2
3	296.3	04 Mar 2019 11:05PM UTC	95.13	Travis Job 296.3

Source Files on build 296

Detailed source file information is not available for this build.

broadinstitute / catch / 296
94%

README BADGES
x

Markdown

Textile

RDoc

HTML

Rst

Jobs

Source Files on build 296

broadinstitute / catch / 296 94%

README BADGES x

Markdown

Textile

RDoc

HTML

Rst

Jobs

Source Files on build 296

broadinstitute / catch / 296
94%

README BADGES
x