24937746010
80%

Ran 25 Apr 2026 06:34PM UTC

Jobs 1

Files 23

Run time 1min

Badge

Embed ▾

Committed 25 Apr 2026 06:32PM UTC coverage: 60.497% (+0.8%) from 59.684%

Build # 24937746010

Build Type

push

github

Committed by

web-flow

Commit Message

v1.20.0: Transcript-aware peptide_mappings (#141) (#159)

Issue #141: ProteomeIndex.from_ensembl picked one longest protein-coding
transcript per gene and used best_t.id (an ENST) as protein_id, so
peptide_mappings.parquet:
  - silently collapsed transcript diversity away before the mapping
    table was built
  - mixed transcript-vs-protein semantics in the protein_id column
    depending on backend (ENST for Ensembl, FASTA-header strings for
    UniProt) so downstream consumers couldn't distinguish gene-level
    ambiguity from transcript-isoform ambiguity from same-protein
    repeated occurrences

Fix: index every protein-coding transcript per gene, expose
transcript_id and is_canonical_transcript as first-class mapping
columns, use the ENSP (when pyensembl surfaces it) as protein_id.

- hitlist/proteome.py:
  - from_ensembl now iterates every protein-coding transcript per
    gene rather than only the longest.  Indexes by t.protein_id
    (ENSP) when available, falling back to t.id (ENST) for older
    pyensembl releases / species without protein_id surfaced.
  - Canonical transcript = longest valid protein-coding translation
    (a stable, pyensembl-version-independent definition).  Each
    proteins[] entry's meta records gene_name, gene_id, transcript_id
    (always the ENST), and is_canonical_transcript bool.
  - from_fasta records transcript_id='' and is_canonical_transcript=
    False so the meta schema is uniform across backends.
  - map_peptides now emits transcript_id and is_canonical_transcript
    columns alongside protein_id.  Empty-result schema also includes
    them.

- hitlist/mappings.py:
  - _MAPPING_COLUMNS adds transcript_id + is_canonical_transcript.
  - _flanking_rows_to_mapping_rows propagates the new columns when
    present (real Ensembl-backed builds) and fills defaults for
    legacy fixtures so the parquet schema stays uniform.
  - load_peptide_mappings exposes transcript_id and
    is_canonical_transcript as pyarrow-p... (continued)

Coverage Stats

2533 of 4187 relevant lines covered (60.5%)

0.6 hits per line

Coverage Regressions

Lines	Coverage	∆	File
91	24.77	8.02%	mappings.py
70	80.78	0.0%	export.py
19	88.07	8.57%	proteome.py

Jobs

ID	Job ID	Ran	Files	Coverage
1	24937746010.1	25 Apr 2026 06:34PM UTC	23	60.5	GitHub Action Run

pirl-unc / hitlist / 24937746010
80%

README BADGES
x

Markdown

Textile

RDoc

HTML

Rst

Coverage Regressions

Jobs

Source Files on build 24937746010

pirl-unc / hitlist / 24937746010 80%

README BADGES x

Markdown

Textile

RDoc

HTML

Rst

Coverage Regressions

Jobs

Source Files on build 24937746010

pirl-unc / hitlist / 24937746010
80%

README BADGES
x