24791370603
80%

Ran 22 Apr 2026 04:59PM UTC

Jobs 1

Files 21

Run time 1min

Badge

Embed ▾

Committed 22 Apr 2026 04:57PM UTC coverage: 50.2% (+0.06%) from 50.139%

Build # 24791370603

Build Type

push

github

Committed by

web-flow

Commit Message

v1.15.1: normalize mhc_restriction on ingest + length bounds on observations/binding (#121, #118) (#123)

## #121 — normalize_allele at ingest

Before this fix, the scanner and supplement ingest paths wrote raw
``mhc_restriction`` strings to the parquet without normalization.
Supplementary data (Gomez-Zepeda 2024 SK-MEL-37, PMID 38480730)
contained ``A*02:01`` without the ``HLA-`` prefix; the scanner
passes through IEDB strings verbatim and occasionally catches typos
or unusual formats.

Effect: ``load_observations(mhc_restriction="HLA-A*02:01")`` would
miss rows stored as ``A*02:01``. The CLI ``--mhc-allele`` path
normalizes at query time so it hits both, but raw parquet inspection
and aggregators grouping on the raw column silently split.

Fix: apply ``hitlist.curation.normalize_allele`` to
``mhc_restriction`` in both ingest paths. The function is already
``@cache``d on the unique vocabulary (~100k strings), so per-row cost
is ~100 ns after the first hit.

## #118 — length_min / length_max on load_observations / load_binding

``load_bulk_peptides`` shipped ``length_min`` / ``length_max`` in
v1.14.3 (#108). The MHC peptide loaders needed the same — every
training-set script was post-filtering ``df[df["peptide"].str.len()
.between(8, 11)]`` after load.

- ``load_observations(length_min=8, length_max=11)`` → MHC-I window
- ``load_observations(length_min=12, length_max=25)`` → MHC-II window
- ``load_binding(length_min=9, length_max=9)`` → strictly 9-mers
- ``load_all_evidence(length_min=..., length_max=...)`` passes
  through to both underlying loaders

Implementation detail: observations.parquet / binding.parquet don't
carry an explicit ``length`` column (unlike the bulk parquet), so the
bound is applied post-read via ``peptide.str.len().between(lo, hi)``.
For the full 4.4M-row observations parquet this is ~100 ms, small
relative to the read. A future PR can add ``length`` as a stored Int64
column at build time and push the filter down; for now the sim... (continued)

Coverage Stats

1629 of 3245 relevant lines covered (50.2%)

0.5 hits per line

Coverage Regressions

Lines	Coverage	∆	File
21	27.67	-0.18%	scanner.py
8	91.0	-0.49%	observations.py
4	86.57	0.2%	supplement.py

Jobs

ID	Job ID	Ran	Files	Coverage
1	24791370603.1	22 Apr 2026 04:59PM UTC	21	50.2	GitHub Action Run

pirl-unc / hitlist / 24791370603
80%

README BADGES
x

Markdown

Textile

RDoc

HTML

Rst

Coverage Regressions

Jobs

Source Files on build 24791370603

pirl-unc / hitlist / 24791370603 80%

README BADGES x

Markdown

Textile

RDoc

HTML

Rst

Coverage Regressions

Jobs

Source Files on build 24791370603

pirl-unc / hitlist / 24791370603
80%

README BADGES
x