24700180143
80%

Ran 21 Apr 2026 02:04AM UTC

Jobs 1

Files 21

Run time 1min

Badge

Embed ▾

Committed 21 Apr 2026 02:02AM UTC coverage: 48.394% (+0.08%) from 48.311%

Build # 24700180143

Build Type

push

github

Committed by

web-flow

Commit Message

v1.13.3: cache classifier functions so scanner isn't quadratic (#85) (#102)

Adds @functools.cache to the hot classifier functions in
hitlist.curation so repeated per-row lookups against a small unique
vocabulary hit the cache instead of re-running mhcgnomes / regex /
YAML-driven logic:

- classify_mhc_species(mhc_restriction)  — ~100k unique allele strings
- classify_allele_resolution(mhc_restriction)
- allele_to_all_serotypes(mhc_restriction)
- is_binding_assay(qualitative_measurement, assay_comments)
- classify_ms_row(process_type, disease, culture, tissue, cell_name,
  pmid, mhc_restriction, submission_id)  — full tuple as key

Also lifts `_cached_parse` from `maxsize=1024` to unbounded `@cache`:
1024 was too small for the actual vocabulary (~100k strings) and was
causing re-parse churn for alleles that cluster later in the scan
(e.g. species-specific alleles that appear next to cell-line rows).
Unbounded is fine — mhcgnomes parse results are small, total
vocabulary footprint is well under 100 MB.

Why this is safe:
- All inputs are hashable scalars (str / int).
- classify_ms_row's return type is a dict but the two known callers
  (scanner.py `record.update(...)`, supplement.py `**...`) splat it
  without mutating, so sharing the cached instance across rows is safe.
- Return types are otherwise immutable (str, bool, tuple).

This is proposal (a) from #85 — precompute over the unique vocabulary
and broadcast. A micro-benchmark on 100k cached calls (warmed cache)
measures 0.30 µs/call — essentially Python call overhead. On the full
IEDB scan that should knock ≥80% off the per-row classifier time
(most of the ~10 min scanner pass).

The larger (b) pyarrow CSV, (c) ProcessPoolExecutor, (d) persistent
fingerprint-keyed scanner-output cache proposals from #85 are left as
follow-ups if more speedup is needed after this lands.

All 138 curation/scanner/supplement/aggregate/observations/proteome
tests pass unchanged.

Closes #85 (proposal a).

Coverage Stats

1492 of 3083 relevant lines covered (48.39%)

0.48 hits per line

Coverage Regressions

Lines	Coverage	∆	File
43	84.18	0.27%	curation.py

Jobs

ID	Job ID	Ran	Files	Coverage
1	24700180143.1	21 Apr 2026 02:04AM UTC	21	48.39	GitHub Action Run

pirl-unc / hitlist / 24700180143
80%

README BADGES
x

Markdown

Textile

RDoc

HTML

Rst

Coverage Regressions

Jobs

Source Files on build 24700180143

pirl-unc / hitlist / 24700180143 80%

README BADGES x

Markdown

Textile

RDoc

HTML

Rst

Coverage Regressions

Jobs

Source Files on build 24700180143

pirl-unc / hitlist / 24700180143
80%

README BADGES
x