• Home
  • Features
  • Pricing
  • Docs
  • Announcements
  • Sign In

pirl-unc / hitlist / 24700180143
80%

Build:
DEFAULT BRANCH: main
Ran 21 Apr 2026 02:04AM UTC
Jobs 1
Files 21
Run time 1min
Badge
Embed ▾
README BADGES
x

If you need to use a raster PNG badge, change the '.svg' to '.png' in the link

Markdown

Textile

RDoc

HTML

Rst

21 Apr 2026 02:02AM UTC coverage: 48.394% (+0.08%) from 48.311%
24700180143

push

github

web-flow
v1.13.3: cache classifier functions so scanner isn't quadratic (#85) (#102)

Adds @functools.cache to the hot classifier functions in
hitlist.curation so repeated per-row lookups against a small unique
vocabulary hit the cache instead of re-running mhcgnomes / regex /
YAML-driven logic:

- classify_mhc_species(mhc_restriction)  — ~100k unique allele strings
- classify_allele_resolution(mhc_restriction)
- allele_to_all_serotypes(mhc_restriction)
- is_binding_assay(qualitative_measurement, assay_comments)
- classify_ms_row(process_type, disease, culture, tissue, cell_name,
  pmid, mhc_restriction, submission_id)  — full tuple as key

Also lifts `_cached_parse` from `maxsize=1024` to unbounded `@cache`:
1024 was too small for the actual vocabulary (~100k strings) and was
causing re-parse churn for alleles that cluster later in the scan
(e.g. species-specific alleles that appear next to cell-line rows).
Unbounded is fine — mhcgnomes parse results are small, total
vocabulary footprint is well under 100 MB.

Why this is safe:
- All inputs are hashable scalars (str / int).
- classify_ms_row's return type is a dict but the two known callers
  (scanner.py `record.update(...)`, supplement.py `**...`) splat it
  without mutating, so sharing the cached instance across rows is safe.
- Return types are otherwise immutable (str, bool, tuple).

This is proposal (a) from #85 — precompute over the unique vocabulary
and broadcast. A micro-benchmark on 100k cached calls (warmed cache)
measures 0.30 µs/call — essentially Python call overhead. On the full
IEDB scan that should knock ≥80% off the per-row classifier time
(most of the ~10 min scanner pass).

The larger (b) pyarrow CSV, (c) ProcessPoolExecutor, (d) persistent
fingerprint-keyed scanner-output cache proposals from #85 are left as
follow-ups if more speedup is needed after this lands.

All 138 curation/scanner/supplement/aggregate/observations/proteome
tests pass unchanged.

Closes #85 (proposal a).

1492 of 3083 relevant lines covered (48.39%)

0.48 hits per line

Coverage Regressions

Lines Coverage ∆ File
43
84.18
0.27% curation.py
Jobs
ID Job ID Ran Files Coverage
1 24700180143.1 21 Apr 2026 02:04AM UTC 21
48.39
GitHub Action Run
Source Files on build 24700180143
  • Tree
  • List 21
  • Changed 1
  • Source Changed 0
  • Coverage Changed 1
Coverage ∆ File Lines Relevant Covered Missed Hits/Line
  • Back to Repo
  • Github Actions Build #24700180143
  • a2d31fca on github
  • Prev Build on main (#24700099692)
  • Next Build on main (#24754572661)
  • Delete
STATUS · Troubleshooting · Open an Issue · Sales · Support · CAREERS · ENTERPRISE · START FREE · SCHEDULE DEMO
ANNOUNCEMENTS · TWITTER · TOS & SLA · Supported CI Services · What's a CI service? · Automated Testing

© 2026 Coveralls, Inc