25613983293
80%

Ran 09 May 2026 11:01PM UTC

Jobs 1

Files 28

Run time 1min

Badge

Embed ▾

Committed 09 May 2026 10:59PM UTC coverage: 74.898% (+0.4%) from 74.531%

Build # 25613983293

Build Type

push

github

Committed by

web-flow

Commit Message

v1.30.46: split gene/protein columns out of observations.parquet (closes #238 partial) (#243)

Drops ``gene_names``, ``gene_ids``, ``protein_ids``, and
``n_source_proteins`` from the observations.parquet schema.  The same
information has always been in ``peptide_mappings.parquet`` (one row
per peptide x protein); we were storing it in TWO places and merging
the long-form mapping back onto every observation row at build time.

## Wins (measured against the v1.30.45 build of the same source data)

| Metric | v1.30.45 | v1.30.46 | Delta |
|---|---|---|---|
| observations.parquet size | 192.2 MB | **117.6 MB** | **-39%, -74.5 MB** |
| Gene/protein columns combined size | 71.1 MB (38.8% of obs) | 0 MB | -71 MB |
| Gene-annotation merge step in build | full-frame merge | skipped | (eliminates the largest single transient memory step in build_observations) |
| ``hitlist pmhc --gene PRAME`` rows loaded | 4.4M (full corpus) | 257 (matched peptides only) | ~17,000x reduction |
| ``hitlist pmhc --gene PRAME`` parquet load time | ~6s | 0.8s | ~7.5x faster |

Build-side memory wins are harder to capture as a single number — the
v1.30.45 build held the full peptide_mappings (~65 MB) AND obs (~190 MB)
in pandas form simultaneously through ``annotate_observations_with_genes``,
which is the largest transient step.  Eliminating that step cuts the
single biggest memory blip in the build pipeline.

## What changed

### Build path (``builder.py``)

Skip ``annotate_observations_with_genes()``.  The peptide_mappings.parquet
sidecar continues to be built independently as it always was.

### Reader path (``observations.py``)

- ``load_observations`` now AUTO-ATTACHES gene/protein columns from
  peptide_mappings.parquet when the caller requests them but the
  parquet doesn't carry them (post-v1.30.46).  Only joins the
  matched-peptides slice (cheap on filtered loads, expensive only on
  full-corpus loads — same complexity as the old build-time merge).
- New entries in ``_DERI... (continued)

Coverage Stats

4234 of 5653 relevant lines covered (74.9%)

0.75 hits per line

Coverage Regressions

Lines	Coverage	∆	File
102	63.08	0.85%	builder.py
19	89.21	0.64%	pmhc_query.py
11	94.53	-0.63%	observations.py

Jobs

ID	Job ID	Ran	Files	Coverage
1	25613983293.1	09 May 2026 11:01PM UTC	28	74.9	GitHub Action Run

pirl-unc / hitlist / 25613983293
80%

README BADGES
x

Markdown

Textile

RDoc

HTML

Rst

Coverage Regressions

Jobs

Source Files on build 25613983293

pirl-unc / hitlist / 25613983293 80%

README BADGES x

Markdown

Textile

RDoc

HTML

Rst

Coverage Regressions

Jobs

Source Files on build 25613983293

pirl-unc / hitlist / 25613983293
80%

README BADGES
x