25492612956

Committed 07 May 2026 11:18AM UTC coverage: 72.255% (+0.1%) from 72.108%

Build # 25492612956

Build Type

push

github

Committed by

web-flow

Commit Message

v1.30.39: shrink build_observations memory ~7-10x via categorical compression (#232)

* v1.30.39: shrink build_observations memory ~7-10x via categorical compression

The full IEDB + CEDAR + supplementary observations frame is held in
RAM through ``build_peptide_mappings`` and the gene-annotation merge
(#105 atomic-rebuild path).  At ~4.4 M rows × ~30 string columns,
peak RSS hit ~40 GB on the build host, which is more than many dev
machines have.

Most of those columns repeat heavily (3-200 distinct values across
millions of rows), so they convert to ``category`` dtype with no
information loss.  Pyarrow stores categoricals as dictionary-encoded
columns regardless, so the on-disk parquet is byte-identical to a
non-categorical build — only in-RAM footprint changes.

- New helper ``_compress_categoricals`` (in-place ``object → category``
  for a fixed list of low-cardinality columns).
- Applied at the natural memory peaks: post-partition (per source),
  post-concat (after IEDB+CEDAR merge), post-supplementary concat,
  and after the gene-annotation merge (which can revert categoricals
  to object).
- Eager ``del`` of the per-source scan frame and the per-source
  partition copies after they've been appended/concatenated, so
  intermediate buffers are reclaimed before the next allocation.
- Sequential row-count tracking via local ``n_ms`` / ``n_bd`` /
  ``n_supp_added`` so prints don't reach into already-freed lists.

Smoke test on a synthetic 1 M-row fixture (representative cardinalities):
  before: 526 MB → after: 69 MB (7.6x reduction).

For the full corpus the realized peak is expected to land in the
single-digit GB range, but actual numbers come from running the
build on the upgraded path.

* fix root cause: pyarrow streaming concat + vectorized dedup

Addresses three structural memory peaks in build_observations that
the categorical pass alone couldn't fix:

1. ``pd.concat`` 2x peak — the prior code held both per-source
   pandas frames AND allocat... (continued)

Coverage Stats

4107 of 5684 relevant lines covered (72.26%)

0.72 hits per line

Source File
Press 'n' to go to next uncovered line, 'b' for previous

54.79

/cli.py

pirl-unc / hitlist / 25492612956

Source File Press 'n' to go to next uncovered line, 'b' for previous

Source Not Available

Source File
Press 'n' to go to next uncovered line, 'b' for previous