• Home
  • Features
  • Pricing
  • Docs
  • Announcements
  • Sign In

pirl-unc / hitlist / 25492612956
75%

Build:
DEFAULT BRANCH: main
Ran 07 May 2026 11:21AM UTC
Jobs 1
Files 28
Run time 1min
Badge
Embed ▾
README BADGES
x

If you need to use a raster PNG badge, change the '.svg' to '.png' in the link

Markdown

Textile

RDoc

HTML

Rst

07 May 2026 11:18AM UTC coverage: 72.255% (+0.1%) from 72.108%
25492612956

push

github

web-flow
v1.30.39: shrink build_observations memory ~7-10x via categorical compression (#232)

* v1.30.39: shrink build_observations memory ~7-10x via categorical compression

The full IEDB + CEDAR + supplementary observations frame is held in
RAM through ``build_peptide_mappings`` and the gene-annotation merge
(#105 atomic-rebuild path).  At ~4.4 M rows × ~30 string columns,
peak RSS hit ~40 GB on the build host, which is more than many dev
machines have.

Most of those columns repeat heavily (3-200 distinct values across
millions of rows), so they convert to ``category`` dtype with no
information loss.  Pyarrow stores categoricals as dictionary-encoded
columns regardless, so the on-disk parquet is byte-identical to a
non-categorical build — only in-RAM footprint changes.

- New helper ``_compress_categoricals`` (in-place ``object → category``
  for a fixed list of low-cardinality columns).
- Applied at the natural memory peaks: post-partition (per source),
  post-concat (after IEDB+CEDAR merge), post-supplementary concat,
  and after the gene-annotation merge (which can revert categoricals
  to object).
- Eager ``del`` of the per-source scan frame and the per-source
  partition copies after they've been appended/concatenated, so
  intermediate buffers are reclaimed before the next allocation.
- Sequential row-count tracking via local ``n_ms`` / ``n_bd`` /
  ``n_supp_added`` so prints don't reach into already-freed lists.

Smoke test on a synthetic 1 M-row fixture (representative cardinalities):
  before: 526 MB → after: 69 MB (7.6x reduction).

For the full corpus the realized peak is expected to land in the
single-digit GB range, but actual numbers come from running the
build on the upgraded path.

* fix root cause: pyarrow streaming concat + vectorized dedup

Addresses three structural memory peaks in build_observations that
the categorical pass alone couldn't fix:

1. ``pd.concat`` 2x peak — the prior code held both per-source
   pandas frames AND allocat... (continued)

4107 of 5684 relevant lines covered (72.26%)

0.72 hits per line

Coverage Regressions

Lines Coverage ∆ File
236
62.18
1.83% builder.py
1
54.79
-0.14% cli.py
Jobs
ID Job ID Ran Files Coverage
1 25492612956.1 07 May 2026 11:21AM UTC 28
72.26
GitHub Action Run
Source Files on build 25492612956
  • Tree
  • List 28
  • Changed 3
  • Source Changed 0
  • Coverage Changed 3
Coverage ∆ File Lines Relevant Covered Missed Hits/Line
  • Back to Repo
  • Github Actions Build #25492612956
  • e869e83b on github
  • Prev Build on main (#25450907293)
  • Next Build on main (#25493184752)
  • Delete
STATUS · Troubleshooting · Open an Issue · Sales · Support · CAREERS · ENTERPRISE · START FREE · SCHEDULE DEMO
ANNOUNCEMENTS · TWITTER · TOS & SLA · Supported CI Services · What's a CI service? · Automated Testing

© 2026 Coveralls, Inc