26973114525
80%

Ran 04 Jun 2026 07:02PM UTC

Jobs 1

Files 29

Run time 1min

Badge

Embed ▾

Committed 04 Jun 2026 07:00PM UTC coverage: 76.817% (+0.03%) from 76.789%

Build # 26973114525

Build Type

push

github

Committed by

web-flow

Commit Message

v1.30.58: categorical dtypes for low-cardinality obs metadata (#263) (#269)

* v1.30.58: categorical dtypes for low-cardinality obs metadata (#263)

generate_observations_table() brings per-sample metadata in via the
ms_samples join and PMID dict-lookups, so those columns arrive as plain
object/string dtype (hundreds of MB of per-worker Python str overhead on
a ~4.4M-row table) instead of via the dictionary-encoded parquet path.

- Downcast an audited allowlist of low-cardinality join/derived columns
  to category at the enrichment boundary (sample_mhc, mhc_restriction,
  mhc_class_label_severity, instrument_type, cell_line_name, condition_
  category, ...). Biggest wins: sample_mhc 479MB, mhc_class_label_severity
  227MB, mhc_restriction 224MB -> ~9MB each. Companion to #262: these
  round-trip through Arrow IPC as dictionary columns that xdist workers
  mmap-share zero-copy.
- sample_label deliberately NOT categoricalized: it's compared element-
  wise against cell_name (already categorical) and two differently-
  categoried categoricals can't be compared.
- Add cell_line_name + cell_type to builder._CATEGORICAL_BUILD_COLUMNS
  (the parquet write path) - low-cardinality, was missing.
- _apply_training_defaults now fills via _fillna_scalar_safe (widens the
  category set for out-of-category sentinels like 'not_applicable')
  instead of a raw fillna that raised on categorical columns.

Excludes peptide / *_iri / free-text / semicolon multi-value columns,
mirroring the builder's existing exclusions.

* Document the three cell_* fields in the categorical test fixture

Clarify cell_name (raw IEDB catch-all) vs cell_line_name (line part,
hybrid suffix stripped) vs cell_type (tissue/type part), and make the
synthetic values coherent (a 'Line-1-B cell' hybrid -> line 'Line-1' +
type 'B cell').

Coverage Stats

4725 of 6151 relevant lines covered (76.82%)

0.77 hits per line

Coverage Regressions

Lines	Coverage	∆	File
157	81.47	0.11%	export.py
109	63.69	0.0%	builder.py

Jobs

ID	Job ID	Ran	Files	Coverage
1	26973114525.1	04 Jun 2026 07:02PM UTC	29	76.82	GitHub Action Run

pirl-unc / hitlist / 26973114525
80%

README BADGES
x

Markdown

Textile

RDoc

HTML

Rst

Coverage Regressions

Jobs

Source Files on build 26973114525

pirl-unc / hitlist / 26973114525 80%

README BADGES x

Markdown

Textile

RDoc

HTML

Rst

Coverage Regressions

Jobs

Source Files on build 26973114525

pirl-unc / hitlist / 26973114525
80%

README BADGES
x