24699661514
80%

Ran 21 Apr 2026 01:44AM UTC

Jobs 1

Files 21

Run time 1min

Badge

Embed ▾

Committed 21 Apr 2026 01:43AM UTC coverage: 47.309% (+1.8%) from 45.555%

Build # 24699661514

Build Type

push

github

Committed by

web-flow

Commit Message

v1.12.2: Wire bulk_proteomics.parquet into builder + harmonize metadata (#90)

Makes bulk proteomics a first-class sibling of observations.parquet /
binding.parquet. `hitlist data build` now emits a third parquet at
~/.hitlist/bulk_proteomics.parquet — a long-form table with rows for
both protein- and peptide-level measurements (distinguished by
`granularity`), with per-source acquisition metadata denormalized onto
every row so the file is self-contained for MS-bias modeling.

Schema is harmonized with the per-sample schema in hitlist.export for
observations.parquet — same column names for instrument, instrument_type,
fragmentation, acquisition_mode, labeling, search_engine, fdr,
cell_line_name, sample_label, pmid, study_label, species — so the same
column list extracts from either index for joint analysis. Bulk-specific
prep fields (digestion, digestion_enzyme, fractionation, n_fractions,
quantification) sit alongside. See #89 for row-level digestion_enzyme
follow-up with non-tryptic digests.

Also:
- Bekker-Jensen protein-level abundance (71,520 rows across A549 /
  HCT116 / HEK293 / HeLa / MCF7) joined into load_bulk_proteomics()
  alongside CCLE. Use abundance_percentile (rank within cell line) for
  cross-source comparisons; intensity values are not directly comparable
  because CCLE is TMT-normalized and BJ is label-free.
- sources.yaml enriched with pmid, study_label, species, fragmentation,
  acquisition_mode, labeling, n_fractions so the harmonized fields are
  curated, not inferred.
- Loaders (load_bulk_proteomics, load_bulk_peptides) prefer the built
  parquet when present (fast + full metadata), fall back to packaged
  CSVs otherwise, so they keep working without data build.
- `cell_line` → `cell_line_name` column rename to match observations.

Bumps cache_is_valid to require all three parquets. Adds round-trip
tests for the harmonized columns, build artifact, and cross-source
loader behavior. Old loaders' `cell_line` filter arg keeps work... (continued)

Coverage Stats

1424 of 3010 relevant lines covered (47.31%)

0.47 hits per line

Coverage Regressions

Lines	Coverage	∆	File
159	47.81	7.6%	builder.py

Jobs

ID	Job ID	Ran	Files	Coverage
1	24699661514.1	21 Apr 2026 01:44AM UTC	21	47.31	GitHub Action Run

pirl-unc / hitlist / 24699661514
80%

README BADGES
x

Markdown

Textile

RDoc

HTML

Rst

Coverage Regressions

Jobs

Source Files on build 24699661514

pirl-unc / hitlist / 24699661514 80%

README BADGES x

Markdown

Textile

RDoc

HTML

Rst

Coverage Regressions

Jobs

Source Files on build 24699661514

pirl-unc / hitlist / 24699661514
80%

README BADGES
x