• Home
  • Features
  • Pricing
  • Docs
  • Announcements
  • Sign In

pirl-unc / tcrsift / 25895642089
84%

Build:
DEFAULT BRANCH: main
Ran 15 May 2026 01:44AM UTC
Jobs 4
Files 29
Run time 1min
Badge
Embed ▾
README BADGES
x

If you need to use a raster PNG badge, change the '.svg' to '.png' in the link

Markdown

Textile

RDoc

HTML

Rst

15 May 2026 01:42AM UTC coverage: 76.211% (+0.4%) from 75.798%
25895642089

push

github

web-flow
Fix #58: lightweight Pgen estimator + publicness annotation + plots (#79)

* Fix #58: lightweight Pgen estimator + publicness annotation + plots

Adds a coarse Pgen estimator for TCR β/α CDR3 sequences without
introducing OLGA (~30MB of pre-trained models + runtime deps). The
estimator is calibrated for *ranking* not for absolute Pgens — good
enough to discount public-CDR3 DB matches in `annotate` and to
visualize where a clone falls on the public-vs-private spectrum.

New module `tcrsift.pgen` (~250 lines, ~3KB inlined marginals):

  log Pgen ≈ log P(V) + log P(J) + log P(length)
           + log P(n_inserted) + log P(CDR3 | length)

The N-insertion term is the key new feature per discussion: sequences
with 0 non-templated nt are convergently generatable across donors
and dominate public-match noise. We model it as:
- β: sum of two geometric distributions (V-D + D-J junctions), p=0.18
- α: single geometric, p=0.22
- n_inserted estimated from observed CDR3 nt length minus typical
  templated V+J+D contribution (~21 nt β / ~24 nt α from IMGT means).

Bundled marginals (chosen from OLGA pre-trained / Britanova 2014 /
Pogorelyy 2018 / Robins 2009 estimates):
- Top-20 TRBV + 13 TRBJ + 15 TRAV + 10 TRAJ usage frequencies; tail
  for unlisted genes spread over remaining gene-pool count.
- CDR3 AA length: Gaussian per chain (β mean 14, α mean 13.5, σ=2.5).
- AA composition: uniform 1/20 per position baseline (OLGA uses
  position-specific matrices; we trade accuracy for size).

API:
- `pgen_single(cdr3, v, j, chain, n_inserted=None) -> float`
  Single-sequence log10 Pgen.
- `pgen_components(...) -> dict[str, float]`
  Debug helper: returns the five additive terms so users can see
  which one is dominating.
- `compute_pgen(df, ...) -> pd.Series`
  Per-row scoring for a clonotype frame.
- `publicness_score(log10_pgen, ...) -> pd.Series`
  Maps log10 Pgen → [0,1] publicness. Default cutoffs are calibrated
  to this estimator (-30 / -18); pass OLGA cutoffs (-20 ... (continued)

6090 of 7991 relevant lines covered (76.21%)

3.05 hits per line

Coverage Regressions

Lines Coverage ∆ File
342
62.57
1.46% plots.py
17
95.2
-1.27% annotate.py
Jobs
ID Job ID Ran Files Coverage
1 python-3.9 - 25895642089.1 15 May 2026 01:44AM UTC 29
76.17
GitHub Action Run
2 python-3.10 - 25895642089.2 15 May 2026 01:44AM UTC 29
76.2
GitHub Action Run
3 python-3.12 - 25895642089.3 15 May 2026 01:44AM UTC 29
76.2
GitHub Action Run
4 python-3.11 - 25895642089.4 15 May 2026 01:44AM UTC 29
76.2
GitHub Action Run
Source Files on build 25895642089
  • Tree
  • List 29
  • Changed 3
  • Source Changed 0
  • Coverage Changed 3
Coverage ∆ File Lines Relevant Covered Missed Hits/Line
  • Back to Repo
  • Github Actions Build #25895642089
  • c0dc5cd6 on github
  • Prev Build on main (#25881973801)
  • Next Build on main (#25926219165)
STATUS · Troubleshooting · Open an Issue · Sales · Support · CAREERS · ENTERPRISE · START FREE · SCHEDULE DEMO
ANNOUNCEMENTS · TWITTER · TOS & SLA · Supported CI Services · What's a CI service? · Automated Testing

© 2026 Coveralls, Inc