28541272305
65%

Ran 01 Jul 2026 07:08PM UTC

Jobs 1

Files 16

Run time 1min

Badge

Embed ▾

Committed 01 Jul 2026 07:07PM UTC coverage: 84.284% (-0.03%) from 84.318%

Build # 28541272305

Build Type

push

github

Committed by

freeeve

Commit Message

feat(terms)!: per-language stop words + decouple stemming from language

Stop-word removal is now keyed on the index language (18 Snowball languages)
instead of a fixed English list, and the RRTI stem filter is decoupled from the
language so an index can strip a language's stop words without stemming.

Header semantics (RRTI): the `language` byte is meaningful when bit0 (stemmed)
OR bit1 (stop-words) is set. The reader builds the stemmer only under bit0 but
reads the language under bit0 | bit1. Enabling either filter requires a language
-- a filter set with no language is a build error (no language==0 => English
fallback). Defaults (no filter) stay byte-identical; the English list is the
unchanged 31-word set, so existing English-stopword indexes are unaffected.

Per-language lists live once in stopwords/<lang>.txt at the repo root (sorted,
lowercased, de-duplicated). Rust embeds them with include_str! and Go with
//go:embed -- the same physical files, so the two ports' lists are byte-identical
by construction. English is the fixed list; the other 17 are from NLTK, Tamil
from spaCy.

API:
- Rust: stop_words(lang) / is_stop_word(t, lang); Tokenizer::with(language, stem,
  stopwords, case_fold) with the old new(..) kept as a stem = language.is_some()
  shim; spec() widened to (language, stem, stopwords, case_fold); from_header
  reads the language under bit0 | bit1; TermIndexConfig / TermSplitBuildConfig
  gain a stem field; the stream writer sets FLAG_STEMMED from stem and errors on
  a filter with no language.
- Go: all 18 TermLanguage constants + a stopwordFile map; termStopWordList /
  isTermStopWord(t, lang); TermTokenizer.language; NewTermTokenizerFull and
  WriteTermIndexFull, with the old *With funcs kept as shims;
  TermSplitBuildConfig.Stem.
- Python: TermBuilder / TermSplitSetBuilder take stem=None (defaults to
  "a language was given"); a ValueError when a filter has no language.

Go multilingual stemming stays out of scope: only English ste... (continued)

Coverage Stats

33 of 38 new or added lines in 2 files covered. (86.84%)

1 existing line in 1 file now uncovered.

1507 of 1788 relevant lines covered (84.28%)

33.25 hits per line

Uncovered Changes

Lines	Coverage	∆	File
5	85.25	-0.4%	terms.go

Coverage Regressions

Lines	Coverage	∆	File
1	85.25	-0.4%	terms.go

Jobs

ID	Job ID	Ran	Files	Coverage
1	28541272305.1	01 Jul 2026 07:08PM UTC	16	84.28	GitHub Action Run

freeeve / roaringrange / 28541272305
65%

README BADGES
x

Markdown

Textile

RDoc

HTML

Rst

Uncovered Changes

Coverage Regressions

Jobs

Source Files on build 28541272305

freeeve / roaringrange / 28541272305 65%

README BADGES x

Markdown

Textile

RDoc

HTML

Rst

Uncovered Changes

Coverage Regressions

Jobs

Source Files on build 28541272305

freeeve / roaringrange / 28541272305
65%

README BADGES
x