• Home
  • Features
  • Pricing
  • Docs
  • Announcements
  • Sign In

jzombie / rust-triplets / 24927961805

25 Apr 2026 09:37AM UTC coverage: 95.473% (-0.06%) from 95.528%
24927961805

push

github

web-flow
Add optional denoise (#81)

* Bump rustls-webpki in the cargo group across 1 directory

Bumps the cargo group with 1 update in the / directory: [rustls-webpki](https://github.com/rustls/webpki).


Updates `rustls-webpki` from 0.103.12 to 0.103.13
- [Release notes](https://github.com/rustls/webpki/releases)
- [Commits](https://github.com/rustls/webpki/compare/v/0.103.12...v/0.103.13)

---
updated-dependencies:
- dependency-name: rustls-webpki
  dependency-version: 0.103.13
  dependency-type: indirect
  dependency-group: cargo
...

Signed-off-by: dependabot[bot] <support@github.com>

* Prototype denoiser implementation

Co-authored-by: Copilot <copilot@github.com>

* Add more tests

Co-authored-by: Copilot <copilot@github.com>

* Add support for linearized data

Co-authored-by: Copilot <copilot@github.com>

* cargo fmt --all

* Remove user-configurable line-level control

Co-authored-by: Copilot <copilot@github.com>

* Draft README update

* Use single-line assertion

* Draft tagline and description

Co-authored-by: Copilot <copilot@github.com>

* Add line break

* Make markdown stripping optional

Co-authored-by: Copilot <copilot@github.com>

* Add link to GFM

Co-authored-by: Copilot <copilot@github.com>

* Fix doc comment

Co-authored-by: Copilot <copilot@github.com>

* Debug flaky CI tests

* Prepare for 0.19.0-alpha

* Add denoiser example to README

Co-authored-by: Copilot <copilot@github.com>

* Improve InMemory source implementation

Co-authored-by: Copilot <copilot@github.com>

* Include links to individual sources

Co-authored-by: Copilot <copilot@github.com>

* Reorder sources

* Update default source verbiage

* Add ability to construct DataRecord from text

Co-authored-by: Copilot <copilot@github.com>

* Prototype iterative wave expansion

Co-authored-by: Copilot <copilot@github.com>

* cargo fmt --all

* Migrate to preprocessor architecture

Co-authored-by: Copilot <copilot@github.com>

* Add more tests

Co-authored-by: Copilot <copilot@gi... (continued)

1028 of 1083 new or added lines in 6 files covered. (94.92%)

1 existing line in 1 file now uncovered.

18708 of 19595 relevant lines covered (95.47%)

3961.16 hits per line

Source File
Press 'n' to go to next uncovered line, 'b' for previous

99.63
/src/preprocessor/backends/denoiser_preprocessor.rs
1
//! OCR denoising and markdown-formatting cleanup for text chunks.
2
//!
3
//! The main entry point is [`denoise_text`], which applies a configurable set of
4
//! line-level (or whole-block) filters to strip digit-heavy OCR noise and
5
//! markdown table formatting that is useless for text embeddings.
6
//!
7
//! For use in a preprocessing pipeline, wrap a [`crate::config::DenoiserConfig`]
8
//! in a [`DenoiserPreprocessor`] and register it with
9
//! [`crate::config::ChunkingStrategy::register_preprocessor`].
10

11
use line_ending::LineEnding;
12

13
use crate::config::DenoiserConfig;
14
use crate::preprocessor::TextPreprocessor;
15

16
// ---------------------------------------------------------------------------
17
// Private helpers
18
// ---------------------------------------------------------------------------
19

20
/// Count ASCII digit and alphabetical characters in `text`.
21
fn count_digit_alpha(text: &str) -> (usize, usize) {
552✔
22
    let mut digits = 0usize;
552✔
23
    let mut alpha = 0usize;
552✔
24
    for ch in text.chars() {
10,905✔
25
        if ch.is_ascii_digit() {
10,905✔
26
            digits += 1;
3,363✔
27
        } else if ch.is_alphabetic() {
7,542✔
28
            alpha += 1;
5,278✔
29
        }
5,278✔
30
    }
31
    (digits, alpha)
552✔
32
}
552✔
33

34
/// Compute the ratio `digits / (digits + alpha)` for `text`.
35
///
36
/// Returns `0.0` when there are no alphanumeric characters.
37
fn digit_ratio(text: &str) -> f32 {
80✔
38
    let (digits, alpha) = count_digit_alpha(text);
80✔
39
    let total = digits + alpha;
80✔
40
    if total == 0 {
80✔
NEW
41
        0.0
×
42
    } else {
43
        digits as f32 / total as f32
80✔
44
    }
45
}
80✔
46

47
/// Keep tokens from a digit-heavy line while preserving numeric context.
48
///
49
/// ## Algorithm: iterative wave expansion from alpha-token seeds
50
///
51
/// Complexity: $O(N^2)$ worst case (N waves × O(N) scan per wave), but in
52
/// practice $O(N)$ with a small constant — lines are short (10–40 tokens)
53
/// and the ratio threshold terminates expansion after 2–5 waves.
54
///
55
/// 1. **Seed** the keep-set with all alpha-bearing tokens.
56
/// 2. **Each wave** rescues the immediate neighbors (±1 position) of every
57
///    currently-kept token that are not yet kept.  Before committing the
58
///    wave, the combined digit-ratio of the *new candidate set*
59
///    (current keep ∪ wave) is checked.  If the ratio stays ≤
60
///    `max_digit_ratio` the wave is accepted and expansion continues;
61
///    otherwise the wave is rejected and expansion stops.
62
/// 3. Repeat until no new neighbors exist or a wave is rejected.
63
///
64
/// Any token type (bare numbers, `—`, `$12.5M`, `+3%`, …) is eligible for
65
/// rescue — the ratio check is the sole gate.  Pure-symbol tokens adjacent
66
/// to alpha tokens can therefore be rescued when they carry contextual
67
/// meaning (e.g. `—` used as a minus sign, `&` in a company name).
68
///
69
/// Returns the space-joined result; may be empty.
70
fn strip_digit_tokens(line: &str, max_digit_ratio: f32) -> String {
39✔
71
    let tokens: Vec<&str> = line.split_whitespace().collect();
39✔
72
    if tokens.is_empty() {
39✔
NEW
73
        return String::new();
×
74
    }
39✔
75
    let n = tokens.len();
39✔
76

77
    let has_alpha: Vec<bool> = tokens
39✔
78
        .iter()
39✔
79
        .map(|t| t.chars().any(|c| c.is_alphabetic()))
1,034✔
80
        .collect();
39✔
81

82
    // Seed: all alpha-bearing tokens.
83
    let mut keep: Vec<bool> = has_alpha.clone();
39✔
84

85
    // Pre-compute (digits, alpha) per token for ratio checks.
86
    let char_counts: Vec<(usize, usize)> = tokens.iter().map(|t| count_digit_alpha(t)).collect();
370✔
87

88
    // Running totals for the current keep-set.
89
    let (mut d, mut a) = (0usize, 0usize);
39✔
90
    for (i, &k) in keep.iter().enumerate() {
370✔
91
        if k {
370✔
92
            d += char_counts[i].0;
102✔
93
            a += char_counts[i].1;
102✔
94
        }
268✔
95
    }
96

97
    // Wave expansion: each iteration rescues ±1 neighbors of kept tokens.
98
    loop {
99
        // Collect the indices of tokens that would be added in this wave.
100
        let wave: Vec<usize> = (0..n)
80✔
101
            .filter(|&i| !keep[i] && ((i > 0 && keep[i - 1]) || (i + 1 < n && keep[i + 1])))
814✔
102
            .collect();
80✔
103

104
        if wave.is_empty() {
80✔
105
            break;
1✔
106
        }
79✔
107

108
        // Compute the ratio if we were to accept this wave.
109
        let (wd, wa): (usize, usize) = wave.iter().fold((0, 0), |(ad, aa), &i| {
156✔
110
            (ad + char_counts[i].0, aa + char_counts[i].1)
156✔
111
        });
156✔
112
        let new_d = d + wd;
79✔
113
        let new_a = a + wa;
79✔
114
        let new_total = new_d + new_a;
79✔
115
        let new_ratio = if new_total == 0 {
79✔
NEW
116
            0.0
×
117
        } else {
118
            new_d as f32 / new_total as f32
79✔
119
        };
120

121
        if new_ratio > max_digit_ratio {
79✔
122
            break; // This wave would push ratio over threshold — stop.
38✔
123
        }
41✔
124

125
        // Accept the wave.
126
        for &i in &wave {
82✔
127
            keep[i] = true;
82✔
128
        }
82✔
129
        d = new_d;
41✔
130
        a = new_a;
41✔
131
    }
132

133
    tokens
39✔
134
        .iter()
39✔
135
        .enumerate()
39✔
136
        .filter(|&(i, _)| keep[i])
370✔
137
        .map(|(_, t)| *t)
39✔
138
        .collect::<Vec<_>>()
39✔
139
        .join(" ")
39✔
140
}
39✔
141

142
/// Returns `true` when `line` is a GFM pipe-table row.
143
///
144
/// A line qualifies when its trimmed form starts with `'|'` and contains at
145
/// least one additional `'|'` (i.e. a two-column table or a single column with
146
/// a closing delimiter).  This covers:
147
///
148
/// - Header rows: `| Name | Age |`
149
/// - Separator rows: `|------|-----|`, `|:----:|:---:|`
150
/// - Data rows: `| Alice | 30 |`
151
fn is_markdown_table_line(line: &str) -> bool {
125✔
152
    let trimmed = line.trim();
125✔
153
    trimmed.starts_with('|') && trimmed.matches('|').count() >= 2
125✔
154
}
125✔
155

156
/// Returns `true` when `line` is a GFM pipe-table separator row.
157
///
158
/// A separator row contains only `|`, `-`, `:`, and whitespace (e.g.
159
/// `|------|-----|` or `|:----:|:---:|`).  These rows carry no textual
160
/// content and should be dropped entirely.
161
fn is_markdown_table_separator(line: &str) -> bool {
37✔
162
    let trimmed = line.trim();
37✔
163
    trimmed.starts_with('|')
37✔
164
        && trimmed.matches('|').count() >= 2
37✔
165
        && trimmed
37✔
166
            .chars()
37✔
167
            .all(|c| c == '|' || c == '-' || c == ':' || c == ' ' || c == '\t')
391✔
168
}
37✔
169

170
/// Strip GFM pipe-table delimiters from a header or data row and return the
171
/// concatenated cell text.
172
///
173
/// Each `|`-delimited cell is trimmed; empty cells (from leading/trailing
174
/// pipes) are discarded.  The surviving cells are joined with a single space.
175
///
176
/// Example: `"| Name | Age |"` → `"Name Age"`
177
fn strip_table_pipes(line: &str) -> String {
22✔
178
    line.split('|')
22✔
179
        .map(|cell| cell.trim())
99✔
180
        .filter(|cell| !cell.is_empty())
99✔
181
        .collect::<Vec<_>>()
22✔
182
        .join(" ")
22✔
183
}
22✔
184

185
// ---------------------------------------------------------------------------
186
// Public API
187
// ---------------------------------------------------------------------------
188

189
/// Apply OCR denoising and markdown-table stripping to a block of text.
190
///
191
/// Returns `Some(cleaned)` with the (possibly stripped) text, or `None` when
192
/// the entire block should be dropped and no chunks should be produced.
193
///
194
/// Line endings are first normalized with [`LineEnding::normalize`].  Each
195
/// line is then evaluated through three gates in order:
196
///
197
/// 1. **Markdown table formatting** — GFM pipe-table rows (trimmed form
198
///    starts with `'|'` and contains at least one more `'|'`) are handled in
199
///    two ways.  *Separator rows* (containing only `|`, `-`, `:`, and
200
///    whitespace) carry no textual content and are dropped.  *Header and data
201
///    rows* have their pipe delimiters stripped and the extracted cell text is
202
///    evaluated by gates 2 and 3 like any other line, preserving useful text
203
///    from inside table cells.
204
///
205
/// 2. **No alphabetical characters** — lines that contain zero alphabetical
206
///    characters (all-numeric rows, symbol/dash-only rows, OCR separator
207
///    artifacts) are dropped.
208
///
209
/// 3. **High digit ratio** — lines whose `digit / (digit + alpha)` ratio
210
///    exceeds `config.max_digit_ratio` are *stripped* using iterative wave
211
///    expansion from alpha-token seeds.  neighboring tokens are rescued
212
///    progressively outward as long as the cumulative ratio stays ≤
213
///    `max_digit_ratio`.  Any token type — numbers, `—`, `+3%`, `$12B` —
214
///    is eligible; the ratio check is the sole gate.  If no tokens survive,
215
///    the line is dropped.
216
///
217
/// `None` is returned only when every line is removed.
218
///
219
/// When `config.enabled` is `false` the function returns `Some(text.to_string())`
220
/// unconditionally.
221
pub fn denoise_text(text: &str, config: &DenoiserConfig) -> Option<String> {
60✔
222
    if !config.enabled {
60✔
223
        return Some(text.to_string());
3✔
224
    }
57✔
225

226
    let normalized = LineEnding::normalize(text);
57✔
227
    let mut cleaned_lines: Vec<String> = Vec::new();
57✔
228
    for line in normalized.lines() {
112✔
229
        // Gate 1: markdown formatting.
230
        // If `strip_markdown` is active, separator rows (containing only |, -, :,
231
        // and whitespace) carry no textual content and are dropped. Header and data
232
        // rows have their pipe delimiters stripped; the extracted cell text then
233
        // passes through gates 2 and 3 like any other line.
234
        let table_stripped = if config.strip_markdown && is_markdown_table_line(line) {
112✔
235
            if is_markdown_table_separator(line) {
28✔
236
                continue;
10✔
237
            }
18✔
238
            Some(strip_table_pipes(line))
18✔
239
        } else {
240
            None
84✔
241
        };
242
        let effective = table_stripped.as_deref().unwrap_or(line);
102✔
243

244
        // Gate 2: no alphabetical characters → drop (all-numeric lines,
245
        //         symbol-only rows, OCR column-separator artifacts, etc.).
246
        let (_, alpha) = count_digit_alpha(effective);
102✔
247
        if alpha == 0 {
102✔
248
            continue;
22✔
249
        }
80✔
250

251
        // Gate 3: digit-heavy line → iterative wave expansion to rescue
252
        //         adjacent tokens within the ratio budget.
253
        if digit_ratio(effective) > config.max_digit_ratio {
80✔
254
            let retained = strip_digit_tokens(effective, config.max_digit_ratio);
39✔
255
            if !retained.is_empty() {
39✔
256
                cleaned_lines.push(retained);
39✔
257
            }
39✔
258
            // else: drop the line entirely
259
        } else {
41✔
260
            cleaned_lines.push(effective.to_string());
41✔
261
        }
41✔
262
    }
263
    if cleaned_lines.is_empty() {
57✔
264
        None
9✔
265
    } else {
266
        Some(cleaned_lines.join("\n"))
48✔
267
    }
268
}
60✔
269

270
// ---------------------------------------------------------------------------
271
// DenoiserPreprocessor
272
// ---------------------------------------------------------------------------
273

274
/// A [`TextPreprocessor`] that applies OCR denoising and markdown-table
275
/// cleanup to section text before chunking.
276
///
277
/// Wraps a [`DenoiserConfig`] and delegates to [`denoise_text`].
278
///
279
/// # Example
280
///
281
/// ```rust
282
/// use triplets::{ChunkingStrategy, DenoiserConfig, DenoiserPreprocessor};
283
///
284
/// let mut strategy = ChunkingStrategy::default();
285
/// strategy.register_preprocessor(DenoiserPreprocessor::new(DenoiserConfig {
286
///     enabled: true,
287
///     max_digit_ratio: 0.35,
288
///     strip_markdown: true,
289
/// }));
290
/// ```
291
pub struct DenoiserPreprocessor {
292
    /// Configuration controlling the denoising behaviour.
293
    pub config: DenoiserConfig,
294
}
295

296
impl DenoiserPreprocessor {
297
    /// Create a new `DenoiserPreprocessor` with the given configuration.
298
    pub fn new(config: DenoiserConfig) -> Self {
5✔
299
        Self { config }
5✔
300
    }
5✔
301
}
302

303
impl TextPreprocessor for DenoiserPreprocessor {
304
    fn process(&self, text: &str) -> Option<String> {
5✔
305
        denoise_text(text, &self.config)
5✔
306
    }
5✔
307
}
308

309
// ---------------------------------------------------------------------------
310
// Tests
311
// ---------------------------------------------------------------------------
312

313
#[cfg(test)]
314
mod tests {
315
    use super::*;
316
    use indoc::indoc;
317

318
    fn denoiser_enabled() -> DenoiserConfig {
50✔
319
        DenoiserConfig {
50✔
320
            enabled: true,
50✔
321
            max_digit_ratio: 0.35,
50✔
322
            strip_markdown: true,
50✔
323
        }
50✔
324
    }
50✔
325

326
    // -----------------------------------------------------------------------
327
    // DenoiserPreprocessor trait impl
328
    // -----------------------------------------------------------------------
329

330
    #[test]
331
    fn denoiser_preprocessor_process_delegates_to_denoise_text() {
1✔
332
        let p = DenoiserPreprocessor::new(DenoiserConfig {
1✔
333
            enabled: true,
1✔
334
            max_digit_ratio: 0.35,
1✔
335
            strip_markdown: true,
1✔
336
        });
1✔
337
        let noisy = "42 524 10788 143 1995 190 394";
1✔
338
        assert_eq!(
1✔
339
            p.process(noisy),
1✔
340
            denoise_text(noisy, &p.config),
1✔
341
            "process() must delegate to denoise_text"
342
        );
343
    }
1✔
344

345
    #[test]
346
    fn denoiser_preprocessor_disabled_returns_text_unchanged() {
1✔
347
        let p = DenoiserPreprocessor::new(DenoiserConfig::default()); // enabled = false
1✔
348
        let input = "42 524 NOVEX INDUSTRIES 10,788.0 14.3";
1✔
349
        assert_eq!(p.process(input), Some(input.to_string()));
1✔
350
    }
1✔
351

352
    #[test]
353
    fn denoiser_preprocessor_drops_digit_heavy_block() {
1✔
354
        let p = DenoiserPreprocessor::new(DenoiserConfig {
1✔
355
            enabled: true,
1✔
356
            max_digit_ratio: 0.35,
1✔
357
            strip_markdown: true,
1✔
358
        });
1✔
359
        assert_eq!(
1✔
360
            p.process("42 524 10788 143 1995 190 394 13611 358 6444 266"),
1✔
361
            None
362
        );
363
    }
1✔
364

365
    // -----------------------------------------------------------------------
366
    // is_markdown_table_line helper
367
    // -----------------------------------------------------------------------
368

369
    #[test]
370
    fn markdown_table_line_detects_separator_row() {
1✔
371
        assert!(is_markdown_table_line("|------|-----|"));
1✔
372
        assert!(is_markdown_table_line("|:----:|:---:|"));
1✔
373
        assert!(is_markdown_table_line(
1✔
374
            "|----------------|-----------|----------|"
1✔
375
        ));
376
        assert!(
1✔
377
            is_markdown_table_line("  |---|---|  "),
1✔
378
            "leading/trailing spaces ok"
379
        );
380
    }
1✔
381

382
    #[test]
383
    fn markdown_table_line_detects_header_and_data_rows() {
1✔
384
        assert!(is_markdown_table_line("| Name | Age |"));
1✔
385
        assert!(is_markdown_table_line("| Annual revenue | $4.2B | +12% |"));
1✔
386
        assert!(is_markdown_table_line("| Alice | 30 | NYC |"));
1✔
387
    }
1✔
388

389
    #[test]
390
    fn markdown_table_line_detects_single_column_with_closing_pipe() {
1✔
391
        assert!(is_markdown_table_line("| Value |"));
1✔
392
        assert!(is_markdown_table_line("|---|"));
1✔
393
    }
1✔
394

395
    #[test]
396
    fn markdown_table_line_rejects_prose_with_single_pipe() {
1✔
397
        assert!(!is_markdown_table_line("Choose option A | option B"));
1✔
398
        assert!(!is_markdown_table_line("See foo | bar for details."));
1✔
399
    }
1✔
400

401
    #[test]
402
    fn markdown_table_line_rejects_line_without_leading_pipe() {
1✔
403
        assert!(!is_markdown_table_line("Name | Age | City"));
1✔
404
        assert!(!is_markdown_table_line("--- | --- | ---"));
1✔
405
    }
1✔
406

407
    #[test]
408
    fn markdown_table_line_rejects_plain_text() {
1✔
409
        assert!(!is_markdown_table_line("The quick brown fox."));
1✔
410
        assert!(!is_markdown_table_line("42 524 NOVEX INDUSTRIES"));
1✔
411
        assert!(!is_markdown_table_line(""));
1✔
412
    }
1✔
413

414
    // -----------------------------------------------------------------------
415
    // is_markdown_table_separator helper
416
    // -----------------------------------------------------------------------
417

418
    #[test]
419
    fn markdown_table_separator_detects_dash_and_colon_rows() {
1✔
420
        assert!(is_markdown_table_separator("|------|-----|"));
1✔
421
        assert!(is_markdown_table_separator("|:----:|:---:|"));
1✔
422
        assert!(is_markdown_table_separator("|:-----|-----:|"));
1✔
423
        assert!(is_markdown_table_separator(
1✔
424
            "|----------------|-----------|----------|"
1✔
425
        ));
426
        assert!(
1✔
427
            is_markdown_table_separator("| ---- | ---- |"),
1✔
428
            "dashes with spaces ok"
429
        );
430
        assert!(
1✔
431
            is_markdown_table_separator("  |---|---|  "),
1✔
432
            "leading/trailing spaces ok"
433
        );
434
    }
1✔
435

436
    #[test]
437
    fn markdown_table_separator_rejects_rows_with_text_or_digits() {
1✔
438
        assert!(!is_markdown_table_separator("| Name | Age |"), "has alpha");
1✔
439
        assert!(
1✔
440
            !is_markdown_table_separator("| 2023 | $3.8B | +10% |"),
1✔
441
            "has digit"
442
        );
443
        assert!(
1✔
444
            !is_markdown_table_separator("| Annual revenue | $4.2B |"),
1✔
445
            "has alpha"
446
        );
447
    }
1✔
448

449
    // -----------------------------------------------------------------------
450
    // strip_table_pipes helper
451
    // -----------------------------------------------------------------------
452

453
    #[test]
454
    fn strip_table_pipes_extracts_cell_text() {
1✔
455
        assert_eq!(strip_table_pipes("| Name | Age |"), "Name Age");
1✔
456
        assert_eq!(
1✔
457
            strip_table_pipes("| Annual revenue | $4.2B | +12% |"),
1✔
458
            "Annual revenue $4.2B +12%"
459
        );
460
        assert_eq!(strip_table_pipes("| Widget A |"), "Widget A");
1✔
461
        assert_eq!(
1✔
462
            strip_table_pipes("| Metric         | Value     | Change  |"),
1✔
463
            "Metric Value Change"
464
        );
465
    }
1✔
466

467
    // -----------------------------------------------------------------------
468
    // Disabled / no-op
469
    // -----------------------------------------------------------------------
470

471
    #[test]
472
    fn denoise_disabled_returns_text_unchanged() {
1✔
473
        let cfg = DenoiserConfig::default();
1✔
474
        let input = "42 524 NOVEX INDUSTRIES 10,788.0 14.3";
1✔
475
        assert_eq!(denoise_text(input, &cfg), Some(input.to_string()));
1✔
476
    }
1✔
477

478
    #[test]
479
    fn denoise_disabled_leaves_markdown_table_unchanged() {
1✔
480
        let cfg = DenoiserConfig::default();
1✔
481
        let input = "| Name | Age |\n|------|-----|\n| Alice | 30 |";
1✔
482
        assert_eq!(denoise_text(input, &cfg), Some(input.to_string()));
1✔
483
    }
1✔
484

485
    #[test]
486
    fn denoise_enabled_but_strip_markdown_false_leaves_tables_intact() {
1✔
487
        let mut cfg = denoiser_enabled();
1✔
488
        cfg.strip_markdown = false;
1✔
489

490
        let input = indoc! {"
1✔
491
            | Name | Age |
1✔
492
            |------|-----|
1✔
493
            | Alice | 30 |
1✔
494
        "};
1✔
495
        let expected = "| Name | Age |\n| Alice | 30 |";
1✔
496
        assert_eq!(denoise_text(input.trim(), &cfg), Some(expected.to_string()));
1✔
497
    }
1✔
498

499
    #[test]
500
    fn denoise_enabled_with_strip_markdown_strips_tables_and_preserves_headings() {
1✔
501
        let cfg = denoiser_enabled();
1✔
502
        let input = indoc! {"
1✔
503
            ### User Demographics
1✔
504
            
1✔
505
            | Name | Age |
1✔
506
            |------|-----|
1✔
507
            | Alice | 30 |
1✔
508
            
1✔
509
            Some bold **text** and `code` here.
1✔
510
        "};
1✔
511
        let expected = indoc! {"
1✔
512
            ### User Demographics
1✔
513
            Name Age
1✔
514
            Alice 30
1✔
515
            Some bold **text** and `code` here.
1✔
516
        "}
1✔
517
        .trim();
1✔
518
        assert_eq!(denoise_text(input.trim(), &cfg), Some(expected.to_string()));
1✔
519
    }
1✔
520

521
    // -----------------------------------------------------------------------
522
    // Single digit-heavy line / below-threshold pass-through
523
    // -----------------------------------------------------------------------
524

525
    #[test]
526
    fn denoise_drops_digit_heavy_single_line() {
1✔
527
        let cfg = denoiser_enabled();
1✔
528
        assert_eq!(
1✔
529
            denoise_text("42 524 10788 143 1995 190 394 13611 358 6444 266", &cfg),
1✔
530
            None
531
        );
532
    }
1✔
533

534
    #[test]
535
    fn denoise_below_threshold_preserves_numbers_and_symbols() {
1✔
536
        let cfg = denoiser_enabled();
1✔
537
        let input = "Q3 revenue grew 12% to $4.2B, up from $3.8B in Q2 (a 10.5% increase).";
1✔
538
        assert_eq!(denoise_text(input, &cfg), Some(input.to_string()));
1✔
539
    }
1✔
540

541
    #[test]
542
    fn denoise_pure_text_passes_through() {
1✔
543
        let cfg = denoiser_enabled();
1✔
544
        let input = "The quick brown fox jumps over the lazy dog";
1✔
545
        assert_eq!(denoise_text(input, &cfg), Some(input.to_string()));
1✔
546
    }
1✔
547

548
    // -----------------------------------------------------------------------
549
    // Line-level: basic drop / pass-through
550
    // -----------------------------------------------------------------------
551

552
    #[test]
553
    fn denoise_empty_input_returns_none_when_enabled() {
1✔
554
        let cfg = denoiser_enabled();
1✔
555
        assert_eq!(denoise_text("", &cfg), None);
1✔
556
    }
1✔
557

558
    #[test]
559
    fn denoise_line_level_returns_none_when_all_lines_dropped() {
1✔
560
        let cfg = denoiser_enabled();
1✔
561
        let input = indoc! {"
1✔
562
            42 524 10788
1✔
563
            143 1995 190
1✔
564
            394 13611 358
1✔
565
        "};
1✔
566
        assert_eq!(denoise_text(input.trim(), &cfg), None);
1✔
567
    }
1✔
568

569
    #[test]
570
    fn denoise_line_level_preserves_clean_lines_unchanged() {
1✔
571
        let cfg = denoiser_enabled();
1✔
572
        let clean = "Climate change drives ocean temperatures higher each decade.";
1✔
573
        assert_eq!(
1✔
574
            denoise_text(clean, &cfg).expect("clean text should be kept"),
1✔
575
            clean
576
        );
577
    }
1✔
578

579
    #[test]
580
    fn denoise_line_level_below_threshold_line_preserves_numbers_and_symbols() {
1✔
581
        let cfg = denoiser_enabled();
1✔
582
        let input = "See section 3.1 (page 42) for details on the Q2 results.";
1✔
583
        assert_eq!(
1✔
584
            denoise_text(input, &cfg).expect("below-threshold line must be kept"),
1✔
585
            input
586
        );
587
    }
1✔
588

589
    #[test]
590
    fn denoise_line_level_clean_lines_with_numbers_preserved_junk_stripped() {
1✔
591
        let cfg = denoiser_enabled();
1✔
592
        let input = indoc! {"
1✔
593
            Revenue grew 8% to $2.1B in FY2025 (vs $1.9B prior year).
1✔
594
            42 9871 3302 19283 4710 22913 5518 30021 6627 41132 7736 52243
1✔
595
            Net income rose 15% YoY, reaching $310M by Q4-2025.
1✔
596
        "};
1✔
597
        let result = denoise_text(input.trim(), &cfg).expect("should not be None");
1✔
598
        assert!(result.contains("Revenue grew 8% to $2.1B in FY2025 (vs $1.9B prior year)."));
1✔
599
        assert!(result.contains("Net income rose 15% YoY, reaching $310M by Q4-2025."));
1✔
600
        assert!(!result.contains("9871"));
1✔
601
        assert_eq!(result.lines().count(), 2);
1✔
602
    }
1✔
603

604
    // -----------------------------------------------------------------------
605
    // Line-level: mixed content
606
    // -----------------------------------------------------------------------
607

608
    #[test]
609
    fn denoise_line_level_mixed_content_same_line() {
1✔
610
        let cfg = denoiser_enabled();
1✔
611
        let input = indoc! {"
1✔
612
            42 524 NOVEX INDUSTRIES Springfield 10788 143 1995 190 394 13611 358
1✔
613
            343 294 ZETA POWER Riverside 10758 31 1283 267 189 45432 175
1✔
614
        "};
1✔
615
        let result = denoise_text(input.trim(), &cfg).expect("should not be None");
1✔
616
        for word in &[
6✔
617
            "NOVEX",
1✔
618
            "INDUSTRIES",
1✔
619
            "Springfield",
1✔
620
            "ZETA",
1✔
621
            "POWER",
1✔
622
            "Riverside",
1✔
623
        ] {
1✔
624
            assert!(result.contains(word), "'{word}' must survive");
6✔
625
        }
626
        assert!(result.contains("524"));
1✔
627
        assert!(result.contains("10788"));
1✔
628
        assert!(result.contains("294"));
1✔
629
        assert!(result.contains("10758"));
1✔
630
        assert!(!result.contains("45432"));
1✔
631
        assert!(!result.contains("13611"));
1✔
632
        assert_eq!(result.lines().count(), 2);
1✔
633
    }
1✔
634

635
    #[test]
636
    fn denoise_line_level_drops_lines_with_no_alpha_tokens() {
1✔
637
        let cfg = denoiser_enabled();
1✔
638
        let input = indoc! {"
1✔
639
            42 524 10788 143 1995
1✔
640
            — — (0.8) (203.5) 473
1✔
641
            NOVEX INDUSTRIES Springfield
1✔
642
        "};
1✔
643
        let result =
1✔
644
            denoise_text(input.trim(), &cfg).expect("should not be None — text line survives");
1✔
645
        assert!(result.contains("NOVEX"));
1✔
646
        assert!(!result.contains("10788"));
1✔
647
        assert!(!result.contains("(0.8)"));
1✔
648
        assert_eq!(result.lines().count(), 1);
1✔
649
    }
1✔
650

651
    #[test]
652
    fn denoise_line_level_retains_text_from_mixed_line() {
1✔
653
        let cfg = denoiser_enabled();
1✔
654
        let input = "42 524 NOVEX INDUSTRIES Springfield 10788 143 1995 190 394 13611 358 6444 266";
1✔
655
        let result = denoise_text(input, &cfg).expect("should not be None");
1✔
656
        for word in &["NOVEX", "INDUSTRIES", "Springfield"] {
3✔
657
            assert!(result.contains(word), "'{word}' must survive");
3✔
658
        }
659
        assert!(result.contains("10788"));
1✔
660
        assert!(!result.contains("13611"));
1✔
661
    }
1✔
662

663
    // -----------------------------------------------------------------------
664
    // Interleaved sequences
665
    // -----------------------------------------------------------------------
666

667
    #[test]
668
    fn denoise_line_level_text_sandwiched_between_junk_tokens() {
1✔
669
        let cfg = denoiser_enabled();
1✔
670
        let input = "42 NOVEX 524 INDUSTRIES 10788 143 1995 190";
1✔
671
        let result = denoise_text(input, &cfg).expect("should not be None");
1✔
672
        assert!(result.contains("NOVEX"));
1✔
673
        assert!(result.contains("INDUSTRIES"));
1✔
674
        assert!(!result.contains("10788"));
1✔
675
        assert!(!result.contains("524"));
1✔
676
    }
1✔
677

678
    #[test]
679
    fn denoise_line_level_repeated_junk_text_interleaving() {
1✔
680
        let cfg = denoiser_enabled();
1✔
681
        let input = "42 ZETA 524 POWER 10758 Riverside 31 GRID 1283 GROUP 267 Holdings 45432 Corp";
1✔
682
        let result = denoise_text(input, &cfg).expect("should not be None");
1✔
683
        let text_tokens = [
1✔
684
            "ZETA",
1✔
685
            "POWER",
1✔
686
            "Riverside",
1✔
687
            "GRID",
1✔
688
            "GROUP",
1✔
689
            "Holdings",
1✔
690
            "Corp",
1✔
691
        ];
1✔
692
        let num_tokens = ["42", "524", "10758", "31", "1283", "267", "45432"];
1✔
693
        for word in &text_tokens {
7✔
694
            assert!(result.contains(word));
7✔
695
        }
696
        for num in &num_tokens {
7✔
697
            assert!(!result.contains(num));
7✔
698
        }
699
        let mut last_pos = 0usize;
1✔
700
        for word in &text_tokens {
7✔
701
            let pos = result.find(word).unwrap();
7✔
702
            assert!(pos >= last_pos);
7✔
703
            last_pos = pos;
7✔
704
        }
705
    }
1✔
706

707
    // -----------------------------------------------------------------------
708
    // Symbol-heavy edge cases
709
    // -----------------------------------------------------------------------
710

711
    #[test]
712
    fn denoise_line_level_parenthesized_negatives_and_dashes_stripped() {
1✔
713
        let cfg = denoiser_enabled();
1✔
714
        let input = "345 397 DELTA CORP Detroit, Mich. 10689 (0.8) 1069 302 — 18214 336 17590 182";
1✔
715
        let result = denoise_text(input, &cfg).expect("should not be None");
1✔
716
        assert!(result.contains("DELTA"));
1✔
717
        assert!(result.contains("CORP"));
1✔
718
        assert!(result.contains("Detroit"));
1✔
719
        assert!(result.contains("Mich."));
1✔
720
        assert!(result.contains("397"));
1✔
721
        assert!(result.contains("10689"));
1✔
722
        assert!(!result.contains("(0.8)"));
1✔
723
        assert!(!result.contains("18214"));
1✔
724
        assert_eq!(result, "397 DELTA CORP Detroit, Mich. 10689");
1✔
725
    }
1✔
726

727
    #[test]
728
    fn denoise_line_level_comma_formatted_numbers_stripped() {
1✔
729
        let cfg = denoiser_enabled();
1✔
730
        let input =
1✔
731
            "42 524 NOVEX INDUSTRIES Springfield 10,788.0 14.3 1,995.0 190 39.4 13,611.0 358";
1✔
732
        let result = denoise_text(input, &cfg).expect("should not be None");
1✔
733
        for word in &["NOVEX", "INDUSTRIES", "Springfield"] {
3✔
734
            assert!(result.contains(word));
3✔
735
        }
736
        assert!(result.contains("10,788.0"));
1✔
737
        for num in &["1,995.0", "13,611.0"] {
2✔
738
            assert!(!result.contains(num));
2✔
739
        }
740
    }
1✔
741

742
    #[test]
743
    fn denoise_neighbor_rescue_falls_back_when_ratio_still_exceeds_threshold() {
1✔
744
        let cfg = denoiser_enabled();
1✔
745
        let input = "1234 word 5678";
1✔
746
        let result = denoise_text(input, &cfg).expect("should not be None");
1✔
747
        assert_eq!(result, "word");
1✔
748
        assert!(!result.contains("1234"));
1✔
749
        assert!(!result.contains("5678"));
1✔
750
    }
1✔
751

752
    #[test]
753
    fn denoise_line_level_symbol_only_line_is_dropped() {
1✔
754
        let cfg = denoiser_enabled();
1✔
755
        assert_eq!(denoise_text("— — — (0.8) (203.5) 473 42 524", &cfg), None);
1✔
756
    }
1✔
757

758
    #[test]
759
    fn denoise_line_level_ordinal_tokens_are_kept() {
1✔
760
        let cfg = denoiser_enabled();
1✔
761
        let input = "3rd Quarter performance review 2nd half summary";
1✔
762
        let result = denoise_text(input, &cfg).expect("should not be None");
1✔
763
        assert!(result.contains("3rd"));
1✔
764
        assert!(result.contains("2nd"));
1✔
765
        assert!(result.contains("Quarter"));
1✔
766
    }
1✔
767

768
    #[test]
769
    fn denoise_line_level_dense_interleave_with_symbols() {
1✔
770
        let cfg = denoiser_enabled();
1✔
771
        let input = "42 (524) ZETA 10,758.0 — POWER 31.5 Riverside, 1283 Corp.";
1✔
772
        let result = denoise_text(input, &cfg).expect("should not be None");
1✔
773
        for word in &["ZETA", "POWER", "Riverside,", "Corp."] {
4✔
774
            assert!(result.contains(word));
4✔
775
        }
776
        for junk in &["42", "(524)", "10,758.0", "1283"] {
4✔
777
            assert!(!result.contains(junk));
4✔
778
        }
779
        assert_eq!(result, "ZETA POWER Riverside, Corp.");
1✔
780
    }
1✔
781

782
    #[test]
783
    fn denoise_line_level_multiple_em_dashes_all_stripped() {
1✔
784
        let cfg = denoiser_enabled();
1✔
785
        let input = "— 42 NOVEX — 524 INDUSTRIES — 10789 —";
1✔
786
        let result = denoise_text(input, &cfg).expect("should not be None");
1✔
787
        assert!(!result.contains("10789"));
1✔
788
        assert!(result.contains("42"));
1✔
789
        assert!(result.contains("524"));
1✔
790
        assert_eq!(result, "42 NOVEX — 524 INDUSTRIES —");
1✔
791
    }
1✔
792

793
    #[test]
794
    fn denoise_line_level_multiple_parenthesized_values_rescued() {
1✔
795
        let cfg = denoiser_enabled();
1✔
796
        let input = "(0.8) NOVEX (1.2) INDUSTRIES (3.4) 10789";
1✔
797
        let result = denoise_text(input, &cfg).expect("should not be None");
1✔
798
        assert!(result.contains("(0.8)"));
1✔
799
        assert!(result.contains("(1.2)"));
1✔
800
        assert!(result.contains("(3.4)"));
1✔
801
        assert!(!result.contains("10789"));
1✔
802
        assert_eq!(result, "(0.8) NOVEX (1.2) INDUSTRIES (3.4)");
1✔
803
    }
1✔
804

805
    #[test]
806
    fn denoise_line_level_mixed_symbol_trash_repeated() {
1✔
807
        let cfg = denoiser_enabled();
1✔
808
        let input = "— (0.8) 42 ZETA — (1.5) 524 POWER — (2.3) 10758 Corp —";
1✔
809
        let result = denoise_text(input, &cfg).expect("should not be None");
1✔
810
        assert!(!result.contains('—'));
1✔
811
        assert!(!result.contains("(0.8)"));
1✔
812
        assert!(!result.contains("(1.5)"));
1✔
813
        assert!(!result.contains("(2.3)"));
1✔
814
        assert!(!result.contains("10758"));
1✔
815
        assert_eq!(result, "ZETA POWER Corp");
1✔
816
    }
1✔
817

818
    #[test]
819
    fn denoise_line_level_multiple_symbol_trash_multiline_exact_output() {
1✔
820
        let cfg = denoiser_enabled();
1✔
821
        let input = indoc! {"
1✔
822
            — 42 NOVEX — 524 INDUSTRIES — 10789 —
1✔
823
            (0.8) ZETA (1.2) POWER (3.4) 10758
1✔
824
        "};
1✔
825
        let result = denoise_text(input.trim(), &cfg).expect("should not be None");
1✔
826
        assert_eq!(result, "42 NOVEX — 524 INDUSTRIES —\nZETA POWER");
1✔
827
    }
1✔
828

829
    // -----------------------------------------------------------------------
830
    // Markdown table handling
831
    // -----------------------------------------------------------------------
832

833
    #[test]
834
    fn denoise_line_level_pure_markdown_table_separator_dropped_text_extracted() {
1✔
835
        let cfg = denoiser_enabled();
1✔
836
        let input = indoc! {"
1✔
837
            | Metric         | Value     | Change  |
1✔
838
            |----------------|-----------|----------|
1✔
839
            | Annual revenue | $4.2B     | +12%     |
1✔
840
            | Operating cost | $2.1B     | +8%      |
1✔
841
            | Net income     | $310M     | +15%     |
1✔
842
        "};
1✔
843
        let result = denoise_text(input.trim(), &cfg).expect("cell text should survive");
1✔
844
        assert_eq!(
1✔
845
            result,
846
            "Metric Value Change\nAnnual revenue $4.2B +12%\nOperating cost $2.1B +8%\nNet income $310M +15%"
847
        );
848
    }
1✔
849

850
    #[test]
851
    fn denoise_line_level_single_markdown_table_row_pipes_stripped() {
1✔
852
        let cfg = denoiser_enabled();
1✔
853
        assert_eq!(
1✔
854
            denoise_text("|----------------|-----------|----------|", &cfg),
1✔
855
            None
856
        );
857
        assert_eq!(
1✔
858
            denoise_text("| Metric | Value | Change |", &cfg),
1✔
859
            Some("Metric Value Change".to_string())
1✔
860
        );
861
        assert_eq!(
1✔
862
            denoise_text("| Annual revenue | $4.2B | +12% |", &cfg),
1✔
863
            Some("Annual revenue $4.2B +12%".to_string())
1✔
864
        );
865
    }
1✔
866

867
    #[test]
868
    fn denoise_line_level_markdown_table_embedded_in_prose() {
1✔
869
        let cfg = denoiser_enabled();
1✔
870
        let input = indoc! {"
1✔
871
            Revenue grew steadily over the past three fiscal years.
1✔
872
            | Year | Revenue | Growth |
1✔
873
            |------|---------|--------|
1✔
874
            | 2023 | $3.8B   | +10%   |
1✔
875
            | 2024 | $4.2B   | +12%   |
1✔
876
            Management expects the trend to continue.
1✔
877
        "};
1✔
878
        let result = denoise_text(input.trim(), &cfg).expect("should not be None");
1✔
879
        assert!(result.contains("Revenue grew"));
1✔
880
        assert!(result.contains("Management expects"));
1✔
881
        assert!(result.contains("Year Revenue Growth"));
1✔
882
        assert!(!result.contains("---|"));
1✔
883
        assert_eq!(result.lines().count(), 5);
1✔
884
    }
1✔
885

886
    #[test]
887
    fn denoise_line_level_markdown_table_various_separator_styles() {
1✔
888
        let cfg = denoiser_enabled();
1✔
889
        let input = indoc! {"
1✔
890
            Only this prose line should survive.
1✔
891
            |------|------|
1✔
892
            |:----:|:----:|
1✔
893
            |:-----|-----:|
1✔
894
            | ---- | ---- |
1✔
895
        "};
1✔
896
        let result = denoise_text(input.trim(), &cfg).expect("should not be None");
1✔
897
        assert!(result.contains("Only this prose line"));
1✔
898
        assert!(!result.contains("---"));
1✔
899
        assert_eq!(result.lines().count(), 1);
1✔
900
    }
1✔
901

902
    #[test]
903
    fn denoise_line_level_markdown_table_numeric_cells_dropped() {
1✔
904
        let cfg = denoiser_enabled();
1✔
905
        let input = indoc! {"
1✔
906
            | ID   | Score | Rank |
1✔
907
            |------|-------|------|
1✔
908
            | 1001 | 98.5  | 1    |
1✔
909
            | 1002 | 87.3  | 2    |
1✔
910
            | 1003 | 76.0  | 3    |
1✔
911
        "};
1✔
912
        let result = denoise_text(input.trim(), &cfg).expect("header row text must survive");
1✔
913
        assert_eq!(result, "ID Score Rank");
1✔
914
    }
1✔
915

916
    #[test]
917
    fn denoise_line_level_markdown_table_single_column() {
1✔
918
        let cfg = denoiser_enabled();
1✔
919
        let input = indoc! {"
1✔
920
            Plain sentence before the table.
1✔
921
            | Item       |
1✔
922
            |------------|
1✔
923
            | Widget A   |
1✔
924
            | Widget B   |
1✔
925
            Plain sentence after the table.
1✔
926
        "};
1✔
927
        let result = denoise_text(input.trim(), &cfg).expect("prose and cell text must survive");
1✔
928
        assert!(result.contains("Plain sentence before"));
1✔
929
        assert!(result.contains("Plain sentence after"));
1✔
930
        assert!(result.contains("Item"));
1✔
931
        assert!(result.contains("Widget A"));
1✔
932
        assert!(result.contains("Widget B"));
1✔
933
        assert!(!result.contains("---"));
1✔
934
        assert!(!result.contains('|'));
1✔
935
        assert_eq!(result.lines().count(), 5);
1✔
936
    }
1✔
937

938
    #[test]
939
    fn denoise_line_level_single_pipe_in_prose_is_not_a_table_row() {
1✔
940
        let cfg = denoiser_enabled();
1✔
941
        let input = "Use the syntax foo | bar to combine options.";
1✔
942
        let result = denoise_text(input, &cfg).expect("should not be None");
1✔
943
        assert_eq!(result, input);
1✔
944
    }
1✔
945

946
    #[test]
947
    fn denoise_line_level_borderless_table_separator_dropped_data_survives() {
1✔
948
        let cfg = denoiser_enabled();
1✔
949
        let input = indoc! {"
1✔
950
            Name | Age | City
1✔
951
            -----|-----|------
1✔
952
            Alice | 30 | Denver
1✔
953
            Bob | 42 | Tulsa
1✔
954
        "};
1✔
955
        let result = denoise_text(input.trim(), &cfg).expect("should not be None");
1✔
956
        assert!(!result.contains("-----"));
1✔
957
        assert!(result.contains("Name"));
1✔
958
        assert!(result.contains("Alice"));
1✔
959
        assert!(result.contains("Bob"));
1✔
960
    }
1✔
961

962
    // -----------------------------------------------------------------------
963
    // Full OCR table block
964
    // -----------------------------------------------------------------------
965

966
    #[test]
967
    fn denoise_full_table_block_retains_company_names() {
1✔
968
        let cfg = denoiser_enabled();
1✔
969
        let input = indoc! {"
1✔
970
            42 524 NOVEX INDUSTRIES Springfield 10788 143 1995 190 394 13611 358 6444 266
1✔
971
            343 294 ZETA POWER Riverside 10758 31 1283 267 189 45432 175 8675 235
1✔
972
            344 442 OCEAN FORGE Denver 10707 699 3910 403 13380 361 5851 285
1✔
973
            345 397 DELTA FINANCIAL Detroit 10689 8 1069 302 1820 18214 336 17590 182
1✔
974
            346 397 APEX HOLDINGS Brentwood 10648 458 2035 473 6728 450 8775 450
1✔
975
            347 379 VEGA SYSTEMS Tulsa 10627 377 1517 231 6190 4672 172 11423 194
1✔
976
            348 225 CREST BRANDS Atlanta 10589 128 5720 369 14179 349 4091 328
1✔
977
            349 555 TITAN CHEMICAL Kingsport 10476 236 8570 332 793 15159 334 5704 288
1✔
978
            350 540 AIR PRODUCTS & LOGISTICS Allentown 10323 166 20991 182 113 26859 252 13539 169
1✔
979
            351 399 NORTHLAND FINANCIAL FOR MEMBERS Minneapolis 10312 265 25302 155 2972 116524 79 13694 165
1✔
980
        "};
1✔
981
        let result = denoise_text(input.trim(), &cfg).expect("block should not be dropped");
1✔
982
        for name in &[
21✔
983
            "NOVEX",
1✔
984
            "INDUSTRIES",
1✔
985
            "ZETA",
1✔
986
            "POWER",
1✔
987
            "OCEAN",
1✔
988
            "FORGE",
1✔
989
            "DELTA",
1✔
990
            "FINANCIAL",
1✔
991
            "APEX",
1✔
992
            "HOLDINGS",
1✔
993
            "VEGA",
1✔
994
            "SYSTEMS",
1✔
995
            "CREST",
1✔
996
            "BRANDS",
1✔
997
            "TITAN",
1✔
998
            "CHEMICAL",
1✔
999
            "AIR",
1✔
1000
            "PRODUCTS",
1✔
1001
            "LOGISTICS",
1✔
1002
            "NORTHLAND",
1✔
1003
            "MEMBERS",
1✔
1004
        ] {
1✔
1005
            assert!(result.contains(name));
21✔
1006
        }
1007
        for loc in &[
10✔
1008
            "Springfield",
1✔
1009
            "Riverside",
1✔
1010
            "Denver",
1✔
1011
            "Detroit",
1✔
1012
            "Brentwood",
1✔
1013
            "Tulsa",
1✔
1014
            "Atlanta",
1✔
1015
            "Kingsport",
1✔
1016
            "Allentown",
1✔
1017
            "Minneapolis",
1✔
1018
        ] {
1✔
1019
            assert!(result.contains(loc));
10✔
1020
        }
1021
        for junk in &["45432", "13539", "116524"] {
3✔
1022
            assert!(!result.contains(junk));
3✔
1023
        }
1024
        assert!(result.contains("PRODUCTS & LOGISTICS"));
1✔
1025
        assert!(result.contains("10788"));
1✔
1026
        assert!(result.contains("10312"));
1✔
1027
        assert_eq!(result.lines().count(), input.trim().lines().count());
1✔
1028
        assert_eq!(
1✔
1029
            result,
1030
            indoc! {"
1031
                42 524 NOVEX INDUSTRIES Springfield 10788 143
1032
                294 ZETA POWER Riverside 10758
1033
                442 OCEAN FORGE Denver 10707
1034
                397 DELTA FINANCIAL Detroit 10689
1035
                397 APEX HOLDINGS Brentwood 10648
1036
                379 VEGA SYSTEMS Tulsa 10627
1037
                225 CREST BRANDS Atlanta 10589
1038
                555 TITAN CHEMICAL Kingsport 10476
1039
                350 540 AIR PRODUCTS & LOGISTICS Allentown 10323 166
1040
                351 399 NORTHLAND FINANCIAL FOR MEMBERS Minneapolis 10312 265 25302"
1041
            }
1042
        );
1043
    }
1✔
1044

1045
    // -----------------------------------------------------------------------
1046
    // Financial punctuation: +, -, —, $, %
1047
    // -----------------------------------------------------------------------
1048

1049
    #[test]
1050
    fn denoise_financial_punctuation_below_threshold_passes_through_unchanged() {
1✔
1051
        let cfg = denoiser_enabled();
1✔
1052
        let input = "Operating cash: $4.2B (+12% YoY) — net debt fell -$1.1B; margin: 23%.";
1✔
1053
        let result = denoise_text(input, &cfg).expect("should not be None");
1✔
1054
        assert_eq!(result, input);
1✔
1055
    }
1✔
1056

1057
    #[test]
1058
    fn denoise_em_dash_operator_on_gate3_line_is_rescued() {
1✔
1059
        let cfg = denoiser_enabled();
1✔
1060
        let input = "REVENUE — COSTS NET 42 524 10788";
1✔
1061
        let result = denoise_text(input, &cfg).expect("should not be None");
1✔
1062
        assert!(result.contains('—'));
1✔
1063
        assert!(result.contains("42"));
1✔
1064
        assert!(result.contains("524"));
1✔
1065
        assert!(!result.contains("10788"));
1✔
1066
        assert_eq!(result, "REVENUE — COSTS NET 42 524");
1✔
1067
    }
1✔
1068

1069
    #[test]
1070
    fn denoise_sign_percent_tokens_on_gate3_line_are_rescued() {
1✔
1071
        let cfg = denoiser_enabled();
1✔
1072
        let input = "REVENUE GROWTH +12% EARNINGS -8% COSTS 42 524 10788 5520 3918";
1✔
1073
        let result = denoise_text(input, &cfg).expect("should not be None");
1✔
1074
        assert!(result.contains("+12%"));
1✔
1075
        assert!(result.contains("-8%"));
1✔
1076
        assert!(result.contains("42"));
1✔
1077
        assert!(result.contains("524"));
1✔
1078
        assert!(result.contains("10788"));
1✔
1079
        assert!(!result.contains("5520"));
1✔
1080
        assert!(!result.contains("3918"));
1✔
1081
        assert_eq!(
1✔
1082
            result,
1083
            "REVENUE GROWTH +12% EARNINGS -8% COSTS 42 524 10788"
1084
        );
1085
    }
1✔
1086

1087
    // -----------------------------------------------------------------------
1088
    // Unicode and non-ASCII symbol survival
1089
    //
1090
    // The wave-expansion algorithm uses `char::is_alphabetic()`, which
1091
    // correctly returns `true` for CJK ideographs, Greek letters, Arabic,
1092
    // Hebrew, and other Unicode scripts.  Non-alphabetic Unicode characters
1093
    // (currency symbols, emoji, diacritics-only clusters) carry zero alpha
1094
    // count and are treated identically to em-dashes: they survive whenever
1095
    // the ratio budget allows their wave to be accepted.
1096
    // -----------------------------------------------------------------------
1097

1098
    /// A line that is below the digit threshold must pass through
1099
    /// byte-for-byte regardless of what non-ASCII characters it contains.
1100
    #[test]
1101
    fn unicode_below_threshold_passes_through_unchanged() {
1✔
1102
        let cfg = denoiser_enabled();
1✔
1103
        // Euro, yen, rupee, emoji — ratio ≈ 0.09, well below 0.35.
1104
        let input = "Operating cash: €4.2B (+12% YoY) ¥310M ₹28B 🚀 — net margin: 23% (prev 21%).";
1✔
1105
        let result = denoise_text(input, &cfg).expect("should not be None");
1✔
1106
        assert_eq!(
1✔
1107
            result, input,
1108
            "below-threshold line with unicode symbols must be byte-identical"
1109
        );
1110
    }
1✔
1111

1112
    /// CJK ideographs are classified as `alphabetic` by Rust's Unicode
1113
    /// tables.  A block where CJK tokens seed the keep-set should rescue
1114
    /// their immediate numeric neighbours exactly like ASCII alpha tokens do.
1115
    ///
1116
    /// Tokens and wave math:
1117
    ///   input = "12 34 56 北京工业有限公司 78 90"
1118
    ///   d=10  a=8  ratio=0.56 → gate 3 triggered
1119
    ///   seed  = {北京工业有限公司(3)}, d=0 a=8
1120
    ///   wave1 = {56(2), 78(4)}  wd=4  new_ratio=4/12≈0.33 ≤ 0.35 → accept
1121
    ///   wave2 = {34(1), 90(5)}  wd=4  new_ratio=8/16=0.50 > 0.35 → reject
1122
    ///   result = "56 北京工业有限公司 78"
1123
    #[test]
1124
    fn cjk_ideographs_act_as_alpha_seeds_rescuing_adjacent_numbers() {
1✔
1125
        let cfg = denoiser_enabled();
1✔
1126
        let input = "12 34 56 北京工业有限公司 78 90";
1✔
1127
        let result = denoise_text(input, &cfg).expect("should not be None");
1✔
1128
        assert!(
1✔
1129
            result.contains("北京工业有限公司"),
1✔
1130
            "CJK token must survive as alpha seed"
1131
        );
1132
        assert!(
1✔
1133
            result.contains("56"),
1✔
1134
            "'56' adjacent to CJK — rescued in wave 1"
1135
        );
1136
        assert!(
1✔
1137
            result.contains("78"),
1✔
1138
            "'78' adjacent to CJK — rescued in wave 1"
1139
        );
1140
        assert!(!result.contains("34"), "'34' two hops from CJK — stripped");
1✔
1141
        assert!(!result.contains("90"), "'90' two hops from CJK — stripped");
1✔
1142
        assert!(!result.contains("12"), "'12' isolated — stripped");
1✔
1143
        assert_eq!(result, "56 北京工业有限公司 78");
1✔
1144
    }
1✔
1145

1146
    /// Greek letters are `alphabetic` and act as alpha seeds.  Mixed
1147
    /// Greek/ASCII alpha tokens should seed the keep-set and rescue the
1148
    /// first wave of adjacent numeric tokens within the ratio budget.
1149
    ///
1150
    ///   input = "α β Revenue 42 524 10788"
1151
    ///   d=10  a=9  ratio≈0.53 → gate 3
1152
    ///   seed  = {α(0), β(1), Revenue(2)}, d=0 a=9
1153
    ///   wave1 = {42(3)}  wd=2  ratio=2/11≈0.18 ≤ 0.35 → accept
1154
    ///   wave2 = {524(4)} wd=3  ratio=5/14≈0.36 > 0.35 → reject
1155
    ///   result = "α β Revenue 42"
1156
    #[test]
1157
    fn greek_letters_act_as_alpha_seeds() {
1✔
1158
        let cfg = denoiser_enabled();
1✔
1159
        let input = "α β Revenue 42 524 10788";
1✔
1160
        let result = denoise_text(input, &cfg).expect("should not be None");
1✔
1161
        assert!(result.contains('α'), "Greek α must survive as alpha seed");
1✔
1162
        assert!(result.contains('β'), "Greek β must survive as alpha seed");
1✔
1163
        assert!(result.contains("Revenue"), "ASCII alpha token must survive");
1✔
1164
        assert!(result.contains("42"), "'42' adjacent to Revenue — rescued");
1✔
1165
        assert!(!result.contains("524"), "'524' in rejected wave — stripped");
1✔
1166
        assert!(!result.contains("10788"), "'10788' unreachable — stripped");
1✔
1167
        assert_eq!(result, "α β Revenue 42");
1✔
1168
    }
1✔
1169

1170
    /// Arabic script characters are `alphabetic`.  A token composed entirely
1171
    /// of Arabic letters seeds the keep-set just like any Latin token and
1172
    /// rescues its immediate numeric neighbour in wave 1.
1173
    ///
1174
    ///   input = "إيرادات 42 99999"
1175
    ///   d=7 a=7 ratio=0.50 → gate 3
1176
    ///   seed  = {إيرادات(0)}, d=0 a=7
1177
    ///   wave1 = {42(1)}     wd=2  ratio=2/9≈0.22 ≤ 0.35 → accept
1178
    ///   wave2 = {99999(2)}  wd=5  ratio=7/14=0.50 > 0.35 → reject
1179
    ///   result = "إيرادات 42"
1180
    #[test]
1181
    fn arabic_script_acts_as_alpha_seed() {
1✔
1182
        let cfg = denoiser_enabled();
1✔
1183
        let input = "إيرادات 42 99999";
1✔
1184
        let result = denoise_text(input, &cfg).expect("should not be None");
1✔
1185
        assert!(
1✔
1186
            result.contains("إيرادات"),
1✔
1187
            "Arabic token must survive as alpha seed"
1188
        );
1189
        assert!(result.contains("42"), "adjacent number rescued in wave 1");
1✔
1190
        assert!(!result.contains("99999"), "non-adjacent number stripped");
1✔
1191
        assert_eq!(result, "إيرادات 42");
1✔
1192
    }
1✔
1193

1194
    /// Hebrew script characters are `alphabetic` and seed the keep-set,
1195
    /// rescuing their immediate numeric neighbour in wave 1.
1196
    ///
1197
    ///   input = "הכנסות 42 99999"
1198
    ///   d=7 a=6 ratio=0.54 → gate 3
1199
    ///   seed  = {הכנסות(0)}, d=0 a=6
1200
    ///   wave1 = {42(1)}     wd=2  ratio=2/8=0.25 ≤ 0.35 → accept
1201
    ///   wave2 = {99999(2)}  wd=5  ratio=7/13≈0.54 > 0.35 → reject
1202
    ///   result = "הכנסות 42"
1203
    #[test]
1204
    fn hebrew_script_acts_as_alpha_seed() {
1✔
1205
        let cfg = denoiser_enabled();
1✔
1206
        let input = "הכנסות 42 99999";
1✔
1207
        let result = denoise_text(input, &cfg).expect("should not be None");
1✔
1208
        assert!(
1✔
1209
            result.contains("הכנסות"),
1✔
1210
            "Hebrew token must survive as alpha seed"
1211
        );
1212
        assert!(result.contains("42"), "adjacent number rescued in wave 1");
1✔
1213
        assert!(!result.contains("99999"), "non-adjacent number stripped");
1✔
1214
        assert_eq!(result, "הכנסות 42");
1✔
1215
    }
1✔
1216

1217
    /// Accented/extended Latin characters (`é`, `ü`, `ñ`, etc.) are
1218
    /// `alphabetic` — a token like `Société` seeds the keep-set normally.
1219
    #[test]
1220
    fn accented_latin_acts_as_alpha_seed() {
1✔
1221
        let cfg = denoiser_enabled();
1✔
1222
        // Société(a=7) and Ünternehmen(a=10) are seeds; digit tokens rescued
1223
        // by wave expansion within budget.
1224
        let input = "12 34 56 Société 78 Ünternehmen 90 1234 5678";
1✔
1225
        let result = denoise_text(input, &cfg).expect("should not be None");
1✔
1226
        assert!(
1✔
1227
            result.contains("Société"),
1✔
1228
            "accented token must survive as alpha seed"
1229
        );
1230
        assert!(
1✔
1231
            result.contains("Ünternehmen"),
1✔
1232
            "umlaut token must survive as alpha seed"
1233
        );
1234
    }
1✔
1235

1236
    /// A bare currency symbol (`€`, `¥`, `₹`) is NOT `alphabetic` — it
1237
    /// carries zero alpha chars and zero digit chars.  On a gate-3 line it
1238
    /// is treated the same as an em-dash: rescued in the first eligible wave
1239
    /// at zero cost to the digit budget.
1240
    ///
1241
    ///   input = "REVENUE €100 COSTS 42 524 10788 5520"
1242
    ///   d=17 a=12 ratio≈0.59 → gate 3
1243
    ///   seed  = {REVENUE(0), COSTS(2)}, d=0 a=12
1244
    ///   wave1 = {€100(1), 42(3)}  wd=3+2=5  ratio=5/17≈0.29 ≤ 0.35 → accept
1245
    ///   wave2 = {524(4)}           wd=3      ratio=8/20=0.40 > 0.35 → reject
1246
    ///   result = "REVENUE €100 COSTS 42"
1247
    #[test]
1248
    fn euro_sign_token_rescued_by_wave_expansion() {
1✔
1249
        let cfg = denoiser_enabled();
1✔
1250
        let input = "REVENUE €100 COSTS 42 524 10788 5520";
1✔
1251
        let result = denoise_text(input, &cfg).expect("should not be None");
1✔
1252
        assert!(result.contains("REVENUE"), "alpha seed must survive");
1✔
1253
        assert!(result.contains("COSTS"), "alpha seed must survive");
1✔
1254
        assert!(
1✔
1255
            result.contains("€100"),
1✔
1256
            "€100 adjacent to REVENUE — rescued in wave 1"
1257
        );
1258
        assert!(
1✔
1259
            result.contains("42"),
1✔
1260
            "'42' adjacent to COSTS — rescued in wave 1"
1261
        );
1262
        assert!(
1✔
1263
            !result.contains("524"),
1✔
1264
            "'524' in rejected wave 2 — stripped"
1265
        );
1266
        assert!(!result.contains("10788"), "unreachable — stripped");
1✔
1267
        assert_eq!(result, "REVENUE €100 COSTS 42");
1✔
1268
    }
1✔
1269

1270
    /// Yen (¥) behaves identically to the euro sign test above.
1271
    #[test]
1272
    fn yen_sign_token_rescued_by_wave_expansion() {
1✔
1273
        let cfg = denoiser_enabled();
1✔
1274
        let input = "PROFIT ¥500 LOSS 42 524 13000 5520";
1✔
1275
        let result = denoise_text(input, &cfg).expect("should not be None");
1✔
1276
        assert!(result.contains("¥500"), "¥500 adjacent to PROFIT — rescued");
1✔
1277
        assert!(result.contains("42"), "42 adjacent to LOSS — rescued");
1✔
1278
        assert!(!result.contains("524"), "stripped");
1✔
1279
    }
1✔
1280

1281
    /// Indian rupee (₹) token rescued by wave expansion.
1282
    ///
1283
    /// `₹200` contains digits but no alpha chars, so it is not a seed but IS
1284
    /// eligible for rescue.  With two alpha-heavy seeds providing enough budget
1285
    /// the wave that rescues `₹200` is accepted.
1286
    ///
1287
    ///   input = "INCOME PROFIT ₹200 9999 9999 9999"
1288
    ///   d=15 a=12 ratio=0.556 → gate 3
1289
    ///   seed  = {INCOME(0), PROFIT(1)}, d=0 a=12
1290
    ///   wave1 = {₹200(2)}   wd=3  ratio=3/15=0.20 ≤ 0.35 → accept
1291
    ///   wave2 = {9999(3)}   wd=4  ratio=7/19≈0.37 > 0.35 → reject
1292
    ///   result = "INCOME PROFIT ₹200"
1293
    #[test]
1294
    fn rupee_sign_token_rescued_by_wave_expansion() {
1✔
1295
        let cfg = denoiser_enabled();
1✔
1296
        let input = "INCOME PROFIT ₹200 9999 9999 9999";
1✔
1297
        let result = denoise_text(input, &cfg).expect("should not be None");
1✔
1298
        assert!(
1✔
1299
            result.contains("₹200"),
1✔
1300
            "₹200 adjacent to alpha seeds — rescued in wave 1"
1301
        );
1302
        assert!(!result.contains("9999"), "non-adjacent numbers stripped");
1✔
1303
        assert_eq!(result, "INCOME PROFIT ₹200");
1✔
1304
    }
1✔
1305

1306
    /// An emoji character is NOT `alphabetic` (zero alpha, zero digit chars).
1307
    /// On a gate-3 line it is rescued at zero cost to the digit budget —
1308
    /// identical behaviour to em-dashes.
1309
    ///
1310
    ///   input = "REVENUE 🚀 COSTS 42 524 10788"
1311
    ///   d=10 a=12 ratio≈0.45 → gate 3
1312
    ///   seed  = {REVENUE(0), COSTS(2)}, d=0 a=12
1313
    ///   wave1 = {🚀(1), 42(3)}  wd=0+2=2   ratio=2/14≈0.14 ≤ 0.35 → accept
1314
    ///   wave2 = {524(4)}         wd=3        ratio=5/17≈0.29 ≤ 0.35 → accept
1315
    ///   wave3 = {10788(5)}       wd=5        ratio=10/22≈0.45 > 0.35 → reject
1316
    ///   result = "REVENUE 🚀 COSTS 42 524"
1317
    #[test]
1318
    fn emoji_rescued_by_wave_expansion_on_gate3_line() {
1✔
1319
        let cfg = denoiser_enabled();
1✔
1320
        let input = "REVENUE 🚀 COSTS 42 524 10788";
1✔
1321
        let result = denoise_text(input, &cfg).expect("should not be None");
1✔
1322
        assert!(
1✔
1323
            result.contains('🚀'),
1✔
1324
            "emoji between alpha tokens must be rescued (zero digit cost)"
1325
        );
1326
        assert!(result.contains("42"), "rescued in wave 1");
1✔
1327
        assert!(result.contains("524"), "rescued in wave 2");
1✔
1328
        assert!(!result.contains("10788"), "rejected wave 3 — stripped");
1✔
1329
        assert_eq!(result, "REVENUE 🚀 COSTS 42 524");
1✔
1330
    }
1✔
1331

1332
    /// Multiple different non-ASCII symbol types on the same gate-3 line:
1333
    /// emoji, currency, and CJK all handled correctly together.
1334
    #[test]
1335
    fn mixed_unicode_symbols_on_gate3_line() {
1✔
1336
        let cfg = denoiser_enabled();
1✔
1337
        // 北京(a=2), CORP(a=4): seeds; €(d=0,a=0) and 🌏(d=0,a=0) rescued at
1338
        // zero digit cost in wave 1; isolated numerics stripped.
1339
        let input = "42 € 北京 🌏 CORP 524 10788 99999";
1✔
1340
        let result = denoise_text(input, &cfg).expect("should not be None");
1✔
1341
        assert!(result.contains("北京"), "CJK alpha seed must survive");
1✔
1342
        assert!(result.contains("CORP"), "ASCII alpha seed must survive");
1✔
1343
        assert!(result.contains('€'), "euro sign rescued at zero cost");
1✔
1344
        assert!(result.contains("🌏"), "globe emoji rescued at zero cost");
1✔
1345
        assert!(!result.contains("99999"), "isolated number stripped");
1✔
1346
    }
1✔
1347

1348
    // -----------------------------------------------------------------------
1349
    // Linearized XBRL passthrough
1350
    // -----------------------------------------------------------------------
1351

1352
    #[test]
1353
    fn linearized_xbrl_single_metric_line_passes_through_unchanged() {
1✔
1354
        let line = concat!(
1✔
1355
            "label=Net income | dir=up | traj=non_monotonic | path=mostly_upward",
1356
            " | recent=up_bias | reg=growth_with_resets | cons=erratic | turn=high_turn",
1357
            " | run=clustered_runs | end=recovering_off_peak | rec=weak_recovery | dd=extreme",
1358
            " | shock=repeated_shock | pol=upside_shocks | flip=false | sig=UUDUUUDU-t4",
1359
            " | first=30.00B | last=42.10B | filing_quality=score=100.0 grade=A transitions=12",
1360
            " scale_issues=0 uom_issues=0",
1361
        );
1362
        let cfg = denoiser_enabled();
1✔
1363
        assert_eq!(denoise_text(line, &cfg), Some(line.to_string()));
1✔
1364
    }
1✔
1365

1366
    #[test]
1367
    fn linearized_xbrl_full_aapl_block_content_preserved() {
1✔
1368
        let input = indoc! {"
1✔
1369
            ### AAPL
1✔
1370
            periods=2026Q1,2025Q4,2025Q3,2025Q2,2025Q1,2024Q4,2024Q3,2024Q2,2024Q1,2023Q4,2023Q3,2023Q2,2023Q1
1✔
1371

1✔
1372
            label=Net income | dir=up | traj=non_monotonic | path=mostly_upward | recent=up_bias | reg=growth_with_resets | cons=erratic | turn=high_turn | run=clustered_runs | end=recovering_off_peak | rec=weak_recovery | dd=extreme | shock=repeated_shock | pol=upside_shocks | flip=false | sig=UUDUUUDU-t4 | first=30.00B | last=42.10B | filing_quality=score=100.0 grade=A transitions=12 scale_issues=0 uom_issues=0
1✔
1373

1✔
1374
            label=Operating income | dir=up | traj=non_monotonic | path=mostly_upward | recent=up_bias | reg=growth_with_resets | cons=erratic | turn=high_turn | run=clustered_runs | end=recovering_off_peak | rec=weak_recovery | dd=severe | shock=repeated_shock | pol=upside_shocks | flip=false | sig=UUDUUUDU-t4 | first=36.02B | last=50.85B | filing_quality=score=100.0 grade=A transitions=12 scale_issues=0 uom_issues=0"
1✔
1375
        };
1376
        let cfg = denoiser_enabled();
1✔
1377
        assert_eq!(
1✔
1378
            denoise_text(input, &cfg),
1✔
1379
            Some(indoc! {"
1✔
1380
                ### AAPL
1✔
1381
                periods=2026Q1,2025Q4,2025Q3,2025Q2,2025Q1,2024Q4,2024Q3,2024Q2,2024Q1,2023Q4,2023Q3,2023Q2,2023Q1
1✔
1382
                label=Net income | dir=up | traj=non_monotonic | path=mostly_upward | recent=up_bias | reg=growth_with_resets | cons=erratic | turn=high_turn | run=clustered_runs | end=recovering_off_peak | rec=weak_recovery | dd=extreme | shock=repeated_shock | pol=upside_shocks | flip=false | sig=UUDUUUDU-t4 | first=30.00B | last=42.10B | filing_quality=score=100.0 grade=A transitions=12 scale_issues=0 uom_issues=0
1✔
1383
                label=Operating income | dir=up | traj=non_monotonic | path=mostly_upward | recent=up_bias | reg=growth_with_resets | cons=erratic | turn=high_turn | run=clustered_runs | end=recovering_off_peak | rec=weak_recovery | dd=severe | shock=repeated_shock | pol=upside_shocks | flip=false | sig=UUDUUUDU-t4 | first=36.02B | last=50.85B | filing_quality=score=100.0 grade=A transitions=12 scale_issues=0 uom_issues=0"
1✔
1384
            }.to_string())
1✔
1385
        );
1386
    }
1✔
1387
}
STATUS · Troubleshooting · Open an Issue · Sales · Support · CAREERS · ENTERPRISE · START FREE · SCHEDULE DEMO
ANNOUNCEMENTS · TWITTER · TOS & SLA · Supported CI Services · What's a CI service? · Automated Testing

© 2026 Coveralls, Inc