• Home
  • Features
  • Pricing
  • Docs
  • Announcements
  • Sign In

zbraniecki / icu4x / 6815798908

09 Nov 2023 05:17PM UTC coverage: 72.607% (-2.4%) from 75.01%
6815798908

push

github

web-flow
Implement `Any/BufferProvider` for some smart pointers (#4255)

Allows storing them as a `Box<dyn Any/BufferProvider>` without using a
wrapper type that implements the trait.

44281 of 60987 relevant lines covered (72.61%)

201375.86 hits per line

Source File
Press 'n' to go to next uncovered line, 'b' for previous

82.0
/components/normalizer/src/lib.rs
1
// This file is part of ICU4X. For terms of use, please see the file
19,867✔
2
// called LICENSE at the top level of the ICU4X source tree
3
// (online at: https://github.com/unicode-org/icu4x/blob/main/LICENSE ).
4

5
// https://github.com/unicode-org/icu4x/blob/main/docs/process/boilerplate.md#library-annotations
6
#![cfg_attr(not(any(test, feature = "std")), no_std)]
7
#![cfg_attr(
8
    not(test),
9
    deny(
10
        clippy::indexing_slicing,
11
        clippy::unwrap_used,
12
        clippy::expect_used,
13
        clippy::panic,
14
        clippy::exhaustive_structs,
15
        clippy::exhaustive_enums,
16
        missing_debug_implementations,
17
    )
18
)]
19
#![warn(missing_docs)]
20

21
//! Normalizing text into Unicode Normalization Forms.
22
//!
23
//! This module is published as its own crate ([`icu_normalizer`](https://docs.rs/icu_normalizer/latest/icu_normalizer/))
24
//! and as part of the [`icu`](https://docs.rs/icu/latest/icu/) crate. See the latter for more details on the ICU4X project.
25
//!
26
//! # Implementation notes
27
//!
28
//! The normalizer operates on a lazy iterator over Unicode scalar values (Rust `char`) internally
29
//! and iterating over guaranteed-valid UTF-8, potentially-invalid UTF-8, and potentially-invalid
30
//! UTF-16 is a step that doesn’t leak into the normalizer internals. Ill-formed byte sequences are
31
//! treated as U+FFFD.
32
//!
33
//! The normalizer data layout is not based on the ICU4C design at all. Instead, the normalization
34
//! data layout is a clean-slate design optimized for the concept of fusing the NFD decomposition
35
//! into the collator. That is, the decomposing normalizer is a by-product of the collator-motivated
36
//! data layout.
37
//!
38
//! Notably, the decomposition data structure is optimized for a starter decomposing to itself,
39
//! which is the most common case, and for a starter decomposing to a starter and a non-starter
40
//! on the Basic Multilingual Plane. Notably, in this case, the collator makes use of the
41
//! knowledge that the second character of such a decomposition is a non-starter. Therefore,
42
//! decomposition into two starters is handled by generic fallback path that looks the
43
//! decomposition from an array by offset and length instead of baking a BMP starter pair directly
44
//! into a trie value.
45
//!
46
//! The decompositions into non-starters are hard-coded. At present in Unicode, these appear
47
//! to be special cases falling into three categories:
48
//!
49
//! 1. Deprecated combining marks.
50
//! 2. Particular Tibetan vowel sings.
51
//! 3. NFKD only: half-width kana voicing marks.
52
//!
53
//! Hopefully Unicode never adds more decompositions into non-starters (other than a character
54
//! decomposing to itself), but if it does, a code update is needed instead of a mere data update.
55
//!
56
//! The composing normalizer builds on the decomposing normalizer by performing the canonical
57
//! composition post-processing per spec. As an optimization, though, the composing normalizer
58
//! attempts to pass through already-normalized text consisting of starters that never combine
59
//! backwards and that map to themselves if followed by a character whose decomposition starts
60
//! with a starter that never combines backwards.
61
//!
62
//! As a difference with ICU4C, the composing normalizer has only the simplest possible
63
//! passthrough (only one inversion list lookup per character in the best case) and the full
64
//! decompose-then-canonically-compose behavior, whereas ICU4C has other paths between these
65
//! extremes. The ICU4X collator doesn't make use of the FCD concept at all in order to avoid
66
//! doing the work of checking whether the FCD condition holds.
67

68
extern crate alloc;
69

70
mod error;
71
pub mod properties;
72
pub mod provider;
73

74
pub use crate::error::NormalizerError;
75

76
#[doc(no_inline)]
77
pub use NormalizerError as Error;
78

79
use crate::provider::CanonicalDecompositionDataV1Marker;
80
use crate::provider::CompatibilityDecompositionSupplementV1Marker;
81
use crate::provider::DecompositionDataV1;
82
#[cfg(feature = "experimental")]
83
use crate::provider::Uts46DecompositionSupplementV1Marker;
84
use alloc::string::String;
85
use alloc::vec::Vec;
86
use core::char::REPLACEMENT_CHARACTER;
87
use core::str::from_utf8_unchecked;
88
use icu_collections::char16trie::Char16Trie;
89
use icu_collections::char16trie::Char16TrieIterator;
90
use icu_collections::char16trie::TrieResult;
91
use icu_collections::codepointtrie::CodePointTrie;
92
use icu_properties::CanonicalCombiningClass;
93
use icu_provider::prelude::*;
94
use provider::CanonicalCompositionsV1Marker;
95
use provider::CanonicalDecompositionTablesV1Marker;
96
use provider::CompatibilityDecompositionTablesV1Marker;
97
use provider::DecompositionSupplementV1;
98
use provider::DecompositionTablesV1;
99
use smallvec::SmallVec;
100
use utf16_iter::Utf16CharsEx;
101
use utf8_iter::Utf8CharsEx;
102
use write16::Write16;
103
use zerofrom::ZeroFrom;
104
use zerovec::{zeroslice, ZeroSlice};
105

106
#[derive(Debug)]
×
107
enum SupplementPayloadHolder {
108
    Compatibility(DataPayload<CompatibilityDecompositionSupplementV1Marker>),
×
109
    #[cfg(feature = "experimental")]
110
    Uts46(DataPayload<Uts46DecompositionSupplementV1Marker>),
×
111
}
112

113
impl SupplementPayloadHolder {
114
    fn get(&self) -> &DecompositionSupplementV1 {
84✔
115
        match self {
84✔
116
            SupplementPayloadHolder::Compatibility(d) => d.get(),
52✔
117
            #[cfg(feature = "experimental")]
118
            SupplementPayloadHolder::Uts46(d) => d.get(),
32✔
119
        }
120
    }
84✔
121
}
122

123
/// Number of iterations allowed on the fast path before flushing.
124
/// Since a typical UTF-16 iteration advances over a 2-byte BMP
125
/// character, this means two memory pages.
126
/// Intel Core i7-4770 had the best results between 2 and 4 pages
127
/// when testing powers of two. Apple M1 didn't seem to care
128
/// about 1, 2, 4, or 8 pages.
129
///
130
/// Curiously, the `str` case does not appear to benefit from
131
/// similar flushing, though the tested monomorphization never
132
/// passes an error through from `Write`.
133
const UTF16_FAST_PATH_FLUSH_THRESHOLD: usize = 4096;
134

135
/// Marker for starters that decompose to themselves but may
136
/// combine backwards under canonical composition.
137
/// (Main trie only; not used in the supplementary trie.)
138
const BACKWARD_COMBINING_STARTER_MARKER: u32 = 1;
139

140
/// Magic marker trie value for characters whose decomposition
141
/// starts with a non-starter. The actual decomposition is
142
/// hard-coded.
143
const SPECIAL_NON_STARTER_DECOMPOSITION_MARKER: u32 = 2;
144

145
/// `u16` version of the previous marker value.
146
const SPECIAL_NON_STARTER_DECOMPOSITION_MARKER_U16: u16 = 2;
147

148
/// Marker that a complex decomposition isn't round-trippable
149
/// under re-composition.
150
const NON_ROUND_TRIP_MARKER: u16 = 1;
151

152
/// Checks if a trie value carries a (non-zero) canonical
153
/// combining class.
154
fn trie_value_has_ccc(trie_value: u32) -> bool {
1,122,187✔
155
    (trie_value & 0xFFFFFF00) == 0xD800
1,122,187✔
156
}
1,122,187✔
157

158
/// Checks if the trie signifies a special non-starter decomposition.
159
fn trie_value_indicates_special_non_starter_decomposition(trie_value: u32) -> bool {
1,119,084✔
160
    trie_value == SPECIAL_NON_STARTER_DECOMPOSITION_MARKER
1,119,084✔
161
}
1,119,084✔
162

163
/// Checks if a trie value signifies a character whose decomposition
164
/// starts with a non-starter.
165
fn decomposition_starts_with_non_starter(trie_value: u32) -> bool {
1,448✔
166
    trie_value_has_ccc(trie_value)
1,448✔
167
        || trie_value_indicates_special_non_starter_decomposition(trie_value)
947✔
168
}
1,448✔
169

170
/// Extracts a canonical combining class (possibly zero) from a trie value.
171
///
172
/// # Panics
173
///
174
/// The trie value must not be one that signifies a special non-starter
175
/// decomposition. (Debug-only)
176
fn ccc_from_trie_value(trie_value: u32) -> CanonicalCombiningClass {
1,077✔
177
    if trie_value_has_ccc(trie_value) {
1,193✔
178
        CanonicalCombiningClass(trie_value as u8)
961✔
179
    } else {
180
        debug_assert_ne!(trie_value, SPECIAL_NON_STARTER_DECOMPOSITION_MARKER);
116✔
181
        CanonicalCombiningClass::NotReordered
116✔
182
    }
183
}
1,077✔
184

185
/// The tail (everything after the first character) of the NFKD form U+FDFA
186
/// as 16-bit units.
187
static FDFA_NFKD: [u16; 17] = [
188
    0x644, 0x649, 0x20, 0x627, 0x644, 0x644, 0x647, 0x20, 0x639, 0x644, 0x64A, 0x647, 0x20, 0x648,
189
    0x633, 0x644, 0x645,
190
];
191

192
/// Marker value for U+FDFA in NFKD
193
const FDFA_MARKER: u16 = 3;
194

195
// These constants originate from page 143 of Unicode 14.0
196
/// Syllable base
197
const HANGUL_S_BASE: u32 = 0xAC00;
198
/// Lead jamo base
199
const HANGUL_L_BASE: u32 = 0x1100;
200
/// Vowel jamo base
201
const HANGUL_V_BASE: u32 = 0x1161;
202
/// Trail jamo base (deliberately off by one to account for the absence of a trail)
203
const HANGUL_T_BASE: u32 = 0x11A7;
204
/// Lead jamo count
205
const HANGUL_L_COUNT: u32 = 19;
206
/// Vowel jamo count
207
const HANGUL_V_COUNT: u32 = 21;
208
/// Trail jamo count (deliberately off by one to account for the absence of a trail)
209
const HANGUL_T_COUNT: u32 = 28;
210
/// Vowel jamo count times trail jamo count
211
const HANGUL_N_COUNT: u32 = 588;
212
/// Syllable count
213
const HANGUL_S_COUNT: u32 = 11172;
214

215
/// One past the conjoining jamo block
216
const HANGUL_JAMO_LIMIT: u32 = 0x1200;
217

218
/// If `opt` is `Some`, unwrap it. If `None`, panic if debug assertions
219
/// are enabled and return `default` if debug assertions are not enabled.
220
///
221
/// Use this only if the only reason why `opt` could be `None` is bogus
222
/// data from the provider.
223
#[inline(always)]
224
fn unwrap_or_gigo<T>(opt: Option<T>, default: T) -> T {
2,444✔
225
    if let Some(val) = opt {
2,444✔
226
        val
227
    } else {
228
        // GIGO case
229
        debug_assert!(false);
×
230
        default
231
    }
232
}
2,444✔
233

234
/// Convert a `u32` _obtained from data provider data_ to `char`.
235
#[inline(always)]
236
fn char_from_u32(u: u32) -> char {
2,444✔
237
    unwrap_or_gigo(core::char::from_u32(u), REPLACEMENT_CHARACTER)
2,444✔
238
}
2,444✔
239

240
/// Convert a `u16` _obtained from data provider data_ to `char`.
241
#[inline(always)]
242
fn char_from_u16(u: u16) -> char {
2,444✔
243
    char_from_u32(u32::from(u))
2,444✔
244
}
2,444✔
245

246
const EMPTY_U16: &ZeroSlice<u16> = zeroslice![];
247

248
const EMPTY_CHAR: &ZeroSlice<char> = zeroslice![];
249

250
#[inline(always)]
251
fn in_inclusive_range(c: char, start: char, end: char) -> bool {
234✔
252
    u32::from(c).wrapping_sub(u32::from(start)) <= (u32::from(end) - u32::from(start))
234✔
253
}
234✔
254

255
#[inline(always)]
256
fn in_inclusive_range32(u: u32, start: u32, end: u32) -> bool {
64✔
257
    u.wrapping_sub(start) <= (end - start)
64✔
258
}
64✔
259

260
#[inline(always)]
261
fn in_inclusive_range16(u: u16, start: u16, end: u16) -> bool {
12✔
262
    u.wrapping_sub(start) <= (end - start)
12✔
263
}
12✔
264

265
/// Performs canonical composition (including Hangul) on a pair of
266
/// characters or returns `None` if these characters don't compose.
267
/// Composition exclusions are taken into account.
268
#[inline]
269
fn compose(iter: Char16TrieIterator, starter: char, second: char) -> Option<char> {
133✔
270
    let v = u32::from(second).wrapping_sub(HANGUL_V_BASE);
133✔
271
    if v >= HANGUL_JAMO_LIMIT - HANGUL_V_BASE {
133✔
272
        return compose_non_hangul(iter, starter, second);
131✔
273
    }
274
    if v < HANGUL_V_COUNT {
2✔
275
        let l = u32::from(starter).wrapping_sub(HANGUL_L_BASE);
2✔
276
        if l < HANGUL_L_COUNT {
2✔
277
            let lv = l * HANGUL_N_COUNT + v * HANGUL_T_COUNT;
2✔
278
            // Safe, because the inputs are known to be in range.
279
            return Some(unsafe { char::from_u32_unchecked(HANGUL_S_BASE + lv) });
2✔
280
        }
281
        return None;
×
282
    }
283
    if in_inclusive_range(second, '\u{11A8}', '\u{11C2}') {
×
284
        let lv = u32::from(starter).wrapping_sub(HANGUL_S_BASE);
×
285
        if lv < HANGUL_S_COUNT && lv % HANGUL_T_COUNT == 0 {
×
286
            let lvt = lv + (u32::from(second) - HANGUL_T_BASE);
×
287
            // Safe, because the inputs are known to be in range.
288
            return Some(unsafe { char::from_u32_unchecked(HANGUL_S_BASE + lvt) });
×
289
        }
290
    }
291
    None
×
292
}
133✔
293

294
/// Performs (non-Hangul) canonical composition on a pair of characters
295
/// or returns `None` if these characters don't compose. Composition
296
/// exclusions are taken into account.
297
fn compose_non_hangul(mut iter: Char16TrieIterator, starter: char, second: char) -> Option<char> {
164✔
298
    // To make the trie smaller, the pairs are stored second character first.
299
    // Given how this method is used in ways where it's known that `second`
300
    // is or isn't a starter. We could potentially split the trie into two
301
    // tries depending on whether `second` is a starter.
302
    match iter.next(second) {
164✔
303
        TrieResult::NoMatch => None,
56✔
304
        TrieResult::NoValue => match iter.next(starter) {
108✔
305
            TrieResult::NoMatch => None,
2✔
306
            TrieResult::FinalValue(i) => {
106✔
307
                if let Some(c) = char::from_u32(i as u32) {
106✔
308
                    Some(c)
106✔
309
                } else {
310
                    // GIGO case
311
                    debug_assert!(false);
×
312
                    None
313
                }
314
            }
315
            TrieResult::NoValue | TrieResult::Intermediate(_) => {
316
                // GIGO case
317
                debug_assert!(false);
×
318
                None
319
            }
320
        },
321
        TrieResult::FinalValue(_) | TrieResult::Intermediate(_) => {
322
            // GIGO case
323
            debug_assert!(false);
×
324
            None
325
        }
326
    }
327
}
164✔
328

329
/// Struct for holding together a character and the value
330
/// looked up for it from the NFD trie in a more explicit
331
/// way than an anonymous pair.
332
/// Also holds a flag about the supplementary-trie provenance.
333
#[derive(Debug, PartialEq, Eq)]
×
334
struct CharacterAndTrieValue {
335
    character: char,
×
336
    trie_val: u32,
×
337
    from_supplement: bool,
×
338
}
339

340
impl CharacterAndTrieValue {
341
    #[inline(always)]
342
    pub fn new(c: char, trie_value: u32) -> Self {
24,771✔
343
        CharacterAndTrieValue {
24,771✔
344
            character: c,
345
            trie_val: trie_value,
346
            from_supplement: false,
347
        }
348
    }
24,771✔
349
    #[inline(always)]
350
    pub fn new_from_supplement(c: char, trie_value: u32) -> Self {
34✔
351
        CharacterAndTrieValue {
34✔
352
            character: c,
353
            trie_val: trie_value,
354
            from_supplement: true,
355
        }
356
    }
34✔
357
    #[inline(always)]
358
    pub fn starter_and_decomposes_to_self(&self) -> bool {
13,346✔
359
        if self.trie_val > BACKWARD_COMBINING_STARTER_MARKER {
13,346✔
360
            return false;
1,597✔
361
        }
362
        // Hangul syllables get 0 as their trie value
363
        u32::from(self.character).wrapping_sub(HANGUL_S_BASE) >= HANGUL_S_COUNT
11,749✔
364
    }
13,346✔
365
    #[inline(always)]
366
    pub fn can_combine_backwards(&self) -> bool {
64✔
367
        decomposition_starts_with_non_starter(self.trie_val)
64✔
368
            || self.trie_val == BACKWARD_COMBINING_STARTER_MARKER
64✔
369
            || in_inclusive_range32(self.trie_val, 0x1161, 0x11C2)
64✔
370
    }
64✔
371
    #[inline(always)]
372
    pub fn potential_passthrough(&self) -> bool {
1,265✔
373
        self.potential_passthrough_impl(BACKWARD_COMBINING_STARTER_MARKER)
1,265✔
374
    }
1,265✔
375
    #[inline(always)]
376
    pub fn potential_passthrough_and_cannot_combine_backwards(&self) -> bool {
6,243✔
377
        self.potential_passthrough_impl(0)
6,243✔
378
    }
6,243✔
379
    #[inline(always)]
380
    fn potential_passthrough_impl(&self, bound: u32) -> bool {
7,508✔
381
        // This methods looks badly branchy, but most characters
382
        // take the first return.
383
        if self.trie_val <= bound {
7,508✔
384
            return true;
7,293✔
385
        }
386
        if self.from_supplement {
215✔
387
            return false;
20✔
388
        }
389
        let trail_or_complex = (self.trie_val >> 16) as u16;
195✔
390
        if trail_or_complex == 0 {
195✔
391
            return false;
68✔
392
        }
393
        let lead = self.trie_val as u16;
127✔
394
        if lead == 0 {
127✔
395
            return true;
7✔
396
        }
397
        if lead == NON_ROUND_TRIP_MARKER {
120✔
398
            return false;
3✔
399
        }
400
        if (trail_or_complex & 0x7F) == 0x3C
117✔
401
            && in_inclusive_range16(trail_or_complex, 0x0900, 0x0BFF)
×
402
        {
403
            // Nukta
404
            return false;
×
405
        }
406
        if in_inclusive_range(self.character, '\u{FB1D}', '\u{FB4E}') {
117✔
407
            // Hebrew presentation forms
408
            return false;
×
409
        }
410
        if in_inclusive_range(self.character, '\u{1F71}', '\u{1FFB}') {
117✔
411
            // Polytonic Greek with oxia
412
            return false;
13✔
413
        }
414
        // To avoid more branchiness, 4 characters that decompose to
415
        // a BMP starter followed by a BMP non-starter are excluded
416
        // from being encoded directly into the trie value and are
417
        // handled as complex decompositions instead. These are:
418
        // U+0F76 TIBETAN VOWEL SIGN VOCALIC R
419
        // U+0F78 TIBETAN VOWEL SIGN VOCALIC L
420
        // U+212B ANGSTROM SIGN
421
        // U+2ADC FORKING
422
        true
104✔
423
    }
7,508✔
424
}
425

426
/// Pack a `char` and a `CanonicalCombiningClass` in
427
/// 32 bits (the former in the lower 24 bits and the
428
/// latter in the high 8 bits). The latter can be
429
/// initialized to 0xFF upon creation, in which case
430
/// it can be actually set later by calling
431
/// `set_ccc_from_trie_if_not_already_set`. This is
432
/// a micro optimization to avoid the Canonical
433
/// Combining Class trie lookup when there is only
434
/// one combining character in a sequence. This type
435
/// is intentionally non-`Copy` to get compiler help
436
/// in making sure that the class is set on the
437
/// instance on which it is intended to be set
438
/// and not on a temporary copy.
439
///
440
/// Note that 0xFF is won't be assigned to an actual
441
/// canonical combining class per definition D104
442
/// in The Unicode Standard.
443
//
444
// NOTE: The Pernosco debugger has special knowledge
445
// of this struct. Please do not change the bit layout
446
// or the crate-module-qualified name of this struct
447
// without coordination.
448
#[derive(Debug)]
×
449
struct CharacterAndClass(u32);
×
450

451
impl CharacterAndClass {
452
    pub fn new(c: char, ccc: CanonicalCombiningClass) -> Self {
730✔
453
        CharacterAndClass(u32::from(c) | (u32::from(ccc.0) << 24))
730✔
454
    }
730✔
455
    pub fn new_with_placeholder(c: char) -> Self {
1,205✔
456
        CharacterAndClass(u32::from(c) | ((0xFF) << 24))
1,205✔
457
    }
1,205✔
458
    pub fn new_with_trie_value(c_tv: CharacterAndTrieValue) -> Self {
730✔
459
        Self::new(c_tv.character, ccc_from_trie_value(c_tv.trie_val))
730✔
460
    }
730✔
461
    pub fn new_starter(c: char) -> Self {
68✔
462
        CharacterAndClass(u32::from(c))
68✔
463
    }
68✔
464
    pub fn character(&self) -> char {
1,999✔
465
        // Safe, because the low 24 bits came from a `char`
466
        // originally.
467
        unsafe { char::from_u32_unchecked(self.0 & 0xFFFFFF) }
1,999✔
468
    }
1,999✔
469
    pub fn ccc(&self) -> CanonicalCombiningClass {
587✔
470
        CanonicalCombiningClass((self.0 >> 24) as u8)
587✔
471
    }
587✔
472
    pub fn character_and_ccc(&self) -> (char, CanonicalCombiningClass) {
153✔
473
        (self.character(), self.ccc())
153✔
474
    }
153✔
475
    pub fn set_ccc_from_trie_if_not_already_set(&mut self, trie: &CodePointTrie<u32>) {
393✔
476
        if self.0 >> 24 != 0xFF {
393✔
477
            return;
478
        }
479
        let scalar = self.0 & 0xFFFFFF;
347✔
480
        self.0 = ((ccc_from_trie_value(trie.get32_u32(scalar)).0 as u32) << 24) | scalar;
347✔
481
    }
393✔
482
}
483

484
// This function exists as a borrow check helper.
485
#[inline(always)]
486
fn sort_slice_by_ccc(slice: &mut [CharacterAndClass], trie: &CodePointTrie<u32>) {
5,422✔
487
    // We don't look up the canonical combining class for starters
488
    // of for single combining characters between starters. When
489
    // there's more than one combining character between starters,
490
    // we look up the canonical combining class for each character
491
    // exactly once.
492
    if slice.len() < 2 {
5,422✔
493
        return;
494
    }
495
    slice
352✔
496
        .iter_mut()
497
        .for_each(|cc| cc.set_ccc_from_trie_if_not_already_set(trie));
569✔
498
    slice.sort_by_key(|cc| cc.ccc());
610✔
499
}
5,422✔
500

501
/// An iterator adaptor that turns an `Iterator` over `char` into
502
/// a lazily-decomposed `char` sequence.
503
#[derive(Debug)]
504
pub struct Decomposition<'data, I>
505
where
506
    I: Iterator<Item = char>,
507
{
508
    delegate: I,
509
    buffer: SmallVec<[CharacterAndClass; 17]>, // Enough to hold NFKD for U+FDFA
510
    /// The index of the next item to be read from `buffer`.
511
    /// The purpose if this index is to avoid having to move
512
    /// the rest upon every read.
513
    buffer_pos: usize,
514
    // At the start of `next()` if not `None`, this is a pending unnormalized
515
    // starter. When `Decomposition` appears alone, this is never a non-starter.
516
    // However, when `Decomposition` appears inside a `Composition`, this
517
    // may become a non-starter before `decomposing_next()` is called.
518
    pending: Option<CharacterAndTrieValue>, // None at end of stream
519
    trie: &'data CodePointTrie<'data, u32>,
520
    supplementary_trie: Option<&'data CodePointTrie<'data, u32>>,
521
    scalars16: &'data ZeroSlice<u16>,
522
    scalars24: &'data ZeroSlice<char>,
523
    supplementary_scalars16: &'data ZeroSlice<u16>,
524
    supplementary_scalars24: &'data ZeroSlice<char>,
525
    half_width_voicing_marks_become_non_starters: bool,
526
    /// The lowest character for which either of the following does
527
    /// not hold:
528
    /// 1. Decomposes to self.
529
    /// 2. Decomposition starts with a non-starter
530
    decomposition_passthrough_bound: u32, // never above 0xC0
531
}
532

533
impl<'data, I> Decomposition<'data, I>
534
where
535
    I: Iterator<Item = char>,
536
{
537
    /// Constructs a decomposing iterator adapter from a delegate
538
    /// iterator and references to the necessary data, without
539
    /// supplementary data.
540
    ///
541
    /// Use `DecomposingNormalizer::normalize_iter()` instead unless
542
    /// there's a good reason to use this constructor directly.
543
    ///
544
    /// Public but hidden in order to be able to use this from the
545
    /// collator.
546
    #[doc(hidden)]
547
    pub fn new(
9✔
548
        delegate: I,
549
        decompositions: &'data DecompositionDataV1,
550
        tables: &'data DecompositionTablesV1,
551
    ) -> Self {
552
        Self::new_with_supplements(delegate, decompositions, None, tables, None, 0xC0)
9✔
553
    }
9✔
554

555
    /// Constructs a decomposing iterator adapter from a delegate
556
    /// iterator and references to the necessary data, including
557
    /// supplementary data.
558
    ///
559
    /// Use `DecomposingNormalizer::normalize_iter()` instead unless
560
    /// there's a good reason to use this constructor directly.
561
    fn new_with_supplements(
3,625✔
562
        delegate: I,
563
        decompositions: &'data DecompositionDataV1,
564
        supplementary_decompositions: Option<&'data DecompositionSupplementV1>,
565
        tables: &'data DecompositionTablesV1,
566
        supplementary_tables: Option<&'data DecompositionTablesV1>,
567
        decomposition_passthrough_bound: u8,
568
    ) -> Self {
569
        let half_width_voicing_marks_become_non_starters =
3,625✔
570
            if let Some(supplementary) = supplementary_decompositions {
3,625✔
571
                supplementary.half_width_voicing_marks_become_non_starters()
84✔
572
            } else {
573
                false
3,541✔
574
            };
575
        let mut ret = Decomposition::<I> {
3,625✔
576
            delegate,
3,625✔
577
            buffer: SmallVec::new(), // Normalized
3,625✔
578
            buffer_pos: 0,
579
            // Initialize with a placeholder starter in case
580
            // the real stream starts with a non-starter.
581
            pending: Some(CharacterAndTrieValue::new('\u{FFFF}', 0)),
3,625✔
582
            trie: &decompositions.trie,
583
            supplementary_trie: supplementary_decompositions.map(|s| &s.trie),
3,711✔
584
            scalars16: &tables.scalars16,
3,626✔
585
            scalars24: &tables.scalars24,
3,626✔
586
            supplementary_scalars16: if let Some(supplementary) = supplementary_tables {
3,626✔
587
                &supplementary.scalars16
84✔
588
            } else {
589
                EMPTY_U16
3,542✔
590
            },
591
            supplementary_scalars24: if let Some(supplementary) = supplementary_tables {
3,626✔
592
                &supplementary.scalars24
84✔
593
            } else {
594
                EMPTY_CHAR
3,542✔
595
            },
596
            half_width_voicing_marks_become_non_starters,
3,626✔
597
            decomposition_passthrough_bound: u32::from(decomposition_passthrough_bound),
3,626✔
598
        };
×
599
        let _ = ret.next(); // Remove the U+FFFF placeholder
3,625✔
600
        ret
601
    }
3,625✔
602

603
    fn push_decomposition16(
257✔
604
        &mut self,
605
        low: u16,
606
        offset: usize,
607
        slice16: &ZeroSlice<u16>,
608
    ) -> (char, usize) {
609
        let len = usize::from(low >> 13) + 2;
257✔
610
        let (starter, tail) = slice16
257✔
611
            .get_subslice(offset..offset + len)
257✔
612
            .and_then(|slice| slice.split_first())
257✔
613
            .map_or_else(
614
                || {
×
615
                    // GIGO case
616
                    debug_assert!(false);
×
617
                    (REPLACEMENT_CHARACTER, EMPTY_U16)
618
                },
619
                |(first, trail)| (char_from_u16(first), trail),
257✔
620
            );
621
        if low & 0x1000 != 0 {
359✔
622
            // All the rest are combining
623
            self.buffer.extend(
155✔
624
                tail.iter()
155✔
625
                    .map(|u| CharacterAndClass::new_with_placeholder(char_from_u16(u))),
347✔
626
            );
627
            (starter, 0)
155✔
628
        } else {
629
            let mut i = 0;
102✔
630
            let mut combining_start = 0;
102✔
631
            for u in tail.iter() {
218✔
632
                let ch = char_from_u16(u);
116✔
633
                let trie_value = self.trie.get(ch);
116✔
634
                self.buffer.push(CharacterAndClass::new_with_trie_value(
116✔
635
                    CharacterAndTrieValue::new(ch, trie_value),
116✔
636
                ));
637
                i += 1;
116✔
638
                // Half-width kana and iota subscript don't occur in the tails
639
                // of these multicharacter decompositions.
640
                if decomposition_starts_with_non_starter(trie_value) {
116✔
641
                    combining_start = i;
×
642
                }
643
            }
644
            (starter, combining_start)
102✔
645
        }
646
    }
257✔
647

648
    fn push_decomposition32(
6✔
649
        &mut self,
650
        low: u16,
651
        offset: usize,
652
        slice32: &ZeroSlice<char>,
653
    ) -> (char, usize) {
654
        let len = usize::from(low >> 13) + 1;
6✔
655
        let (starter, tail) = slice32
6✔
656
            .get_subslice(offset..offset + len)
6✔
657
            .and_then(|slice| slice.split_first())
6✔
658
            .unwrap_or_else(|| {
×
659
                // GIGO case
660
                debug_assert!(false);
×
661
                (REPLACEMENT_CHARACTER, EMPTY_CHAR)
662
            });
663
        if low & 0x1000 != 0 {
6✔
664
            // All the rest are combining
665
            self.buffer
6✔
666
                .extend(tail.iter().map(CharacterAndClass::new_with_placeholder));
6✔
667
            (starter, 0)
6✔
668
        } else {
669
            let mut i = 0;
×
670
            let mut combining_start = 0;
×
671
            for ch in tail.iter() {
×
672
                let trie_value = self.trie.get(ch);
×
673
                self.buffer.push(CharacterAndClass::new_with_trie_value(
×
674
                    CharacterAndTrieValue::new(ch, trie_value),
×
675
                ));
676
                i += 1;
×
677
                // Half-width kana and iota subscript don't occur in the tails
678
                // of these multicharacter decompositions.
679
                if decomposition_starts_with_non_starter(trie_value) {
×
680
                    combining_start = i;
×
681
                }
682
            }
683
            (starter, combining_start)
×
684
        }
685
    }
6✔
686

687
    #[inline(always)]
688
    fn attach_trie_value(&self, c: char) -> CharacterAndTrieValue {
20,835✔
689
        if let Some(supplementary) = self.supplementary_trie {
20,835✔
690
            if let Some(value) = self.attach_supplementary_trie_value(c, supplementary) {
97✔
691
                return value;
42✔
692
            }
693
        }
694

695
        CharacterAndTrieValue::new(c, self.trie.get(c))
20,793✔
696
    }
20,835✔
697

698
    #[inline(never)]
699
    fn attach_supplementary_trie_value(
97✔
700
        &self,
701
        c: char,
702
        supplementary: &CodePointTrie<u32>,
703
    ) -> Option<CharacterAndTrieValue> {
704
        let voicing_mark = u32::from(c).wrapping_sub(0xFF9E);
97✔
705
        if voicing_mark <= 1 && self.half_width_voicing_marks_become_non_starters {
97✔
706
            return Some(CharacterAndTrieValue::new(
8✔
707
                if voicing_mark == 0 {
8✔
708
                    '\u{3099}'
4✔
709
                } else {
710
                    '\u{309A}'
4✔
711
                },
712
                0xD800 | u32::from(CanonicalCombiningClass::KanaVoicing.0),
8✔
713
            ));
714
        }
715
        let trie_value = supplementary.get32(u32::from(c));
89✔
716
        if trie_value != 0 {
89✔
717
            return Some(CharacterAndTrieValue::new_from_supplement(c, trie_value));
34✔
718
        }
719
        None
55✔
720
    }
97✔
721

722
    fn delegate_next_no_pending(&mut self) -> Option<CharacterAndTrieValue> {
5,443✔
723
        debug_assert!(self.pending.is_none());
5,443✔
724
        let c = self.delegate.next()?;
5,443✔
725

726
        // TODO(#2384): Measure if this check is actually an optimization even in the
727
        // non-supplementary case of if this should go inside the supplementary
728
        // `if` below.
729
        if u32::from(c) < self.decomposition_passthrough_bound {
4,965✔
730
            return Some(CharacterAndTrieValue::new(c, 0));
223✔
731
        }
732

733
        Some(self.attach_trie_value(c))
4,742✔
734
    }
5,443✔
735

736
    fn delegate_next(&mut self) -> Option<CharacterAndTrieValue> {
6,035✔
737
        if let Some(pending) = self.pending.take() {
6,035✔
738
            // Only happens as part of `Composition` and as part of
739
            // the contiguous-buffer methods of `DecomposingNormalizer`.
740
            // I.e. does not happen as part of standalone iterator
741
            // usage of `Decomposition`.
742
            Some(pending)
590✔
743
        } else {
744
            self.delegate_next_no_pending()
5,445✔
745
        }
746
    }
6,035✔
747

748
    fn decomposing_next(&mut self, c_and_trie_val: CharacterAndTrieValue) -> char {
4,922✔
749
        let (starter, combining_start) = {
4,922✔
750
            let c = c_and_trie_val.character;
4,922✔
751
            let hangul_offset = u32::from(c).wrapping_sub(HANGUL_S_BASE); // SIndex in the spec
4,922✔
752
            if hangul_offset >= HANGUL_S_COUNT {
4,922✔
753
                let decomposition = c_and_trie_val.trie_val;
4,922✔
754
                if decomposition <= BACKWARD_COMBINING_STARTER_MARKER {
4,922✔
755
                    // The character is its own decomposition
756
                    (c, 0)
3,783✔
757
                } else {
758
                    let trail_or_complex = (decomposition >> 16) as u16;
1,139✔
759
                    let lead = decomposition as u16;
1,139✔
760
                    if lead > NON_ROUND_TRIP_MARKER && trail_or_complex != 0 {
1,139✔
761
                        // Decomposition into two BMP characters: starter and non-starter
762
                        let starter = char_from_u16(lead);
852✔
763
                        let combining = char_from_u16(trail_or_complex);
852✔
764
                        self.buffer
852✔
765
                            .push(CharacterAndClass::new_with_placeholder(combining));
852✔
766
                        (starter, 0)
852✔
767
                    } else if lead > NON_ROUND_TRIP_MARKER {
287✔
768
                        if lead != FDFA_MARKER {
44✔
769
                            debug_assert_ne!(
20✔
770
                                lead, SPECIAL_NON_STARTER_DECOMPOSITION_MARKER_U16,
771
                                "Should not reach this point with non-starter marker"
772
                            );
773
                            // Decomposition into one BMP character
774
                            let starter = char_from_u16(lead);
20✔
775
                            (starter, 0)
20✔
776
                        } else {
777
                            // Special case for the NFKD form of U+FDFA.
778
                            self.buffer.extend(FDFA_NFKD.map(|u| {
72✔
779
                                // Safe, because `FDFA_NFKD` is known not to contain
780
                                // surrogates.
781
                                CharacterAndClass::new_starter(unsafe {
68✔
782
                                    core::char::from_u32_unchecked(u32::from(u))
68✔
783
                                })
784
                            }));
68✔
785
                            ('\u{0635}', 17)
4✔
786
                        }
787
                    } else {
788
                        // Complex decomposition
789
                        // Format for 16-bit value:
790
                        // 15..13: length minus two for 16-bit case and length minus one for
791
                        //         the 32-bit case. Length 8 needs to fit in three bits in
792
                        //         the 16-bit case, and this way the value is future-proofed
793
                        //         up to 9 in the 16-bit case. Zero is unused and length one
794
                        //         in the 16-bit case goes directly into the trie.
795
                        //     12: 1 if all trailing characters are guaranteed non-starters,
796
                        //         0 if no guarantees about non-starterness.
797
                        //         Note: The bit choice is this way around to allow for
798
                        //         dynamically falling back to not having this but instead
799
                        //         having one more bit for length by merely choosing
800
                        //         different masks.
801
                        //  11..0: Start offset in storage. The offset is to the logical
802
                        //         sequence of scalars16, scalars32, supplementary_scalars16,
803
                        //         supplementary_scalars32.
804
                        let offset = usize::from(trail_or_complex & 0xFFF);
263✔
805
                        if offset < self.scalars16.len() {
263✔
806
                            self.push_decomposition16(trail_or_complex, offset, self.scalars16)
243✔
807
                        } else if offset < self.scalars16.len() + self.scalars24.len() {
20✔
808
                            self.push_decomposition32(
6✔
809
                                trail_or_complex,
810
                                offset - self.scalars16.len(),
6✔
811
                                self.scalars24,
6✔
812
                            )
813
                        } else if offset
14✔
814
                            < self.scalars16.len()
42✔
815
                                + self.scalars24.len()
14✔
816
                                + self.supplementary_scalars16.len()
14✔
817
                        {
818
                            self.push_decomposition16(
14✔
819
                                trail_or_complex,
820
                                offset - (self.scalars16.len() + self.scalars24.len()),
14✔
821
                                self.supplementary_scalars16,
14✔
822
                            )
823
                        } else {
824
                            self.push_decomposition32(
×
825
                                trail_or_complex,
826
                                offset
×
827
                                    - (self.scalars16.len()
×
828
                                        + self.scalars24.len()
×
829
                                        + self.supplementary_scalars16.len()),
×
830
                                self.supplementary_scalars24,
×
831
                            )
832
                        }
833
                    }
834
                }
835
            } else {
836
                // Hangul syllable
837
                // The math here comes from page 144 of Unicode 14.0
838
                let l = hangul_offset / HANGUL_N_COUNT;
×
839
                let v = (hangul_offset % HANGUL_N_COUNT) / HANGUL_T_COUNT;
×
840
                let t = hangul_offset % HANGUL_T_COUNT;
×
841

842
                // The unsafe blocks here are OK, because the values stay
843
                // within the Hangul jamo block and, therefore, the scalar
844
                // value range by construction.
845
                self.buffer.push(CharacterAndClass::new_starter(unsafe {
×
846
                    core::char::from_u32_unchecked(HANGUL_V_BASE + v)
×
847
                }));
848
                let first = unsafe { core::char::from_u32_unchecked(HANGUL_L_BASE + l) };
×
849
                if t != 0 {
×
850
                    self.buffer.push(CharacterAndClass::new_starter(unsafe {
×
851
                        core::char::from_u32_unchecked(HANGUL_T_BASE + t)
×
852
                    }));
853
                    (first, 2)
×
854
                } else {
855
                    (first, 1)
×
856
                }
857
            }
858
        };
859
        // Either we're inside `Composition` or `self.pending.is_none()`.
860

861
        self.gather_and_sort_combining(combining_start);
4,922✔
862
        starter
863
    }
4,922✔
864

865
    fn gather_and_sort_combining(&mut self, combining_start: usize) {
5,423✔
866
        // Not a `for` loop to avoid holding a mutable reference to `self` across
867
        // the loop body.
868
        while let Some(ch_and_trie_val) = self.delegate_next() {
6,036✔
869
            if trie_value_has_ccc(ch_and_trie_val.trie_val) {
5,556✔
870
                self.buffer
613✔
871
                    .push(CharacterAndClass::new_with_trie_value(ch_and_trie_val));
613✔
872
            } else if trie_value_indicates_special_non_starter_decomposition(
4,943✔
873
                ch_and_trie_val.trie_val,
4,943✔
874
            ) {
875
                // The Tibetan special cases are starters that decompose into non-starters.
876
                let mapped = match ch_and_trie_val.character {
×
877
                    '\u{0340}' => {
878
                        // COMBINING GRAVE TONE MARK
879
                        CharacterAndClass::new('\u{0300}', CanonicalCombiningClass::Above)
×
880
                    }
881
                    '\u{0341}' => {
882
                        // COMBINING ACUTE TONE MARK
883
                        CharacterAndClass::new('\u{0301}', CanonicalCombiningClass::Above)
×
884
                    }
885
                    '\u{0343}' => {
886
                        // COMBINING GREEK KORONIS
887
                        CharacterAndClass::new('\u{0313}', CanonicalCombiningClass::Above)
×
888
                    }
889
                    '\u{0344}' => {
890
                        // COMBINING GREEK DIALYTIKA TONOS
891
                        self.buffer.push(CharacterAndClass::new(
×
892
                            '\u{0308}',
893
                            CanonicalCombiningClass::Above,
894
                        ));
895
                        CharacterAndClass::new('\u{0301}', CanonicalCombiningClass::Above)
×
896
                    }
897
                    '\u{0F73}' => {
898
                        // TIBETAN VOWEL SIGN II
899
                        self.buffer.push(CharacterAndClass::new(
×
900
                            '\u{0F71}',
901
                            CanonicalCombiningClass::CCC129,
902
                        ));
903
                        CharacterAndClass::new('\u{0F72}', CanonicalCombiningClass::CCC130)
×
904
                    }
905
                    '\u{0F75}' => {
906
                        // TIBETAN VOWEL SIGN UU
907
                        self.buffer.push(CharacterAndClass::new(
×
908
                            '\u{0F71}',
909
                            CanonicalCombiningClass::CCC129,
910
                        ));
911
                        CharacterAndClass::new('\u{0F74}', CanonicalCombiningClass::CCC132)
×
912
                    }
913
                    '\u{0F81}' => {
914
                        // TIBETAN VOWEL SIGN REVERSED II
915
                        self.buffer.push(CharacterAndClass::new(
×
916
                            '\u{0F71}',
917
                            CanonicalCombiningClass::CCC129,
918
                        ));
919
                        CharacterAndClass::new('\u{0F80}', CanonicalCombiningClass::CCC130)
×
920
                    }
921
                    _ => {
922
                        // GIGO case
923
                        debug_assert!(false);
×
924
                        CharacterAndClass::new_with_placeholder(REPLACEMENT_CHARACTER)
925
                    }
926
                };
927
                self.buffer.push(mapped);
×
928
            } else {
929
                self.pending = Some(ch_and_trie_val);
4,943✔
930
                break;
931
            }
932
        }
933
        // Slicing succeeds by construction; we've always ensured that `combining_start`
934
        // is in permissible range.
935
        #[allow(clippy::indexing_slicing)]
936
        sort_slice_by_ccc(&mut self.buffer[combining_start..], self.trie);
5,423✔
937
    }
5,423✔
938
}
939

940
impl<'data, I> Iterator for Decomposition<'data, I>
941
where
942
    I: Iterator<Item = char>,
943
{
944
    type Item = char;
945

946
    fn next(&mut self) -> Option<char> {
3,654✔
947
        if let Some(ret) = self.buffer.get(self.buffer_pos).map(|c| c.character()) {
3,657✔
948
            self.buffer_pos += 1;
3✔
949
            if self.buffer_pos == self.buffer.len() {
3✔
950
                self.buffer.clear();
3✔
951
                self.buffer_pos = 0;
3✔
952
            }
953
            return Some(ret);
3✔
954
        }
955
        debug_assert_eq!(self.buffer_pos, 0);
3,651✔
956
        let c_and_trie_val = self.pending.take()?;
3,651✔
957
        Some(self.decomposing_next(c_and_trie_val))
3,648✔
958
    }
3,654✔
959
}
960

961
/// An iterator adaptor that turns an `Iterator` over `char` into
962
/// a lazily-decomposed and then canonically composed `char` sequence.
963
#[derive(Debug)]
964
pub struct Composition<'data, I>
965
where
966
    I: Iterator<Item = char>,
967
{
968
    /// The decomposing part of the normalizer than operates before
969
    /// the canonical composition is performed on its output.
970
    decomposition: Decomposition<'data, I>,
971
    /// Non-Hangul canonical composition data.
972
    canonical_compositions: Char16Trie<'data>,
973
    /// To make `next()` yield in cases where there's a non-composing
974
    /// starter in the decomposition buffer, we put it here to let it
975
    /// wait for the next `next()` call (or a jump forward within the
976
    /// `next()` call).
977
    unprocessed_starter: Option<char>,
978
    /// The lowest character for which any one of the following does
979
    /// not hold:
980
    /// 1. Roundtrips via decomposition and recomposition.
981
    /// 2. Decomposition starts with a non-starter
982
    /// 3. Is not a backward-combining starter
983
    composition_passthrough_bound: u32,
984
}
985

986
impl<'data, I> Composition<'data, I>
987
where
988
    I: Iterator<Item = char>,
989
{
990
    fn new(
1,284✔
991
        decomposition: Decomposition<'data, I>,
992
        canonical_compositions: Char16Trie<'data>,
993
        composition_passthrough_bound: u16,
994
    ) -> Self {
995
        Self {
1,284✔
996
            decomposition,
1,284✔
997
            canonical_compositions,
1,284✔
998
            unprocessed_starter: None,
1,284✔
999
            composition_passthrough_bound: u32::from(composition_passthrough_bound),
1,284✔
1000
        }
×
1001
    }
1,284✔
1002

1003
    /// Performs canonical composition (including Hangul) on a pair of
1004
    /// characters or returns `None` if these characters don't compose.
1005
    /// Composition exclusions are taken into account.
1006
    #[inline(always)]
1007
    pub fn compose(&self, starter: char, second: char) -> Option<char> {
133✔
1008
        compose(self.canonical_compositions.iter(), starter, second)
133✔
1009
    }
133✔
1010

1011
    /// Performs (non-Hangul) canonical composition on a pair of characters
1012
    /// or returns `None` if these characters don't compose. Composition
1013
    /// exclusions are taken into account.
1014
    #[inline(always)]
1015
    fn compose_non_hangul(&self, starter: char, second: char) -> Option<char> {
×
1016
        compose_non_hangul(self.canonical_compositions.iter(), starter, second)
×
1017
    }
×
1018
}
1019

1020
impl<'data, I> Iterator for Composition<'data, I>
1021
where
1022
    I: Iterator<Item = char>,
1023
{
1024
    type Item = char;
1025

1026
    #[inline]
1027
    fn next(&mut self) -> Option<char> {
1028
        let mut undecomposed_starter = CharacterAndTrieValue::new('\u{0}', 0); // The compiler can't figure out that this gets overwritten before use.
1029
        if self.unprocessed_starter.is_none() {
1030
            // The loop is only broken out of as goto forward
1031
            #[allow(clippy::never_loop)]
1032
            loop {
1033
                if let Some((character, ccc)) = self
1034
                    .decomposition
1035
                    .buffer
1036
                    .get(self.decomposition.buffer_pos)
1037
                    .map(|c| c.character_and_ccc())
1038
                {
1039
                    self.decomposition.buffer_pos += 1;
1040
                    if self.decomposition.buffer_pos == self.decomposition.buffer.len() {
1041
                        self.decomposition.buffer.clear();
1042
                        self.decomposition.buffer_pos = 0;
1043
                    }
1044
                    if ccc == CanonicalCombiningClass::NotReordered {
1045
                        // Previous decomposition contains a starter. This must
1046
                        // now become the `unprocessed_starter` for it to have
1047
                        // a chance to compose with the upcoming characters.
1048
                        //
1049
                        // E.g. parenthesized Hangul in NFKC comes through here,
1050
                        // but suitable composition exclusion could exercise this
1051
                        // in NFC.
1052
                        self.unprocessed_starter = Some(character);
1053
                        break; // We already have a starter, so skip taking one from `pending`.
1054
                    }
1055
                    return Some(character);
1056
                }
1057
                debug_assert_eq!(self.decomposition.buffer_pos, 0);
1058
                undecomposed_starter = self.decomposition.pending.take()?;
1059
                if u32::from(undecomposed_starter.character) < self.composition_passthrough_bound
1060
                    || undecomposed_starter.potential_passthrough()
1061
                {
1062
                    // TODO(#2385): In the NFC case (moot for NFKC and UTS46), if the upcoming
1063
                    // character is not below `decomposition_passthrough_bound` but is
1064
                    // below `composition_passthrough_bound`, we read from the trie
1065
                    // unnecessarily.
1066
                    if let Some(upcoming) = self.decomposition.delegate_next_no_pending() {
1067
                        let cannot_combine_backwards = u32::from(upcoming.character)
1068
                            < self.composition_passthrough_bound
1069
                            || !upcoming.can_combine_backwards();
1070
                        self.decomposition.pending = Some(upcoming);
1071
                        if cannot_combine_backwards {
1072
                            // Fast-track succeeded!
1073
                            return Some(undecomposed_starter.character);
1074
                        }
1075
                    } else {
1076
                        // End of stream
1077
                        return Some(undecomposed_starter.character);
1078
                    }
1079
                }
1080
                break; // Not actually looping
1081
            }
1082
        }
1083
        let mut starter = '\u{0}'; // The compiler can't figure out this gets overwritten before use.
1084

1085
        // The point of having this boolean is to have only one call site to
1086
        // `self.decomposition.decomposing_next`, which is hopefully beneficial for
1087
        // code size under inlining.
1088
        let mut attempt_composition = false;
1089
        loop {
1090
            if let Some(unprocessed) = self.unprocessed_starter.take() {
1091
                debug_assert_eq!(undecomposed_starter, CharacterAndTrieValue::new('\u{0}', 0));
1092
                debug_assert_eq!(starter, '\u{0}');
1093
                starter = unprocessed;
1094
            } else {
1095
                debug_assert_eq!(self.decomposition.buffer_pos, 0);
1096
                let next_starter = self.decomposition.decomposing_next(undecomposed_starter);
1097
                if !attempt_composition {
1098
                    starter = next_starter;
1099
                } else if let Some(composed) = self.compose(starter, next_starter) {
1100
                    starter = composed;
1101
                } else {
1102
                    // This is our yield point. We'll pick this up above in the
1103
                    // next call to `next()`.
1104
                    self.unprocessed_starter = Some(next_starter);
1105
                    return Some(starter);
1106
                }
1107
            }
1108
            // We first loop by index to avoid moving the contents of `buffer`, but
1109
            // if there's a discontiguous match, we'll start modifying `buffer` instead.
1110
            loop {
1111
                let (character, ccc) = if let Some((character, ccc)) = self
1112
                    .decomposition
1113
                    .buffer
1114
                    .get(self.decomposition.buffer_pos)
1115
                    .map(|c| c.character_and_ccc())
1116
                {
1117
                    (character, ccc)
1118
                } else {
1119
                    self.decomposition.buffer.clear();
1120
                    self.decomposition.buffer_pos = 0;
1121
                    break;
1122
                };
1123
                if let Some(composed) = self.compose(starter, character) {
1124
                    starter = composed;
1125
                    self.decomposition.buffer_pos += 1;
1126
                    continue;
1127
                }
1128
                let mut most_recent_skipped_ccc = ccc;
1129
                {
1130
                    let _ = self
1131
                        .decomposition
1132
                        .buffer
1133
                        .drain(0..self.decomposition.buffer_pos);
1134
                }
1135
                self.decomposition.buffer_pos = 0;
1136
                if most_recent_skipped_ccc == CanonicalCombiningClass::NotReordered {
1137
                    // We failed to compose a starter. Discontiguous match not allowed.
1138
                    // We leave the starter in `buffer` for `next()` to find.
1139
                    return Some(starter);
1140
                }
1141
                let mut i = 1; // We have skipped one non-starter.
1142
                while let Some((character, ccc)) = self
1143
                    .decomposition
1144
                    .buffer
1145
                    .get(i)
1146
                    .map(|c| c.character_and_ccc())
1147
                {
1148
                    if ccc == CanonicalCombiningClass::NotReordered {
1149
                        // Discontiguous match not allowed.
1150
                        return Some(starter);
1151
                    }
1152
                    debug_assert!(ccc >= most_recent_skipped_ccc);
1153
                    if ccc != most_recent_skipped_ccc {
1154
                        // Using the non-Hangul version as a micro-optimization, since
1155
                        // we already rejected the case where `second` is a starter
1156
                        // above, and conjoining jamo are starters.
1157
                        if let Some(composed) = self.compose_non_hangul(starter, character) {
1158
                            self.decomposition.buffer.remove(i);
1159
                            starter = composed;
1160
                            continue;
1161
                        }
1162
                    }
1163
                    most_recent_skipped_ccc = ccc;
1164
                    i += 1;
1165
                }
1166
                break;
1167
            }
1168

1169
            debug_assert_eq!(self.decomposition.buffer_pos, 0);
1170

1171
            if !self.decomposition.buffer.is_empty() {
1172
                return Some(starter);
1173
            }
1174
            // Now we need to check if composition with an upcoming starter is possible.
1175
            #[allow(clippy::unwrap_used)]
1176
            if self.decomposition.pending.is_some() {
1177
                // We know that `pending_starter` decomposes to start with a starter.
1178
                // Otherwise, it would have been moved to `self.decomposition.buffer`
1179
                // by `self.decomposing_next()`. We do this set lookup here in order
1180
                // to get an opportunity to go back to the fast track.
1181
                // Note that this check has to happen _after_ checking that `pending`
1182
                // holds a character, because this flag isn't defined to be meaningful
1183
                // when `pending` isn't holding a character.
1184
                let pending = self.decomposition.pending.as_ref().unwrap();
1185
                if u32::from(pending.character) < self.composition_passthrough_bound
1186
                    || !pending.can_combine_backwards()
1187
                {
1188
                    // Won't combine backwards anyway.
1189
                    return Some(starter);
1190
                }
1191
                // Consume what we peeked. `unwrap` OK, because we checked `is_some()`
1192
                // above.
1193
                undecomposed_starter = self.decomposition.pending.take().unwrap();
1194
                // The following line is OK, because we're about to loop back
1195
                // to `self.decomposition.decomposing_next(c);`, which will
1196
                // restore the between-`next()`-calls invariant of `pending`
1197
                // before this function returns.
1198
                attempt_composition = true;
1199
                continue;
1200
            }
1201
            // End of input
1202
            return Some(starter);
1203
        }
1204
    }
1205
}
1206

1207
macro_rules! composing_normalize_to {
1208
    ($(#[$meta:meta])*,
1209
     $normalize_to:ident,
1210
     $write:path,
1211
     $slice:ty,
1212
     $prolog:block,
1213
     $always_valid_utf:literal,
1214
     $as_slice:ident,
1215
     $fast:block,
1216
     $text:ident,
1217
     $sink:ident,
1218
     $composition:ident,
1219
     $composition_passthrough_bound:ident,
1220
     $undecomposed_starter:ident,
1221
     $pending_slice:ident,
1222
     $len_utf:ident,
1223
    ) => {
1224
        $(#[$meta])*
1225
        pub fn $normalize_to<W: $write + ?Sized>(
1,264✔
1226
            &self,
1227
            $text: $slice,
1228
            $sink: &mut W,
1229
        ) -> core::fmt::Result {
1230
            $prolog
1231
            let mut $composition = self.normalize_iter($text.chars());
1,264✔
1232
            for cc in $composition.decomposition.buffer.drain(..) {
1,268✔
1233
                $sink.write_char(cc.character())?;
4✔
1234
            }
1235

1236
            // Try to get the compiler to hoist the bound to a register.
1237
            let $composition_passthrough_bound = $composition.composition_passthrough_bound;
1,264✔
1238
            'outer: loop {
1,264✔
1239
                debug_assert_eq!($composition.decomposition.buffer_pos, 0);
1,343✔
1240
                let mut $undecomposed_starter =
1241
                    if let Some(pending) = $composition.decomposition.pending.take() {
1,343✔
1242
                        pending
1,337✔
1243
                    } else {
1244
                        return Ok(());
6✔
1245
                    };
1246
                // Allowing indexed slicing, because a failure would be a code bug and
1247
                // not a data issue.
1248
                #[allow(clippy::indexing_slicing)]
1249
                if u32::from($undecomposed_starter.character) < $composition_passthrough_bound ||
2,602✔
1250
                    $undecomposed_starter.potential_passthrough()
1,265✔
1251
                {
1252
                    // We don't know if a `REPLACEMENT_CHARACTER` occurred in the slice or
1253
                    // was returned in response to an error by the iterator. Assume the
1254
                    // latter for correctness even though it pessimizes the former.
1255
                    if $always_valid_utf || $undecomposed_starter.character != REPLACEMENT_CHARACTER {
34✔
1256
                        let $pending_slice = &$text[$text.len() - $composition.decomposition.delegate.$as_slice().len() - $undecomposed_starter.character.$len_utf()..];
1,308✔
1257
                        // The `$fast` block must either:
1258
                        // 1. Return due to reaching EOF
1259
                        // 2. Leave a starter with its trie value in `$undecomposed_starter`
1260
                        //    and, if there is still more input, leave the next character
1261
                        //    and its trie value in `$composition.decomposition.pending`.
1262
                        $fast
1263
                    }
1264
                }
1265
                // Fast track above, full algorithm below
1266
                let mut starter = $composition
108✔
1267
                    .decomposition
1268
                    .decomposing_next($undecomposed_starter);
108✔
1269
                'bufferloop: loop {
1270
                    // We first loop by index to avoid moving the contents of `buffer`, but
1271
                    // if there's a discontiguous match, we'll start modifying `buffer` instead.
1272
                    loop {
1273
                        let (character, ccc) = if let Some((character, ccc)) = $composition
229✔
1274
                            .decomposition
1275
                            .buffer
1276
                            .get($composition.decomposition.buffer_pos)
229✔
1277
                            .map(|c| c.character_and_ccc())
133✔
1278
                        {
1279
                            (character, ccc)
133✔
1280
                        } else {
1281
                            $composition.decomposition.buffer.clear();
96✔
1282
                            $composition.decomposition.buffer_pos = 0;
96✔
1283
                            break;
1284
                        };
1285
                        if let Some(composed) = $composition.compose(starter, character) {
133✔
1286
                            starter = composed;
81✔
1287
                            $composition.decomposition.buffer_pos += 1;
81✔
1288
                            continue;
1289
                        }
1290
                        let mut most_recent_skipped_ccc = ccc;
52✔
1291
                        if most_recent_skipped_ccc == CanonicalCombiningClass::NotReordered {
52✔
1292
                            // We failed to compose a starter. Discontiguous match not allowed.
1293
                            // Write the current `starter` we've been composing, make the unmatched
1294
                            // starter in the buffer the new `starter` (we know it's been decomposed)
1295
                            // and process the rest of the buffer with that as the starter.
1296
                            $sink.write_char(starter)?;
43✔
1297
                            starter = character;
40✔
1298
                            $composition.decomposition.buffer_pos += 1;
40✔
1299
                            continue 'bufferloop;
1300
                        } else {
1301
                            {
1302
                                let _ = $composition
9✔
1303
                                    .decomposition
1304
                                    .buffer
1305
                                    .drain(0..$composition.decomposition.buffer_pos);
9✔
1306
                            }
1307
                            $composition.decomposition.buffer_pos = 0;
9✔
1308
                        }
1309
                        let mut i = 1; // We have skipped one non-starter.
9✔
1310
                        while let Some((character, ccc)) = $composition
9✔
1311
                            .decomposition
1312
                            .buffer
1313
                            .get(i)
9✔
1314
                            .map(|c| c.character_and_ccc())
×
1315
                        {
1316
                            if ccc == CanonicalCombiningClass::NotReordered {
×
1317
                                // Discontiguous match not allowed.
1318
                                $sink.write_char(starter)?;
×
1319
                                for cc in $composition.decomposition.buffer.drain(..i) {
×
1320
                                    $sink.write_char(cc.character())?;
×
1321
                                }
1322
                                starter = character;
×
1323
                                {
1324
                                    let removed = $composition.decomposition.buffer.remove(0);
×
1325
                                    debug_assert_eq!(starter, removed.character());
×
1326
                                }
1327
                                debug_assert_eq!($composition.decomposition.buffer_pos, 0);
×
1328
                                continue 'bufferloop;
1329
                            }
1330
                            debug_assert!(ccc >= most_recent_skipped_ccc);
×
1331
                            if ccc != most_recent_skipped_ccc {
×
1332
                                // Using the non-Hangul version as a micro-optimization, since
1333
                                // we already rejected the case where `second` is a starter
1334
                                // above, and conjoining jamo are starters.
1335
                                if let Some(composed) =
×
1336
                                    $composition.compose_non_hangul(starter, character)
×
1337
                                {
1338
                                    $composition.decomposition.buffer.remove(i);
×
1339
                                    starter = composed;
×
1340
                                    continue;
1341
                                }
1342
                            }
1343
                            most_recent_skipped_ccc = ccc;
×
1344
                            i += 1;
×
1345
                        }
1346
                        break;
1347
                    }
1348
                    debug_assert_eq!($composition.decomposition.buffer_pos, 0);
105✔
1349

1350
                    if !$composition.decomposition.buffer.is_empty() {
105✔
1351
                        $sink.write_char(starter)?;
9✔
1352
                        for cc in $composition.decomposition.buffer.drain(..) {
18✔
1353
                            $sink.write_char(cc.character())?;
9✔
1354
                        }
1355
                        // We had non-empty buffer, so can't compose with upcoming.
1356
                        continue 'outer;
1357
                    }
1358
                    // Now we need to check if composition with an upcoming starter is possible.
1359
                    if $composition.decomposition.pending.is_some() {
96✔
1360
                        // We know that `pending_starter` decomposes to start with a starter.
1361
                        // Otherwise, it would have been moved to `composition.decomposition.buffer`
1362
                        // by `composition.decomposing_next()`. We do this set lookup here in order
1363
                        // to get an opportunity to go back to the fast track.
1364
                        // Note that this check has to happen _after_ checking that `pending`
1365
                        // holds a character, because this flag isn't defined to be meaningful
1366
                        // when `pending` isn't holding a character.
1367
                        let pending = $composition.decomposition.pending.as_ref().unwrap();
70✔
1368
                        if u32::from(pending.character) < $composition.composition_passthrough_bound
134✔
1369
                            || !pending.can_combine_backwards()
64✔
1370
                        {
1371
                            // Won't combine backwards anyway.
1372
                            $sink.write_char(starter)?;
70✔
1373
                            continue 'outer;
1374
                        }
1375
                        let pending_starter = $composition.decomposition.pending.take().unwrap();
×
1376
                        let decomposed = $composition.decomposition.decomposing_next(pending_starter);
×
1377
                        if let Some(composed) = $composition.compose(starter, decomposed) {
×
1378
                            starter = composed;
×
1379
                        } else {
1380
                            $sink.write_char(starter)?;
×
1381
                            starter = decomposed;
×
1382
                        }
1383
                        continue 'bufferloop;
1384
                    }
1385
                    // End of input
1386
                    $sink.write_char(starter)?;
26✔
1387
                    return Ok(());
26✔
1388
                } // 'bufferloop
1389
            }
1390
        }
1,264✔
1391
    };
1392
}
1393

1394
macro_rules! decomposing_normalize_to {
1395
    ($(#[$meta:meta])*,
1396
     $normalize_to:ident,
1397
     $write:path,
1398
     $slice:ty,
1399
     $prolog:block,
1400
     $as_slice:ident,
1401
     $fast:block,
1402
     $text:ident,
1403
     $sink:ident,
1404
     $decomposition:ident,
1405
     $decomposition_passthrough_bound:ident,
1406
     $undecomposed_starter:ident,
1407
     $pending_slice:ident,
1408
     $outer:lifetime, // loop labels use lifetime tokens
1409
    ) => {
1410
        $(#[$meta])*
1411
        pub fn $normalize_to<W: $write + ?Sized>(
2,312✔
1412
            &self,
1413
            $text: $slice,
1414
            $sink: &mut W,
1415
        ) -> core::fmt::Result {
1416
            $prolog
1417

1418
            let mut $decomposition = self.normalize_iter($text.chars());
2,312✔
1419

1420
            // Try to get the compiler to hoist the bound to a register.
1421
            let $decomposition_passthrough_bound = $decomposition.decomposition_passthrough_bound;
2,312✔
1422
            $outer: loop {
1423
                for cc in $decomposition.buffer.drain(..) {
5,712✔
1424
                    $sink.write_char(cc.character())?;
1,808✔
1425
                }
1426
                debug_assert_eq!($decomposition.buffer_pos, 0);
3,904✔
1427
                let mut $undecomposed_starter = if let Some(pending) = $decomposition.pending.take() {
3,904✔
1428
                    pending
3,497✔
1429
                } else {
1430
                    return Ok(());
407✔
1431
                };
1432
                // Allowing indexed slicing, because a failure would be a code bug and
1433
                // not a data issue.
1434
                #[allow(clippy::indexing_slicing)]
1435
                if $undecomposed_starter.starter_and_decomposes_to_self() {
3,502✔
1436
                    // Don't bother including `undecomposed_starter` in a contiguous buffer
1437
                    // write: Just write it right away:
1438
                    $sink.write_char($undecomposed_starter.character)?;
3,160✔
1439

1440
                    let $pending_slice = $decomposition.delegate.$as_slice();
3,160✔
1441
                    $fast
1442
                }
1443
                let starter = $decomposition.decomposing_next($undecomposed_starter);
1,100✔
1444
                $sink.write_char(starter)?;
1,100✔
1445
            }
2,312✔
1446
        }
2,312✔
1447
    };
1448
}
1449

1450
macro_rules! normalizer_methods {
1451
    () => {
1452
        /// Normalize a string slice into a `String`.
1453
        pub fn normalize(&self, text: &str) -> String {
3,168✔
1454
            let mut ret = String::new();
3,168✔
1455
            ret.reserve(text.len());
3,168✔
1456
            let _ = self.normalize_to(text, &mut ret);
3,168✔
1457
            ret
1458
        }
3,168✔
1459

1460
        /// Check whether a string slice is normalized.
1461
        pub fn is_normalized(&self, text: &str) -> bool {
16✔
1462
            let mut sink = IsNormalizedSinkStr::new(text);
16✔
1463
            if self.normalize_to(text, &mut sink).is_err() {
16✔
1464
                return false;
4✔
1465
            }
1466
            sink.finished()
12✔
1467
        }
16✔
1468

1469
        /// Normalize a slice of potentially-invalid UTF-16 into a `Vec`.
1470
        ///
1471
        /// Unpaired surrogates are mapped to the REPLACEMENT CHARACTER
1472
        /// before normalizing.
1473
        pub fn normalize_utf16(&self, text: &[u16]) -> Vec<u16> {
2✔
1474
            let mut ret = Vec::new();
2✔
1475
            let _ = self.normalize_utf16_to(text, &mut ret);
2✔
1476
            ret
1477
        }
2✔
1478

1479
        /// Checks whether a slice of potentially-invalid UTF-16 is normalized.
1480
        ///
1481
        /// Unpaired surrogates are treated as the REPLACEMENT CHARACTER.
1482
        pub fn is_normalized_utf16(&self, text: &[u16]) -> bool {
20✔
1483
            let mut sink = IsNormalizedSinkUtf16::new(text);
20✔
1484
            if self.normalize_utf16_to(text, &mut sink).is_err() {
20✔
1485
                return false;
4✔
1486
            }
1487
            sink.finished()
16✔
1488
        }
20✔
1489

1490
        /// Normalize a slice of potentially-invalid UTF-8 into a `String`.
1491
        ///
1492
        /// Ill-formed byte sequences are mapped to the REPLACEMENT CHARACTER
1493
        /// according to the WHATWG Encoding Standard.
1494
        pub fn normalize_utf8(&self, text: &[u8]) -> String {
350✔
1495
            let mut ret = String::new();
350✔
1496
            ret.reserve(text.len());
350✔
1497
            let _ = self.normalize_utf8_to(text, &mut ret);
350✔
1498
            ret
1499
        }
350✔
1500

1501
        /// Check if a slice of potentially-invalid UTF-8 is normalized.
1502
        ///
1503
        /// Ill-formed byte sequences are mapped to the REPLACEMENT CHARACTER
1504
        /// according to the WHATWG Encoding Standard before checking.
1505
        pub fn is_normalized_utf8(&self, text: &[u8]) -> bool {
20✔
1506
            let mut sink = IsNormalizedSinkUtf8::new(text);
20✔
1507
            if self.normalize_utf8_to(text, &mut sink).is_err() {
20✔
1508
                return false;
4✔
1509
            }
1510
            sink.finished()
16✔
1511
        }
20✔
1512
    };
1513
}
1514

1515
/// A normalizer for performing decomposing normalization.
1516
#[derive(Debug)]
×
1517
pub struct DecomposingNormalizer {
1518
    decompositions: DataPayload<CanonicalDecompositionDataV1Marker>,
×
1519
    supplementary_decompositions: Option<SupplementPayloadHolder>,
×
1520
    tables: DataPayload<CanonicalDecompositionTablesV1Marker>,
×
1521
    supplementary_tables: Option<DataPayload<CompatibilityDecompositionTablesV1Marker>>,
×
1522
    decomposition_passthrough_bound: u8, // never above 0xC0
×
1523
    composition_passthrough_bound: u16,  // never above 0x0300
×
1524
}
1525

1526
impl DecomposingNormalizer {
1527
    /// NFD constructor using compiled data.
1528
    ///
1529
    /// ✨ *Enabled with the `compiled_data` Cargo feature.*
1530
    ///
1531
    /// [📚 Help choosing a constructor](icu_provider::constructors)
1532
    #[cfg(feature = "compiled_data")]
1533
    pub const fn new_nfd() -> Self {
24✔
1534
        const _: () = assert!(
1535
            crate::provider::Baked::SINGLETON_NORMALIZER_NFDEX_V1
1536
                .scalars16
1537
                .const_len()
1538
                + crate::provider::Baked::SINGLETON_NORMALIZER_NFDEX_V1
1539
                    .scalars24
1540
                    .const_len()
1541
                <= 0xFFF,
1542
            "NormalizerError::FutureExtension"
1543
        );
1544

1545
        DecomposingNormalizer {
24✔
1546
            decompositions: DataPayload::from_static_ref(
24✔
1547
                crate::provider::Baked::SINGLETON_NORMALIZER_NFD_V1,
1548
            ),
1549
            supplementary_decompositions: None,
24✔
1550
            tables: DataPayload::from_static_ref(
24✔
1551
                crate::provider::Baked::SINGLETON_NORMALIZER_NFDEX_V1,
1552
            ),
1553
            supplementary_tables: None,
24✔
1554
            decomposition_passthrough_bound: 0xC0,
1555
            composition_passthrough_bound: 0x0300,
1556
        }
×
1557
    }
24✔
1558

1559
    icu_provider::gen_any_buffer_data_constructors!(
1560
        locale: skip,
1561
        options: skip,
1562
        error: NormalizerError,
1563
        #[cfg(skip)]
1564
        functions: [
1565
            new_nfd,
1566
            try_new_nfd_with_any_provider,
1567
            try_new_nfd_with_buffer_provider,
1568
            try_new_nfd_unstable,
1569
            Self,
1570
        ]
1571
    );
1572

1573
    #[doc = icu_provider::gen_any_buffer_unstable_docs!(UNSTABLE, Self::new_nfd)]
1574
    pub fn try_new_nfd_unstable<D>(provider: &D) -> Result<Self, NormalizerError>
4✔
1575
    where
1576
        D: DataProvider<CanonicalDecompositionDataV1Marker>
1577
            + DataProvider<CanonicalDecompositionTablesV1Marker>
1578
            + ?Sized,
1579
    {
1580
        let decompositions: DataPayload<CanonicalDecompositionDataV1Marker> =
1581
            provider.load(Default::default())?.take_payload()?;
4✔
1582
        let tables: DataPayload<CanonicalDecompositionTablesV1Marker> =
1583
            provider.load(Default::default())?.take_payload()?;
4✔
1584

1585
        if tables.get().scalars16.len() + tables.get().scalars24.len() > 0xFFF {
4✔
1586
            // The data is from a future where there exists a normalization flavor whose
1587
            // complex decompositions take more than 0xFFF but fewer than 0x1FFF code points
1588
            // of space. If a good use case from such a decomposition flavor arises, we can
1589
            // dynamically change the bit masks so that the length mask becomes 0x1FFF instead
1590
            // of 0xFFF and the all-non-starters mask becomes 0 instead of 0x1000. However,
1591
            // since for now the masks are hard-coded, error out.
1592
            return Err(NormalizerError::FutureExtension);
×
1593
        }
1594

1595
        Ok(DecomposingNormalizer {
4✔
1596
            decompositions,
4✔
1597
            supplementary_decompositions: None,
4✔
1598
            tables,
4✔
1599
            supplementary_tables: None,
4✔
1600
            decomposition_passthrough_bound: 0xC0,
1601
            composition_passthrough_bound: 0x0300,
1602
        })
1603
    }
4✔
1604

1605
    /// NFKD constructor using compiled data.
1606
    ///
1607
    /// ✨ *Enabled with the `compiled_data` Cargo feature.*
1608
    ///
1609
    /// [📚 Help choosing a constructor](icu_provider::constructors)
1610
    #[cfg(feature = "compiled_data")]
1611
    pub const fn new_nfkd() -> Self {
10✔
1612
        const _: () = assert!(
1613
            crate::provider::Baked::SINGLETON_NORMALIZER_NFDEX_V1
1614
                .scalars16
1615
                .const_len()
1616
                + crate::provider::Baked::SINGLETON_NORMALIZER_NFDEX_V1
1617
                    .scalars24
1618
                    .const_len()
1619
                + crate::provider::Baked::SINGLETON_NORMALIZER_NFKDEX_V1
1620
                    .scalars16
1621
                    .const_len()
1622
                + crate::provider::Baked::SINGLETON_NORMALIZER_NFKDEX_V1
1623
                    .scalars24
1624
                    .const_len()
1625
                <= 0xFFF,
1626
            "NormalizerError::FutureExtension"
1627
        );
1628

1629
        const _: () = assert!(
1630
            crate::provider::Baked::SINGLETON_NORMALIZER_NFKD_V1.passthrough_cap <= 0x0300,
1631
            "NormalizerError::ValidationError"
1632
        );
1633

1634
        let decomposition_capped =
1635
            if crate::provider::Baked::SINGLETON_NORMALIZER_NFKD_V1.passthrough_cap < 0xC0 {
1636
                crate::provider::Baked::SINGLETON_NORMALIZER_NFKD_V1.passthrough_cap
10✔
1637
            } else {
1638
                0xC0
1639
            };
1640
        let composition_capped =
1641
            if crate::provider::Baked::SINGLETON_NORMALIZER_NFKD_V1.passthrough_cap < 0x0300 {
1642
                crate::provider::Baked::SINGLETON_NORMALIZER_NFKD_V1.passthrough_cap
10✔
1643
            } else {
1644
                0x0300
1645
            };
1646

1647
        DecomposingNormalizer {
10✔
1648
            decompositions: DataPayload::from_static_ref(
10✔
1649
                crate::provider::Baked::SINGLETON_NORMALIZER_NFD_V1,
1650
            ),
1651
            supplementary_decompositions: Some(SupplementPayloadHolder::Compatibility(
10✔
1652
                DataPayload::from_static_ref(crate::provider::Baked::SINGLETON_NORMALIZER_NFKD_V1),
10✔
1653
            )),
1654
            tables: DataPayload::from_static_ref(
10✔
1655
                crate::provider::Baked::SINGLETON_NORMALIZER_NFDEX_V1,
1656
            ),
1657
            supplementary_tables: Some(DataPayload::from_static_ref(
10✔
1658
                crate::provider::Baked::SINGLETON_NORMALIZER_NFKDEX_V1,
1659
            )),
1660
            decomposition_passthrough_bound: decomposition_capped as u8,
1661
            composition_passthrough_bound: composition_capped,
1662
        }
×
1663
    }
10✔
1664

1665
    icu_provider::gen_any_buffer_data_constructors!(
1666
        locale: skip,
1667
        options: skip,
1668
        error: NormalizerError,
1669
        #[cfg(skip)]
1670
        functions: [
1671
            new_nfkd,
1672
            try_new_nfkd_with_any_provider,
1673
            try_new_nfkd_with_buffer_provider,
1674
            try_new_nfkd_unstable,
1675
            Self,
1676
        ]
1677
    );
1678

1679
    #[doc = icu_provider::gen_any_buffer_unstable_docs!(UNSTABLE, Self::new_nfkd)]
1680
    pub fn try_new_nfkd_unstable<D>(provider: &D) -> Result<Self, NormalizerError>
×
1681
    where
1682
        D: DataProvider<CanonicalDecompositionDataV1Marker>
1683
            + DataProvider<CompatibilityDecompositionSupplementV1Marker>
1684
            + DataProvider<CanonicalDecompositionTablesV1Marker>
1685
            + DataProvider<CompatibilityDecompositionTablesV1Marker>
1686
            + ?Sized,
1687
    {
1688
        let decompositions: DataPayload<CanonicalDecompositionDataV1Marker> =
1689
            provider.load(Default::default())?.take_payload()?;
×
1690
        let supplementary_decompositions: DataPayload<
1691
            CompatibilityDecompositionSupplementV1Marker,
1692
        > = provider.load(Default::default())?.take_payload()?;
×
1693
        let tables: DataPayload<CanonicalDecompositionTablesV1Marker> =
1694
            provider.load(Default::default())?.take_payload()?;
×
1695
        let supplementary_tables: DataPayload<CompatibilityDecompositionTablesV1Marker> =
1696
            provider.load(Default::default())?.take_payload()?;
×
1697

1698
        if tables.get().scalars16.len()
×
1699
            + tables.get().scalars24.len()
×
1700
            + supplementary_tables.get().scalars16.len()
×
1701
            + supplementary_tables.get().scalars24.len()
×
1702
            > 0xFFF
1703
        {
1704
            // The data is from a future where there exists a normalization flavor whose
1705
            // complex decompositions take more than 0xFFF but fewer than 0x1FFF code points
1706
            // of space. If a good use case from such a decomposition flavor arises, we can
1707
            // dynamically change the bit masks so that the length mask becomes 0x1FFF instead
1708
            // of 0xFFF and the all-non-starters mask becomes 0 instead of 0x1000. However,
1709
            // since for now the masks are hard-coded, error out.
1710
            return Err(NormalizerError::FutureExtension);
×
1711
        }
1712

1713
        let cap = supplementary_decompositions.get().passthrough_cap;
×
1714
        if cap > 0x0300 {
×
1715
            return Err(NormalizerError::ValidationError);
×
1716
        }
1717
        let decomposition_capped = cap.min(0xC0);
×
1718
        let composition_capped = cap.min(0x0300);
×
1719

1720
        Ok(DecomposingNormalizer {
×
1721
            decompositions,
×
1722
            supplementary_decompositions: Some(SupplementPayloadHolder::Compatibility(
×
1723
                supplementary_decompositions,
×
1724
            )),
1725
            tables,
×
1726
            supplementary_tables: Some(supplementary_tables),
×
1727
            decomposition_passthrough_bound: decomposition_capped as u8,
×
1728
            composition_passthrough_bound: composition_capped,
1729
        })
1730
    }
×
1731

1732
    #[doc(hidden)]
1733
    #[cfg(all(feature = "experimental", feature = "compiled_data"))]
1734
    pub const fn new_uts46_decomposed_without_ignored_and_disallowed() -> Self {
2✔
1735
        const _: () = assert!(
1736
            crate::provider::Baked::SINGLETON_NORMALIZER_NFDEX_V1
1737
                .scalars16
1738
                .const_len()
1739
                + crate::provider::Baked::SINGLETON_NORMALIZER_NFDEX_V1
1740
                    .scalars24
1741
                    .const_len()
1742
                + crate::provider::Baked::SINGLETON_NORMALIZER_NFKDEX_V1
1743
                    .scalars16
1744
                    .const_len()
1745
                + crate::provider::Baked::SINGLETON_NORMALIZER_NFKDEX_V1
1746
                    .scalars24
1747
                    .const_len()
1748
                <= 0xFFF,
1749
            "NormalizerError::FutureExtension"
1750
        );
1751

1752
        const _: () = assert!(
1753
            crate::provider::Baked::SINGLETON_NORMALIZER_UTS46D_V1.passthrough_cap <= 0x0300,
1754
            "NormalizerError::ValidationError"
1755
        );
1756

1757
        let decomposition_capped =
1758
            if crate::provider::Baked::SINGLETON_NORMALIZER_UTS46D_V1.passthrough_cap < 0xC0 {
1759
                crate::provider::Baked::SINGLETON_NORMALIZER_UTS46D_V1.passthrough_cap
2✔
1760
            } else {
1761
                0xC0
1762
            };
1763
        let composition_capped =
1764
            if crate::provider::Baked::SINGLETON_NORMALIZER_UTS46D_V1.passthrough_cap < 0x0300 {
1765
                crate::provider::Baked::SINGLETON_NORMALIZER_UTS46D_V1.passthrough_cap
2✔
1766
            } else {
1767
                0x0300
1768
            };
1769

1770
        DecomposingNormalizer {
2✔
1771
            decompositions: DataPayload::from_static_ref(
2✔
1772
                crate::provider::Baked::SINGLETON_NORMALIZER_NFD_V1,
1773
            ),
1774
            supplementary_decompositions: Some(SupplementPayloadHolder::Uts46(
2✔
1775
                DataPayload::from_static_ref(
2✔
1776
                    crate::provider::Baked::SINGLETON_NORMALIZER_UTS46D_V1,
1777
                ),
1778
            )),
1779
            tables: DataPayload::from_static_ref(
2✔
1780
                crate::provider::Baked::SINGLETON_NORMALIZER_NFDEX_V1,
1781
            ),
1782
            supplementary_tables: Some(DataPayload::from_static_ref(
2✔
1783
                crate::provider::Baked::SINGLETON_NORMALIZER_NFKDEX_V1,
1784
            )),
1785
            decomposition_passthrough_bound: decomposition_capped as u8,
1786
            composition_passthrough_bound: composition_capped,
1787
        }
×
1788
    }
2✔
1789

1790
    /// UTS 46 decomposed constructor (testing only)
1791
    ///
1792
    /// This is a special building block normalization for IDNA. It is the decomposed counterpart of
1793
    /// ICU4C's UTS 46 normalization with two exceptions: characters that UTS 46 disallows and
1794
    /// ICU4C maps to U+FFFD and characters that UTS 46 maps to the empty string normalize as in
1795
    /// NFD in this normalization. In both cases, the previous UTS 46 processing before using
1796
    /// normalization is expected to deal with these characters. Making the disallowed characters
1797
    /// behave like this is beneficial to data size, and this normalizer implementation cannot
1798
    /// deal with a character normalizing to the empty string, which doesn't happen in NFD or
1799
    /// NFKD as of Unicode 14.
1800
    ///
1801
    /// Warning: In this normalization, U+0345 COMBINING GREEK YPOGEGRAMMENI exhibits a behavior
1802
    /// that no character in Unicode exhibits in NFD, NFKD, NFC, or NFKC: Case folding turns
1803
    /// U+0345 from a reordered character into a non-reordered character before reordering happens.
1804
    /// Therefore, the output of this normalization may differ for different inputs that are
1805
    /// canonically equivalent with each other if they differ by how U+0345 is ordered relative
1806
    /// to other reorderable characters.
1807
    ///
1808
    /// Public for testing only.
1809
    #[doc(hidden)]
1810
    #[cfg(feature = "experimental")]
1811
    pub fn try_new_uts46_decomposed_without_ignored_and_disallowed_unstable<D>(
1812
        provider: &D,
1813
    ) -> Result<Self, NormalizerError>
1814
    where
1815
        D: DataProvider<CanonicalDecompositionDataV1Marker>
1816
            + DataProvider<Uts46DecompositionSupplementV1Marker>
1817
            + DataProvider<CanonicalDecompositionTablesV1Marker>
1818
            + DataProvider<CompatibilityDecompositionTablesV1Marker>
1819
            // UTS 46 tables merged into CompatibilityDecompositionTablesV1Marker
1820
            + ?Sized,
1821
    {
1822
        let decompositions: DataPayload<CanonicalDecompositionDataV1Marker> =
1823
            provider.load(Default::default())?.take_payload()?;
1824
        let supplementary_decompositions: DataPayload<Uts46DecompositionSupplementV1Marker> =
1825
            provider.load(Default::default())?.take_payload()?;
1826
        let tables: DataPayload<CanonicalDecompositionTablesV1Marker> =
1827
            provider.load(Default::default())?.take_payload()?;
1828
        let supplementary_tables: DataPayload<CompatibilityDecompositionTablesV1Marker> =
1829
            provider.load(Default::default())?.take_payload()?;
1830

1831
        if tables.get().scalars16.len()
1832
            + tables.get().scalars24.len()
1833
            + supplementary_tables.get().scalars16.len()
1834
            + supplementary_tables.get().scalars24.len()
1835
            > 0xFFF
1836
        {
1837
            // The data is from a future where there exists a normalization flavor whose
1838
            // complex decompositions take more than 0xFFF but fewer than 0x1FFF code points
1839
            // of space. If a good use case from such a decomposition flavor arises, we can
1840
            // dynamically change the bit masks so that the length mask becomes 0x1FFF instead
1841
            // of 0xFFF and the all-non-starters mask becomes 0 instead of 0x1000. However,
1842
            // since for now the masks are hard-coded, error out.
1843
            return Err(NormalizerError::FutureExtension);
1844
        }
1845

1846
        let cap = supplementary_decompositions.get().passthrough_cap;
1847
        if cap > 0x0300 {
1848
            return Err(NormalizerError::ValidationError);
1849
        }
1850
        let decomposition_capped = cap.min(0xC0);
1851
        let composition_capped = cap.min(0x0300);
1852

1853
        Ok(DecomposingNormalizer {
1854
            decompositions,
1855
            supplementary_decompositions: Some(SupplementPayloadHolder::Uts46(
1856
                supplementary_decompositions,
1857
            )),
1858
            tables,
1859
            supplementary_tables: Some(supplementary_tables),
1860
            decomposition_passthrough_bound: decomposition_capped as u8,
1861
            composition_passthrough_bound: composition_capped,
1862
        })
1863
    }
1864

1865
    /// Wraps a delegate iterator into a decomposing iterator
1866
    /// adapter by using the data already held by this normalizer.
1867
    pub fn normalize_iter<I: Iterator<Item = char>>(&self, iter: I) -> Decomposition<I> {
2,332✔
1868
        Decomposition::new_with_supplements(
2,332✔
1869
            iter,
2,332✔
1870
            self.decompositions.get(),
2,332✔
1871
            self.supplementary_decompositions.as_ref().map(|s| s.get()),
2,374✔
1872
            self.tables.get(),
2,332✔
1873
            self.supplementary_tables.as_ref().map(|s| s.get()),
2,374✔
1874
            self.decomposition_passthrough_bound,
2,332✔
1875
        )
2,332✔
1876
    }
2,332✔
1877

1878
    normalizer_methods!();
1879

1880
    decomposing_normalize_to!(
1881
        /// Normalize a string slice into a `Write` sink.
1882
        ,
1883
        normalize_to,
1884
        core::fmt::Write,
1885
        &str,
1886
        {
1887
        },
1888
        as_str,
1889
        {
1890
            let decomposition_passthrough_byte_bound = if decomposition_passthrough_bound == 0xC0 {
3,032✔
1891
                0xC3u8
3,014✔
1892
            } else {
1893
                decomposition_passthrough_bound.min(0x80) as u8
9✔
1894
            };
1895
            // The attribute belongs on an inner statement, but Rust doesn't allow it there.
1896
            #[allow(clippy::unwrap_used)]
1897
            'fast: loop {
1898
                let mut code_unit_iter = decomposition.delegate.as_str().as_bytes().iter();
11,605✔
1899
                'fastest: loop {
1900
                    if let Some(&upcoming_byte) = code_unit_iter.next() {
11,989✔
1901
                        if upcoming_byte < decomposition_passthrough_byte_bound {
10,216✔
1902
                            // Fast-track succeeded!
1903
                            continue 'fastest;
1904
                        }
1905
                        decomposition.delegate = pending_slice[pending_slice.len() - code_unit_iter.as_slice().len() - 1..].chars();
9,832✔
1906
                        break 'fastest;
1907
                    }
1908
                    // End of stream
1909
                    sink.write_str(pending_slice)?;
1,773✔
1910
                    return Ok(());
1,773✔
1911
                }
1912

1913
                // `unwrap()` OK, because the slice is valid UTF-8 and we know there
1914
                // is an upcoming byte.
1915
                let upcoming = decomposition.delegate.next().unwrap();
9,832✔
1916
                let upcoming_with_trie_value = decomposition.attach_trie_value(upcoming);
9,832✔
1917
                if upcoming_with_trie_value.starter_and_decomposes_to_self() {
9,832✔
1918
                    continue 'fast;
1919
                }
1920
                let consumed_so_far_slice = &pending_slice[..pending_slice.len()
2,500✔
1921
                    - decomposition.delegate.as_str().len()
1,250✔
1922
                    - upcoming.len_utf8()];
1,250✔
1923
                sink.write_str(consumed_so_far_slice)?;
1,250✔
1924

1925
                // Now let's figure out if we got a starter or a non-starter.
1926
                if decomposition_starts_with_non_starter(
1,250✔
1927
                    upcoming_with_trie_value.trie_val,
1,250✔
1928
                ) {
1929
                    // Let this trie value to be reprocessed in case it is
1930
                    // one of the rare decomposing ones.
1931
                    decomposition.pending = Some(upcoming_with_trie_value);
497✔
1932
                    decomposition.gather_and_sort_combining(0);
497✔
1933
                    continue 'outer;
1934
                }
1935
                undecomposed_starter = upcoming_with_trie_value;
753✔
1936
                debug_assert!(decomposition.pending.is_none());
753✔
1937
                break 'fast;
1938
            }
1939
        },
1940
        text,
1941
        sink,
1942
        decomposition,
1943
        decomposition_passthrough_bound,
1944
        undecomposed_starter,
1945
        pending_slice,
1946
        'outer,
1947
    );
1948

1949
    decomposing_normalize_to!(
1950
        /// Normalize a slice of potentially-invalid UTF-8 into a `Write` sink.
1951
        ///
1952
        /// Ill-formed byte sequences are mapped to the REPLACEMENT CHARACTER
1953
        /// according to the WHATWG Encoding Standard.
1954
        ,
1955
        normalize_utf8_to,
1956
        core::fmt::Write,
1957
        &[u8],
1958
        {
1959
        },
1960
        as_slice,
1961
        {
1962
            let decomposition_passthrough_byte_bound = decomposition_passthrough_bound.min(0x80) as u8;
123✔
1963
            // The attribute belongs on an inner statement, but Rust doesn't allow it there.
1964
            #[allow(clippy::unwrap_used)]
1965
            'fast: loop {
123✔
1966
                let mut code_unit_iter = decomposition.delegate.as_slice().iter();
126✔
1967
                'fastest: loop {
1968
                    if let Some(&upcoming_byte) = code_unit_iter.next() {
131✔
1969
                        if upcoming_byte < decomposition_passthrough_byte_bound {
15✔
1970
                            // Fast-track succeeded!
1971
                            continue 'fastest;
1972
                        }
1973
                        break 'fastest;
1974
                    }
1975
                    // End of stream
1976
                    sink.write_str(unsafe { from_utf8_unchecked(pending_slice) })?;
116✔
1977
                    return Ok(());
116✔
1978
                }
1979
                decomposition.delegate = pending_slice[pending_slice.len() - code_unit_iter.as_slice().len() - 1..].chars();
10✔
1980

1981
                // `unwrap()` OK, because the slice is valid UTF-8 and we know there
1982
                // is an upcoming byte.
1983
                let upcoming = decomposition.delegate.next().unwrap();
10✔
1984
                let upcoming_with_trie_value = decomposition.attach_trie_value(upcoming);
10✔
1985
                if upcoming_with_trie_value.starter_and_decomposes_to_self() {
10✔
1986
                    if upcoming != REPLACEMENT_CHARACTER {
5✔
1987
                        continue 'fast;
1988
                    }
1989
                    // We might have an error, so fall out of the fast path.
1990

1991
                    // Since the U+FFFD might signify an error, we can't
1992
                    // assume `upcoming.len_utf8()` for the backoff length.
1993
                    let mut consumed_so_far = pending_slice[..pending_slice.len() - decomposition.delegate.as_slice().len()].chars();
2✔
1994
                    let back = consumed_so_far.next_back();
2✔
1995
                    debug_assert_eq!(back, Some(REPLACEMENT_CHARACTER));
2✔
1996
                    let consumed_so_far_slice = consumed_so_far.as_slice();
2✔
1997
                    sink.write_str(unsafe{from_utf8_unchecked(consumed_so_far_slice)})?;
2✔
1998

1999
                    // We could call `gather_and_sort_combining` here and
2000
                    // `continue 'outer`, but this should be better for code
2001
                    // size.
2002
                    undecomposed_starter = upcoming_with_trie_value;
2✔
2003
                    debug_assert!(decomposition.pending.is_none());
2✔
2004
                    break 'fast;
2005
                }
2006
                let consumed_so_far_slice = &pending_slice[..pending_slice.len()
10✔
2007
                    - decomposition.delegate.as_slice().len()
5✔
2008
                    - upcoming.len_utf8()];
5✔
2009
                sink.write_str(unsafe{from_utf8_unchecked(consumed_so_far_slice)})?;
5✔
2010

2011
                // Now let's figure out if we got a starter or a non-starter.
2012
                if decomposition_starts_with_non_starter(
5✔
2013
                    upcoming_with_trie_value.trie_val,
5✔
2014
                ) {
2015
                    // Let this trie value to be reprocessed in case it is
2016
                    // one of the rare decomposing ones.
2017
                    decomposition.pending = Some(upcoming_with_trie_value);
2✔
2018
                    decomposition.gather_and_sort_combining(0);
2✔
2019
                    continue 'outer;
2020
                }
2021
                undecomposed_starter = upcoming_with_trie_value;
3✔
2022
                debug_assert!(decomposition.pending.is_none());
3✔
2023
                break 'fast;
2024
            }
2025
        },
2026
        text,
2027
        sink,
2028
        decomposition,
2029
        decomposition_passthrough_bound,
2030
        undecomposed_starter,
2031
        pending_slice,
2032
        'outer,
2033
    );
2034

2035
    decomposing_normalize_to!(
2036
        /// Normalize a slice of potentially-invalid UTF-16 into a `Write16` sink.
2037
        ///
2038
        /// Unpaired surrogates are mapped to the REPLACEMENT CHARACTER
2039
        /// before normalizing.
2040
        ,
2041
        normalize_utf16_to,
2042
        write16::Write16,
2043
        &[u16],
2044
        {
2045
            sink.size_hint(text.len())?;
10✔
2046
        },
2047
        as_slice,
2048
        {
2049
            let mut code_unit_iter = decomposition.delegate.as_slice().iter();
14✔
2050
            // The purpose of the counter is to flush once in a while. If we flush
2051
            // too much, there is too much flushing overhead. If we flush too rarely,
2052
            // the flush starts reading from too far behind compared to the hot
2053
            // recently-read memory.
2054
            let mut counter = UTF16_FAST_PATH_FLUSH_THRESHOLD;
14✔
2055
            'fast: loop {
14✔
2056
                counter -= 1;
22✔
2057
                if let Some(&upcoming_code_unit) = code_unit_iter.next() {
22✔
2058
                    let mut upcoming32 = u32::from(upcoming_code_unit);
15✔
2059
                    if upcoming32 < decomposition_passthrough_bound && counter != 0 {
15✔
2060
                        continue 'fast;
2061
                    }
2062
                    // The loop is only broken out of as goto forward
2063
                    #[allow(clippy::never_loop)]
2064
                    'surrogateloop: loop {
2065
                        let surrogate_base = upcoming32.wrapping_sub(0xD800);
9✔
2066
                        if surrogate_base > (0xDFFF - 0xD800) {
9✔
2067
                            // Not surrogate
2068
                            break 'surrogateloop;
2069
                        }
2070
                        if surrogate_base <= (0xDBFF - 0xD800) {
6✔
2071
                            let iter_backup = code_unit_iter.clone();
6✔
2072
                            if let Some(&low) = code_unit_iter.next() {
8✔
2073
                                if in_inclusive_range16(low, 0xDC00, 0xDFFF) {
6✔
2074
                                    upcoming32 = (upcoming32 << 10) + u32::from(low)
4✔
2075
                                        - (((0xD800u32 << 10) - 0x10000u32) + 0xDC00u32);
2076
                                    break 'surrogateloop;
2077
                                } else {
2078
                                    code_unit_iter = iter_backup;
2✔
2079
                                }
2080
                            }
2081
                        }
2082
                        // unpaired surrogate
2083
                        let slice_to_write = &pending_slice
2✔
2084
                            [..pending_slice.len() - code_unit_iter.as_slice().len() - 1];
2✔
2085
                        sink.write_slice(slice_to_write)?;
2✔
2086
                        undecomposed_starter =
2✔
2087
                            CharacterAndTrieValue::new(REPLACEMENT_CHARACTER, 0);
2✔
2088
                        debug_assert!(decomposition.pending.is_none());
2✔
2089
                        // We could instead call `gather_and_sort_combining` and `continue 'outer`, but
2090
                        // assuming this is better for code size.
2091
                        break 'fast;
2092
                    }
2093
                    // Not unpaired surrogate
2094
                    let upcoming = unsafe { char::from_u32_unchecked(upcoming32) };
7✔
2095
                    let upcoming_with_trie_value =
2096
                        decomposition.attach_trie_value(upcoming);
7✔
2097
                    if upcoming_with_trie_value.starter_and_decomposes_to_self() && counter != 0 {
7✔
2098
                        continue 'fast;
2099
                    }
2100
                    let consumed_so_far_slice = &pending_slice[..pending_slice.len()
10✔
2101
                        - code_unit_iter.as_slice().len()
5✔
2102
                        - upcoming.len_utf16()];
5✔
2103
                    sink.write_slice(consumed_so_far_slice)?;
5✔
2104

2105
                    // Now let's figure out if we got a starter or a non-starter.
2106
                    if decomposition_starts_with_non_starter(
5✔
2107
                        upcoming_with_trie_value.trie_val,
5✔
2108
                    ) {
2109
                        // Sync with main iterator
2110
                        decomposition.delegate = code_unit_iter.as_slice().chars();
2✔
2111
                        // Let this trie value to be reprocessed in case it is
2112
                        // one of the rare decomposing ones.
2113
                        decomposition.pending = Some(upcoming_with_trie_value);
2✔
2114
                        decomposition.gather_and_sort_combining(0);
2✔
2115
                        continue 'outer;
2116
                    }
2117
                    undecomposed_starter = upcoming_with_trie_value;
3✔
2118
                    debug_assert!(decomposition.pending.is_none());
3✔
2119
                    break 'fast;
2120
                }
2121
                // End of stream
2122
                sink.write_slice(pending_slice)?;
7✔
2123
                return Ok(());
7✔
2124
            }
2125
            // Sync the main iterator
2126
            decomposition.delegate = code_unit_iter.as_slice().chars();
5✔
2127
        },
2128
        text,
2129
        sink,
2130
        decomposition,
2131
        decomposition_passthrough_bound,
2132
        undecomposed_starter,
2133
        pending_slice,
2134
        'outer,
2135
    );
2136
}
2137

2138
/// A normalizer for performing composing normalization.
2139
#[derive(Debug)]
×
2140
pub struct ComposingNormalizer {
2141
    decomposing_normalizer: DecomposingNormalizer,
×
2142
    canonical_compositions: DataPayload<CanonicalCompositionsV1Marker>,
×
2143
}
2144

2145
impl ComposingNormalizer {
2146
    /// NFC constructor using compiled data.
2147
    ///
2148
    /// ✨ *Enabled with the `compiled_data` Cargo feature.*
2149
    ///
2150
    /// [📚 Help choosing a constructor](icu_provider::constructors)
2151
    #[cfg(feature = "compiled_data")]
2152
    pub const fn new_nfc() -> Self {
12✔
2153
        ComposingNormalizer {
12✔
2154
            decomposing_normalizer: DecomposingNormalizer::new_nfd(),
12✔
2155
            canonical_compositions: DataPayload::from_static_ref(
12✔
2156
                crate::provider::Baked::SINGLETON_NORMALIZER_COMP_V1,
2157
            ),
2158
        }
×
2159
    }
12✔
2160

2161
    icu_provider::gen_any_buffer_data_constructors!(
2162
        locale: skip,
2163
        options: skip,
2164
        error: NormalizerError,
2165
        #[cfg(skip)]
2166
        functions: [
2167
            new_nfc,
2168
            try_new_nfc_with_any_provider,
2169
            try_new_nfc_with_buffer_provider,
2170
            try_new_nfc_unstable,
2171
            Self,
2172
        ]
2173
    );
2174

2175
    #[doc = icu_provider::gen_any_buffer_unstable_docs!(UNSTABLE, Self::new_nfc)]
2176
    pub fn try_new_nfc_unstable<D>(provider: &D) -> Result<Self, NormalizerError>
2✔
2177
    where
2178
        D: DataProvider<CanonicalDecompositionDataV1Marker>
2179
            + DataProvider<CanonicalDecompositionTablesV1Marker>
2180
            + DataProvider<CanonicalCompositionsV1Marker>
2181
            + ?Sized,
2182
    {
2183
        let decomposing_normalizer = DecomposingNormalizer::try_new_nfd_unstable(provider)?;
2✔
2184

2185
        let canonical_compositions: DataPayload<CanonicalCompositionsV1Marker> =
2186
            provider.load(Default::default())?.take_payload()?;
2✔
2187

2188
        Ok(ComposingNormalizer {
2✔
2189
            decomposing_normalizer,
2✔
2190
            canonical_compositions,
2✔
2191
        })
2192
    }
2✔
2193

2194
    /// NFKC constructor using compiled data.
2195
    ///
2196
    /// ✨ *Enabled with the `compiled_data` Cargo feature.*
2197
    ///
2198
    /// [📚 Help choosing a constructor](icu_provider::constructors)
2199
    #[cfg(feature = "compiled_data")]
2200
    pub const fn new_nfkc() -> Self {
5✔
2201
        ComposingNormalizer {
5✔
2202
            decomposing_normalizer: DecomposingNormalizer::new_nfkd(),
5✔
2203
            canonical_compositions: DataPayload::from_static_ref(
5✔
2204
                crate::provider::Baked::SINGLETON_NORMALIZER_COMP_V1,
2205
            ),
2206
        }
×
2207
    }
5✔
2208

2209
    icu_provider::gen_any_buffer_data_constructors!(
2210
        locale: skip,
2211
        options: skip,
2212
        error: NormalizerError,
2213
        #[cfg(skip)]
2214
        functions: [
2215
            new_nfkc,
2216
            try_new_nfkc_with_any_provider,
2217
            try_new_nfkc_with_buffer_provider,
2218
            try_new_nfkc_unstable,
2219
            Self,
2220
        ]
2221
    );
2222

2223
    #[doc = icu_provider::gen_any_buffer_unstable_docs!(UNSTABLE, Self::new_nfkc)]
2224
    pub fn try_new_nfkc_unstable<D>(provider: &D) -> Result<Self, NormalizerError>
×
2225
    where
2226
        D: DataProvider<CanonicalDecompositionDataV1Marker>
2227
            + DataProvider<CompatibilityDecompositionSupplementV1Marker>
2228
            + DataProvider<CanonicalDecompositionTablesV1Marker>
2229
            + DataProvider<CompatibilityDecompositionTablesV1Marker>
2230
            + DataProvider<CanonicalCompositionsV1Marker>
2231
            + ?Sized,
2232
    {
2233
        let decomposing_normalizer = DecomposingNormalizer::try_new_nfkd_unstable(provider)?;
×
2234

2235
        let canonical_compositions: DataPayload<CanonicalCompositionsV1Marker> =
2236
            provider.load(Default::default())?.take_payload()?;
×
2237

2238
        Ok(ComposingNormalizer {
×
2239
            decomposing_normalizer,
×
2240
            canonical_compositions,
×
2241
        })
2242
    }
×
2243

2244
    /// See [`Self::try_new_uts46_without_ignored_and_disallowed_unstable`].
2245
    #[cfg(all(feature = "experimental", feature = "compiled_data"))]
2246
    pub const fn new_uts46_without_ignored_and_disallowed() -> Self {
1✔
2247
        ComposingNormalizer {
1✔
2248
            decomposing_normalizer:
2249
                DecomposingNormalizer::new_uts46_decomposed_without_ignored_and_disallowed(),
1✔
2250
            canonical_compositions: DataPayload::from_static_ref(
1✔
2251
                crate::provider::Baked::SINGLETON_NORMALIZER_COMP_V1,
2252
            ),
2253
        }
×
2254
    }
1✔
2255

2256
    /// 🚧 \[Experimental\] UTS 46 constructor
2257
    ///
2258
    /// This is a special building block normalization for IDNA that implements parts of the Map
2259
    /// step and the following Normalize step. The caller is responsible for performing the
2260
    /// "disallowed", "ignored", and "deviation" parts of the Map step before passing data to
2261
    /// this normalizer such that disallowed and ignored characters aren't passed to this
2262
    /// normalizer.
2263
    ///
2264
    /// This is ICU4C's UTS 46 normalization with two exceptions: characters that UTS 46 disallows
2265
    /// and ICU4C maps to U+FFFD and characters that UTS 46 maps to the empty string normalize as
2266
    /// in NFC in this normalization. Making the disallowed characters behave like this is beneficial
2267
    /// to data size, and this normalizer implementation cannot deal with a character normalizing
2268
    /// to the empty string, which doesn't happen in NFC or NFKC as of Unicode 14.
2269
    ///
2270
    /// Warning: In this normalization, U+0345 COMBINING GREEK YPOGEGRAMMENI exhibits a behavior
2271
    /// that no character in Unicode exhibits in NFD, NFKD, NFC, or NFKC: Case folding turns
2272
    /// U+0345 from a reordered character into a non-reordered character before reordering happens.
2273
    /// Therefore, the output of this normalization may differ for different inputs that are
2274
    /// canonically equivalents with each other if they differ by how U+0345 is ordered relative
2275
    /// to other reorderable characters.
2276
    ///
2277
    /// NOTE: This method remains experimental until suitability of this feature as part of
2278
    /// IDNA processing has been demonstrated.
2279
    ///
2280
    /// <div class="stab unstable">
2281
    /// 🚧 This code is experimental; it may change at any time, in breaking or non-breaking ways,
2282
    /// including in SemVer minor releases. It can be enabled with the "experimental" Cargo feature
2283
    /// of the icu meta-crate. Use with caution.
2284
    /// <a href="https://github.com/unicode-org/icu4x/issues/2614">#2614</a>
2285
    /// </div>
2286
    #[cfg(feature = "experimental")]
2287
    pub fn try_new_uts46_without_ignored_and_disallowed_unstable<D>(
2288
        provider: &D,
2289
    ) -> Result<Self, NormalizerError>
2290
    where
2291
        D: DataProvider<CanonicalDecompositionDataV1Marker>
2292
            + DataProvider<Uts46DecompositionSupplementV1Marker>
2293
            + DataProvider<CanonicalDecompositionTablesV1Marker>
2294
            + DataProvider<CompatibilityDecompositionTablesV1Marker>
2295
            // UTS 46 tables merged into CompatibilityDecompositionTablesV1Marker
2296
            + DataProvider<CanonicalCompositionsV1Marker>
2297
            + ?Sized,
2298
    {
2299
        let decomposing_normalizer =
2300
            DecomposingNormalizer::try_new_uts46_decomposed_without_ignored_and_disallowed_unstable(
2301
                provider,
2302
            )?;
2303

2304
        let canonical_compositions: DataPayload<CanonicalCompositionsV1Marker> =
2305
            provider.load(Default::default())?.take_payload()?;
2306

2307
        Ok(ComposingNormalizer {
2308
            decomposing_normalizer,
2309
            canonical_compositions,
2310
        })
2311
    }
2312

2313
    /// Wraps a delegate iterator into a composing iterator
2314
    /// adapter by using the data already held by this normalizer.
2315
    pub fn normalize_iter<I: Iterator<Item = char>>(&self, iter: I) -> Composition<I> {
1,284✔
2316
        Composition::new(
1,284✔
2317
            Decomposition::new_with_supplements(
2,568✔
2318
                iter,
1,284✔
2319
                self.decomposing_normalizer.decompositions.get(),
1,284✔
2320
                self.decomposing_normalizer
1,284✔
2321
                    .supplementary_decompositions
2322
                    .as_ref()
2323
                    .map(|s| s.get()),
42✔
2324
                self.decomposing_normalizer.tables.get(),
1,284✔
2325
                self.decomposing_normalizer
1,284✔
2326
                    .supplementary_tables
2327
                    .as_ref()
2328
                    .map(|s| s.get()),
42✔
2329
                self.decomposing_normalizer.decomposition_passthrough_bound,
1,284✔
2330
            ),
1,284✔
2331
            ZeroFrom::zero_from(&self.canonical_compositions.get().canonical_compositions),
1,284✔
2332
            self.decomposing_normalizer.composition_passthrough_bound,
1,284✔
2333
        )
1,284✔
2334
    }
1,284✔
2335

2336
    normalizer_methods!();
2337

2338
    composing_normalize_to!(
2339
        /// Normalize a string slice into a `Write` sink.
2340
        ,
2341
        normalize_to,
2342
        core::fmt::Write,
2343
        &str,
2344
        {},
2345
        true,
2346
        as_str,
2347
        {
2348
            // Let's hope LICM hoists this outside `'outer`.
2349
            let composition_passthrough_byte_bound = if composition_passthrough_bound == 0x300 {
1,294✔
2350
                0xCCu8
1,264✔
2351
            } else {
2352
                // We can make this fancy if a normalization other than NFC where looking at
2353
                // non-ASCII lead bytes is worthwhile is ever introduced.
2354
                composition_passthrough_bound.min(0x80) as u8
15✔
2355
            };
2356
            // This is basically an `Option` discriminant for `undecomposed_starter`,
2357
            // but making it a boolean so that writes in the tightest loop are as
2358
            // simple as possible (and potentially as peel-hoistable as possible).
2359
            // Furthermore, this reduces `unwrap()` later.
2360
            let mut undecomposed_starter_valid = true;
1,279✔
2361
            // Annotation belongs really on inner statements, but Rust doesn't
2362
            // allow it there.
2363
            #[allow(clippy::unwrap_used)]
2364
            'fast: loop {
1,279✔
2365
                let mut code_unit_iter = composition.decomposition.delegate.as_str().as_bytes().iter();
7,441✔
2366
                'fastest: loop {
2367
                    if let Some(&upcoming_byte) = code_unit_iter.next() {
7,871✔
2368
                        if upcoming_byte < composition_passthrough_byte_bound {
6,661✔
2369
                            // Fast-track succeeded!
2370
                            undecomposed_starter_valid = false;
430✔
2371
                            continue 'fastest;
2372
                        }
2373
                        composition.decomposition.delegate = pending_slice[pending_slice.len() - code_unit_iter.as_slice().len() - 1..].chars();
6,231✔
2374
                        break 'fastest;
2375
                    }
2376
                    // End of stream
2377
                    sink.write_str(pending_slice)?;
1,210✔
2378
                    return Ok(());
1,210✔
2379
                }
2380
                // `unwrap()` OK, because the slice is valid UTF-8 and we know there
2381
                // is an upcoming byte.
2382
                let upcoming = composition.decomposition.delegate.next().unwrap();
6,231✔
2383
                let upcoming_with_trie_value = composition.decomposition.attach_trie_value(upcoming);
6,231✔
2384
                if upcoming_with_trie_value.potential_passthrough_and_cannot_combine_backwards() {
6,231✔
2385
                    // Can't combine backwards, hence a plain (non-backwards-combining)
2386
                    // starter albeit past `composition_passthrough_bound`
2387

2388
                    // Fast-track succeeded!
2389
                    undecomposed_starter = upcoming_with_trie_value;
6,162✔
2390
                    undecomposed_starter_valid = true;
6,162✔
2391
                    continue 'fast;
2392
                }
2393
                // We need to fall off the fast path.
2394
                composition.decomposition.pending = Some(upcoming_with_trie_value);
69✔
2395
                let consumed_so_far_slice = if undecomposed_starter_valid {
138✔
2396
                    &pending_slice[..pending_slice.len() - composition.decomposition.delegate.as_str().len() - upcoming.len_utf8() - undecomposed_starter.character.len_utf8()]
67✔
2397
                } else {
2398
                    // slicing and unwrap OK, because we've just evidently read enough previously.
2399
                    let mut consumed_so_far = pending_slice[..pending_slice.len() - composition.decomposition.delegate.as_str().len() - upcoming.len_utf8()].chars();
2✔
2400
                    // `unwrap` OK, because we've previously manage to read the previous character
2401
                    undecomposed_starter = composition.decomposition.attach_trie_value(consumed_so_far.next_back().unwrap());
2✔
2402
                    undecomposed_starter_valid = true;
2✔
2403
                    consumed_so_far.as_str()
2✔
2404
                };
2405
                sink.write_str(consumed_so_far_slice)?;
69✔
2406
                break 'fast;
2407
            }
2408
            debug_assert!(undecomposed_starter_valid);
69✔
2409
        },
2410
        text,
2411
        sink,
2412
        composition,
2413
        composition_passthrough_bound,
2414
        undecomposed_starter,
2415
        pending_slice,
2416
        len_utf8,
2417
    );
2418

2419
    composing_normalize_to!(
2420
        /// Normalize a slice of potentially-invalid UTF-8 into a `Write` sink.
2421
        ///
2422
        /// Ill-formed byte sequences are mapped to the REPLACEMENT CHARACTER
2423
        /// according to the WHATWG Encoding Standard.
2424
        ,
2425
        normalize_utf8_to,
2426
        core::fmt::Write,
2427
        &[u8],
2428
        {},
2429
        false,
2430
        as_slice,
2431
        {
2432
            // This is basically an `Option` discriminant for `undecomposed_starter`,
2433
            // but making it a boolean so that writes in the tightest loop are as
2434
            // simple as possible (and potentially as peel-hoistable as possible).
2435
            // Furthermore, this reduces `unwrap()` later.
2436
            let mut undecomposed_starter_valid = true;
14✔
2437
            'fast: loop {
14✔
2438
                if let Some(upcoming) = composition.decomposition.delegate.next() {
26✔
2439
                    if u32::from(upcoming) < composition_passthrough_bound {
17✔
2440
                        // Fast-track succeeded!
2441
                        undecomposed_starter_valid = false;
9✔
2442
                        continue 'fast;
2443
                    }
2444
                    // TODO(#2006): Annotate as unlikely
2445
                    if upcoming == REPLACEMENT_CHARACTER {
8✔
2446
                        // Can't tell if this is an error or a literal U+FFFD in
2447
                        // the input. Assuming the former to be sure.
2448

2449
                        // Since the U+FFFD might signify an error, we can't
2450
                        // assume `upcoming.len_utf8()` for the backoff length.
2451
                        let mut consumed_so_far = pending_slice[..pending_slice.len() - composition.decomposition.delegate.as_slice().len()].chars();
2✔
2452
                        let back = consumed_so_far.next_back();
2✔
2453
                        debug_assert_eq!(back, Some(REPLACEMENT_CHARACTER));
2✔
2454
                        let consumed_so_far_slice = consumed_so_far.as_slice();
2✔
2455
                        sink.write_str(unsafe{ from_utf8_unchecked(consumed_so_far_slice)})?;
2✔
2456
                        undecomposed_starter = CharacterAndTrieValue::new(REPLACEMENT_CHARACTER, 0);
2✔
2457
                        undecomposed_starter_valid = true;
2✔
2458
                        composition.decomposition.pending = None;
2✔
2459
                        break 'fast;
2460
                    }
2461
                    let upcoming_with_trie_value = composition.decomposition.attach_trie_value(upcoming);
6✔
2462
                    if upcoming_with_trie_value.potential_passthrough_and_cannot_combine_backwards() {
6✔
2463
                        // Can't combine backwards, hence a plain (non-backwards-combining)
2464
                        // starter albeit past `composition_passthrough_bound`
2465

2466
                        // Fast-track succeeded!
2467
                        undecomposed_starter = upcoming_with_trie_value;
3✔
2468
                        undecomposed_starter_valid = true;
3✔
2469
                        continue 'fast;
2470
                    }
2471
                    // We need to fall off the fast path.
2472
                    composition.decomposition.pending = Some(upcoming_with_trie_value);
3✔
2473
                    // Annotation belongs really on inner statement, but Rust doesn't
2474
                    // allow it there.
2475
                    #[allow(clippy::unwrap_used)]
2476
                    let consumed_so_far_slice = if undecomposed_starter_valid {
6✔
2477
                        &pending_slice[..pending_slice.len() - composition.decomposition.delegate.as_slice().len() - upcoming.len_utf8() - undecomposed_starter.character.len_utf8()]
3✔
2478
                    } else {
2479
                        // slicing and unwrap OK, because we've just evidently read enough previously.
2480
                        let mut consumed_so_far = pending_slice[..pending_slice.len() - composition.decomposition.delegate.as_slice().len() - upcoming.len_utf8()].chars();
×
2481
                        // `unwrap` OK, because we've previously manage to read the previous character
2482
                        undecomposed_starter = composition.decomposition.attach_trie_value(consumed_so_far.next_back().unwrap());
×
2483
                        undecomposed_starter_valid = true;
×
2484
                        consumed_so_far.as_slice()
×
2485
                    };
2486
                    sink.write_str(unsafe { from_utf8_unchecked(consumed_so_far_slice)})?;
3✔
2487
                    break 'fast;
2488
                }
2489
                // End of stream
2490
                sink.write_str(unsafe {from_utf8_unchecked(pending_slice) })?;
9✔
2491
                return Ok(());
9✔
2492
            }
2493
            debug_assert!(undecomposed_starter_valid);
5✔
2494
        },
2495
        text,
2496
        sink,
2497
        composition,
2498
        composition_passthrough_bound,
2499
        undecomposed_starter,
2500
        pending_slice,
2501
        len_utf8,
2502
    );
2503

2504
    composing_normalize_to!(
2505
        /// Normalize a slice of potentially-invalid UTF-16 into a `Write16` sink.
2506
        ///
2507
        /// Unpaired surrogates are mapped to the REPLACEMENT CHARACTER
2508
        /// before normalizing.
2509
        ,
2510
        normalize_utf16_to,
2511
        write16::Write16,
2512
        &[u16],
2513
        {
2514
            sink.size_hint(text.len())?;
12✔
2515
        },
2516
        false,
2517
        as_slice,
2518
        {
2519
            let mut code_unit_iter = composition.decomposition.delegate.as_slice().iter();
15✔
2520
            let mut upcoming32;
2521
            // This is basically an `Option` discriminant for `undecomposed_starter`,
2522
            // but making it a boolean so that writes to it are  are as
2523
            // simple as possible.
2524
            // Furthermore, this removes the need for `unwrap()` later.
2525
            let mut undecomposed_starter_valid;
2526
            // The purpose of the counter is to flush once in a while. If we flush
2527
            // too much, there is too much flushing overhead. If we flush too rarely,
2528
            // the flush starts reading from too far behind compared to the hot
2529
            // recently-read memory.
2530
            let mut counter = UTF16_FAST_PATH_FLUSH_THRESHOLD;
15✔
2531
            // The purpose of this trickiness is to avoid writing to
2532
            // `undecomposed_starter_valid` from the tightest loop. Writing to it
2533
            // from there destroys performance.
2534
            let mut counter_reference = counter - 1;
15✔
2535
            'fast: loop {
15✔
2536
                counter -= 1;
27✔
2537
                if let Some(&upcoming_code_unit) = code_unit_iter.next() {
27✔
2538
                    upcoming32 = u32::from(upcoming_code_unit); // may be surrogate
17✔
2539
                    if upcoming32 < composition_passthrough_bound && counter != 0 {
17✔
2540
                        // No need for surrogate or U+FFFD check, because
2541
                        // `composition_passthrough_bound` cannot be higher than
2542
                        // U+0300.
2543
                        // Fast-track succeeded!
2544
                        continue 'fast;
2545
                    }
2546
                    // if `counter` equals `counter_reference`, the `continue 'fast`
2547
                    // line above has not executed and `undecomposed_starter` is still
2548
                    // valid.
2549
                    undecomposed_starter_valid = counter == counter_reference;
8✔
2550
                    // The loop is only broken out of as goto forward
2551
                    #[allow(clippy::never_loop)]
2552
                    'surrogateloop: loop {
2553
                        let surrogate_base = upcoming32.wrapping_sub(0xD800);
8✔
2554
                        if surrogate_base > (0xDFFF - 0xD800) {
8✔
2555
                            // Not surrogate
2556
                            break 'surrogateloop;
2557
                        }
2558
                        if surrogate_base <= (0xDBFF - 0xD800) {
6✔
2559
                            let iter_backup = code_unit_iter.clone();
6✔
2560
                            if let Some(&low) = code_unit_iter.next() {
8✔
2561
                                if in_inclusive_range16(low, 0xDC00, 0xDFFF) {
6✔
2562
                                    upcoming32 = (upcoming32 << 10) + u32::from(low)
4✔
2563
                                        - (((0xD800u32 << 10) - 0x10000u32) + 0xDC00u32);
2564
                                    break 'surrogateloop;
2565
                                } else {
2566
                                    code_unit_iter = iter_backup;
2✔
2567
                                }
2568
                            }
2569
                        }
2570
                        // unpaired surrogate
2571
                        let slice_to_write = &pending_slice[..pending_slice.len() - code_unit_iter.as_slice().len() - 1];
2✔
2572
                        sink.write_slice(slice_to_write)?;
2✔
2573
                        undecomposed_starter = CharacterAndTrieValue::new(REPLACEMENT_CHARACTER, 0);
2✔
2574
                        undecomposed_starter_valid = true;
2✔
2575
                        composition.decomposition.pending = None;
2✔
2576
                        break 'fast;
2577
                    }
2578
                    // Not unpaired surrogate
2579
                    let upcoming = unsafe { char::from_u32_unchecked(upcoming32) };
6✔
2580
                    let upcoming_with_trie_value = composition.decomposition.attach_trie_value(upcoming);
6✔
2581
                    if upcoming_with_trie_value.potential_passthrough_and_cannot_combine_backwards() && counter != 0 {
6✔
2582
                        // Can't combine backwards, hence a plain (non-backwards-combining)
2583
                        // starter albeit past `composition_passthrough_bound`
2584

2585
                        // Fast-track succeeded!
2586
                        undecomposed_starter = upcoming_with_trie_value;
3✔
2587
                        // Cause `undecomposed_starter_valid` to be set to true.
2588
                        // This regresses English performance on Haswell by 11%
2589
                        // compared to commenting out this assignment to
2590
                        // `counter_reference`.
2591
                        counter_reference = counter - 1;
3✔
2592
                        continue 'fast;
2593
                    }
2594
                    // We need to fall off the fast path.
2595
                    composition.decomposition.pending = Some(upcoming_with_trie_value);
3✔
2596
                    // Annotation belongs really on inner statement, but Rust doesn't
2597
                    // allow it there.
2598
                    #[allow(clippy::unwrap_used)]
2599
                    let consumed_so_far_slice = if undecomposed_starter_valid {
6✔
2600
                        &pending_slice[..pending_slice.len() - code_unit_iter.as_slice().len() - upcoming.len_utf16() - undecomposed_starter.character.len_utf16()]
3✔
2601
                    } else {
2602
                        // slicing and unwrap OK, because we've just evidently read enough previously.
2603
                        let mut consumed_so_far = pending_slice[..pending_slice.len() - code_unit_iter.as_slice().len() - upcoming.len_utf16()].chars();
×
2604
                        // `unwrap` OK, because we've previously manage to read the previous character
2605
                        undecomposed_starter = composition.decomposition.attach_trie_value(consumed_so_far.next_back().unwrap());
×
2606
                        undecomposed_starter_valid = true;
×
2607
                        consumed_so_far.as_slice()
×
2608
                    };
2609
                    sink.write_slice(consumed_so_far_slice)?;
3✔
2610
                    break 'fast;
2611
                }
2612
                // End of stream
2613
                sink.write_slice(pending_slice)?;
10✔
2614
                return Ok(());
10✔
2615
            }
2616
            debug_assert!(undecomposed_starter_valid);
5✔
2617
            // Sync the main iterator
2618
            composition.decomposition.delegate = code_unit_iter.as_slice().chars();
5✔
2619
        },
2620
        text,
2621
        sink,
2622
        composition,
2623
        composition_passthrough_bound,
2624
        undecomposed_starter,
2625
        pending_slice,
2626
        len_utf16,
2627
    );
2628
}
2629

2630
struct IsNormalizedSinkUtf16<'a> {
2631
    expect: &'a [u16],
2632
}
2633

2634
impl<'a> IsNormalizedSinkUtf16<'a> {
2635
    pub fn new(slice: &'a [u16]) -> Self {
20✔
2636
        IsNormalizedSinkUtf16 { expect: slice }
20✔
2637
    }
20✔
2638
    pub fn finished(&self) -> bool {
16✔
2639
        self.expect.is_empty()
16✔
2640
    }
16✔
2641
}
2642

2643
impl<'a> Write16 for IsNormalizedSinkUtf16<'a> {
2644
    fn write_slice(&mut self, s: &[u16]) -> core::fmt::Result {
28✔
2645
        // We know that if we get a slice, it's a pass-through,
2646
        // so we can compare addresses. Indexing is OK, because
2647
        // an indexing failure would be a code bug rather than
2648
        // an input or data issue.
2649
        #[allow(clippy::indexing_slicing)]
2650
        if s.as_ptr() == self.expect.as_ptr() {
28✔
2651
            self.expect = &self.expect[s.len()..];
28✔
2652
            Ok(())
28✔
2653
        } else {
2654
            Err(core::fmt::Error {})
×
2655
        }
2656
    }
28✔
2657

2658
    fn write_char(&mut self, c: char) -> core::fmt::Result {
29✔
2659
        let mut iter = self.expect.chars();
29✔
2660
        if iter.next() == Some(c) {
29✔
2661
            self.expect = iter.as_slice();
25✔
2662
            Ok(())
25✔
2663
        } else {
2664
            Err(core::fmt::Error {})
4✔
2665
        }
2666
    }
29✔
2667
}
2668

2669
struct IsNormalizedSinkUtf8<'a> {
2670
    expect: &'a [u8],
2671
}
2672

2673
impl<'a> IsNormalizedSinkUtf8<'a> {
2674
    pub fn new(slice: &'a [u8]) -> Self {
20✔
2675
        IsNormalizedSinkUtf8 { expect: slice }
20✔
2676
    }
20✔
2677
    pub fn finished(&self) -> bool {
16✔
2678
        self.expect.is_empty()
16✔
2679
    }
16✔
2680
}
2681

2682
impl<'a> core::fmt::Write for IsNormalizedSinkUtf8<'a> {
2683
    fn write_str(&mut self, s: &str) -> core::fmt::Result {
28✔
2684
        // We know that if we get a slice, it's a pass-through,
2685
        // so we can compare addresses. Indexing is OK, because
2686
        // an indexing failure would be a code bug rather than
2687
        // an input or data issue.
2688
        #[allow(clippy::indexing_slicing)]
2689
        if s.as_ptr() == self.expect.as_ptr() {
28✔
2690
            self.expect = &self.expect[s.len()..];
28✔
2691
            Ok(())
28✔
2692
        } else {
2693
            Err(core::fmt::Error {})
×
2694
        }
2695
    }
28✔
2696

2697
    fn write_char(&mut self, c: char) -> core::fmt::Result {
29✔
2698
        let mut iter = self.expect.chars();
29✔
2699
        if iter.next() == Some(c) {
29✔
2700
            self.expect = iter.as_slice();
25✔
2701
            Ok(())
25✔
2702
        } else {
2703
            Err(core::fmt::Error {})
4✔
2704
        }
2705
    }
29✔
2706
}
2707

2708
struct IsNormalizedSinkStr<'a> {
2709
    expect: &'a str,
2710
}
2711

2712
impl<'a> IsNormalizedSinkStr<'a> {
2713
    pub fn new(slice: &'a str) -> Self {
16✔
2714
        IsNormalizedSinkStr { expect: slice }
16✔
2715
    }
16✔
2716
    pub fn finished(&self) -> bool {
12✔
2717
        self.expect.is_empty()
12✔
2718
    }
12✔
2719
}
2720

2721
impl<'a> core::fmt::Write for IsNormalizedSinkStr<'a> {
2722
    fn write_str(&mut self, s: &str) -> core::fmt::Result {
20✔
2723
        // We know that if we get a slice, it's a pass-through,
2724
        // so we can compare addresses. Indexing is OK, because
2725
        // an indexing failure would be a code bug rather than
2726
        // an input or data issue.
2727
        #[allow(clippy::indexing_slicing)]
2728
        if s.as_ptr() == self.expect.as_ptr() {
20✔
2729
            self.expect = &self.expect[s.len()..];
20✔
2730
            Ok(())
20✔
2731
        } else {
2732
            Err(core::fmt::Error {})
×
2733
        }
2734
    }
20✔
2735

2736
    fn write_char(&mut self, c: char) -> core::fmt::Result {
21✔
2737
        let mut iter = self.expect.chars();
21✔
2738
        if iter.next() == Some(c) {
21✔
2739
            self.expect = iter.as_str();
17✔
2740
            Ok(())
17✔
2741
        } else {
2742
            Err(core::fmt::Error {})
4✔
2743
        }
2744
    }
21✔
2745
}
STATUS · Troubleshooting · Open an Issue · Sales · Support · CAREERS · ENTERPRISE · START FREE · SCHEDULE DEMO
ANNOUNCEMENTS · TWITTER · TOS & SLA · Supported CI Services · What's a CI service? · Automated Testing

© 2025 Coveralls, Inc