• Home
  • Features
  • Pricing
  • Docs
  • Announcements
  • Sign In

JuliaLang / julia / #37596

pending completion
#37596

push

local

web-flow
🤖 [master] Bump the Pkg stdlib from 2c04d5a98 to b044bf6a2 (#50851)

Co-authored-by: Dilum Aluthge <dilum@aluthge.com>

71913 of 84418 relevant lines covered (85.19%)

32144286.87 hits per line

Source File
Press 'n' to go to next uncovered line, 'b' for previous

1.69
/stdlib/Unicode/src/Unicode.jl
1
# This file is a part of Julia. License is MIT: https://julialang.org/license
2

3
module Unicode
4

5
export graphemes, isequal_normalized
6

7
"""
8
    Unicode.julia_chartransform(c::Union{Char,Integer})
9

10
Map the Unicode character (`Char`) or codepoint (`Integer`) `c` to the corresponding
11
"equivalent" character or codepoint, respectively, according to the custom equivalence
12
used within the Julia parser (in addition to NFC normalization).
13

14
For example, `'µ'` (U+00B5 micro) is treated as equivalent to `'μ'` (U+03BC mu) by
15
Julia's parser, so `julia_chartransform` performs this transformation while leaving
16
other characters unchanged:
17
```jldoctest
18
julia> Unicode.julia_chartransform('\u00B5')
19
'μ': Unicode U+03BC (category Ll: Letter, lowercase)
20

21
julia> Unicode.julia_chartransform('x')
22
'x': ASCII/Unicode U+0078 (category Ll: Letter, lowercase)
23
```
24

25
`julia_chartransform` is mainly useful for passing to the [`Unicode.normalize`](@ref)
26
function in order to mimic the normalization used by the Julia parser:
27
```jldoctest
28
julia> s = "\u00B5o\u0308"
29
"µö"
30

31
julia> s2 = Unicode.normalize(s, compose=true, stable=true, chartransform=Unicode.julia_chartransform)
32
"μö"
33

34
julia> collect(s2)
35
2-element Vector{Char}:
36
 'μ': Unicode U+03BC (category Ll: Letter, lowercase)
37
 'ö': Unicode U+00F6 (category Ll: Letter, lowercase)
38

39
julia> s2 == string(Meta.parse(s))
40
true
41
```
42

43
!!! compat "Julia 1.8"
44
    This function was introduced in Julia 1.8.
45
"""
46
function julia_chartransform end
47
julia_chartransform(codepoint::UInt32) = get(Base.Unicode._julia_charmap, codepoint, codepoint)
×
48
julia_chartransform(codepoint::Integer) = julia_chartransform(UInt32(codepoint))
×
49
julia_chartransform(char::Char) = Char(julia_chartransform(UInt32(char)))
×
50

51
"""
52
    Unicode.normalize(s::AbstractString; keywords...)
53
    Unicode.normalize(s::AbstractString, normalform::Symbol)
54

55
Normalize the string `s`. By default, canonical composition (`compose=true`) is performed without ensuring
56
Unicode versioning stability (`compat=false`), which produces the shortest possible equivalent string
57
but may introduce composition characters not present in earlier Unicode versions.
58

59
Alternatively, one of the four "normal forms" of the Unicode standard can be specified:
60
`normalform` can be `:NFC`, `:NFD`, `:NFKC`, or `:NFKD`.  Normal forms C
61
(canonical composition) and D (canonical decomposition) convert different visually identical
62
representations of the same abstract string into a single canonical form, with form C being
63
more compact.  Normal forms KC and KD additionally canonicalize "compatibility equivalents":
64
they convert characters that are abstractly similar but visually distinct into a single
65
canonical choice (e.g. they expand ligatures into the individual characters), with form KC
66
being more compact.
67

68
Alternatively, finer control and additional transformations may be obtained by calling
69
`Unicode.normalize(s; keywords...)`, where any number of the following boolean keywords
70
options (which all default to `false` except for `compose`) are specified:
71

72
* `compose=false`: do not perform canonical composition
73
* `decompose=true`: do canonical decomposition instead of canonical composition
74
  (`compose=true` is ignored if present)
75
* `compat=true`: compatibility equivalents are canonicalized
76
* `casefold=true`: perform Unicode case folding, e.g. for case-insensitive string comparison
77
* `newline2lf=true`, `newline2ls=true`, or `newline2ps=true`: convert various newline
78
  sequences (LF, CRLF, CR, NEL) into a linefeed (LF), line-separation (LS), or
79
  paragraph-separation (PS) character, respectively
80
* `stripmark=true`: strip diacritical marks (e.g. accents)
81
* `stripignore=true`: strip Unicode's "default ignorable" characters (e.g. the soft hyphen
82
  or the left-to-right marker)
83
* `stripcc=true`: strip control characters; horizontal tabs and form feeds are converted to
84
  spaces; newlines are also converted to spaces unless a newline-conversion flag was
85
  specified
86
* `rejectna=true`: throw an error if unassigned code points are found
87
* `stable=true`: enforce Unicode versioning stability (never introduce characters missing from earlier Unicode versions)
88

89
You can also use the `chartransform` keyword (which defaults to `identity`) to pass an arbitrary
90
*function* mapping `Integer` codepoints to codepoints, which is is called on each
91
character in `s` as it is processed, in order to perform arbitrary additional normalizations.
92
For example, by passing `chartransform=Unicode.julia_chartransform`, you can apply a few Julia-specific
93
character normalizations that are performed by Julia when parsing identifiers (in addition to
94
NFC normalization: `compose=true, stable=true`).
95

96
For example, NFKC corresponds to the options `compose=true, compat=true, stable=true`.
97

98
# Examples
99
```jldoctest
100
julia> "é" == Unicode.normalize("é") #LHS: Unicode U+00e9, RHS: U+0065 & U+0301
101
true
102

103
julia> "μ" == Unicode.normalize("µ", compat=true) #LHS: Unicode U+03bc, RHS: Unicode U+00b5
104
true
105

106
julia> Unicode.normalize("JuLiA", casefold=true)
107
"julia"
108

109
julia> Unicode.normalize("JúLiA", stripmark=true)
110
"JuLiA"
111
```
112

113
!!! compat "Julia 1.8"
114
    The `chartransform` keyword argument requires Julia 1.8.
115
"""
116
function normalize end
117
normalize(s::AbstractString, nf::Symbol) = Base.Unicode.normalize(s, nf)
20✔
118
normalize(s::AbstractString; kwargs...) = Base.Unicode.normalize(s; kwargs...)
×
119

120
"""
121
    Unicode.isassigned(c) -> Bool
122

123
Return `true` if the given char or integer is an assigned Unicode code point.
124

125
# Examples
126
```jldoctest
127
julia> Unicode.isassigned(101)
128
true
129

130
julia> Unicode.isassigned('\\x01')
131
true
132
```
133
"""
134
isassigned(c) = Base.Unicode.isassigned(c)
×
135

136
"""
137
    graphemes(s::AbstractString) -> GraphemeIterator
138

139
Return an iterator over substrings of `s` that correspond to the extended graphemes in the
140
string, as defined by Unicode UAX #29. (Roughly, these are what users would perceive as
141
single characters, even though they may contain more than one codepoint; for example a
142
letter combined with an accent mark is a single grapheme.)
143
"""
144
graphemes(s::AbstractString) = Base.Unicode.GraphemeIterator{typeof(s)}(s)
×
145

146
"""
147
    graphemes(s::AbstractString, m:n) -> SubString
148

149
Returns a [`SubString`](@ref) of `s` consisting of the `m`-th
150
through `n`-th graphemes of the string `s`, where the second
151
argument `m:n` is an integer-valued [`AbstractUnitRange`](@ref).
152

153
Loosely speaking, this corresponds to the `m:n`-th user-perceived
154
"characters" in the string.  For example:
155

156
```jldoctest
157
julia> s = graphemes("exposé", 3:6)
158
"posé"
159

160
julia> collect(s)
161
5-element Vector{Char}:
162
 'p': ASCII/Unicode U+0070 (category Ll: Letter, lowercase)
163
 'o': ASCII/Unicode U+006F (category Ll: Letter, lowercase)
164
 's': ASCII/Unicode U+0073 (category Ll: Letter, lowercase)
165
 'e': ASCII/Unicode U+0065 (category Ll: Letter, lowercase)
166
 '́': Unicode U+0301 (category Mn: Mark, nonspacing)
167
```
168
This consists of the 3rd to *7th* codepoints ([`Char`](@ref)s) in `"exposé"`,
169
because the grapheme `"é"` is actually *two* Unicode codepoints
170
(an `'e'` followed by an acute-accent combining character U+0301).
171

172
Because finding grapheme boundaries requires iteration over the
173
string contents, the `graphemes(s, m:n)` function requires time
174
proportional to the length of the string (number of codepoints)
175
before the end of the substring.
176

177
!!! compat "Julia 1.9"
178
    The `m:n` argument of `graphemes` requires Julia 1.9.
179
"""
180
function graphemes(s::AbstractString, r::AbstractUnitRange{<:Integer})
×
181
    m, n = Int(first(r)), Int(last(r))
×
182
    m > 0 || throw(ArgumentError("starting index $m is not ≥ 1"))
×
183
    n < m && return @view s[1:0]
×
184
    c0 = eltype(s)(0x00000000)
×
185
    state = Ref{Int32}(0)
×
186
    count = 0
×
187
    i, iprev, ilast = 1, 1, lastindex(s)
×
188
    # find the start of the m-th grapheme
189
    while i ≤ ilast && count < m
×
190
        @inbounds c = s[i]
×
191
        count += Base.Unicode.isgraphemebreak!(state, c0, c)
×
192
        c0 = c
×
193
        i, iprev = nextind(s, i), i
×
194
    end
×
195
    start = iprev
×
196
    count < m && throw(BoundsError(s, i))
×
197
    # find the end of the n-th grapheme
198
    while i ≤ ilast
×
199
        @inbounds c = s[i]
×
200
        count += Base.Unicode.isgraphemebreak!(state, c0, c)
×
201
        count > n && break
×
202
        c0 = c
×
203
        i, iprev = nextind(s, i), i
×
204
    end
×
205
    count < n && throw(BoundsError(s, i))
×
206
    return @view s[start:iprev]
×
207
end
208

209
using Base.Unicode: utf8proc_error, UTF8PROC_DECOMPOSE, UTF8PROC_CASEFOLD, UTF8PROC_STRIPMARK
210

211
function _decompose_char!(codepoint::Union{Integer,Char}, dest::Vector{UInt32}, options::Integer)
×
212
    ret = @ccall utf8proc_decompose_char(codepoint::UInt32, dest::Ptr{UInt32}, length(dest)::Int, options::Cint, C_NULL::Ptr{Cint})::Int
×
213
    ret < 0 && utf8proc_error(ret)
×
214
    return ret
×
215
end
216

217
"""
218
    isequal_normalized(s1::AbstractString, s2::AbstractString; casefold=false, stripmark=false, chartransform=identity)
219

220
Return whether `s1` and `s2` are canonically equivalent Unicode strings.   If `casefold=true`,
221
ignores case (performs Unicode case-folding); if `stripmark=true`, strips diacritical marks
222
and other combining characters.
223

224
As with [`Unicode.normalize`](@ref), you can also pass an arbitrary
225
function via the `chartransform` keyword (mapping `Integer` codepoints to codepoints)
226
to perform custom normalizations, such as [`Unicode.julia_chartransform`](@ref).
227

228
# Examples
229

230
For example, the string `"noël"` can be constructed in two canonically equivalent ways
231
in Unicode, depending on whether `"ë"` is formed from a single codepoint U+00EB or
232
from the ASCII character `'e'` followed by the U+0308 combining-diaeresis character.
233

234
```jldoctest
235
julia> s1 = "no\u00EBl"
236
"noël"
237

238
julia> s2 = "noe\u0308l"
239
"noël"
240

241
julia> s1 == s2
242
false
243

244
julia> isequal_normalized(s1, s2)
245
true
246

247
julia> isequal_normalized(s1, "noel", stripmark=true)
248
true
249

250
julia> isequal_normalized(s1, "NOËL", casefold=true)
251
true
252
```
253
"""
254
function isequal_normalized(s1::AbstractString, s2::AbstractString; casefold::Bool=false, stripmark::Bool=false, chartransform=identity)
×
255
    function decompose_next_char!(c, state, d, options, s)
×
256
        n = _decompose_char!(c, d, options)
×
257
        if n > length(d) # may be possible in future Unicode versions?
×
258
            n = _decompose_char!(c, resize!(d, n), options)
×
259
        end
260
        return 1, n, iterate(s, state)
×
261
    end
262
    options = UTF8PROC_DECOMPOSE
×
263
    casefold && (options |= UTF8PROC_CASEFOLD)
×
264
    stripmark && (options |= UTF8PROC_STRIPMARK)
×
265
    i1,i2 = iterate(s1),iterate(s2)
×
266
    d1,d2 = Vector{UInt32}(undef, 4), Vector{UInt32}(undef, 4) # codepoint buffers
×
267
    n1 = n2 = 0 # lengths of codepoint buffers
×
268
    j1 = j2 = 1 # indices in d1, d2
×
269
    while true
×
270
        if j1 > n1
×
271
            i1 === nothing && return i2 === nothing && j2 > n2
×
272
            j1, n1, i1 = decompose_next_char!(chartransform(UInt32(i1[1])), i1[2], d1, options, s1)
×
273
        end
274
        if j2 > n2
×
275
            i2 === nothing && return false
×
276
            j2, n2, i2 = decompose_next_char!(chartransform(UInt32(i2[1])), i2[2], d2, options, s2)
×
277
        end
278
        d1[j1] == d2[j2] || return false
×
279
        j1 += 1; j2 += 1
×
280
    end
×
281
end
282

283
end
STATUS · Troubleshooting · Open an Issue · Sales · Support · CAREERS · ENTERPRISE · START FREE · SCHEDULE DEMO
ANNOUNCEMENTS · TWITTER · TOS & SLA · Supported CI Services · What's a CI service? · Automated Testing

© 2026 Coveralls, Inc