• Home
  • Features
  • Pricing
  • Docs
  • Announcements
  • Sign In

JuliaLang / julia / #38002

06 Feb 2025 06:14AM UTC coverage: 20.322% (-2.4%) from 22.722%
#38002

push

local

web-flow
bpart: Fully switch to partitioned semantics (#57253)

This is the final PR in the binding partitions series (modulo bugs and
tweaks), i.e. it closes #54654 and thus closes #40399, which was the
original design sketch.

This thus activates the full designed semantics for binding partitions,
in particular allowing safe replacement of const bindings. It in
particular allows struct redefinitions. This thus closes
timholy/Revise.jl#18 and also closes #38584.

The biggest semantic change here is probably that this gets rid of the
notion of "resolvedness" of a binding. Previously, a lot of the behavior
of our implementation depended on when bindings were "resolved", which
could happen at basically an arbitrary point (in the compiler, in REPL
completion, in a different thread), making a lot of the semantics around
bindings ill- or at least implementation-defined. There are several
related issues in the bugtracker, so this closes #14055 closes #44604
closes #46354 closes #30277

It is also the last step to close #24569.
It also supports bindings for undef->defined transitions and thus closes
#53958 closes #54733 - however, this is not activated yet for
performance reasons and may need some further optimization.

Since resolvedness no longer exists, we need to replace it with some
hopefully more well-defined semantics. I will describe the semantics
below, but before I do I will make two notes:

1. There are a number of cases where these semantics will behave
slightly differently than the old semantics absent some other task going
around resolving random bindings.
2. The new behavior (except for the replacement stuff) was generally
permissible under the old semantics if the bindings happened to be
resolved at the right time.

With all that said, there are essentially three "strengths" of bindings:

1. Implicit Bindings: Anything implicitly obtained from `using Mod`, "no
binding", plus slightly more exotic corner cases around conflicts

2. Weakly declared bindin... (continued)

11 of 111 new or added lines in 7 files covered. (9.91%)

1273 existing lines in 68 files now uncovered.

9908 of 48755 relevant lines covered (20.32%)

105126.48 hits per line

Source File
Press 'n' to go to next uncovered line, 'b' for previous

7.56
/base/strings/unicode.jl
1
# This file is a part of Julia. License is MIT: https://julialang.org/license
2

3
# Various Unicode functionality from the utf8proc library
4
module Unicode
5

6
import Base: show, ==, hash, string, Symbol, isless, length, eltype,
7
             convert, isvalid, ismalformed, isoverlong, iterate,
8
             AnnotatedString, AnnotatedChar, annotated_chartransform,
9
             @assume_effects, annotations
10

11
# whether codepoints are valid Unicode scalar values, i.e. 0-0xd7ff, 0xe000-0x10ffff
12

13
"""
14
    isvalid(value) -> Bool
15

16
Return `true` if the given value is valid for its type, which currently can be either
17
`AbstractChar` or `String` or `SubString{String}`.
18

19
# Examples
20
```jldoctest
21
julia> isvalid(Char(0xd800))
22
false
23

24
julia> isvalid(SubString(String(UInt8[0xfe,0x80,0x80,0x80,0x80,0x80]),1,2))
25
false
26

27
julia> isvalid(Char(0xd799))
28
true
29
```
30
"""
31
isvalid(value)
32

33
"""
34
    isvalid(T, value) -> Bool
35

36
Return `true` if the given value is valid for that type. Types currently can
37
be either `AbstractChar` or `String`. Values for `AbstractChar` can be of type `AbstractChar` or [`UInt32`](@ref).
38
Values for `String` can be of that type, `SubString{String}`, `Vector{UInt8}`,
39
or a contiguous subarray thereof.
40

41
# Examples
42
```jldoctest
43
julia> isvalid(Char, 0xd800)
44
false
45

46
julia> isvalid(String, SubString("thisisvalid",1,5))
47
true
48

49
julia> isvalid(Char, 0xd799)
50
true
51
```
52

53
!!! compat "Julia 1.6"
54
    Support for subarray values was added in Julia 1.6.
55
"""
56
isvalid(T,value)
57

58
isvalid(c::AbstractChar) = !ismalformed(c) & !isoverlong(c) & ((c ≤ '\ud7ff') | ('\ue000' ≤ c) & (c ≤ '\U10ffff'))
7,991✔
59
isvalid(::Type{<:AbstractChar}, c::Unsigned) = ((c ≤  0xd7ff ) | ( 0xe000  ≤ c) & (c ≤  0x10ffff ))
×
60
isvalid(::Type{T}, c::Integer) where {T<:AbstractChar}  = isvalid(T, Unsigned(c))
×
61
isvalid(::Type{<:AbstractChar}, c::AbstractChar)     = isvalid(c)
×
62

63
# utf8 category constants
64
const UTF8PROC_CATEGORY_CN = 0
65
const UTF8PROC_CATEGORY_LU = 1
66
const UTF8PROC_CATEGORY_LL = 2
67
const UTF8PROC_CATEGORY_LT = 3
68
const UTF8PROC_CATEGORY_LM = 4
69
const UTF8PROC_CATEGORY_LO = 5
70
const UTF8PROC_CATEGORY_MN = 6
71
const UTF8PROC_CATEGORY_MC = 7
72
const UTF8PROC_CATEGORY_ME = 8
73
const UTF8PROC_CATEGORY_ND = 9
74
const UTF8PROC_CATEGORY_NL = 10
75
const UTF8PROC_CATEGORY_NO = 11
76
const UTF8PROC_CATEGORY_PC = 12
77
const UTF8PROC_CATEGORY_PD = 13
78
const UTF8PROC_CATEGORY_PS = 14
79
const UTF8PROC_CATEGORY_PE = 15
80
const UTF8PROC_CATEGORY_PI = 16
81
const UTF8PROC_CATEGORY_PF = 17
82
const UTF8PROC_CATEGORY_PO = 18
83
const UTF8PROC_CATEGORY_SM = 19
84
const UTF8PROC_CATEGORY_SC = 20
85
const UTF8PROC_CATEGORY_SK = 21
86
const UTF8PROC_CATEGORY_SO = 22
87
const UTF8PROC_CATEGORY_ZS = 23
88
const UTF8PROC_CATEGORY_ZL = 24
89
const UTF8PROC_CATEGORY_ZP = 25
90
const UTF8PROC_CATEGORY_CC = 26
91
const UTF8PROC_CATEGORY_CF = 27
92
const UTF8PROC_CATEGORY_CS = 28
93
const UTF8PROC_CATEGORY_CO = 29
94

95
# strings corresponding to the category constants
96
const category_strings = [
97
    "Other, not assigned",
98
    "Letter, uppercase",
99
    "Letter, lowercase",
100
    "Letter, titlecase",
101
    "Letter, modifier",
102
    "Letter, other",
103
    "Mark, nonspacing",
104
    "Mark, spacing combining",
105
    "Mark, enclosing",
106
    "Number, decimal digit",
107
    "Number, letter",
108
    "Number, other",
109
    "Punctuation, connector",
110
    "Punctuation, dash",
111
    "Punctuation, open",
112
    "Punctuation, close",
113
    "Punctuation, initial quote",
114
    "Punctuation, final quote",
115
    "Punctuation, other",
116
    "Symbol, math",
117
    "Symbol, currency",
118
    "Symbol, modifier",
119
    "Symbol, other",
120
    "Separator, space",
121
    "Separator, line",
122
    "Separator, paragraph",
123
    "Other, control",
124
    "Other, format",
125
    "Other, surrogate",
126
    "Other, private use",
127
    "Invalid, too high",
128
    "Malformed, bad data",
129
]
130

131
const UTF8PROC_STABLE    = (1<<1)
132
const UTF8PROC_COMPAT    = (1<<2)
133
const UTF8PROC_COMPOSE   = (1<<3)
134
const UTF8PROC_DECOMPOSE = (1<<4)
135
const UTF8PROC_IGNORE    = (1<<5)
136
const UTF8PROC_REJECTNA  = (1<<6)
137
const UTF8PROC_NLF2LS    = (1<<7)
138
const UTF8PROC_NLF2PS    = (1<<8)
139
const UTF8PROC_NLF2LF    = (UTF8PROC_NLF2LS | UTF8PROC_NLF2PS)
140
const UTF8PROC_STRIPCC   = (1<<9)
141
const UTF8PROC_CASEFOLD  = (1<<10)
142
const UTF8PROC_CHARBOUND = (1<<11)
143
const UTF8PROC_LUMP      = (1<<12)
144
const UTF8PROC_STRIPMARK = (1<<13)
145

146
############################################################################
147

148
utf8proc_error(result) = error(unsafe_string(ccall(:utf8proc_errmsg, Cstring, (Cssize_t,), result)))
×
149

150
# static wrapper around user callback function
151
utf8proc_custom_func(codepoint::UInt32, callback::Any) =
×
152
    UInt32(callback(codepoint))::UInt32
153

154
function utf8proc_decompose(str, options, buffer, nwords, chartransform::typeof(identity))
×
155
    ret = ccall(:utf8proc_decompose, Int, (Ptr{UInt8}, Int, Ptr{UInt8}, Int, Cint),
×
156
                str, sizeof(str), buffer, nwords, options)
157
    ret < 0 && utf8proc_error(ret)
×
158
    return ret
×
159
end
160
function utf8proc_decompose(str, options, buffer, nwords, chartransform::F) where F
×
161
    ret = ccall(:utf8proc_decompose_custom, Int, (Ptr{UInt8}, Int, Ptr{UInt8}, Int, Cint, Ptr{Cvoid}, Ref{F}),
×
162
                str, sizeof(str), buffer, nwords, options,
163
                @cfunction(utf8proc_custom_func, UInt32, (UInt32, Ref{F})), chartransform)
×
164
    ret < 0 && utf8proc_error(ret)
×
165
    return ret
×
166
end
167

168
function utf8proc_map(str::Union{String,SubString{String}}, options::Integer, chartransform::F = identity) where F
×
169
    nwords = utf8proc_decompose(str, options, C_NULL, 0, chartransform)
×
170
    buffer = Base.StringVector(nwords*4)
×
171
    nwords = utf8proc_decompose(str, options, buffer, nwords, chartransform)
×
172
    nbytes = ccall(:utf8proc_reencode, Int, (Ptr{UInt8}, Int, Cint), buffer, nwords, options)
×
173
    nbytes < 0 && utf8proc_error(nbytes)
×
174
    return String(resize!(buffer, nbytes))
×
175
end
176

177
"""
178
`Dict` of `original codepoint => replacement codepoint` normalizations
179
to perform on Julia identifiers, to canonicalize characters that
180
are both easily confused and easily inputted by accident.
181

182
!!! warning
183
    When this table is updated, also update the corresponding table in `src/flisp/julia_charmap.h`.
184
"""
185
const _julia_charmap = Dict{UInt32,UInt32}(
186
    0x025B => 0x03B5, # latin small letter open e -> greek small letter epsilon
187
    0x00B5 => 0x03BC, # micro sign -> greek small letter mu
188
    0x00B7 => 0x22C5, # middot char -> dot operator (#25098)
189
    0x0387 => 0x22C5, # Greek interpunct -> dot operator (#25098)
190
    0x2212 => 0x002D, # minus -> hyphen-minus (#26193)
191
    0x210F => 0x0127, # hbar -> small letter h with stroke (#48870)
192
)
193

194
utf8proc_map(s::AbstractString, flags::Integer, chartransform::F = identity) where F = utf8proc_map(String(s), flags, chartransform)
×
195

196
# Documented in Unicode module
197
function normalize(
×
198
    s::AbstractString;
199
    stable::Bool=false,
200
    compat::Bool=false,
201
    compose::Bool=true,
202
    decompose::Bool=false,
203
    stripignore::Bool=false,
204
    rejectna::Bool=false,
205
    newline2ls::Bool=false,
206
    newline2ps::Bool=false,
207
    newline2lf::Bool=false,
208
    stripcc::Bool=false,
209
    casefold::Bool=false,
210
    lump::Bool=false,
211
    stripmark::Bool=false,
212
    chartransform=identity,
213
)
214
    flags = 0
×
215
    stable && (flags = flags | UTF8PROC_STABLE)
×
216
    compat && (flags = flags | UTF8PROC_COMPAT)
×
217
    # TODO: error if compose & decompose?
218
    if decompose
×
219
        flags = flags | UTF8PROC_DECOMPOSE
×
220
    elseif compose
×
221
        flags = flags | UTF8PROC_COMPOSE
×
222
    elseif compat || stripmark
×
223
        throw(ArgumentError("compat=true or stripmark=true require compose=true or decompose=true"))
×
224
    end
225
    stripignore && (flags = flags | UTF8PROC_IGNORE)
×
226
    rejectna && (flags = flags | UTF8PROC_REJECTNA)
×
227
    newline2ls + newline2ps + newline2lf > 1 && throw(ArgumentError("only one newline conversion may be specified"))
×
228
    newline2ls && (flags = flags | UTF8PROC_NLF2LS)
×
229
    newline2ps && (flags = flags | UTF8PROC_NLF2PS)
×
230
    newline2lf && (flags = flags | UTF8PROC_NLF2LF)
×
231
    stripcc && (flags = flags | UTF8PROC_STRIPCC)
×
232
    casefold && (flags = flags | UTF8PROC_CASEFOLD)
×
233
    lump && (flags = flags | UTF8PROC_LUMP)
×
234
    stripmark && (flags = flags | UTF8PROC_STRIPMARK)
×
235
    utf8proc_map(s, flags, chartransform)
×
236
end
237

238
function normalize(s::AbstractString, nf::Symbol)
×
239
    utf8proc_map(s, nf === :NFC ? (UTF8PROC_STABLE | UTF8PROC_COMPOSE) :
×
240
                    nf === :NFD ? (UTF8PROC_STABLE | UTF8PROC_DECOMPOSE) :
241
                    nf === :NFKC ? (UTF8PROC_STABLE | UTF8PROC_COMPOSE
242
                                   | UTF8PROC_COMPAT) :
243
                    nf === :NFKD ? (UTF8PROC_STABLE | UTF8PROC_DECOMPOSE
244
                                   | UTF8PROC_COMPAT) :
245
                    throw(ArgumentError(":$nf is not one of :NFC, :NFD, :NFKC, :NFKD")))
246
end
247

248
############################################################################
249

250
## character column width function ##
251
"""
252
    textwidth(c)
253

254
Give the number of columns needed to print a character.
255

256
# Examples
257
```jldoctest
258
julia> textwidth('α')
259
1
260

261
julia> textwidth('⛵')
262
2
263
```
264
"""
265
function textwidth(c::AbstractChar)
×
266
    ismalformed(c) && return 1
×
267
    i = codepoint(c)
×
268
    i < 0x7f && return Int(i >= 0x20) # ASCII fast path
×
269
    Int(ccall(:utf8proc_charwidth, Cint, (UInt32,), i))
×
270
end
271

272
function textwidth(c::Char)
273
    b = bswap(reinterpret(UInt32, c)) # from isascii(c)
21✔
274
    b < 0x7f && return Int(b >= 0x20) # ASCII fast path
21✔
275
    ismalformed(c) && return 1
×
276
    Int(ccall(:utf8proc_charwidth, Cint, (UInt32,), c))
×
277
end
278

279
"""
280
    textwidth(s::AbstractString)
281

282
Give the number of columns needed to print a string.
283

284
# Examples
285
```jldoctest
286
julia> textwidth("March")
287
5
288
```
289
"""
290
textwidth(s::AbstractString) = mapreduce(textwidth, +, s; init=0)
143✔
291

292
textwidth(s::AnnotatedString) = textwidth(s.string)
×
293

294
"""
295
    lowercase(c::AbstractChar)
296

297
Convert `c` to lowercase.
298

299
See also [`uppercase`](@ref), [`titlecase`](@ref).
300

301
# Examples
302
```jldoctest
303
julia> lowercase('A')
304
'a': ASCII/Unicode U+0061 (category Ll: Letter, lowercase)
305

306
julia> lowercase('Ö')
307
'ö': Unicode U+00F6 (category Ll: Letter, lowercase)
308
```
309
"""
310
lowercase(c::T) where {T<:AbstractChar} = isascii(c) ? ('A' <= c <= 'Z' ? c + 0x20 : c) :
×
311
    T(ccall(:utf8proc_tolower, UInt32, (UInt32,), c))
312

313
lowercase(c::AnnotatedChar) = AnnotatedChar(lowercase(c.char), annotations(c))
×
314

315
"""
316
    uppercase(c::AbstractChar)
317

318
Convert `c` to uppercase.
319

320
See also [`lowercase`](@ref), [`titlecase`](@ref).
321

322
# Examples
323
```jldoctest
324
julia> uppercase('a')
325
'A': ASCII/Unicode U+0041 (category Lu: Letter, uppercase)
326

327
julia> uppercase('ê')
328
'Ê': Unicode U+00CA (category Lu: Letter, uppercase)
329
```
330
"""
UNCOV
331
uppercase(c::T) where {T<:AbstractChar} = isascii(c) ? ('a' <= c <= 'z' ? c - 0x20 : c) :
×
332
    T(ccall(:utf8proc_toupper, UInt32, (UInt32,), c))
333

334
uppercase(c::AnnotatedChar) = AnnotatedChar(uppercase(c.char), annotations(c))
×
335

336
"""
337
    titlecase(c::AbstractChar)
338

339
Convert `c` to titlecase. This may differ from uppercase for digraphs,
340
compare the example below.
341

342
See also [`uppercase`](@ref), [`lowercase`](@ref).
343

344
# Examples
345
```jldoctest
346
julia> titlecase('a')
347
'A': ASCII/Unicode U+0041 (category Lu: Letter, uppercase)
348

349
julia> titlecase('dž')
350
'Dž': Unicode U+01C5 (category Lt: Letter, titlecase)
351

352
julia> uppercase('dž')
353
'DŽ': Unicode U+01C4 (category Lu: Letter, uppercase)
354
```
355
"""
356
titlecase(c::T) where {T<:AbstractChar} = isascii(c) ? ('a' <= c <= 'z' ? c - 0x20 : c) :
×
357
    T(ccall(:utf8proc_totitle, UInt32, (UInt32,), c))
358

359
titlecase(c::AnnotatedChar) = AnnotatedChar(titlecase(c.char), annotations(c))
×
360

361
############################################################################
362

363
# returns UTF8PROC_CATEGORY code in 0:30 giving Unicode category
364
function category_code(c::AbstractChar)
365
    !ismalformed(c) ? category_code(UInt32(c)) : Cint(31)
3,421✔
366
end
367

368
function category_code(x::Integer)
369
    x ≤ 0x10ffff ? (@assume_effects :foldable @ccall utf8proc_category(UInt32(x)::UInt32)::Cint) : Cint(30)
3,421✔
370
end
371

372
# more human-readable representations of the category code
373
function category_abbrev(c::AbstractChar)
×
374
    ismalformed(c) && return "Ma"
×
375
    c ≤ '\U10ffff' || return "In"
×
376
    unsafe_string(ccall(:utf8proc_category_string, Cstring, (UInt32,), c))
×
377
end
378

379
category_string(c) = category_strings[category_code(c)+1]
×
380

381
isassigned(c) = UTF8PROC_CATEGORY_CN < category_code(c) <= UTF8PROC_CATEGORY_CO
×
382

383
## libc character class predicates ##
384

385
"""
386
    islowercase(c::AbstractChar) -> Bool
387

388
Tests whether a character is a lowercase letter (according to the Unicode
389
standard's `Lowercase` derived property).
390

391
See also [`isuppercase`](@ref).
392

393
# Examples
394
```jldoctest
395
julia> islowercase('α')
396
true
397

398
julia> islowercase('Γ')
399
false
400

401
julia> islowercase('❤')
402
false
403
```
404
"""
405
islowercase(c::AbstractChar) = ismalformed(c) ? false :
×
406
    Bool(@assume_effects :foldable @ccall utf8proc_islower(UInt32(c)::UInt32)::Cint)
×
407

408
# true for Unicode upper and mixed case
409

410
"""
411
    isuppercase(c::AbstractChar) -> Bool
412

413
Tests whether a character is an uppercase letter (according to the Unicode
414
standard's `Uppercase` derived property).
415

416
See also [`islowercase`](@ref).
417

418
# Examples
419
```jldoctest
420
julia> isuppercase('γ')
421
false
422

423
julia> isuppercase('Γ')
424
true
425

426
julia> isuppercase('❤')
427
false
428
```
429
"""
430
isuppercase(c::AbstractChar) = ismalformed(c) ? false :
×
431
    Bool(@assume_effects :foldable @ccall utf8proc_isupper(UInt32(c)::UInt32)::Cint)
×
432

433
"""
434
    iscased(c::AbstractChar) -> Bool
435

436
Tests whether a character is cased, i.e. is lower-, upper- or title-cased.
437

438
See also [`islowercase`](@ref), [`isuppercase`](@ref).
439
"""
440
function iscased(c::AbstractChar)
×
441
    cat = category_code(c)
×
442
    return cat == UTF8PROC_CATEGORY_LU ||
×
443
           cat == UTF8PROC_CATEGORY_LT ||
444
           cat == UTF8PROC_CATEGORY_LL
445
end
446

447

448
"""
449
    isdigit(c::AbstractChar) -> Bool
450

451
Tests whether a character is an ASCII decimal digit (`0`-`9`).
452

453
See also: [`isletter`](@ref).
454

455
# Examples
456
```jldoctest
457
julia> isdigit('❤')
458
false
459

460
julia> isdigit('9')
461
true
462

463
julia> isdigit('α')
464
false
465
```
466
"""
467
isdigit(c::AbstractChar) = (c >= '0') & (c <= '9')
332,952✔
468

469
"""
470
    isletter(c::AbstractChar) -> Bool
471

472
Test whether a character is a letter.
473
A character is classified as a letter if it belongs to the Unicode general
474
category Letter, i.e. a character whose category code begins with 'L'.
475

476
See also: [`isdigit`](@ref).
477

478
# Examples
479
```jldoctest
480
julia> isletter('❤')
481
false
482

483
julia> isletter('α')
484
true
485

486
julia> isletter('9')
487
false
488
```
489
"""
490
isletter(c::AbstractChar) = UTF8PROC_CATEGORY_LU <= category_code(c) <= UTF8PROC_CATEGORY_LO
×
491

492
"""
493
    isnumeric(c::AbstractChar) -> Bool
494

495
Tests whether a character is numeric.
496
A character is classified as numeric if it belongs to the Unicode general category Number,
497
i.e. a character whose category code begins with 'N'.
498

499
Note that this broad category includes characters such as ¾ and ௰.
500
Use [`isdigit`](@ref) to check whether a character is a decimal digit between 0 and 9.
501

502
# Examples
503
```jldoctest
504
julia> isnumeric('௰')
505
true
506

507
julia> isnumeric('9')
508
true
509

510
julia> isnumeric('α')
511
false
512

513
julia> isnumeric('❤')
514
false
515
```
516
"""
517
isnumeric(c::AbstractChar) = UTF8PROC_CATEGORY_ND <= category_code(c) <= UTF8PROC_CATEGORY_NO
3✔
518

519
# following C++ only control characters from the Latin-1 subset return true
520

521
"""
522
    iscntrl(c::AbstractChar) -> Bool
523

524
Tests whether a character is a control character.
525
Control characters are the non-printing characters of the Latin-1 subset of Unicode.
526

527
# Examples
528
```jldoctest
529
julia> iscntrl('\\x01')
530
true
531

532
julia> iscntrl('a')
533
false
534
```
535
"""
536
iscntrl(c::AbstractChar) = c <= '\x1f' || '\x7f' <= c <= '\u9f'
15,966✔
537

538
"""
539
    ispunct(c::AbstractChar) -> Bool
540

541
Tests whether a character belongs to the Unicode general category Punctuation, i.e. a
542
character whose category code begins with 'P'.
543

544
!!! note
545
    This behavior is different from the `ispunct` function in C.
546

547
# Examples
548
```jldoctest
549
julia> ispunct('α')
550
false
551

552
julia> ispunct('=')
553
false
554

555
julia> ispunct('/')
556
true
557

558
julia> ispunct(';')
559
true
560
```
561
"""
562
ispunct(c::AbstractChar) = UTF8PROC_CATEGORY_PC <= category_code(c) <= UTF8PROC_CATEGORY_PO
×
563

564
# \u85 is the Unicode Next Line (NEL) character
565

566
"""
567
    isspace(c::AbstractChar) -> Bool
568

569
Tests whether a character is any whitespace character. Includes ASCII characters '\\t',
570
'\\n', '\\v', '\\f', '\\r', and ' ', Latin-1 character U+0085, and characters in Unicode
571
category Zs.
572

573
# Examples
574
```jldoctest
575
julia> isspace('\\n')
576
true
577

578
julia> isspace('\\r')
579
true
580

581
julia> isspace(' ')
582
true
583

584
julia> isspace('\\x20')
585
true
586
```
587
"""
588
@inline isspace(c::AbstractChar) =
4,156✔
589
    c == ' ' || '\t' <= c <= '\r' || c == '\u85' ||
590
    '\ua0' <= c && category_code(c) == UTF8PROC_CATEGORY_ZS
591

592
"""
593
    isprint(c::AbstractChar) -> Bool
594

595
Tests whether a character is printable, including spaces, but not a control character.
596

597
# Examples
598
```jldoctest
599
julia> isprint('\\x01')
600
false
601

602
julia> isprint('A')
603
true
604
```
605
"""
606
isprint(c::AbstractChar) = UTF8PROC_CATEGORY_LU <= category_code(c) <= UTF8PROC_CATEGORY_ZS
3,418✔
607

608
# true in principal if a printer would use ink
609

610
"""
611
    isxdigit(c::AbstractChar) -> Bool
612

613
Test whether a character is a valid hexadecimal digit. Note that this does not
614
include `x` (as in the standard `0x` prefix).
615

616
# Examples
617
```jldoctest
618
julia> isxdigit('a')
619
true
620

621
julia> isxdigit('x')
622
false
623
```
624
"""
625
isxdigit(c::AbstractChar) = '0'<=c<='9' || 'a'<=c<='f' || 'A'<=c<='F'
×
626

627
## uppercase, lowercase, and titlecase transformations ##
628

629
"""
630
    uppercase(s::AbstractString)
631

632
Return `s` with all characters converted to uppercase.
633

634
See also [`lowercase`](@ref), [`titlecase`](@ref), [`uppercasefirst`](@ref).
635

636
# Examples
637
```jldoctest
638
julia> uppercase("Julia")
639
"JULIA"
640
```
641
"""
642
uppercase(s::AbstractString) = map(uppercase, s)
132✔
643
uppercase(s::AnnotatedString) = annotated_chartransform(uppercase, s)
×
644

645
"""
646
    lowercase(s::AbstractString)
647

648
Return `s` with all characters converted to lowercase.
649

650
See also [`uppercase`](@ref), [`titlecase`](@ref), [`lowercasefirst`](@ref).
651

652
# Examples
653
```jldoctest
654
julia> lowercase("STRINGS AND THINGS")
655
"strings and things"
656
```
657
"""
658
lowercase(s::AbstractString) = map(lowercase, s)
2,500✔
659
lowercase(s::AnnotatedString) = annotated_chartransform(lowercase, s)
×
660

661
"""
662
    titlecase(s::AbstractString; [wordsep::Function], strict::Bool=true) -> String
663

664
Capitalize the first character of each word in `s`;
665
if `strict` is true, every other character is
666
converted to lowercase, otherwise they are left unchanged.
667
By default, all non-letters beginning a new grapheme are considered as word separators;
668
a predicate can be passed as the `wordsep` keyword to determine
669
which characters should be considered as word separators.
670
See also [`uppercasefirst`](@ref) to capitalize only the first
671
character in `s`.
672

673
See also [`uppercase`](@ref), [`lowercase`](@ref), [`uppercasefirst`](@ref).
674

675
# Examples
676
```jldoctest
677
julia> titlecase("the JULIA programming language")
678
"The Julia Programming Language"
679

680
julia> titlecase("ISS - international space station", strict=false)
681
"ISS - International Space Station"
682

683
julia> titlecase("a-a b-b", wordsep = c->c==' ')
684
"A-a B-b"
685
```
686
"""
687
function titlecase(s::AbstractString; wordsep::Function = !isletter, strict::Bool=true)
×
688
    startword = true
×
689
    state = Ref{Int32}(0)
×
690
    c0 = eltype(s)(0x00000000)
×
691
    b = IOBuffer()
×
692
    for c in s
×
693
        # Note: It would be better to have a word iterator following UAX#29,
694
        # similar to our grapheme iterator, but utf8proc does not yet have
695
        # this information.  At the very least we shouldn't break inside graphemes.
696
        if isgraphemebreak!(state, c0, c) && wordsep(c)
×
697
            print(b, c)
×
698
            startword = true
×
699
        else
700
            print(b, startword ? titlecase(c) : strict ? lowercase(c) : c)
×
701
            startword = false
×
702
        end
703
        c0 = c
×
704
    end
×
705
    return String(take!(b))
×
706
end
707

708
# TODO: improve performance characteristics, room for a ~10x improvement.
709
function titlecase(s::AnnotatedString; wordsep::Function = !isletter, strict::Bool=true)
×
710
    initial_state = (; startword = true, state = Ref{Int32}(0),
×
711
             c0 = eltype(s)(zero(UInt32)), wordsep, strict)
712
    annotated_chartransform(s, initial_state) do c, state
×
713
        if isgraphemebreak!(state.state, state.c0, c) && state.wordsep(c)
×
714
            state = Base.setindex(state, true, :startword)
×
715
            cnew = c
×
716
        else
717
            cnew = state.startword ? titlecase(c) : state.strict ? lowercase(c) : c
×
718
            state = Base.setindex(state, false, :startword)
×
719
        end
720
        state = Base.setindex(state, c, :c0)
×
721
        cnew, state
×
722
    end
723
end
724

725
"""
726
    uppercasefirst(s::AbstractString) -> String
727

728
Return `s` with the first character converted to uppercase (technically "title
729
case" for Unicode). See also [`titlecase`](@ref) to capitalize the first
730
character of every word in `s`.
731

732
See also [`lowercasefirst`](@ref), [`uppercase`](@ref), [`lowercase`](@ref),
733
[`titlecase`](@ref).
734

735
# Examples
736
```jldoctest
737
julia> uppercasefirst("python")
738
"Python"
739
```
740
"""
741
function uppercasefirst(s::AbstractString)
×
742
    isempty(s) && return ""
×
743
    c = s[1]
×
744
    c′ = titlecase(c)
×
745
    c == c′ ? convert(String, s) :
×
746
    string(c′, SubString(s, nextind(s, 1)))
747
end
748

749
# TODO: improve performance characteristics, room for a ~5x improvement.
750
function uppercasefirst(s::AnnotatedString)
×
751
    annotated_chartransform(s, true) do c, state
×
752
        if state
×
753
            (titlecase(c), false)
×
754
        else
755
            (c, state)
×
756
        end
757
    end
758
end
759

760
"""
761
    lowercasefirst(s::AbstractString)
762

763
Return `s` with the first character converted to lowercase.
764

765
See also [`uppercasefirst`](@ref), [`uppercase`](@ref), [`lowercase`](@ref),
766
[`titlecase`](@ref).
767

768
# Examples
769
```jldoctest
770
julia> lowercasefirst("Julia")
771
"julia"
772
```
773
"""
774
function lowercasefirst(s::AbstractString)
×
775
    isempty(s) && return ""
×
776
    c = s[1]
×
777
    c′ = lowercase(c)
×
778
    c == c′ ? convert(String, s) :
×
779
    string(c′, SubString(s, nextind(s, 1)))
780
end
781

782
# TODO: improve performance characteristics, room for a ~5x improvement.
783
function lowercasefirst(s::AnnotatedString)
×
784
    annotated_chartransform(s, true) do c, state
×
785
        if state
×
786
            (lowercase(c), false)
×
787
        else
788
            (c, state)
×
789
        end
790
    end
791
end
792

793
############################################################################
794
# iterators for grapheme segmentation
795

796
isgraphemebreak(c1::AbstractChar, c2::AbstractChar) =
×
797
    ismalformed(c1) || ismalformed(c2) ||
798
    ccall(:utf8proc_grapheme_break, Bool, (UInt32, UInt32), c1, c2)
799

800
# Stateful grapheme break required by Unicode-9 rules: the string
801
# must be processed in sequence, with state initialized to Ref{Int32}(0).
802
# Requires utf8proc v2.0 or later.
803
function isgraphemebreak!(state::Ref{Int32}, c1::AbstractChar, c2::AbstractChar)
×
804
    if ismalformed(c1) || ismalformed(c2)
×
805
        state[] = 0
×
806
        return true
×
807
    end
808
    ccall(:utf8proc_grapheme_break_stateful, Bool,
×
809
          (UInt32, UInt32, Ref{Int32}), c1, c2, state)
810
end
811

812
struct GraphemeIterator{S<:AbstractString}
813
    s::S # original string (for generation of SubStrings)
814
end
815

816
# Documented in Unicode module
817
graphemes(s::AbstractString) = GraphemeIterator{typeof(s)}(s)
×
818

819
eltype(::Type{GraphemeIterator{S}}) where {S} = SubString{S}
×
820
eltype(::Type{GraphemeIterator{SubString{S}}}) where {S} = SubString{S}
×
821

822
function length(g::GraphemeIterator{S}) where {S}
×
823
    c0 = eltype(S)(0x00000000)
×
824
    n = 0
×
825
    state = Ref{Int32}(0)
×
826
    for c in g.s
×
827
        n += isgraphemebreak!(state, c0, c)
×
828
        c0 = c
×
829
    end
×
830
    return n
×
831
end
832

833
function iterate(g::GraphemeIterator, i_=(Int32(0),firstindex(g.s)))
×
834
    s = g.s
×
835
    statei, i = i_
×
836
    state = Ref{Int32}(statei)
×
837
    j = i
×
838
    y = iterate(s, i)
×
839
    y === nothing && return nothing
×
840
    c0, k = y
×
841
    while k <= ncodeunits(s) # loop until next grapheme is s[i:j]
×
842
        c, ℓ = iterate(s, k)::NTuple{2,Any}
×
843
        isgraphemebreak!(state, c0, c) && break
×
844
        j = k
×
845
        k = ℓ
×
846
        c0 = c
×
847
    end
×
848
    return (SubString(s, i, j), (state[], k))
×
849
end
850

851
==(g1::GraphemeIterator, g2::GraphemeIterator) = g1.s == g2.s
×
852
hash(g::GraphemeIterator, h::UInt) = hash(g.s, h)
×
853
isless(g1::GraphemeIterator, g2::GraphemeIterator) = isless(g1.s, g2.s)
×
854

855
show(io::IO, g::GraphemeIterator{S}) where {S} = print(io, "length-$(length(g)) GraphemeIterator{$S} for \"$(g.s)\"")
×
856

857
############################################################################
858

859
end # module
STATUS · Troubleshooting · Open an Issue · Sales · Support · CAREERS · ENTERPRISE · START FREE · SCHEDULE DEMO
ANNOUNCEMENTS · TWITTER · TOS & SLA · Supported CI Services · What's a CI service? · Automated Testing

© 2026 Coveralls, Inc