• Home
  • Features
  • Pricing
  • Docs
  • Announcements
  • Sign In

JuliaLang / julia / 1503

12 Apr 2026 10:07PM UTC coverage: 77.837% (-0.09%) from 77.924%
1503

push

buildkite

web-flow
Port `StringView` to Base (#60526)

This PR ports
[StringViews.jl](https://github.com/JuliaStrings/StringViews.jl) to
Base, as discussed in #60037. Closes #60037.

_Tests have been written by a chat-bot and reviewed by me_

## Decisions for reviewers before reviewing
* Can we use `@inbounds` with a `StringView` wrapping an unknown array
type? The code currently does, but I'm not against removing it. If we do
remove it, we may need to re-jigger a bunch of string related code,
since shared code paths of strings and string views would mean that
`String` performance could be degraded by removing inbounds annotations.

## Motivation
We generally want to avoid adding more types to Base when it can live in
a package, so this PR requires some justification:

### StringViews are the more fundamental string
`String` and `SubString` are an abstraction over an underlying byte
array, and almost all its operations are defined in terms of loading
bytes from that array. This abstraction of strings as "arrays in a
trench coat" is made explicit by string views.
Hence, `String` and `SubString` can be implemented in terms of string
views, but not the other way around. And it is better if Base contains
the foundations and packages provide implementations on top of those,
instead of the other way around.

One hint that the relationship is inverted is to look at the
implementation of `StringViews.jl`: It re-implements tonnes of internal
Base string functions, instead of calling into generic, foundational
methods in Base.
We also have comments like this in Base: `# duck-type s so that external
UTF-8 string packages like StringViews can hook in`, where the sensible
thing would be to _define_ getindex for UTF8-encoded strings in terms of
this method, instead of encouraging use of Base internals.

Due to the strong intersection of strings as an abstraction and string
views, this PR only has ~100 LOC in `base/strings/stringview.jl`,
whereas `src/` in `StringViews.jl` has ~800 LO... (continued)

75 of 93 new or added lines in 8 files covered. (80.65%)

132 existing lines in 11 files now uncovered.

65347 of 83954 relevant lines covered (77.84%)

24288390.47 hits per line

Source File
Press 'n' to go to next uncovered line, 'b' for previous

88.81
/base/strings/string.jl
1
# This file is a part of Julia. License is MIT: https://julialang.org/license
2

3
"""
4
    StringView{T <: AbstractVector{UInt8}} <: AbstractString
5

6
An `AbstractString` representation of any `vector` of `UInt8` data,
7
interpreted as UTF-8 encoded Unicode.
8
Similar to `String`, the underlying data may be invalid UTF-8.
9

10
`StringView(v::AbstractVector{UInt8})::StringView` does not make a copy of
11
or modify the `v`. Use `codeunits` to get `v` from the `StringView`.
12
After construction, `v` may be mutated, which will be reflected in
13
the resulting `StringView`.
14

15
!!! compat "Julia 1.14"
16
    The `StringView` type requires at least Julia 1.14.
17

18
# Examples
19
```jldoctest
20
julia> arr = [0x61, 0xf0, 0x63, 0x64];
21

22
julia> s = StringView(arr)
23
"a\\xf0cd"
24

25
julia> codeunits(s) === arr
26
true
27

28
julia> arr[2] = Int('b'); s
29
"abcd"
30
```
31
"""
32
struct StringView{T <: AbstractVector{UInt8}} <: AbstractString
33
    data::T
34

NEW
35
    function StringView{T}(data::T) where {T <: AbstractVector{UInt8}}
×
36
        # For now, StringViews code assumes one-based indexing
NEW
37
        require_one_based_indexing(data)
×
38

39
        # Prevent someone constructing e.g. a `StringView{AbstractVector{UInt8}}`,
40
        # the existence of which will complicate the implementation and provide
41
        # no usability benefit.
NEW
42
        if !isconcretetype(T)
×
NEW
43
            throw(ArgumentError("StringView must be parameterized with a concrete type"))
×
44
        end
45

NEW
46
        new{T}(data)
×
47
    end
48
end
49

50

51
"""
52
    StringIndexError(str, i)
53

54
An error occurred when trying to access `str` at index `i` that is not valid.
55
"""
56
struct StringIndexError <: Exception
57
    string::AbstractString
10✔
58
    index::Int
59
end
60
@noinline string_index_err((@nospecialize s::AbstractString), i::Integer) =
2✔
61
    throw(StringIndexError(s, Int(i)))
62
function showerror(io::IO, exc::StringIndexError)
8✔
63
    s = exc.string
8✔
64
    print(io, "StringIndexError: ", "invalid index [$(exc.index)]")
8✔
65
    if firstindex(s) <= exc.index <= ncodeunits(s)
8✔
66
        iprev = thisind(s, exc.index)
8✔
67
        inext = nextind(s, iprev)
8✔
68
        escprev = escape_string(s[iprev:iprev])
8✔
69
        if inext <= ncodeunits(s)
8✔
70
            escnext = escape_string(s[inext:inext])
6✔
71
            print(io, ", valid nearby indices [$iprev]=>'$escprev', [$inext]=>'$escnext'")
6✔
72
        else
73
            print(io, ", valid nearby index [$iprev]=>'$escprev'")
2✔
74
        end
75
    end
76
end
77

78
@inline between(b::T, lo::T, hi::T) where {T<:Integer} = (lo ≤ b) & (b ≤ hi)
885,293,414✔
79

80
"""
81
    String <: AbstractString
82

83
The default string type in Julia, used by e.g. string literals.
84

85
`String`s are immutable sequences of `Char`s. A `String` is stored internally as
86
a contiguous byte array, and while they are interpreted as being UTF-8 encoded,
87
they can be composed of any byte sequence. Use [`isvalid`](@ref) to validate
88
that the underlying byte sequence is valid as UTF-8.
89
"""
90
String
91

92
## constructors and conversions ##
93

94
# String constructor docstring from boot.jl, workaround for #16730
95
# and the unavailability of @doc in boot.jl context.
96
"""
97
    String(v::AbstractVector{UInt8})
98

99
Create a new `String` object using the data buffer from byte vector `v`.
100
If `v` is a `Vector{UInt8}` it will be truncated to zero length and future
101
modification of `v` cannot affect the contents of the resulting string.
102
To avoid truncation of `Vector{UInt8}` data, use `String(copy(v))`; for other
103
`AbstractVector` types, `String(v)` already makes a copy.
104

105
When possible, the memory of `v` will be used without copying when the `String`
106
object is created. This is guaranteed to be the case for byte vectors returned
107
by [`take!`](@ref) on a writable [`IOBuffer`](@ref) and by calls to
108
[`read(io, nb)`](@ref). This allows zero-copy conversion of I/O data to strings.
109
In other cases, `Vector{UInt8}` data may be copied, but `v` is truncated anyway
110
to guarantee consistent behavior.
111
"""
112
String(v::AbstractVector{UInt8}) = unsafe_takestring(copyto!(StringMemory(length(v)), v))
14,360,742✔
113

114
function String(v::Vector{UInt8})
4,089✔
115
    len = length(v)
22,816,275✔
116
    len == 0 && return ""
22,816,275✔
117
    ref = v.ref
22,723,286✔
118
    if ref.ptr_or_offset == ref.mem.ptr
22,723,290✔
119
        str = ccall(:jl_genericmemory_to_string, Ref{String}, (Any, Int), ref.mem, len)
22,723,285✔
120
    else
121
        str = ccall(:jl_pchar_to_string, Ref{String}, (Ptr{UInt8}, Int), ref, len)
5✔
122
    end
123
    # optimized empty!(v); sizehint!(v, 0) calls
124
    setfield!(v, :size, (0,))
22,723,290✔
125
    setfield!(v, :ref, memoryref(Memory{UInt8}()))
22,723,286✔
126
    return str
22,723,290✔
127
end
128

129
"""
130
    unsafe_takestring(m::Memory{UInt8})::String
131

132
Create a `String` from `m`, changing the interpretation of the contents of `m`.
133
This is done without copying, if possible. Thus, any access to `m` after
134
calling this function, either to read or to write, is undefined behavior.
135
"""
136
function unsafe_takestring(m::Memory{UInt8})
137
    isempty(m) ? "" : ccall(:jl_genericmemory_to_string, Ref{String}, (Any, Int), m, length(m))
14,362,624✔
138
end
139

140
"""
141
    takestring!(x) -> String
142

143
Create a string from the content of `x`, emptying `x`.
144

145
# Examples
146
```jldoctest
147
julia> v = [0x61, 0x62, 0x63];
148

149
julia> s = takestring!(v)
150
"abc"
151

152
julia> isempty(v)
153
true
154
```
155
"""
156
takestring!(v::Vector{UInt8}) = String(v)
×
157

158
"""
159
    unsafe_string(p::Ptr{UInt8}, [length::Integer])
160
    unsafe_string(p::Cstring)
161

162
Copy a string from the address of a C-style (NUL-terminated) string encoded as UTF-8.
163
(The pointer can be safely freed afterwards.) If `length` is specified
164
(the length of the data in bytes), the string does not have to be NUL-terminated.
165

166
This function is labeled "unsafe" because it will crash if `p` is not
167
a valid memory address to data of the requested length.
168
"""
169
function unsafe_string(p::Union{Ptr{UInt8},Ptr{Int8}}, len::Integer)
876✔
170
    p == C_NULL && throw(ArgumentError("cannot convert NULL to string"))
5,465,816✔
171
    ccall(:jl_pchar_to_string, Ref{String}, (Ptr{UInt8}, Int), p, len)
5,465,814✔
172
end
173
function unsafe_string(p::Union{Ptr{UInt8},Ptr{Int8}})
2,772✔
174
    p == C_NULL && throw(ArgumentError("cannot convert NULL to string"))
6,866,730✔
175
    ccall(:jl_cstr_to_string, Ref{String}, (Ptr{UInt8},), p)
6,866,730✔
176
end
177

178
# This is `@assume_effects :total !:consistent @ccall jl_alloc_string(n::Csize_t)::Ref{String}`,
179
# but the macro is not available at this time in bootstrap, so we write it manually.
180
const _string_n_override = 0x04ee
181
@eval _string_n(n::Integer) = $(Expr(:foreigncall, QuoteNode(:jl_alloc_string), Ref{String},
141,309,159✔
182
    :(Core.svec(Csize_t)), 1, QuoteNode((:ccall, _string_n_override, false)), :(convert(Csize_t, n))))
183

184
"""
185
    String(s::AbstractString)
186

187
Create a new `String` from an existing `AbstractString`.
188
"""
189
String(s::AbstractString) = print_to_string(s)
1,074✔
190
@assume_effects :total String(s::Symbol) = unsafe_string(unsafe_convert(Ptr{UInt8}, s))
6,329,763✔
191

192
unsafe_wrap(::Type{Memory{UInt8}}, s::String) = ccall(:jl_string_to_genericmemory, Ref{Memory{UInt8}}, (Any,), s)
24,332,135✔
193
unsafe_wrap(::Type{Vector{UInt8}}, s::String) = wrap(Array, unsafe_wrap(Memory{UInt8}, s))
120,204✔
194

195
Vector{UInt8}(s::CodeUnits{UInt8,String}) = copyto!(Vector{UInt8}(undef, length(s)), s)
63,062✔
196
Vector{UInt8}(s::String) = Vector{UInt8}(codeunits(s))
63,022✔
197
Array{UInt8}(s::String)  = Vector{UInt8}(codeunits(s))
×
198

199
String(s::CodeUnits{UInt8,String}) = s.s
2✔
200

201
## low-level functions ##
202

203
pointer(s::String) = unsafe_convert(Ptr{UInt8}, s)
1,731,105,753✔
204
pointer(s::String, i::Integer) = pointer(s) + Int(i)::Int - 1
839,377,372✔
205

206
ncodeunits(s::String) = Core.sizeof(s)
1,305,210,907✔
207
codeunit(s::String) = UInt8
2,129,112✔
208

209
codeunit(s::String, i::Integer) = codeunit(s, Int(i)::Int)
4✔
210
@assume_effects :foldable @inline function codeunit(s::String, i::Int)
17,267✔
211
    @boundscheck checkbounds(s, i)
824,843,793✔
212
    b = GC.@preserve s unsafe_load(pointer(s, i))
824,843,841✔
213
    return b
820,036,443✔
214
end
215

216
## comparison ##
217

218
@assume_effects :total _memcmp(a::String, b::String) = @invoke _memcmp(a::Union{Ptr{UInt8},AbstractString},b::Union{Ptr{UInt8},AbstractString})
407,944✔
219

220
_memcmp(a::Union{Ptr{UInt8},AbstractString}, b::Union{Ptr{UInt8},AbstractString}) = _memcmp(a, b, min(sizeof(a), sizeof(b)))
985,313✔
221
function _memcmp(a::Union{Ptr{UInt8},AbstractString}, b::Union{Ptr{UInt8},AbstractString}, len::Int)
35✔
222
    GC.@preserve a b begin
1,208,628✔
223
        pa = unsafe_convert(Ptr{UInt8}, a)
1,208,628✔
224
        pb = unsafe_convert(Ptr{UInt8}, b)
1,208,628✔
225
        memcmp(pa, pb, len % Csize_t) % Int
1,208,628✔
226
    end
227
end
228

229
function cmp(a::String, b::String)
2✔
230
    al, bl = sizeof(a), sizeof(b)
407,944✔
231
    c = _memcmp(a, b)
407,944✔
232
    return c < 0 ? -1 : c > 0 ? +1 : cmp(al,bl)
630,273✔
233
end
234

235
==(a::String, b::String) = a===b
34,217,670✔
236

237
typemin(::Type{String}) = ""
×
238
typemin(::String) = typemin(String)
×
239

240
## thisind, nextind ##
241

242
@propagate_inbounds thisind(s::String, i::Int) = _thisind_str(s, i)
149,613,053✔
243

244
# s should be String, StringView, or SubString{String}
245
@inline function _thisind_str(s, i::Int)
7,417✔
246
    i == 0 && return 0
75,282,739✔
247
    n = ncodeunits(s)
75,166,242✔
248
    i == n + 1 && return i
75,166,242✔
249
    @boundscheck between(i, 1, n) || throw(BoundsError(s, i))
75,166,228✔
250
    @inbounds b = codeunit(s, i)
75,166,228✔
251
    (b & 0xc0 == 0x80) & (i-1 > 0) || return i
141,002,266✔
252
    (@noinline function _thisind_continued(s, i, n) # mark the rest of the function as a slow-path
8,602,608✔
253
        local b
64,634✔
254
        @inbounds b = codeunit(s, i-1)
64,634✔
255
        between(b, 0b11000000, 0b11110111) && return i-1
64,634✔
256
        (b & 0xc0 == 0x80) & (i-2 > 0) || return i
38,574✔
257
        @inbounds b = codeunit(s, i-2)
38,574✔
258
        between(b, 0b11100000, 0b11110111) && return i-2
38,574✔
259
        (b & 0xc0 == 0x80) & (i-3 > 0) || return i
11✔
260
        @inbounds b = codeunit(s, i-3)
11✔
261
        between(b, 0b11110000, 0b11110111) && return i-3
11✔
262
        return i
×
263
    end)(s, i, n)
264
end
265

266
@propagate_inbounds nextind(s::String, i::Int) = _nextind_str(s, i)
93,537,272✔
267

268
# s should be String or SubString{String}
269
@inline function _nextind_str(s, i::Int)
4,076✔
270
    i == 0 && return 1
127,413,480✔
271
    n = ncodeunits(s)
127,400,914✔
272
    @boundscheck between(i, 1, n) || throw(BoundsError(s, i))
127,400,914✔
273
    @inbounds l = codeunit(s, i)
127,400,914✔
274
    between(l, 0x80, 0xf7) || return i+1
254,619,978✔
275
    (@noinline function _nextind_continued(s, i, n, l) # mark the rest of the function as a slow-path
195,841✔
276
        if l < 0xc0
13,995✔
277
            # handle invalid codeunit index by scanning back to the start of this index
278
            # (which may be the same as this index)
279
            i′ = @inbounds thisind(s, i)
×
280
            i′ >= i && return i+1
×
281
            i = i′
×
282
            @inbounds l = codeunit(s, i)
×
283
            (l < 0x80) | (0xf8 ≤ l) && return i+1
×
284
            @assert l >= 0xc0 "invalid codeunit"
×
285
        end
286
        # first continuation byte
287
        (i += 1) > n && return i
13,995✔
288
        @inbounds b = codeunit(s, i)
13,995✔
289
        b & 0xc0 ≠ 0x80 && return i
13,995✔
290
        ((i += 1) > n) | (l < 0xe0) && return i
13,995✔
291
        # second continuation byte
292
        @inbounds b = codeunit(s, i)
13,945✔
293
        b & 0xc0 ≠ 0x80 && return i
13,945✔
294
        ((i += 1) > n) | (l < 0xf0) && return i
13,945✔
295
        # third continuation byte
296
        @inbounds b = codeunit(s, i)
×
297
        return ifelse(b & 0xc0 ≠ 0x80, i, i+1)
×
298
    end)(s, i, n, l)
299
end
300

301
## checking UTF-8 & ASCII validity ##
302
#=
303
    The UTF-8 Validation is performed by a shift based DFA.
304
    ┌───────────────────────────────────────────────────────────────────┐
305
    │    UTF-8 DFA State Diagram    ┌──────────────2──────────────┐     │
306
    │                               ├────────3────────┐           │     │
307
    │                 ┌──────────┐  │     ┌─┐        ┌▼┐          │     │
308
    │      ASCII      │  UTF-8   │  ├─5──►│9├───1────► │          │     │
309
    │                 │          │  │     ├─┤        │ │         ┌▼┐    │
310
    │                 │  ┌─0─┐   │  ├─6──►│8├─1,7,9──►4├──1,7,9──► │    │
311
    │      ┌─0─┐      │  │   │   │  │     ├─┤        │ │         │ │    │
312
    │      │   │      │ ┌▼───┴┐  │  ├─11─►│7├──7,9───► │ ┌───────►3├─┐  │
313
    │     ┌▼───┴┐     │ │     │  ▼  │     └─┘        └─┘ │       │ │ │  │
314
    │     │  0  ├─────┘ │  1  ├─► ──┤                    │  ┌────► │ │  │
315
    │     └─────┘       │     │     │     ┌─┐            │  │    └─┘ │  │
316
    │                   └──▲──┘     ├─10─►│5├─────7──────┘  │        │  │
317
    │                      │        │     ├─┤               │        │  │
318
    │                      │        └─4──►│6├─────1,9───────┘        │  │
319
    │          INVALID     │              └─┘                        │  │
320
    │           ┌─*─┐      └──────────────────1,7,9──────────────────┘  │
321
    │          ┌▼───┴┐                                                  │
322
    │          │  2  ◄─── All undefined transitions result in state 2   │
323
    │          └─────┘                                                  │
324
    └───────────────────────────────────────────────────────────────────┘
325

326
        Validation States
327
            0 -> _UTF8_DFA_ASCII is the start state and will only stay in this state if the string is only ASCII characters
328
                        If the DFA ends in this state the string is ASCII only
329
            1 -> _UTF8_DFA_ACCEPT is the valid complete character state of the DFA once it has encountered a UTF-8 Unicode character
330
            2 -> _UTF8_DFA_INVALID is only reached by invalid bytes and once in this state it will not change
331
                    as seen by all 1s in that column of table below
332
            3 -> One valid continuation byte needed to return to state 0
333
        4,5,6 -> Two valid continuation bytes needed to return to state 0
334
        7,8,9 -> Three valids continuation bytes needed to return to state 0
335

336
                        Current State
337
                    0̲  1̲  2̲  3̲  4̲  5̲  6̲  7̲  8̲  9̲
338
                0 | 0  1  2  2  2  2  2  2  2  2
339
                1 | 2  2  2  1  3  2  3  2  4  4
340
                2 | 3  3  2  2  2  2  2  2  2  2
341
                3 | 4  4  2  2  2  2  2  2  2  2
342
                4 | 6  6  2  2  2  2  2  2  2  2
343
    Character   5 | 9  9  2  2  2  2  2  2  2  2     <- Next State
344
    Class       6 | 8  8  2  2  2  2  2  2  2  2
345
                7 | 2  2  2  1  3  3  2  4  4  2
346
                8 | 2  2  2  2  2  2  2  2  2  2
347
                9 | 2  2  2  1  3  2  3  4  4  2
348
               10 | 5  5  2  2  2  2  2  2  2  2
349
               11 | 7  7  2  2  2  2  2  2  2  2
350

351
           Shifts | 0  4 10 14 18 24  8 20 12 26
352

353
    The shifts that represent each state were derived using the SMT solver Z3, to ensure when encoded into
354
    the rows the correct shift was a result.
355

356
    Each character class row is encoding 10 states with shifts as defined above. By shifting the bitsof a row by
357
    the current state then masking the result with 0x11110 give the shift for the new state
358

359

360
=#
361

362
#State type used by UTF-8 DFA
363
const _UTF8DFAState = UInt32
364
# Fill the table with 256 UInt64 representing the DFA transitions for all bytes
365
const _UTF8_DFA_TABLE = let # let block rather than function doesn't pollute base
366
    num_classes=12
367
    num_states=10
368
    bit_per_state = 6
369

370
    # These shifts were derived using a SMT solver
371
    state_shifts = [0, 4, 10, 14, 18, 24, 8, 20, 12, 26]
372

373
    character_classes = [   0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
374
                            0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
375
                            0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
376
                            0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
377
                            0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
378
                            0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
379
                            0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
380
                            0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
381
                            1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
382
                            9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9,
383
                            7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7,
384
                            7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7,
385
                            8, 8, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
386
                            2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
387
                            10, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 4, 3, 3,
388
                            11, 6, 6, 6, 5, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8 ]
389

390
    # These are the rows discussed in comments above
391
    state_arrays = [ 0  1  2  2  2  2  2  2  2  2;
392
                     2  2  2  1  3  2  3  2  4  4;
393
                     3  3  2  2  2  2  2  2  2  2;
394
                     4  4  2  2  2  2  2  2  2  2;
395
                     6  6  2  2  2  2  2  2  2  2;
396
                     9  9  2  2  2  2  2  2  2  2;
397
                     8  8  2  2  2  2  2  2  2  2;
398
                     2  2  2  1  3  3  2  4  4  2;
399
                     2  2  2  2  2  2  2  2  2  2;
400
                     2  2  2  1  3  2  3  4  4  2;
401
                     5  5  2  2  2  2  2  2  2  2;
402
                     7  7  2  2  2  2  2  2  2  2]
403

404
    #This converts the state_arrays into the shift encoded _UTF8DFAState
405
    class_row = zeros(_UTF8DFAState, num_classes)
406

407
    for i = 1:num_classes
408
        row = _UTF8DFAState(0)
409
        for j in 1:num_states
410
            #Calculate the shift required for the next state
411
            to_shift = UInt8((state_shifts[state_arrays[i,j]+1]) )
412
            #Shift the next state into the position of the current state
413
            row = row | (_UTF8DFAState(to_shift) << state_shifts[j])
414
        end
415
        class_row[i]=row
416
    end
417

418
    map(c->class_row[c+1],character_classes)
×
419
end
420

421

422
const _UTF8_DFA_ASCII = _UTF8DFAState(0) #This state represents the start and end of any valid string
423
const _UTF8_DFA_ACCEPT = _UTF8DFAState(4) #This state represents the start and end of any valid string
424
const _UTF8_DFA_INVALID = _UTF8DFAState(10) # If the state machine is ever in this state just stop
425

426
# The dfa step is broken out so that it may be used in other functions. The mask was calculated to work with state shifts above
427
@inline _utf_dfa_step(state::_UTF8DFAState, byte::UInt8) = @inbounds (_UTF8_DFA_TABLE[byte+1] >> state) & _UTF8DFAState(0x0000001E)
109,615✔
428

429
@inline function _isvalid_utf8_dfa(state::_UTF8DFAState, bytes::AbstractVector{UInt8}, first::Int = firstindex(bytes), last::Int = lastindex(bytes))
10,496✔
430
    for i = first:last
51,762✔
431
       @inbounds state = _utf_dfa_step(state, bytes[i])
109,615✔
432
    end
167,468✔
433
    return (state)
51,762✔
434
end
435

436
@inline function  _find_nonascii_chunk(chunk_size,cu::AbstractVector{CU}, first,last) where {CU}
437
    n=first
20✔
438
    while n <= last - chunk_size
80✔
439
        _isascii(cu,n,n+chunk_size-1) || return n
60✔
440
        n += chunk_size
60✔
441
    end
60✔
442
    n= last-chunk_size+1
20✔
443
    _isascii(cu,n,last) || return n
20✔
444
    return nothing
20✔
445
end
446

447
##
448

449
# Classifications of string
450
    # 0: neither valid ASCII nor UTF-8
451
    # 1: valid ASCII
452
    # 2: valid UTF-8
453
 byte_string_classify(s::AbstractString) = byte_string_classify(codeunits(s))
48✔
454

455

456
function byte_string_classify(bytes::AbstractVector{UInt8})
48✔
457
    chunk_size = 1024
41,345✔
458
    chunk_threshold =  chunk_size + (chunk_size ÷ 2)
41,345✔
459
    n = length(bytes)
41,345✔
460
    if n > chunk_threshold
41,345✔
461
        start = _find_nonascii_chunk(chunk_size,bytes,1,n)
20✔
462
        isnothing(start) && return 1
20✔
463
    else
464
        _isascii(bytes,1,n) && return 1
41,325✔
465
        start = 1
40,498✔
466
    end
467
    return _byte_string_classify_nonascii(bytes,start,n)
40,498✔
468
end
469

470
function _byte_string_classify_nonascii(bytes::AbstractVector{UInt8}, first::Int, last::Int)
40,498✔
471
    chunk_size = 256
40,498✔
472

473
    start = first
40,498✔
474
    stop = min(last,first + chunk_size - 1)
40,498✔
475
    state = _UTF8_DFA_ACCEPT
40,498✔
476
    while start <= last
50,090✔
477
        # try to process ascii chunks
478
        while state == _UTF8_DFA_ACCEPT
40,498✔
479
            _isascii(bytes,start,stop) || break
40,498✔
480
            (start = start + chunk_size) <= last || break
×
481
            stop = min(last,stop + chunk_size)
×
482
        end
×
483
        # Process non ascii chunk
484
        state = _isvalid_utf8_dfa(state,bytes,start,stop)
98,351✔
485
        state == _UTF8_DFA_INVALID && return 0
40,498✔
486

487
        start = start + chunk_size
9,592✔
488
        stop = min(last,stop + chunk_size)
9,592✔
489
    end
9,592✔
490
    return ifelse(state == _UTF8_DFA_ACCEPT,2,0)
9,592✔
491
end
492

493
isvalid(::Type{String}, bytes::AbstractVector{UInt8}) = (@inline byte_string_classify(bytes)) ≠ 0
41,463✔
494
isvalid(::Type{String}, s::AbstractString) =  (@inline byte_string_classify(s)) ≠ 0
48✔
495

496
@inline isvalid(s::AbstractString) = @inline isvalid(String, codeunits(s))
945✔
497

498
is_valid_continuation(c) = c & 0xc0 == 0x80
859✔
499

500
## required core functionality ##
501

502
@inline function iterate(s::Union{String, StringView}, i::Int=firstindex(s))
2,829✔
503
    (i % UInt) - 1 < ncodeunits(s) || return nothing
514,634,301✔
504
    b = @inbounds codeunit(s, i)
430,504,213✔
505
    u = UInt32(b) << 24
430,504,213✔
506
    between(b, 0x80, 0xf7) || return reinterpret(Char, u), i+1
860,356,716✔
507
    return @noinline iterate_continued(s, i, u)
651,710✔
508
end
509

510
# duck-type s so that external UTF-8 string packages like StringViews can hook in
511
function iterate_continued(s, i::Int, u::UInt32)
274,627✔
512
    @label begin
274,627✔
513
        u < 0xc0000000 && (i += 1; break)
274,627✔
514
        n = ncodeunits(s)
259,769✔
515
        # first continuation byte
516
        (i += 1) > n && break
259,769✔
517
        @inbounds b = codeunit(s, i)
256,599✔
518
        b & 0xc0 == 0x80 || break
256,599✔
519
        u |= UInt32(b) << 16
247,763✔
520
        # second continuation byte
521
        ((i += 1) > n) | (u < 0xe0000000) && break
247,763✔
522
        @inbounds b = codeunit(s, i)
121,546✔
523
        b & 0xc0 == 0x80 || break
121,546✔
524
        u |= UInt32(b) << 8
121,546✔
525
        # third continuation byte
526
        ((i += 1) > n) | (u < 0xf0000000) && break
121,546✔
527
        @inbounds b = codeunit(s, i)
2,067✔
528
        b & 0xc0 == 0x80 || break
2,067✔
529
        u |= UInt32(b); i += 1
2,067✔
530
    end
531
    return reinterpret(Char, u), i
274,627✔
532
end
533

534
@propagate_inbounds function getindex(s::Union{String, StringView}, i::Int)
2,076✔
535
    b = codeunit(s, i)
128,069,842✔
536
    u = UInt32(b) << 24
128,069,842✔
537
    between(b, 0x80, 0xf7) || return reinterpret(Char, u)
255,934,367✔
538
    return getindex_continued(s, i, u)
205,249✔
539
end
540

541
# duck-type s so that external UTF-8 string packages like StringViews can hook in
542
function getindex_continued(s, i::Int, u::UInt32)
24,542✔
543
    @label begin
24,542✔
544
        if u < 0xc0000000
24,542✔
545
            # called from `getindex` which checks bounds
546
            @inbounds isvalid(s, i) && break
×
547
            string_index_err(s, i)
×
548
        end
549
        n = ncodeunits(s)
24,542✔
550

551
        (i += 1) > n && break
24,542✔
552
        @inbounds b = codeunit(s, i) # cont byte 1
24,542✔
553
        b & 0xc0 == 0x80 || break
24,542✔
554
        u |= UInt32(b) << 16
24,542✔
555

556
        ((i += 1) > n) | (u < 0xe0000000) && break
24,542✔
557
        @inbounds b = codeunit(s, i) # cont byte 2
24,492✔
558
        b & 0xc0 == 0x80 || break
24,492✔
559
        u |= UInt32(b) << 8
24,492✔
560

561
        ((i += 1) > n) | (u < 0xf0000000) && break
24,492✔
562
        @inbounds b = codeunit(s, i) # cont byte 3
21✔
563
        b & 0xc0 == 0x80 || break
21✔
564
        u |= UInt32(b)
21✔
565
    end
566
    return reinterpret(Char, u)
24,542✔
567
end
568

569
function getindex(s::Union{String, StringView}, r::AbstractUnitRange{<:Integer})
8✔
570
    span = (Int(first(r))::Int):(Int(last(r)))::Int
8✔
571
    return s[span]
8✔
572
end
573

574
@inline function getindex(s::String, r::UnitRange{Int})
1,248✔
575
    isempty(r) && return ""
2,515,891✔
576
    i, j = first(r), last(r)
1,398,810✔
577
    @boundscheck begin
2,472,937✔
578
        checkbounds(s, r)
2,472,937✔
579
        @inbounds isvalid(s, i) || string_index_err(s, i)
2,472,937✔
580
        @inbounds isvalid(s, j) || string_index_err(s, j)
2,472,937✔
581
    end
582
    j = nextind(s, j) - 1
4,944,696✔
583
    n = j - i + 1
2,472,937✔
584
    ss = _string_n(n)
2,472,937✔
585
    GC.@preserve s ss unsafe_copyto!(pointer(ss), pointer(s, i), n)
2,472,937✔
586
    return ss
2,472,937✔
587
end
588

589
# nothrow because we know the start and end indices are valid
590
@assume_effects :nothrow function length(s::String)
105,212✔
591
    return length_continued(s, 1, ncodeunits(s), ncodeunits(s))
105,212✔
592
end
593

NEW
594
function length(s::StringView)
×
NEW
595
    return length_continued(s, 1, ncodeunits(s), ncodeunits(s))
×
596
end
597

598
# effects needed because @inbounds
599
@assume_effects :consistent :effect_free @inline function length(s::String, i::Int, j::Int)
600
    _length(s, i, j)
169,204✔
601
end
602

NEW
603
@inline function length(s::StringView, i::Int, j::Int)
×
NEW
604
    _length(s, i, j)
×
605
end
606

607
@inline function _length(s::Union{String, StringView}, i::Int, j::Int)
608
    @boundscheck begin
113,504✔
609
        0 < i ≤ ncodeunits(s)+1 || throw(BoundsError(s, i))
113,504✔
610
        0 ≤ j < ncodeunits(s)+1 || throw(BoundsError(s, j))
113,504✔
611
    end
612
    j < i && return 0
113,504✔
613
    @inbounds i, k = thisind(s, i), i
113,804✔
614
    c = j - i + (i == k)
56,902✔
615
    @inbounds length_continued(s, i, j, c)
56,902✔
616
end
617

618
@assume_effects :terminates_globally @propagate_inbounds function length_continued(s::String, i::Int, n::Int, c::Int)
4✔
619
    _length_continued(s, i, n, c)
162,120✔
620
end
621

NEW
622
@propagate_inbounds function length_continued(s::StringView, i::Int, n::Int, c::Int)
×
NEW
623
    _length_continued(s, i, n, c)
×
624
end
625

626

627
@propagate_inbounds function _length_continued(s::Union{String, StringView}, i::Int, n::Int, c::Int)
4✔
628
    i < n || return c
163,532✔
629
    b = codeunit(s, i)
160,696✔
630
    while true
909,636✔
631
        while true
3,409,188✔
632
            (i += 1) ≤ n || return c
18,908,829✔
633
            0xc0 ≤ b ≤ 0xf7 && break
18,590,033✔
634
            b = codeunit(s, i)
17,839,795✔
635
        end
17,839,795✔
636
        l = b
10✔
637
        b = codeunit(s, i) # cont byte 1
750,238✔
638
        c -= (x = b & 0xc0 == 0x80)
750,238✔
639
        x & (l ≥ 0xe0) || continue
750,238✔
640

641
        (i += 1) ≤ n || return c
60,470✔
642
        b = codeunit(s, i) # cont byte 2
57,874✔
643
        c -= (x = b & 0xc0 == 0x80)
57,874✔
644
        x & (l ≥ 0xf0) || continue
115,748✔
645

646
        (i += 1) ≤ n || return c
×
647
        b = codeunit(s, i) # cont byte 3
×
648
        c -= (b & 0xc0 == 0x80)
×
649
    end
748,940✔
650
end
651

652
## overload methods for efficiency ##
653

654
isvalid(s::String, i::Int) = checkbounds(Bool, s, i) && thisind(s, i) == i
117,699,991✔
655

656
isascii(s::String) = isascii(codeunits(s))
6,333,515✔
657

658
# don't assume effects for general integers since we cannot know their implementation
659
@assume_effects :foldable repeat(c::Char, r::BitInteger) = @invoke repeat(c::Char, r::Integer)
8,315,475✔
660

661
"""
662
    repeat(c::AbstractChar, r::Integer)::String
663

664
Repeat a character `r` times. This can equivalently be accomplished by calling
665
[`c^r`](@ref :^(::Union{AbstractString, AbstractChar}, ::Integer)).
666

667
# Examples
668
```jldoctest
669
julia> repeat('A', 3)
670
"AAA"
671
```
672
"""
673
function repeat(c::AbstractChar, r::Integer)
8,548,418✔
674
    r < 0 && throw(ArgumentError("can't repeat a character $r times"))
8,548,900✔
675
    r = UInt(r)::UInt
8,548,892✔
676
    c = Char(c)::Char
8,548,892✔
677
    r == 0 && return ""
8,548,892✔
678
    u = bswap(reinterpret(UInt32, c))
8,396,338✔
679
    n = 4 - (leading_zeros(u | 0xff) >> 3)
8,396,338✔
680
    s = _string_n(n*r)
8,396,338✔
681
    p = pointer(s)
8,396,336✔
682
    GC.@preserve s if n == 1
8,396,336✔
683
        memset(p, u % UInt8, r)
8,395,742✔
684
    elseif n == 2
594✔
685
        p16 = reinterpret(Ptr{UInt16}, p)
14✔
686
        for i = 1:r
14✔
687
            unsafe_store!(p16, u % UInt16, i)
40✔
688
        end
40✔
689
    elseif n == 3
580✔
690
        b1 = (u >> 0) % UInt8
570✔
691
        b2 = (u >> 8) % UInt8
570✔
692
        b3 = (u >> 16) % UInt8
570✔
693
        for i = 0:r-1
570✔
694
            unsafe_store!(p, b1, 3i + 1)
3,116✔
695
            unsafe_store!(p, b2, 3i + 2)
3,116✔
696
            unsafe_store!(p, b3, 3i + 3)
3,116✔
697
        end
3,116✔
698
    elseif n == 4
10✔
699
        p32 = reinterpret(Ptr{UInt32}, p)
10✔
700
        for i = 1:r
10✔
701
            unsafe_store!(p32, u, i)
22✔
702
        end
8,396,358✔
703
    end
704
    return s
8,396,336✔
705
end
STATUS · Troubleshooting · Open an Issue · Sales · Support · CAREERS · ENTERPRISE · START FREE · SCHEDULE DEMO
ANNOUNCEMENTS · TWITTER · TOS & SLA · Supported CI Services · What's a CI service? · Automated Testing

© 2026 Coveralls, Inc