• Home
  • Features
  • Pricing
  • Docs
  • Announcements
  • Sign In

realm / realm-core / 2075

27 Feb 2024 04:12PM UTC coverage: 90.97% (+0.05%) from 90.925%
2075

push

Evergreen

web-flow
Eliminate copies when accessing values from Bson types (#7377)

Returning things by value performs a deep copy, which is very expensive when
those things are also bson containers.

Re-align the naming with the convention names for the functions rather than
being weird and different.

93914 of 173104 branches covered (54.25%)

82 of 82 new or added lines in 9 files covered. (100.0%)

66 existing lines in 16 files now uncovered.

238508 of 262184 relevant lines covered (90.97%)

5724419.94 hits per line

Source File
Press 'n' to go to next uncovered line, 'b' for previous

93.68
/src/realm/index_string.hpp
1
/*************************************************************************
2
 *
3
 * Copyright 2016 Realm Inc.
4
 *
5
 * Licensed under the Apache License, Version 2.0 (the "License");
6
 * you may not use this file except in compliance with the License.
7
 * You may obtain a copy of the License at
8
 *
9
 * http://www.apache.org/licenses/LICENSE-2.0
10
 *
11
 * Unless required by applicable law or agreed to in writing, software
12
 * distributed under the License is distributed on an "AS IS" BASIS,
13
 * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
14
 * See the License for the specific language governing permissions and
15
 * limitations under the License.
16
 *
17
 **************************************************************************/
18

19
#ifndef REALM_INDEX_STRING_HPP
20
#define REALM_INDEX_STRING_HPP
21

22
#include <cstring>
23
#include <memory>
24
#include <array>
25
#include <set>
26

27
#include <realm/array.hpp>
28
#include <realm/column_integer.hpp>
29
#include <realm/search_index.hpp>
30

31
/*
32
The StringIndex class is used for both type_String and all integral types, such as type_Bool, type_Timestamp and
33
type_Int. When used for integral types, the 64-bit integer is simply casted to a string of 8 bytes through a
34
pretty simple "wrapper layer" in all public methods.
35

36
The StringIndex data structure is like an "inversed" B+ tree where the leafs contain row indexes and the non-leafs
37
contain 4-byte chunks of payload. Imagine a table with following strings:
38

39
       hello, kitty, kitten, foobar, kitty, foobar
40

41
The topmost level of the index tree contains prefixes of the payload strings of length <= 4. The next level contains
42
prefixes of the remaining parts of the strings. Unnecessary levels of the tree are optimized away; the prefix "foob"
43
is shared only by rows that are identical ("foobar"), so "ar" is not needed to be stored in the tree.
44

45
       hell   kitt      foob
46
        |      /\        |
47
        0     en  y    {3, 5}
48
              |    \
49
           {1, 4}   2
50

51
Each non-leafs consists of two integer arrays of the same length, one containing payload and the other containing
52
references to the sublevel nodes.
53

54
The leafs can be either a single value or a Column. If the reference in its parent node has its least significant
55
bit set, then the remaining upper bits specify the row index at which the string is stored. If the bit is clear,
56
it must be interpreted as a reference to a Column that stores the row indexes at which the string is stored.
57

58
If a Column is used, then all row indexes are guaranteed to be sorted increasingly, which means you an search in it
59
using our binary search functions such as upper_bound() and lower_bound(). Each duplicate value will be stored in
60
the same Column, but Columns may contain more than just duplicates if the depth of the tree exceeds the value
61
`s_max_offset` This is to avoid stack overflow problems with many of our recursive functions if we have two very
62
long strings that have a long common prefix but differ in the last couple bytes. If a Column stores more than just
63
duplicates, then the list is kept sorted in ascending order by string value and within the groups of common
64
strings, the rows are sorted in ascending order.
65
*/
66

67
namespace realm {
68

69
class Spec;
70
class Timestamp;
71
class ClusterColumn;
72

73
template <class T>
74
class BPlusTree;
75

76
/// Each StringIndex node contains an array of this type
77
class IndexArray : public Array {
78
public:
79
    IndexArray(Allocator& allocator)
80
        : Array(allocator)
81
    {
7,872,540✔
82
    }
7,872,540✔
83

84
    ObjKey index_string_find_first(const Mixed& value, const ClusterColumn& column) const;
85
    void index_string_find_all(std::vector<ObjKey>& result, const Mixed& value, const ClusterColumn& column,
86
                               bool case_insensitive = false) const;
87
    FindRes index_string_find_all_no_copy(const Mixed& value, const ClusterColumn& column,
88
                                          InternalFindResult& result) const;
89
    size_t index_string_count(const Mixed& value, const ClusterColumn& column) const;
90
    void index_string_find_all_prefix(std::set<int64_t>& result, StringData str) const
91
    {
66✔
92
        _index_string_find_all_prefix(result, str, NodeHeader::get_header_from_data(m_data));
66✔
93
    }
66✔
94

95
private:
96
    template <IndexMethod>
97
    int64_t from_list(const Mixed& value, InternalFindResult& result_ref, const IntegerColumn& key_values,
98
                      const ClusterColumn& column) const;
99

100
    void from_list_all(const Mixed& value, std::vector<ObjKey>& result, const IntegerColumn& rows,
101
                       const ClusterColumn& column) const;
102

103
    void from_list_all_ins(StringData value, std::vector<ObjKey>& result, const IntegerColumn& rows,
104
                           const ClusterColumn& column) const;
105

106
    template <IndexMethod method>
107
    int64_t index_string(const Mixed& value, InternalFindResult& result_ref, const ClusterColumn& column) const;
108

109
    void index_string_all(const Mixed& value, std::vector<ObjKey>& result, const ClusterColumn& column) const;
110

111
    void index_string_all_ins(StringData value, std::vector<ObjKey>& result, const ClusterColumn& column) const;
112
    void _index_string_find_all_prefix(std::set<int64_t>& result, StringData str, const char* header) const;
113
};
114

115
// 16 is the biggest element size of any non-string/binary Realm type
116
constexpr size_t string_conversion_buffer_size = 16;
117
using StringConversionBuffer = std::array<char, string_conversion_buffer_size>;
118
static_assert(sizeof(UUID::UUIDBytes) <= string_conversion_buffer_size,
119
              "if you change the size of a UUID then also change the string index buffer space");
120

121

122
class StringIndex : public SearchIndex {
123
public:
124
    StringIndex(const ClusterColumn& target_column, Allocator&);
125
    StringIndex(ref_type, ArrayParent*, size_t ndx_in_parent, const ClusterColumn& target_column, Allocator&);
126
    ~StringIndex() noexcept
127
    {
7,864,476✔
128
    }
7,864,476✔
129

130
    static bool type_supported(realm::DataType type)
131
    {
96,948✔
132
        return (type == type_Int || type == type_String || type == type_Bool || type == type_Timestamp ||
96,948✔
133
                type == type_ObjectId || type == type_Mixed || type == type_UUID);
69,063✔
134
    }
96,948✔
135

136
    // StringIndex interface:
137

138
    bool is_empty() const override;
139
    bool is_fulltext_index() const
140
    {
732✔
141
        return this->m_target_column.tokenize();
732✔
142
    }
732✔
143

144
    void insert(ObjKey key, const Mixed& value) final;
145
    void set(ObjKey key, const Mixed& new_value) final;
146
    void erase(ObjKey key) final;
147
    void erase_list(ObjKey key, const Lst<String>&);
148
    // Erase without getting value from parent column (useful when string stored
149
    // does not directly match string in parent, like with full-text indexing)
150
    void erase_string(ObjKey key, StringData value);
151

152
    ObjKey find_first(const Mixed& value) const final;
153
    void find_all(std::vector<ObjKey>& result, Mixed value, bool case_insensitive = false) const final;
154
    FindRes find_all_no_copy(Mixed value, InternalFindResult& result) const final;
155
    size_t count(const Mixed& value) const final;
156
    void insert_bulk(const ArrayUnsigned* keys, uint64_t key_offset, size_t num_values, ArrayPayload& values) final;
157
    void insert_bulk_list(const ArrayUnsigned* keys, uint64_t key_offset, size_t num_values,
158
                          ArrayInteger& ref_array) final;
159

160
    void find_all_fulltext(std::vector<ObjKey>& result, StringData value) const;
161

162
    void clear() override;
163
    bool has_duplicate_values() const noexcept override;
164

165
    void verify() const final;
166
#ifdef REALM_DEBUG
167
    template <class T>
168
    void verify_entries(const ClusterColumn& column) const;
169
    void print() const final;
170
#endif
171

172
    typedef int32_t key_type;
173

174
    // s_max_offset specifies the number of levels of recursive string indexes
175
    // allowed before storing everything in lists. This is to avoid nesting
176
    // to too deep of a level. Since every SubStringIndex stores 4 bytes, this
177
    // means that a StringIndex is helpful for strings of a common prefix up to
178
    // 4 times this limit (200 bytes shared). Lists are stored in sorted order,
179
    // so strings sharing a common prefix of more than this limit will use a
180
    // binary search of approximate complexity log2(n) from `std::lower_bound`.
181
    static const size_t s_max_offset = 200; // max depth * s_index_key_length
182
    static const size_t s_index_key_length = 4;
183
    static key_type create_key(StringData) noexcept;
184
    static key_type create_key(StringData, size_t) noexcept;
185

186
private:
187
    // m_array is a compact representation for storing the children of this StringIndex.
188
    // Children can be:
189
    // 1) an ObjKey
190
    // 2) a reference to a list which stores ObjKeys (for duplicate strings).
191
    // 3) a reference to a sub-index
192
    // m_array[0] is always a reference to a values array which stores the 4 byte chunk
193
    // of payload data for quick string chunk comparisons. The array stored
194
    // at m_array[0] lines up with the indices of values in m_array[1] so for example
195
    // starting with an empty StringIndex:
196
    // StringColumn::insert(key=42, value="test_string") would result with
197
    // get_array_from_ref(m_array[0])[0] == create_key("test") and
198
    // m_array[1] == 42
199
    // In this way, m_array which stores one child has a size of two.
200
    // Children are type 1 (ObjKey) if the LSB of the value is set.
201
    // To get the actual key value, shift value down by one.
202
    // If the LSB of the value is 0 then the value is a reference and can be either
203
    // type 2, or type 3 (no shifting in either case).
204
    // References point to a list if the context header flag is NOT set.
205
    // If the header flag is set, references point to a sub-StringIndex (nesting).
206
    std::unique_ptr<IndexArray> m_array;
207

208
    struct inner_node_tag {
209
    };
210
    StringIndex(inner_node_tag, Allocator&);
211
    StringIndex(const ClusterColumn& target_column, std::unique_ptr<IndexArray> root)
212
        : SearchIndex(target_column, root.get())
213
        , m_array(std::move(root))
214
    {
7,874,934✔
215
    }
7,874,934✔
216

217
    static std::unique_ptr<IndexArray> create_node(Allocator&, bool is_leaf);
218

219
    void insert_with_offset(ObjKey key, StringData index_data, const Mixed& value, size_t offset);
220
    void insert_row_list(size_t ref, size_t offset, StringData value);
221
    void insert_to_existing_list(ObjKey key, Mixed value, IntegerColumn& list);
222
    void insert_to_existing_list_at_lower(ObjKey key, Mixed value, IntegerColumn& list,
223
                                          const IntegerColumnIterator& lower);
224
    key_type get_last_key() const;
225

226
    struct NodeChange {
227
        size_t ref1;
228
        size_t ref2;
229
        enum ChangeType { change_None, change_InsertBefore, change_InsertAfter, change_Split } type;
230
        NodeChange(ChangeType t, size_t r1 = 0, size_t r2 = 0)
231
            : ref1(r1)
232
            , ref2(r2)
233
            , type(t)
234
        {
10,155,348✔
235
        }
10,155,348✔
236
        NodeChange()
237
            : ref1(0)
238
            , ref2(0)
239
            , type(change_None)
240
        {
×
241
        }
×
242
    };
243

244
    // B-Tree functions
245
    void TreeInsert(ObjKey obj_key, key_type, size_t offset, StringData index_data, const Mixed& value);
246
    NodeChange do_insert(ObjKey, key_type, size_t offset, StringData index_data, const Mixed& value);
247
    /// Returns true if there is room or it can join existing entries
248
    bool leaf_insert(ObjKey obj_key, key_type, size_t offset, StringData index_data, const Mixed& value,
249
                     bool noextend = false);
250
    void node_insert_split(size_t ndx, size_t new_ref);
251
    void node_insert(size_t ndx, size_t ref);
252
    void do_delete(ObjKey key, StringData, size_t offset);
253

254
    Mixed get(ObjKey key) const;
255
    void node_add_key(ref_type ref);
256

257
#ifdef REALM_DEBUG
258
    static void dump_node_structure(const Array& node, std::ostream&, int level);
259
#endif
260
};
261

262
class SortedListComparator {
263
public:
264
    SortedListComparator(const ClusterTree* cluster_tree, ColKey column_key, IndexType type)
265
        : m_column(cluster_tree, column_key, type)
266
    {
×
267
    }
×
268
    SortedListComparator(const ClusterColumn& column)
269
        : m_column(column)
270
    {
4,199,040✔
271
    }
4,199,040✔
272

273
    bool operator()(int64_t key_value, const Mixed& b);
274
    bool operator()(const Mixed& a, int64_t key_value);
275

276
    IntegerColumn::const_iterator find_start_of_unsorted(const Mixed& value, const IntegerColumn& key_values) const;
277
    IntegerColumn::const_iterator find_end_of_unsorted(const Mixed& value, const IntegerColumn& key_values,
278
                                                       IntegerColumn::const_iterator begin) const;
279

280
private:
281
    const ClusterColumn m_column;
282
};
283

284

285
// Implementation:
286
inline StringIndex::StringIndex(const ClusterColumn& target_column, Allocator& alloc)
287
    : StringIndex(target_column, create_node(alloc, true)) // Throws
288
{
681,534✔
289
}
681,534✔
290

291
inline StringIndex::StringIndex(ref_type ref, ArrayParent* parent, size_t ndx_in_parent,
292
                                const ClusterColumn& target_column, Allocator& alloc)
293
    : StringIndex(target_column, std::make_unique<IndexArray>(alloc))
294
{
7,190,112✔
295
    REALM_ASSERT_EX(Array::get_context_flag_from_header(alloc.translate(ref)), ref, size_t(alloc.translate(ref)));
7,190,112✔
296
    m_array->init_from_ref(ref);
7,190,112✔
297
    set_parent(parent, ndx_in_parent);
7,190,112✔
298
}
7,190,112✔
299

300
inline StringIndex::StringIndex(inner_node_tag, Allocator& alloc)
301
    : StringIndex(ClusterColumn(nullptr, {}, IndexType::General), create_node(alloc, false)) // Throws
302
{
81✔
303
}
81✔
304

305
// Byte order of the key is *reversed*, so that for the integer index, the least significant
306
// byte comes first, so that it fits little-endian machines. That way we can perform fast
307
// range-lookups and iterate in order, etc, as future features. This, however, makes the same
308
// features slower for string indexes. Todo, we should reverse the order conditionally, depending
309
// on the column type.
310
inline StringIndex::key_type StringIndex::create_key(StringData str) noexcept
311
{
13,398,633✔
312
    key_type key = 0;
13,398,633✔
313

6,659,217✔
314
    if (str.size() >= 4)
13,398,633✔
315
        goto four;
12,499,056✔
316
    if (str.size() < 2) {
899,577✔
317
        if (str.size() == 0)
497,064✔
318
            goto none;
×
319
        goto one;
497,064✔
320
    }
497,064✔
321
    if (str.size() == 2)
402,513✔
322
        goto two;
192,516✔
323
    goto three;
209,997✔
324

103,557✔
325
// Create 4 byte index key
103,557✔
326
// (encoded like this to allow literal comparisons
103,557✔
327
// independently of endianness)
103,557✔
328
four:
12,499,293✔
329
    key |= (key_type(static_cast<unsigned char>(str[3])) << 0);
12,499,293✔
330
three:
12,710,601✔
331
    key |= (key_type(static_cast<unsigned char>(str[2])) << 8);
12,710,601✔
332
two:
12,897,834✔
333
    key |= (key_type(static_cast<unsigned char>(str[1])) << 16);
12,897,834✔
334
one:
13,389,183✔
335
    key |= (key_type(static_cast<unsigned char>(str[0])) << 24);
13,389,183✔
336
none:
13,392,717✔
337
    return key;
13,392,717✔
338
}
13,389,183✔
339

340
// Index works as follows: All non-NULL values are stored as if they had appended an 'X' character at the end. So
341
// "foo" is stored as if it was "fooX", and "" (empty string) is stored as "X". And NULLs are stored as empty strings.
342
inline StringIndex::key_type StringIndex::create_key(StringData str, size_t offset) noexcept
343
{
13,955,403✔
344
    if (str.is_null())
13,955,403✔
345
        return 0;
575,952✔
346

6,648,450✔
347
    if (offset > str.size())
13,379,451✔
UNCOV
348
        return 0;
×
349

6,648,450✔
350
    // for very short strings
6,648,450✔
351
    size_t tail = str.size() - offset;
13,379,451✔
352
    if (tail <= sizeof(key_type) - 1) {
13,379,451✔
353
        char buf[sizeof(key_type)];
942,054✔
354
        memset(buf, 0, sizeof(key_type));
942,054✔
355
        buf[tail] = 'X';
942,054✔
356
        memcpy(buf, str.data() + offset, tail);
942,054✔
357
        return create_key(StringData(buf, tail + 1));
942,054✔
358
    }
942,054✔
359
    // else fallback
6,177,129✔
360
    return create_key(str.substr(offset));
12,437,397✔
361
}
12,437,397✔
362

363
inline ObjKey StringIndex::find_first(const Mixed& value) const
364
{
1,292,283✔
365
    // Use direct access method
635,757✔
366
    return m_array->index_string_find_first(value, m_target_column);
1,292,283✔
367
}
1,292,283✔
368

369
inline void StringIndex::find_all(std::vector<ObjKey>& result, Mixed value, bool case_insensitive) const
370
{
57,783✔
371
    // Use direct access method
29,019✔
372
    return m_array->index_string_find_all(result, value, m_target_column, case_insensitive);
57,783✔
373
}
57,783✔
374

375
inline FindRes StringIndex::find_all_no_copy(Mixed value, InternalFindResult& result) const
376
{
657,702✔
377
    return m_array->index_string_find_all_no_copy(value, m_target_column, result);
657,702✔
378
}
657,702✔
379

380
inline size_t StringIndex::count(const Mixed& value) const
381
{
12,864✔
382
    // Use direct access method
6,432✔
383
    return m_array->index_string_count(value, m_target_column);
12,864✔
384
}
12,864✔
385

386
} // namespace realm
387

388
#endif // REALM_INDEX_STRING_HPP
STATUS · Troubleshooting · Open an Issue · Sales · Support · CAREERS · ENTERPRISE · START FREE · SCHEDULE DEMO
ANNOUNCEMENTS · TWITTER · TOS & SLA · Supported CI Services · What's a CI service? · Automated Testing

© 2026 Coveralls, Inc