• Home
  • Features
  • Pricing
  • Docs
  • Announcements
  • Sign In

vigna / dsi-bitstream-rs / 18145139209

30 Sep 2025 10:26PM UTC coverage: 43.572% (-5.2%) from 48.731%
18145139209

push

github

vigna
Unified test

7 of 11 new or added lines in 2 files covered. (63.64%)

323 existing lines in 12 files now uncovered.

1559 of 3578 relevant lines covered (43.57%)

2347492.34 hits per line

Source File
Press 'n' to go to next uncovered line, 'b' for previous

79.28
/src/codes/vbyte.rs
1
/*
2
 * SPDX-FileCopyrightText: 2023 Tommaso Fontana
3
 * SPDX-FileCopyrightText: 2023 Sebastiano Vigna
4
 *
5
 * SPDX-License-Identifier: Apache-2.0 OR LGPL-2.1-or-later
6
 */
7

8
//! Variable-length byte codes.
9
//!
10
//! These codes represent a natural number as a sequence of bytes, the length of
11
//! the sequence depends on the magnitude of the number. They are used in many
12
//! contexts, and they are known under a plethora of different names such
13
//! “vbyte”, “varint”, “[variable-length
14
//! quantity](https://en.wikipedia.org/wiki/Variable-length_quantity)”, “LEB”,
15
//! and so on.
16
//!
17
//! There are several variants of their definition, but their implied
18
//! distribution is always ≈ 1/*x*<sup>8/7</sup>
19
//!
20
//! The supported range is [0 . . 2⁶⁴).
21
//!
22
//! # Definition
23
//!
24
//! Since there are a few slightly different variants used in production code
25
//! and in the literature, before going into the details of this implementation
26
//! we will try to define a clear taxonomy by explaining in detail the four
27
//! three properties that define such variants.
28
//!
29
//! The simplest variable-length byte code encodes a number with a binary
30
//! representation of *k* bits using ⌈*k* / 7⌉ bytes. The binary representation
31
//! is left-padded with zeros so to obtain exactly ⌈*k* / 7⌉ blocks of 7 bits.
32
//! Each block is stored in a byte, prefixed with a continuation bit which is
33
//! one for all blocks except for the last one.
34
//!
35
//! ## Endianness
36
//!
37
//! The first property is the endianness of the bytes: in big-endian codes, the
38
//! first byte contains the highest (most significant) bits, whereas in
39
//! little-endian codes, the first byte contains the lowest (less significant)
40
//! bits.
41
//!
42
//! The advantage of the big-endian variant is that is lexicographical, that is,
43
//! comparing lexicographically a stream of encoded natural numbers will give
44
//! the same results as comparing lexicographically the encoded numbers, much
45
//! like it happens for [UTF-8 encoding](https://en.wikipedia.org/wiki/UTF-8).
46
//!
47
//! ## Completeness
48
//!
49
//! This basic representation discussed above is not *complete*, as there are
50
//! sequences that are not used. For example, zero can be written in many ways
51
//! (e.g., `0x00` or `0x80 0x00` ), but we are using only the single-byte
52
//! representation. Incompleteness leads to a (small) loss in compression.
53
//!
54
//! To have completeness, one can offset the representation in *k* bits by the
55
//! maximum number representable using *k* − 1 bits. That is, we represent the
56
//! interval [0..2⁷) with one byte, then the interval [2⁷..2⁷ + 2¹⁴] with two
57
//! bytes, the interval [2⁷ + 2¹⁴..2⁷ + 2¹⁴ + 2²¹] with three bytes, and so on.
58
//!
59
//! ## Grouping
60
//!
61
//! In the basic representation, the continuation bit is the most significant
62
//! bit of each byte. However, one can gather all continuation bits in the first
63
//! byte ([as UTF-8 does](https://en.wikipedia.org/wiki/UTF-8)). This approach
64
//! makes it possible to compute the length of the code using a call to
65
//! [`usize::leading_ones`] on the first negated byte, which usually maps to a
66
//! negation and a call to a fast instruction for the detection of the most
67
//! significant bit, improving branch prediction.
68
//!
69
//! Note that if the code is grouped, choosing a code with the same endianness
70
//! as your hardware can lead to a performance improvement, as after the first
71
//! byte the rest of the code can be read with a
72
//![`read_exact`](std::io::Read::read_exact).
73
//!
74
//! ## Sign
75
//!
76
//! It is possible to extend the codes to represent signed integers. Two
77
//! possible approaches are using a [bijective mapping](crate::codes::ToInt)
78
//! between the integers and the natural numbers, or defining a specialized
79
//! code.
80
//!
81
//! # Implementations
82
//!
83
//! We provide two unsigned, ungrouped, complete representations, one
84
//! [big-endian](VByteBeRead) and one [little-endian](VByteLeRead). We recommend
85
//! in general big-endian version as it is lexicographical. However, the
86
//! big-endian version needs a small buffer when writing, so on some hardware
87
//! the little-endian version might be faster.
88
//!
89
//! Note that the endianness of the code is independent from the endianness of
90
//! the underlying bit stream, as it just depend on the order in which bytes are
91
//! written.
92
//!
93
//! Since this code is byte-aligned, we provide also convenient, fast methods
94
//! [`vbyte_write`] and [`vbyte_read`] that can be used on types implementing
95
//! [`std::io::Read`] and [`std::io::Write`]. There are also endianness-specific
96
//! methods [`vbyte_write_be`], [`vbyte_write_le`], [`vbyte_read_be`], and
97
//! [`vbyte_read_le`].
98
//!
99
//! [`LEB128`]: https://en.wikipedia.org/wiki/LEB128
100
//! [`VLQ`]: https://en.wikipedia.org/wiki/Variable-length_quantity
101
//!
102
//! # Examples
103
//!
104
//! - The [LEB128](https://en.wikipedia.org/wiki/LEB128) code used by LLVM is a
105
//!   little-endian incomplete ungrouped representation. There is both a signed
106
//!   and an unsigned version; the signed version represent negative numbers
107
//!   using two's complement.
108
//!
109
//! - The [code used by
110
//!   git](https://github.com/git/git/blob/7fb6aefd2aaffe66e614f7f7b83e5b7ab16d4806/varint.c)
111
//!   is a big-endian complete ungrouped representation.
112
//!
113
//! - [This implementation in
114
//!   folly](https://github.com/facebook/folly/blob/dd4a5eb057afbc3c7c7da71801df2ee3c61c47d1/folly/Varint.h#L109)
115
//!   is a little-endian incomplete ungrouped representation.
116

117
use crate::traits::*;
118

119
/// Return the length of the variable-length byte code for `value` in bytes.
120
#[must_use]
121
#[inline(always)]
122
pub fn byte_len_vbyte(mut value: u64) -> usize {
2,106,600✔
123
    let mut len = 1;
4,213,200✔
UNCOV
124
    loop {
×
125
        value >>= 7;
6,282,448✔
126
        if value == 0 {
6,282,448✔
127
            return len;
2,106,600✔
128
        }
UNCOV
129
        value -= 1;
×
UNCOV
130
        len += 1;
×
131
    }
132
}
133

134
/// Return the length of the variable-length byte code for `value` in bits.
135
#[must_use]
136
#[inline(always)]
137
pub fn bit_len_vbyte(value: u64) -> usize {
9,448✔
138
    8 * byte_len_vbyte(value)
9,448✔
139
}
140

141
/// Trait for reading big-endian variable-length byte codes.
142
///
143
/// Note that the endianness of the code is independent
144
/// from the endianness of the underlying bit stream.
145
pub trait VByteBeRead<E: Endianness>: BitRead<E> {
146
    fn read_vbyte_be(&mut self) -> Result<u64, Self::Error>;
147
}
148

149
/// Trait for reading little-endian variable-length byte codes.
150
///
151
/// Note that the endianness of the code is independent
152
/// from the endianness of the underlying bit stream.
153
pub trait VByteLeRead<E: Endianness>: BitRead<E> {
154
    fn read_vbyte_le(&mut self) -> Result<u64, Self::Error>;
155
}
156

157
impl<E: Endianness, B: BitRead<E>> VByteBeRead<E> for B {
158
    #[inline(always)]
159
    fn read_vbyte_be(&mut self) -> Result<u64, Self::Error> {
4,424✔
160
        let mut byte = self.read_bits(8)?;
13,272✔
UNCOV
161
        let mut value = byte & 0x7F;
×
162
        while (byte >> 7) != 0 {
9,620✔
163
            value += 1;
5,196✔
164
            byte = self.read_bits(8)?;
10,392✔
165
            value = (value << 7) | (byte & 0x7F);
5,196✔
166
        }
167
        Ok(value)
4,424✔
168
    }
169
}
170

171
impl<E: Endianness, B: BitRead<E>> VByteLeRead<E> for B {
172
    #[inline(always)]
173
    fn read_vbyte_le(&mut self) -> Result<u64, Self::Error> {
4,660✔
174
        let mut result = 0;
9,320✔
175
        let mut shift = 0;
9,320✔
176
        loop {
×
177
            let byte = self.read_bits(8)?;
34,632✔
UNCOV
178
            result += (byte & 0x7F) << shift;
×
UNCOV
179
            if (byte >> 7) == 0 {
×
180
                break;
4,660✔
181
            }
UNCOV
182
            shift += 7;
×
183
            result += 1 << shift;
×
184
        }
UNCOV
185
        Ok(result)
×
186
    }
187
}
188

189
/// Trait for write big-endian variable-length byte codes.
190
///
191
/// Note that the endianness of the code is independent
192
/// from the endianness of the underlying bit stream.
193
pub trait VByteBeWrite<E: Endianness>: BitWrite<E> {
194
    fn write_vbyte_be(&mut self, value: u64) -> Result<usize, Self::Error>;
195
}
196
/// Trait for write little-endian variable-length byte codes.
197
///
198
/// Note that the endianness of the code is independent
199
/// from the endianness of the underlying bit stream.
200
pub trait VByteLeWrite<E: Endianness>: BitWrite<E> {
201
    fn write_vbyte_le(&mut self, value: u64) -> Result<usize, Self::Error>;
202
}
203

204
impl<E: Endianness, B: BitWrite<E>> VByteBeWrite<E> for B {
205
    #[inline(always)]
206
    fn write_vbyte_be(&mut self, mut value: u64) -> Result<usize, Self::Error> {
4,392✔
207
        let mut buf = [0u8; 10];
8,784✔
208
        let mut pos = buf.len() - 1;
8,784✔
209
        buf[pos] = (value & 0x7F) as u8;
4,392✔
210
        value >>= 7;
4,392✔
211
        while value != 0 {
14,256✔
212
            value -= 1;
4,932✔
213
            pos -= 1;
4,932✔
214
            buf[pos] = 0x80 | (value & 0x7F) as u8;
4,932✔
215
            value >>= 7;
4,932✔
216
        }
217
        let bytes_to_write = buf.len() - pos;
13,176✔
218
        for &byte in &buf[pos..] {
13,716✔
219
            self.write_bits(byte as u64, 8)?;
27,972✔
220
        }
221
        Ok(bytes_to_write * 8)
4,392✔
222
    }
223
}
224

225
impl<E: Endianness, B: BitWrite<E>> VByteLeWrite<E> for B {
226
    #[inline(always)]
227
    fn write_vbyte_le(&mut self, mut value: u64) -> Result<usize, Self::Error> {
4,510✔
228
        let mut len = 1;
9,020✔
UNCOV
229
        loop {
×
230
            let byte = (value & 0x7F) as u8;
20,572✔
231
            value >>= 7;
10,286✔
232
            if value != 0 {
10,286✔
233
                self.write_bits((byte | 0x80) as u64, 8)?;
17,328✔
234
            } else {
235
                self.write_bits(byte as u64, 8)?;
4,510✔
236
                break;
4,510✔
237
            }
238
            value -= 1;
5,776✔
239
            len += 1;
×
240
        }
UNCOV
241
        Ok(len * 8)
×
242
    }
243
}
244

245
/// Write a natural number to a byte stream using variable-length byte codes and
246
/// return the number of bytes written.
247
///
248
/// This method just delegates to the correct endianness-specific method.
249
#[inline(always)]
250
pub fn vbyte_write<E: Endianness, W: std::io::Write>(
2,097,188✔
251
    value: u64,
252
    writer: &mut W,
253
) -> std::io::Result<usize> {
254
    if core::any::TypeId::of::<E>() == core::any::TypeId::of::<BigEndian>() {
2,097,188✔
255
        vbyte_write_be(value, writer)
3,145,782✔
256
    } else {
257
        vbyte_write_le(value, writer)
1,048,594✔
258
    }
259
}
260

261
/// Encode a natural number to a big-endian byte stream using variable-length
262
/// byte codes and return the number of bytes written.
263
#[inline(always)]
264
pub fn vbyte_write_be<W: std::io::Write>(mut value: u64, w: &mut W) -> std::io::Result<usize> {
1,048,594✔
265
    let mut buf = [0u8; 10];
2,097,188✔
266
    let mut pos = buf.len() - 1;
2,097,188✔
267
    buf[pos] = (value & 0x7F) as u8;
1,048,594✔
268
    value >>= 7;
1,048,594✔
269
    while value != 0 {
5,209,764✔
270
        value -= 1;
2,080,585✔
271
        pos -= 1;
2,080,585✔
272
        buf[pos] = 0x80 | (value & 0x7F) as u8;
2,080,585✔
273
        value >>= 7;
2,080,585✔
274
    }
275
    let bytes_to_write = buf.len() - pos;
3,145,782✔
276
    w.write_all(&buf[pos..])?;
3,145,782✔
277
    Ok(bytes_to_write)
1,048,594✔
278
}
279

280
/// Encode a natural number to a little-endian byte stream using variable-length
281
/// byte codes and return the number of bytes written.
282
#[inline(always)]
283
pub fn vbyte_write_le<W: std::io::Write>(mut value: u64, writer: &mut W) -> std::io::Result<usize> {
1,048,594✔
284
    let mut len = 1;
2,097,188✔
UNCOV
285
    loop {
×
286
        let byte = (value & 0x7F) as u8;
6,258,358✔
287
        value >>= 7;
3,129,179✔
288
        if value != 0 {
3,129,179✔
289
            writer.write_all(&[byte | 0x80])?;
6,241,755✔
290
        } else {
291
            writer.write_all(&[byte])?;
1,048,594✔
292
            break;
1,048,594✔
293
        }
294
        value -= 1;
2,080,585✔
295
        len += 1;
×
296
    }
UNCOV
297
    Ok(len)
×
298
}
299

300
#[inline(always)]
301
/// Decode a natural number from a byte stream using variable-length byte codes.
302
///
303
/// This method just delegates to the correct endianness-specific method.
304
pub fn vbyte_read<E: Endianness, R: std::io::Read>(reader: &mut R) -> std::io::Result<u64> {
2,097,188✔
305
    if core::any::TypeId::of::<E>() == core::any::TypeId::of::<BigEndian>() {
2,097,188✔
306
        vbyte_read_be(reader)
2,097,188✔
307
    } else {
308
        vbyte_read_le(reader)
1,048,594✔
309
    }
310
}
311

312
/// Decode a natural number from a big-endian byte stream using variable-length
313
/// byte codes.
314
#[inline(always)]
315
pub fn vbyte_read_be<R: std::io::Read>(reader: &mut R) -> std::io::Result<u64> {
1,048,594✔
316
    let mut buf = [0u8; 1];
2,097,188✔
UNCOV
317
    let mut value: u64;
×
318
    reader.read_exact(&mut buf)?;
3,145,782✔
319
    value = (buf[0] & 0x7F) as u64;
1,048,594✔
320
    while (buf[0] >> 7) != 0 {
3,129,179✔
321
        value += 1;
2,080,585✔
322
        reader.read_exact(&mut buf)?;
6,241,755✔
323
        value = (value << 7) | ((buf[0] & 0x7F) as u64);
2,080,585✔
324
    }
325
    Ok(value)
1,048,594✔
326
}
327

328
/// Decode a natural number from a little-endian byte stream using
329
/// variable-length byte codes.
330
#[inline(always)]
331
pub fn vbyte_read_le<R: std::io::Read>(reader: &mut R) -> std::io::Result<u64> {
1,048,594✔
332
    let mut result = 0;
2,097,188✔
333
    let mut shift = 0;
2,097,188✔
334
    let mut buffer = [0; 1];
2,097,188✔
UNCOV
335
    loop {
×
336
        reader.read_exact(&mut buffer)?;
9,387,537✔
337
        let byte = buffer[0];
3,129,179✔
UNCOV
338
        result += ((byte & 0x7F) as u64) << shift;
×
UNCOV
339
        if (byte >> 7) == 0 {
×
340
            break;
1,048,594✔
341
        }
UNCOV
342
        shift += 7;
×
343
        result += 1 << shift;
×
344
    }
UNCOV
345
    Ok(result)
×
346
}
347

348
#[cfg(test)]
349
mod test {
350
    use super::*;
351

352
    const UPPER_BOUND_1: u64 = 128;
353
    const UPPER_BOUND_2: u64 = 128_u64.pow(2) + UPPER_BOUND_1;
354
    const UPPER_BOUND_3: u64 = 128_u64.pow(3) + UPPER_BOUND_2;
355
    const UPPER_BOUND_4: u64 = 128_u64.pow(4) + UPPER_BOUND_3;
356
    const UPPER_BOUND_5: u64 = 128_u64.pow(5) + UPPER_BOUND_4;
357
    const UPPER_BOUND_6: u64 = 128_u64.pow(6) + UPPER_BOUND_5;
358
    const UPPER_BOUND_7: u64 = 128_u64.pow(7) + UPPER_BOUND_6;
359
    const UPPER_BOUND_8: u64 = 128_u64.pow(8) + UPPER_BOUND_7;
360

361
    macro_rules! impl_tests {
362
        ($test_name:ident, $E:ty) => {
363
            #[test]
364
            fn $test_name() {
365
                const MAX: usize = 1 << 20;
366
                const MIN: usize = 0;
367
                let mut buffer = std::io::Cursor::new(Vec::with_capacity(128));
368
                let mut lens = Vec::new();
369

370
                for i in MIN..MAX {
371
                    lens.push(vbyte_write::<$E, _>(i as _, &mut buffer).unwrap());
372
                }
373
                buffer.set_position(0);
374
                for (i, l) in (MIN..MAX).zip(lens.iter()) {
375
                    let j = vbyte_read::<$E, _>(&mut buffer).unwrap();
376
                    assert_eq!(byte_len_vbyte(i as _), *l);
377
                    assert_eq!(j, i as u64);
378
                }
379

380
                let values = [
381
                    0,
382
                    UPPER_BOUND_1 - 1,
383
                    UPPER_BOUND_1 + 1,
384
                    UPPER_BOUND_2 - 1,
385
                    UPPER_BOUND_2 + 1,
386
                    UPPER_BOUND_3 - 1,
387
                    UPPER_BOUND_3 + 1,
388
                    UPPER_BOUND_4 - 1,
389
                    UPPER_BOUND_4 + 1,
390
                    UPPER_BOUND_5 - 1,
391
                    UPPER_BOUND_5 + 1,
392
                    UPPER_BOUND_6 - 1,
393
                    UPPER_BOUND_6 + 1,
394
                    UPPER_BOUND_7 - 1,
395
                    UPPER_BOUND_7 + 1,
396
                    UPPER_BOUND_8 - 1,
397
                    UPPER_BOUND_8 + 1,
398
                    u64::MAX,
399
                ];
400

401
                let tell: u64 = buffer.position();
402
                for &i in values.iter() {
403
                    vbyte_write::<$E, _>(i, &mut buffer).unwrap();
404
                }
405
                buffer.set_position(tell);
406
                for &i in values.iter() {
407
                    assert_eq!(i, vbyte_read::<$E, _>(&mut buffer).unwrap());
408
                }
409
            }
410
        };
411
    }
412

413
    impl_tests!(test_vbytes_be, BE);
414
    impl_tests!(test_vbytes_le, LE);
415
}
STATUS · Troubleshooting · Open an Issue · Sales · Support · CAREERS · ENTERPRISE · START FREE · SCHEDULE DEMO
ANNOUNCEMENTS · TWITTER · TOS & SLA · Supported CI Services · What's a CI service? · Automated Testing

© 2026 Coveralls, Inc