• Home
  • Features
  • Pricing
  • Docs
  • Announcements
  • Sign In

vortex-data / vortex / 17040114484

18 Aug 2025 12:07PM UTC coverage: 86.042% (-1.9%) from 87.913%
17040114484

Pull #4215

github

web-flow
Merge 4600ca5c4 into cb1a92920
Pull Request #4215: Ji/vectors

132 of 1817 new or added lines in 42 files covered. (7.26%)

125 existing lines in 26 files now uncovered.

56932 of 66168 relevant lines covered (86.04%)

611735.65 hits per line

Source File
Press 'n' to go to next uncovered line, 'b' for previous

0.0
/vortex-vector/src/view.rs
1
// SPDX-License-Identifier: Apache-2.0
2
// SPDX-FileCopyrightText: Copyright the Vortex contributors
3

4
use std::fmt::Display;
5

6
use vortex_buffer::ByteBuffer;
7
use vortex_error::VortexExpect;
8

9
use crate::PIPELINE_STEP_COUNT;
10
use crate::bits::{BitView, BitViewMut};
11
use crate::types::{Element, VType};
12

13
pub struct View<'a> {
14
    /// The physical type of the vector, which defines how the elements are stored.
15
    pub(super) vtype: VType,
16
    /// A pointer to the allocated elements buffer.
17
    /// Alignment is at least the size of the element type.
18
    /// The capacity of the elements buffer is N * size_of::<T>() where T is the element type.
19
    pub(super) elements: *const u8,
20
    /// The validity mask for the vector, indicating which elements in the buffer are valid.
21
    /// This value can be `None` if the expected DType is `NonNullable`.
22
    pub(super) validity: Option<BitView<'a>>,
23
    // A selection mask over the elements and validity of the vector.
24
    pub(super) len: usize,
25

26
    /// Additional buffers of data used by the vector, such as string data.
27
    #[allow(dead_code)]
28
    pub(super) data: Vec<ByteBuffer>,
29

30
    /// Marker defining the lifetime of the contents of the vector.
31
    pub(super) _marker: std::marker::PhantomData<&'a ()>,
32
}
33

34
impl<'a> View<'a> {
35
    #[inline(always)]
NEW
36
    pub fn len(&self) -> usize {
×
NEW
37
        self.len
×
NEW
38
    }
×
39

NEW
40
    pub fn is_empty(&self) -> bool {
×
NEW
41
        self.len == 0
×
NEW
42
    }
×
43

NEW
44
    pub fn as_slice<T>(&self) -> &'a [T]
×
NEW
45
    where
×
NEW
46
        T: Element,
×
47
    {
NEW
48
        debug_assert_eq!(self.vtype, T::vtype(), "Invalid type for canonical view");
×
49
        // SAFETY: We assume that the elements are of type T and that the view is valid.
NEW
50
        unsafe { std::slice::from_raw_parts(self.elements.cast(), PIPELINE_STEP_COUNT) }
×
NEW
51
    }
×
52

53
    /// Re-interpret cast the vector into a new type where the element has the same width.
54
    #[inline(always)]
NEW
55
    pub fn reinterpret_as<E: Element>(&mut self) {
×
NEW
56
        assert_eq!(
×
NEW
57
            self.vtype.byte_width(),
×
58
            size_of::<E>(),
NEW
59
            "Cannot reinterpret {} as {}",
×
60
            self.vtype,
NEW
61
            E::vtype()
×
62
        );
NEW
63
        self.vtype = E::vtype();
×
NEW
64
    }
×
65
}
66

67
/// A vector is the atomic unit of canonical data in Vortex.
68
///
69
/// Like with our canonical arrays, it is physically typed, not logically typed. So each DType
70
/// has a well-defined physical representation in vector form.
71
///
72
/// I'm not quite sure on sizing yet. We could follow DuckDB and make vectors 2k elements. We
73
/// could follow FastLanes and make them 1k elements. Or we could do something interesting where
74
/// we pick the largest power of two that lets us fit ~3-4 vectors into the L1 cache. For example,
75
/// strings may be 1k elements, but u8 integers may be 8k elements. Many compute functions operate
76
/// over two inputs of the same type, so this could be interesting.
77
///
78
/// Interestingly, Vectors don't own their own data. In fact, I'm considering calling them 'views'.
79
/// This also solves our problem of zero-copy export by allowing us to pass down an output buffer
80
/// from an external source. This tends to work well since these external sources typically agree
81
/// with us on the physical representation of data, e.g. DuckDB, Arrow, Numpy, etc.
82
///
83
/// ## Why is this a single type-erased struct?
84
///
85
/// If we used generics at this level, we would taint all functions that use this type with a
86
/// generic type. To remove the generic, we'd need to introduce a trait, at which point we're
87
/// forced into both dynamic dispatch, and boxed return types. We also cannot down-cast a dynamic
88
/// trait that holds borrowed data since `Any` requires a static lifetime.
89
///
90
/// ## How do we handle custom encodings, e.g. FSST, RoaringBitMap, ZStd, etc.?
91
///
92
/// I could imagine a VarBinView vector (i.e. it has 16-byte views in its elements buffer), but
93
/// is able to delay decompression of the data buffers. These could be stored as child arrays and
94
/// decompressed on-demand, since this is now an opaque operation (and then call export on the
95
/// child data arrays using a slices mask? We'd be masking binary data... that sounds slow).
96
/// In conclusion... I'm not really sure yet.
97
///
98
/// What about dictionary arrays? Are they even important at this level?
99
/// I have done a "medium amount" of thinking on this, and I think we can actually get away with
100
/// not supporting dictionary encoding at the vector level. Vortex arrays actually define the
101
/// encoding tree, and therefore can decide how to implement a compute function. So where it's
102
/// possible to return a dictionary encoded thing, e.g. to DuckDB, we would have some sort of
103
/// Vortex Array -> DuckDB function that would check for top-level dictionaries and handle the
104
/// conversion at that layer.
105

106
/// ## Can we re-use parts of the pipeline to avoid common-subexpression elimination?
107
///
108
/// This gets tricky... Suppose we start with a Vortex expression. We can then pass that naively
109
/// through pipeline construction. This now represents a physical execution plan. At this point,
110
/// we could in theory optimize the pipeline by removing common sub-expressions, such as
111
/// canonicalizing the same field multiple times to pass into two comparison operators.
112
///
113
/// We'd then need some way to buffer the intermediate results as both expressions are driven.
114
/// Maybe this works? Not sure yet.
115
pub struct ViewMut<'a> {
116
    /// The physical type of the vector, which defines how the elements are stored.
117
    pub(super) vtype: VType,
118
    /// A pointer to the allocated elements buffer.
119
    /// Alignment is at least the size of the element type.
120
    /// The capacity of the elements buffer is N * size_of::<T>() where T is the element type.
121
    // TODO(ngates): it would be nice to guarantee _wider_ alignment, ideally 128 bytes, so that
122
    //  we can use aligned load/store instructions for wide SIMD lanes.
123
    pub(super) elements: *mut u8,
124
    /// The validity mask for the vector, indicating which elements in the buffer are valid.
125
    /// This value can be `None` if the expected DType is `NonNullable`.
126
    pub(super) validity: Option<BitViewMut<'a>>,
127

128
    /// Additional buffers of data used by the vector, such as string data.
129
    // TODO(ngates): ideally these buffers are compressed somehow? E.g. using FSST?
130
    #[allow(dead_code)]
131
    pub(super) data: Vec<ByteBuffer>,
132

133
    /// Marker defining the lifetime of the contents of the vector.
134
    pub(super) _marker: std::marker::PhantomData<&'a mut ()>,
135
}
136

137
impl<'a> ViewMut<'a> {
NEW
138
    pub fn new<E: Element>(elements: &'a mut [E], validity: Option<BitViewMut<'a>>) -> Self {
×
NEW
139
        assert_eq!(elements.len(), PIPELINE_STEP_COUNT);
×
NEW
140
        Self {
×
NEW
141
            vtype: E::vtype(),
×
NEW
142
            elements: elements.as_mut_ptr().cast(),
×
NEW
143
            validity,
×
NEW
144
            data: vec![],
×
NEW
145
            _marker: Default::default(),
×
NEW
146
        }
×
NEW
147
    }
×
148

149
    /// Re-interpret cast the vector into a new type where the element has the same width.
150
    #[inline(always)]
NEW
151
    pub fn reinterpret_as<E: Element>(&mut self) -> ViewMut<'a> {
×
NEW
152
        assert_eq!(
×
NEW
153
            self.vtype.byte_width(),
×
154
            size_of::<E>(),
NEW
155
            "Cannot reinterpret {} as {}",
×
156
            self.vtype,
NEW
157
            E::vtype()
×
158
        );
NEW
159
        Self {
×
NEW
160
            vtype: E::vtype(),
×
NEW
161
            elements: self.elements,
×
NEW
162
            validity: self.validity.take(),
×
NEW
163
            data: self.data.clone(),
×
NEW
164
            _marker: Default::default(),
×
NEW
165
        }
×
NEW
166
    }
×
167

168
    /// Returns an immutable slice of the elements in the vector.
169
    #[inline(always)]
NEW
170
    pub fn as_slice<E: Element>(&self) -> &'a [E] {
×
NEW
171
        debug_assert_eq!(self.vtype, E::vtype(), "Invalid type for canonical view");
×
NEW
172
        unsafe { std::slice::from_raw_parts(self.elements.cast::<E>(), PIPELINE_STEP_COUNT) }
×
NEW
173
    }
×
174

175
    /// Returns a mutable slice of the elements in the vector, allowing for modification.
176
    /// FIXME(ngates): test the performance if we return &mut [E; N] instead of &[E].
177
    #[inline(always)]
NEW
178
    pub fn as_slice_mut<E: Element>(&mut self) -> &'a mut [E] {
×
NEW
179
        debug_assert_eq!(self.vtype, E::vtype(), "Invalid type for canonical view");
×
NEW
180
        unsafe { std::slice::from_raw_parts_mut(self.elements.cast::<E>(), PIPELINE_STEP_COUNT) }
×
NEW
181
    }
×
182

183
    /// Access the validity mask of the vector.
184
    ///
185
    /// ## Panics
186
    ///
187
    /// Panics if the vector does not support validity, i.e. if the DType was non-nullable when
188
    /// it was created.
NEW
189
    pub fn validity(&mut self) -> &mut BitViewMut<'a> {
×
NEW
190
        self.validity
×
NEW
191
            .as_mut()
×
NEW
192
            .vortex_expect("Vector does not support validity")
×
NEW
193
    }
×
194

NEW
195
    pub fn add_buffer(&mut self, buffer: ByteBuffer) {
×
NEW
196
        self.data.push(buffer);
×
NEW
197
    }
×
198

199
    /// Flatten the view by bringing the selected elements of the mask to the beginning of
200
    /// the elements buffer.
201
    ///
202
    /// FIXME(ngates): also need to select validity bits.
NEW
203
    pub fn select_mask<E: Element + Display>(&mut self, mask: &BitView) {
×
NEW
204
        assert_eq!(
×
205
            self.vtype,
NEW
206
            E::vtype(),
×
NEW
207
            "ViewMut::flatten_mask: type mismatch"
×
208
        );
209

NEW
210
        match mask.true_count() {
×
NEW
211
            0 => {
×
NEW
212
                // If the mask has no true bits, we set the length to 0.
×
NEW
213
            }
×
NEW
214
            PIPELINE_STEP_COUNT => {
×
NEW
215
                // If the mask has N true bits, we copy all elements.
×
NEW
216
            }
×
NEW
217
            n if n > 3 * PIPELINE_STEP_COUNT / 4 => {
×
218
                // High density: use iter_zeros to compact by removing gaps
NEW
219
                let slice = self.as_slice_mut::<E>();
×
NEW
220
                let mut write_idx = 0;
×
NEW
221
                let mut read_idx = 0;
×
222

NEW
223
                mask.iter_zeros(|zero_idx| {
×
224
                    // Copy elements from read_idx to zero_idx (exclusive) to write_idx
NEW
225
                    let count = zero_idx - read_idx;
×
NEW
226
                    unsafe {
×
NEW
227
                        // SAFETY: We assume that the elements are of type E and that the view is valid.
×
NEW
228
                        // Using memmove for potentially overlapping regions
×
NEW
229
                        std::ptr::copy(
×
NEW
230
                            slice.as_ptr().add(read_idx),
×
NEW
231
                            slice.as_mut_ptr().add(write_idx),
×
NEW
232
                            count,
×
NEW
233
                        );
×
NEW
234
                        write_idx += count;
×
NEW
235
                    }
×
NEW
236
                    read_idx = zero_idx + 1;
×
NEW
237
                });
×
238

239
                // Copy any remaining elements after the last zero
NEW
240
                unsafe {
×
NEW
241
                    std::ptr::copy(
×
NEW
242
                        slice.as_ptr().add(read_idx),
×
NEW
243
                        slice.as_mut_ptr().add(write_idx),
×
NEW
244
                        PIPELINE_STEP_COUNT - read_idx,
×
NEW
245
                    );
×
NEW
246
                }
×
247
            }
248
            _ => {
NEW
249
                let mut offset = 0;
×
NEW
250
                let slice = self.as_slice_mut::<E>();
×
NEW
251
                mask.iter_ones(|idx| {
×
NEW
252
                    unsafe {
×
NEW
253
                        // SAFETY: We assume that the elements are of type E and that the view is valid.
×
NEW
254
                        let value = *slice.get_unchecked(idx);
×
NEW
255
                        // TODO(joe): use ptr increment (not offset).
×
NEW
256
                        *slice.get_unchecked_mut(offset) = value;
×
NEW
257

×
NEW
258
                        offset += 1;
×
NEW
259
                    }
×
NEW
260
                });
×
261
            }
262
        }
NEW
263
    }
×
264
}
STATUS · Troubleshooting · Open an Issue · Sales · Support · CAREERS · ENTERPRISE · START FREE · SCHEDULE DEMO
ANNOUNCEMENTS · TWITTER · TOS & SLA · Supported CI Services · What's a CI service? · Automated Testing

© 2026 Coveralls, Inc