17040114484

Committed 18 Aug 2025 12:07PM UTC coverage: 86.042% (-1.9%) from 87.913%

Build # 17040114484

Build Type

Pull #4215

github

Committed by

web-flow

Commit Message

Merge 4600ca5c4 into cb1a92920

Pull Request Pull Request #4215: Ji/vectors

Run Details

132 of 1817 new or added lines in 42 files covered. (7.26%)

125 existing lines in 26 files now uncovered.

56932 of 66168 relevant lines covered (86.04%)

611735.65 hits per line

Source File
Press 'n' to go to next uncovered line, 'b' for previous

0.0

/vortex-vector/src/view.rs

// SPDX-License-Identifier: Apache-2.0
// SPDX-FileCopyrightText: Copyright the Vortex contributors

use std::fmt::Display;

use vortex_buffer::ByteBuffer;
use vortex_error::VortexExpect;

use crate::PIPELINE_STEP_COUNT;
use crate::bits::{BitView, BitViewMut};
use crate::types::{Element, VType};

pub struct View<'a> {
    /// The physical type of the vector, which defines how the elements are stored.
    pub(super) vtype: VType,
    /// A pointer to the allocated elements buffer.
    /// Alignment is at least the size of the element type.
    /// The capacity of the elements buffer is N * size_of::<T>() where T is the element type.
    pub(super) elements: *const u8,
    /// The validity mask for the vector, indicating which elements in the buffer are valid.
    /// This value can be `None` if the expected DType is `NonNullable`.
    pub(super) validity: Option<BitView<'a>>,
    // A selection mask over the elements and validity of the vector.
    pub(super) len: usize,

    /// Additional buffers of data used by the vector, such as string data.
    #[allow(dead_code)]
    pub(super) data: Vec<ByteBuffer>,

    /// Marker defining the lifetime of the contents of the vector.
    pub(super) _marker: std::marker::PhantomData<&'a ()>,
}

impl<'a> View<'a> {
    #[inline(always)]
    pub fn len(&self) -> usize {
        self.len
    }

    pub fn is_empty(&self) -> bool {
        self.len == 0
    }

    pub fn as_slice<T>(&self) -> &'a [T]
    where
        T: Element,
    {
        debug_assert_eq!(self.vtype, T::vtype(), "Invalid type for canonical view");
        // SAFETY: We assume that the elements are of type T and that the view is valid.
        unsafe { std::slice::from_raw_parts(self.elements.cast(), PIPELINE_STEP_COUNT) }
    }

    /// Re-interpret cast the vector into a new type where the element has the same width.
    #[inline(always)]
    pub fn reinterpret_as<E: Element>(&mut self) {
        assert_eq!(
            self.vtype.byte_width(),
            size_of::<E>(),
            "Cannot reinterpret {} as {}",
            self.vtype,
            E::vtype()
        );
        self.vtype = E::vtype();
    }
}

/// A vector is the atomic unit of canonical data in Vortex.
///
/// Like with our canonical arrays, it is physically typed, not logically typed. So each DType
/// has a well-defined physical representation in vector form.
///
/// I'm not quite sure on sizing yet. We could follow DuckDB and make vectors 2k elements. We
/// could follow FastLanes and make them 1k elements. Or we could do something interesting where
/// we pick the largest power of two that lets us fit ~3-4 vectors into the L1 cache. For example,
/// strings may be 1k elements, but u8 integers may be 8k elements. Many compute functions operate
/// over two inputs of the same type, so this could be interesting.
///
/// Interestingly, Vectors don't own their own data. In fact, I'm considering calling them 'views'.
/// This also solves our problem of zero-copy export by allowing us to pass down an output buffer
/// from an external source. This tends to work well since these external sources typically agree
/// with us on the physical representation of data, e.g. DuckDB, Arrow, Numpy, etc.
///
/// ## Why is this a single type-erased struct?
///
/// If we used generics at this level, we would taint all functions that use this type with a
/// generic type. To remove the generic, we'd need to introduce a trait, at which point we're
/// forced into both dynamic dispatch, and boxed return types. We also cannot down-cast a dynamic
/// trait that holds borrowed data since `Any` requires a static lifetime.
///
/// ## How do we handle custom encodings, e.g. FSST, RoaringBitMap, ZStd, etc.?
///
/// I could imagine a VarBinView vector (i.e. it has 16-byte views in its elements buffer), but
/// is able to delay decompression of the data buffers. These could be stored as child arrays and
/// decompressed on-demand, since this is now an opaque operation (and then call export on the
/// child data arrays using a slices mask? We'd be masking binary data... that sounds slow).
/// In conclusion... I'm not really sure yet.
///
/// What about dictionary arrays? Are they even important at this level?
/// I have done a "medium amount" of thinking on this, and I think we can actually get away with
/// not supporting dictionary encoding at the vector level. Vortex arrays actually define the
/// encoding tree, and therefore can decide how to implement a compute function. So where it's
/// possible to return a dictionary encoded thing, e.g. to DuckDB, we would have some sort of
/// Vortex Array -> DuckDB function that would check for top-level dictionaries and handle the
/// conversion at that layer.

/// ## Can we re-use parts of the pipeline to avoid common-subexpression elimination?
///
/// This gets tricky... Suppose we start with a Vortex expression. We can then pass that naively
/// through pipeline construction. This now represents a physical execution plan. At this point,
/// we could in theory optimize the pipeline by removing common sub-expressions, such as
/// canonicalizing the same field multiple times to pass into two comparison operators.
///
/// We'd then need some way to buffer the intermediate results as both expressions are driven.
/// Maybe this works? Not sure yet.
pub struct ViewMut<'a> {
    /// The physical type of the vector, which defines how the elements are stored.
    pub(super) vtype: VType,
    /// A pointer to the allocated elements buffer.
    /// Alignment is at least the size of the element type.
    /// The capacity of the elements buffer is N * size_of::<T>() where T is the element type.
    // TODO(ngates): it would be nice to guarantee _wider_ alignment, ideally 128 bytes, so that
    //  we can use aligned load/store instructions for wide SIMD lanes.
    pub(super) elements: *mut u8,
    /// The validity mask for the vector, indicating which elements in the buffer are valid.
    /// This value can be `None` if the expected DType is `NonNullable`.
    pub(super) validity: Option<BitViewMut<'a>>,

    /// Additional buffers of data used by the vector, such as string data.
    // TODO(ngates): ideally these buffers are compressed somehow? E.g. using FSST?
    #[allow(dead_code)]
    pub(super) data: Vec<ByteBuffer>,

    /// Marker defining the lifetime of the contents of the vector.
    pub(super) _marker: std::marker::PhantomData<&'a mut ()>,
}

impl<'a> ViewMut<'a> {
    pub fn new<E: Element>(elements: &'a mut [E], validity: Option<BitViewMut<'a>>) -> Self {
        assert_eq!(elements.len(), PIPELINE_STEP_COUNT);
        Self {
            vtype: E::vtype(),
            elements: elements.as_mut_ptr().cast(),
            validity,
            data: vec![],
            _marker: Default::default(),
        }
    }

    /// Re-interpret cast the vector into a new type where the element has the same width.
    #[inline(always)]
    pub fn reinterpret_as<E: Element>(&mut self) -> ViewMut<'a> {
        assert_eq!(
            self.vtype.byte_width(),
            size_of::<E>(),
            "Cannot reinterpret {} as {}",
            self.vtype,
            E::vtype()
        );
        Self {
            vtype: E::vtype(),
            elements: self.elements,
            validity: self.validity.take(),
            data: self.data.clone(),
            _marker: Default::default(),
        }
    }

    /// Returns an immutable slice of the elements in the vector.
    #[inline(always)]
    pub fn as_slice<E: Element>(&self) -> &'a [E] {
        debug_assert_eq!(self.vtype, E::vtype(), "Invalid type for canonical view");
        unsafe { std::slice::from_raw_parts(self.elements.cast::<E>(), PIPELINE_STEP_COUNT) }
    }

    /// Returns a mutable slice of the elements in the vector, allowing for modification.
    /// FIXME(ngates): test the performance if we return &mut [E; N] instead of &[E].
    #[inline(always)]
    pub fn as_slice_mut<E: Element>(&mut self) -> &'a mut [E] {
        debug_assert_eq!(self.vtype, E::vtype(), "Invalid type for canonical view");
        unsafe { std::slice::from_raw_parts_mut(self.elements.cast::<E>(), PIPELINE_STEP_COUNT) }
    }

    /// Access the validity mask of the vector.
    ///
    /// ## Panics
    ///
    /// Panics if the vector does not support validity, i.e. if the DType was non-nullable when
    /// it was created.
    pub fn validity(&mut self) -> &mut BitViewMut<'a> {
        self.validity
            .as_mut()
            .vortex_expect("Vector does not support validity")
    }

    pub fn add_buffer(&mut self, buffer: ByteBuffer) {
        self.data.push(buffer);
    }

    /// Flatten the view by bringing the selected elements of the mask to the beginning of
    /// the elements buffer.
    ///
    /// FIXME(ngates): also need to select validity bits.
    pub fn select_mask<E: Element + Display>(&mut self, mask: &BitView) {
        assert_eq!(
            self.vtype,
            E::vtype(),
            "ViewMut::flatten_mask: type mismatch"
        );

        match mask.true_count() {
            0 => {
                // If the mask has no true bits, we set the length to 0.
            }
            PIPELINE_STEP_COUNT => {
                // If the mask has N true bits, we copy all elements.
            }
            n if n > 3 * PIPELINE_STEP_COUNT / 4 => {
                // High density: use iter_zeros to compact by removing gaps
                let slice = self.as_slice_mut::<E>();
                let mut write_idx = 0;
                let mut read_idx = 0;

                mask.iter_zeros(|zero_idx| {
                    // Copy elements from read_idx to zero_idx (exclusive) to write_idx
                    let count = zero_idx - read_idx;
                    unsafe {
                        // SAFETY: We assume that the elements are of type E and that the view is valid.
                        // Using memmove for potentially overlapping regions
                        std::ptr::copy(
                            slice.as_ptr().add(read_idx),
                            slice.as_mut_ptr().add(write_idx),
                            count,
                        );
                        write_idx += count;
                    }
                    read_idx = zero_idx + 1;
                });

                // Copy any remaining elements after the last zero
                unsafe {
                    std::ptr::copy(
                        slice.as_ptr().add(read_idx),
                        slice.as_mut_ptr().add(write_idx),
                        PIPELINE_STEP_COUNT - read_idx,
                    );
                }
            }
            _ => {
                let mut offset = 0;
                let slice = self.as_slice_mut::<E>();
                mask.iter_ones(|idx| {
                    unsafe {
                        // SAFETY: We assume that the elements are of type E and that the view is valid.
                        let value = *slice.get_unchecked(idx);
                        // TODO(joe): use ptr increment (not offset).
                        *slice.get_unchecked_mut(offset) = value;

                        offset += 1;
                    }
                });
            }
        }
    }
}

1	// SPDX-License-Identifier: Apache-2.0
2	// SPDX-FileCopyrightText: Copyright the Vortex contributors
3
4	use std::fmt::Display;
5
6	use vortex_buffer::ByteBuffer;
7	use vortex_error::VortexExpect;
8
9	use crate::PIPELINE_STEP_COUNT;
10	use crate::bits::{BitView, BitViewMut};
11	use crate::types::{Element, VType};
12
13	pub struct View<'a> {
14	/// The physical type of the vector, which defines how the elements are stored.
15	pub(super) vtype: VType,
16	/// A pointer to the allocated elements buffer.
17	/// Alignment is at least the size of the element type.
18	/// The capacity of the elements buffer is N * size_of::<T>() where T is the element type.
19	pub(super) elements: *const u8,
20	/// The validity mask for the vector, indicating which elements in the buffer are valid.
21	/// This value can be `None` if the expected DType is `NonNullable`.
22	pub(super) validity: Option<BitView<'a>>,
23	// A selection mask over the elements and validity of the vector.
24	pub(super) len: usize,
25
26	/// Additional buffers of data used by the vector, such as string data.
27	#[allow(dead_code)]
28	pub(super) data: Vec<ByteBuffer>,
29
30	/// Marker defining the lifetime of the contents of the vector.
31	pub(super) _marker: std::marker::PhantomData<&'a ()>,
32	}
33
34	impl<'a> View<'a> {
35	#[inline(always)]
NEW 36	pub fn len(&self) -> usize {	×
NEW 37	self.len	×
NEW 38	}	×
39
NEW 40	pub fn is_empty(&self) -> bool {	×
NEW 41	self.len == 0	×
NEW 42	}	×
43
NEW 44	pub fn as_slice<T>(&self) -> &'a [T]	×
NEW 45	where	×
NEW 46	T: Element,	×
47	{
NEW 48	debug_assert_eq!(self.vtype, T::vtype(), "Invalid type for canonical view");	×
49	// SAFETY: We assume that the elements are of type T and that the view is valid.
NEW 50	unsafe { std::slice::from_raw_parts(self.elements.cast(), PIPELINE_STEP_COUNT) }	×
NEW 51	}	×
52
53	/// Re-interpret cast the vector into a new type where the element has the same width.
54	#[inline(always)]
NEW 55	pub fn reinterpret_as<E: Element>(&mut self) {	×
NEW 56	assert_eq!(	×
NEW 57	self.vtype.byte_width(),	×
58	size_of::<E>(),
NEW 59	"Cannot reinterpret {} as {}",	×
60	self.vtype,
NEW 61	E::vtype()	×
62	);
NEW 63	self.vtype = E::vtype();	×
NEW 64	}	×
65	}
66
67	/// A vector is the atomic unit of canonical data in Vortex.
68	///
69	/// Like with our canonical arrays, it is physically typed, not logically typed. So each DType
70	/// has a well-defined physical representation in vector form.
71	///
72	/// I'm not quite sure on sizing yet. We could follow DuckDB and make vectors 2k elements. We
73	/// could follow FastLanes and make them 1k elements. Or we could do something interesting where
74	/// we pick the largest power of two that lets us fit ~3-4 vectors into the L1 cache. For example,
75	/// strings may be 1k elements, but u8 integers may be 8k elements. Many compute functions operate
76	/// over two inputs of the same type, so this could be interesting.
77	///
78	/// Interestingly, Vectors don't own their own data. In fact, I'm considering calling them 'views'.
79	/// This also solves our problem of zero-copy export by allowing us to pass down an output buffer
80	/// from an external source. This tends to work well since these external sources typically agree
81	/// with us on the physical representation of data, e.g. DuckDB, Arrow, Numpy, etc.
82	///
83	/// ## Why is this a single type-erased struct?
84	///
85	/// If we used generics at this level, we would taint all functions that use this type with a
86	/// generic type. To remove the generic, we'd need to introduce a trait, at which point we're
87	/// forced into both dynamic dispatch, and boxed return types. We also cannot down-cast a dynamic
88	/// trait that holds borrowed data since `Any` requires a static lifetime.
89	///
90	/// ## How do we handle custom encodings, e.g. FSST, RoaringBitMap, ZStd, etc.?
91	///
92	/// I could imagine a VarBinView vector (i.e. it has 16-byte views in its elements buffer), but
93	/// is able to delay decompression of the data buffers. These could be stored as child arrays and
94	/// decompressed on-demand, since this is now an opaque operation (and then call export on the
95	/// child data arrays using a slices mask? We'd be masking binary data... that sounds slow).
96	/// In conclusion... I'm not really sure yet.
97	///
98	/// What about dictionary arrays? Are they even important at this level?
99	/// I have done a "medium amount" of thinking on this, and I think we can actually get away with
100	/// not supporting dictionary encoding at the vector level. Vortex arrays actually define the
101	/// encoding tree, and therefore can decide how to implement a compute function. So where it's
102	/// possible to return a dictionary encoded thing, e.g. to DuckDB, we would have some sort of
103	/// Vortex Array -> DuckDB function that would check for top-level dictionaries and handle the
104	/// conversion at that layer.
105
106	/// ## Can we re-use parts of the pipeline to avoid common-subexpression elimination?
107	///
108	/// This gets tricky... Suppose we start with a Vortex expression. We can then pass that naively
109	/// through pipeline construction. This now represents a physical execution plan. At this point,
110	/// we could in theory optimize the pipeline by removing common sub-expressions, such as
111	/// canonicalizing the same field multiple times to pass into two comparison operators.
112	///
113	/// We'd then need some way to buffer the intermediate results as both expressions are driven.
114	/// Maybe this works? Not sure yet.
115	pub struct ViewMut<'a> {
116	/// The physical type of the vector, which defines how the elements are stored.
117	pub(super) vtype: VType,
118	/// A pointer to the allocated elements buffer.
119	/// Alignment is at least the size of the element type.
120	/// The capacity of the elements buffer is N * size_of::<T>() where T is the element type.
121	// TODO(ngates): it would be nice to guarantee _wider_ alignment, ideally 128 bytes, so that
122	// we can use aligned load/store instructions for wide SIMD lanes.
123	pub(super) elements: *mut u8,
124	/// The validity mask for the vector, indicating which elements in the buffer are valid.
125	/// This value can be `None` if the expected DType is `NonNullable`.
126	pub(super) validity: Option<BitViewMut<'a>>,
127
128	/// Additional buffers of data used by the vector, such as string data.
129	// TODO(ngates): ideally these buffers are compressed somehow? E.g. using FSST?
130	#[allow(dead_code)]
131	pub(super) data: Vec<ByteBuffer>,
132
133	/// Marker defining the lifetime of the contents of the vector.
134	pub(super) _marker: std::marker::PhantomData<&'a mut ()>,
135	}
136
137	impl<'a> ViewMut<'a> {
NEW 138	pub fn new<E: Element>(elements: &'a mut [E], validity: Option<BitViewMut<'a>>) -> Self {	×
NEW 139	assert_eq!(elements.len(), PIPELINE_STEP_COUNT);	×
NEW 140	Self {	×
NEW 141	vtype: E::vtype(),	×
NEW 142	elements: elements.as_mut_ptr().cast(),	×
NEW 143	validity,	×
NEW 144	data: vec![],	×
NEW 145	_marker: Default::default(),	×
NEW 146	}	×
NEW 147	}	×
148
149	/// Re-interpret cast the vector into a new type where the element has the same width.
150	#[inline(always)]
NEW 151	pub fn reinterpret_as<E: Element>(&mut self) -> ViewMut<'a> {	×
NEW 152	assert_eq!(	×
NEW 153	self.vtype.byte_width(),	×
154	size_of::<E>(),
NEW 155	"Cannot reinterpret {} as {}",	×
156	self.vtype,
NEW 157	E::vtype()	×
158	);
NEW 159	Self {	×
NEW 160	vtype: E::vtype(),	×
NEW 161	elements: self.elements,	×
NEW 162	validity: self.validity.take(),	×
NEW 163	data: self.data.clone(),	×
NEW 164	_marker: Default::default(),	×
NEW 165	}	×
NEW 166	}	×
167
168	/// Returns an immutable slice of the elements in the vector.
169	#[inline(always)]
NEW 170	pub fn as_slice<E: Element>(&self) -> &'a [E] {	×
NEW 171	debug_assert_eq!(self.vtype, E::vtype(), "Invalid type for canonical view");	×
NEW 172	unsafe { std::slice::from_raw_parts(self.elements.cast::<E>(), PIPELINE_STEP_COUNT) }	×
NEW 173	}	×
174
175	/// Returns a mutable slice of the elements in the vector, allowing for modification.
176	/// FIXME(ngates): test the performance if we return &mut [E; N] instead of &[E].
177	#[inline(always)]
NEW 178	pub fn as_slice_mut<E: Element>(&mut self) -> &'a mut [E] {	×
NEW 179	debug_assert_eq!(self.vtype, E::vtype(), "Invalid type for canonical view");	×
NEW 180	unsafe { std::slice::from_raw_parts_mut(self.elements.cast::<E>(), PIPELINE_STEP_COUNT) }	×
NEW 181	}	×
182
183	/// Access the validity mask of the vector.
184	///
185	/// ## Panics
186	///
187	/// Panics if the vector does not support validity, i.e. if the DType was non-nullable when
188	/// it was created.
NEW 189	pub fn validity(&mut self) -> &mut BitViewMut<'a> {	×
NEW 190	self.validity	×
NEW 191	.as_mut()	×
NEW 192	.vortex_expect("Vector does not support validity")	×
NEW 193	}	×
194
NEW 195	pub fn add_buffer(&mut self, buffer: ByteBuffer) {	×
NEW 196	self.data.push(buffer);	×
NEW 197	}	×
198
199	/// Flatten the view by bringing the selected elements of the mask to the beginning of
200	/// the elements buffer.
201	///
202	/// FIXME(ngates): also need to select validity bits.
NEW 203	pub fn select_mask<E: Element + Display>(&mut self, mask: &BitView) {	×
NEW 204	assert_eq!(	×
205	self.vtype,
NEW 206	E::vtype(),	×
NEW 207	"ViewMut::flatten_mask: type mismatch"	×
208	);
209
NEW 210	match mask.true_count() {	×
NEW 211	0 => {	×
NEW 212	// If the mask has no true bits, we set the length to 0.	×
NEW 213	}	×
NEW 214	PIPELINE_STEP_COUNT => {	×
NEW 215	// If the mask has N true bits, we copy all elements.	×
NEW 216	}	×
NEW 217	n if n > 3 * PIPELINE_STEP_COUNT / 4 => {	×
218	// High density: use iter_zeros to compact by removing gaps
NEW 219	let slice = self.as_slice_mut::<E>();	×
NEW 220	let mut write_idx = 0;	×
NEW 221	let mut read_idx = 0;	×
222
NEW 223	mask.iter_zeros(\|zero_idx\| {	×
224	// Copy elements from read_idx to zero_idx (exclusive) to write_idx
NEW 225	let count = zero_idx - read_idx;	×
NEW 226	unsafe {	×
NEW 227	// SAFETY: We assume that the elements are of type E and that the view is valid.	×
NEW 228	// Using memmove for potentially overlapping regions	×
NEW 229	std::ptr::copy(	×
NEW 230	slice.as_ptr().add(read_idx),	×
NEW 231	slice.as_mut_ptr().add(write_idx),	×
NEW 232	count,	×
NEW 233	);	×
NEW 234	write_idx += count;	×
NEW 235	}	×
NEW 236	read_idx = zero_idx + 1;	×
NEW 237	});	×
238
239	// Copy any remaining elements after the last zero
NEW 240	unsafe {	×
NEW 241	std::ptr::copy(	×
NEW 242	slice.as_ptr().add(read_idx),	×
NEW 243	slice.as_mut_ptr().add(write_idx),	×
NEW 244	PIPELINE_STEP_COUNT - read_idx,	×
NEW 245	);	×
NEW 246	}	×
247	}
248	_ => {
NEW 249	let mut offset = 0;	×
NEW 250	let slice = self.as_slice_mut::<E>();	×
NEW 251	mask.iter_ones(\|idx\| {	×
NEW 252	unsafe {	×
NEW 253	// SAFETY: We assume that the elements are of type E and that the view is valid.	×
NEW 254	let value = *slice.get_unchecked(idx);	×
NEW 255	// TODO(joe): use ptr increment (not offset).	×
NEW 256	*slice.get_unchecked_mut(offset) = value;	×
NEW 257		×
NEW 258	offset += 1;	×
NEW 259	}	×
NEW 260	});	×
261	}
262	}
NEW 263	}	×
264	}

vortex-data / vortex / 17040114484

Source File Press 'n' to go to next uncovered line, 'b' for previous

Source File
Press 'n' to go to next uncovered line, 'b' for previous