1
69%
main: 69%

Ran 02 Feb 2026 03:15PM UTC

Files 158

Run time 4s

Badge

Embed ▾

Committed 02 Feb 2026 03:04PM UTC coverage: 39.024% (-0.01%) from 39.037%

Job # 21595320431.1

Build Type

push

github

Committed by

web-flow

Commit Message

PyTorch-style caching allocator for the CUDA backend with proper multi-stream support. (#626)

* Add PyTorch-style caching allocator for GPU memory

Implements a caching allocator model that associates memory pages with
CUDA streams, enabling efficient memory reuse without expensive
cuMemAlloc/cuMemFree calls on every allocation.

Key features:
- Page-level stream tracking (m_stream set once, never transfers)
- Lazy CUDA event creation (only for multi-stream scenarios)
- PageCache stores freed pages for reuse instead of freeing to CUDA
- HeapCachingConfig for programmatic configuration
- Environment variable support (SLANG_RHI_ALLOCATOR_*)

This follows PyTorch's caching allocator design where:
- Block ownership remains with original allocation stream
- Events only created when current_stream != block.stream
- Memory reclaimed only after all stream events complete

* Add lazy events optimization for single-stream workloads

- Skip event creation for single-stream submits (PyTorch-style)
- Use cuStreamQuery for non-blocking completion checks
- Add cuStreamQuery to dynamic CUDA API loading
- Add tests for lazy events and rapid alloc/free patterns

This optimization eliminates event overhead for single-stream workloads,
matching PyTorch's behavior where events are only created for cross-stream
synchronization.

* Fix lazy events: remove signalFenceCount check

The internal event is for command buffer retirement, not for user fences.
User fences use setCurrentValue() which is a separate mechanism.
Single-stream retirement works with cuStreamQuery() regardless of user fences.

* Add test-caching-allocator.cpp to CMakeLists.txt

* Wire up multi-stream page tracking

- Set current stream when creating command encoder (not just at submit)
- Add Page::notifyUse() virtual hook called on every allocation
- Implement PageImpl::notifyUse() to call recordStreamUse() for cross-stream usage
- This enables proper PyTorch-style multi-stream memory synchronization:
  wh... (continued)

Run Details

3945 of 11918 branches covered (33.1%)

Branch coverage included in aggregate %.

12277 of 29651 relevant lines covered (41.41%)

26729.97 hits per line

shader-slang / slang-rhi / 21595320431 / 1
69%
main: 69%

README BADGES
x

Markdown

Textile

RDoc

HTML

Rst

Source Files on job macos-aarch64 - 21595320431.1

shader-slang / slang-rhi / 21595320431 / 1 69% main: 69%

README BADGES x

Markdown

Textile

RDoc

HTML

Rst

Source Files on job macos-aarch64 - 21595320431.1

shader-slang / slang-rhi / 21595320431 / 1
69%
main: 69%

README BADGES
x