21595320431

Committed 02 Feb 2026 03:04PM UTC coverage: 69.182% (-0.07%) from 69.255%

Build # 21595320431

Build Type

push

github

Committed by

web-flow

Commit Message

PyTorch-style caching allocator for the CUDA backend with proper multi-stream support. (#626)

* Add PyTorch-style caching allocator for GPU memory

Implements a caching allocator model that associates memory pages with
CUDA streams, enabling efficient memory reuse without expensive
cuMemAlloc/cuMemFree calls on every allocation.

Key features:
- Page-level stream tracking (m_stream set once, never transfers)
- Lazy CUDA event creation (only for multi-stream scenarios)
- PageCache stores freed pages for reuse instead of freeing to CUDA
- HeapCachingConfig for programmatic configuration
- Environment variable support (SLANG_RHI_ALLOCATOR_*)

This follows PyTorch's caching allocator design where:
- Block ownership remains with original allocation stream
- Events only created when current_stream != block.stream
- Memory reclaimed only after all stream events complete

* Add lazy events optimization for single-stream workloads

- Skip event creation for single-stream submits (PyTorch-style)
- Use cuStreamQuery for non-blocking completion checks
- Add cuStreamQuery to dynamic CUDA API loading
- Add tests for lazy events and rapid alloc/free patterns

This optimization eliminates event overhead for single-stream workloads,
matching PyTorch's behavior where events are only created for cross-stream
synchronization.

* Fix lazy events: remove signalFenceCount check

The internal event is for command buffer retirement, not for user fences.
User fences use setCurrentValue() which is a separate mechanism.
Single-stream retirement works with cuStreamQuery() regardless of user fences.

* Add test-caching-allocator.cpp to CMakeLists.txt

* Wire up multi-stream page tracking

- Set current stream when creating command encoder (not just at submit)
- Add Page::notifyUse() virtual hook called on every allocation
- Implement PageImpl::notifyUse() to call recordStreamUse() for cross-stream usage
- This enables proper PyTorch-style multi-stream memory synchronization:
  wh... (continued)

Run Details

10853 of 18688 branches covered (58.07%)

Branch coverage included in aggregate %.

201 of 302 new or added lines in 6 files covered. (66.56%)

16 existing lines in 2 files now uncovered.

33052 of 44775 relevant lines covered (73.82%)

228616.78 hits per line

Source File
Press 'n' to go to next uncovered line, 'b' for previous

51.8

/src/cuda-driver-api.cpp

shader-slang / slang-rhi / 21595320431

Source File Press 'n' to go to next uncovered line, 'b' for previous

Source Not Available

Source File
Press 'n' to go to next uncovered line, 'b' for previous