21595320431
69%

Ran 02 Feb 2026 03:09PM UTC

Jobs 3

Files 240

Run time 1min

Badge

Embed ▾

Committed 02 Feb 2026 03:04PM UTC coverage: 69.182% (-0.07%) from 69.255%

Build # 21595320431

Build Type

push

github

Committed by

web-flow

Commit Message

PyTorch-style caching allocator for the CUDA backend with proper multi-stream support. (#626)

* Add PyTorch-style caching allocator for GPU memory

Implements a caching allocator model that associates memory pages with
CUDA streams, enabling efficient memory reuse without expensive
cuMemAlloc/cuMemFree calls on every allocation.

Key features:
- Page-level stream tracking (m_stream set once, never transfers)
- Lazy CUDA event creation (only for multi-stream scenarios)
- PageCache stores freed pages for reuse instead of freeing to CUDA
- HeapCachingConfig for programmatic configuration
- Environment variable support (SLANG_RHI_ALLOCATOR_*)

This follows PyTorch's caching allocator design where:
- Block ownership remains with original allocation stream
- Events only created when current_stream != block.stream
- Memory reclaimed only after all stream events complete

* Add lazy events optimization for single-stream workloads

- Skip event creation for single-stream submits (PyTorch-style)
- Use cuStreamQuery for non-blocking completion checks
- Add cuStreamQuery to dynamic CUDA API loading
- Add tests for lazy events and rapid alloc/free patterns

This optimization eliminates event overhead for single-stream workloads,
matching PyTorch's behavior where events are only created for cross-stream
synchronization.

* Fix lazy events: remove signalFenceCount check

The internal event is for command buffer retirement, not for user fences.
User fences use setCurrentValue() which is a separate mechanism.
Single-stream retirement works with cuStreamQuery() regardless of user fences.

* Add test-caching-allocator.cpp to CMakeLists.txt

* Wire up multi-stream page tracking

- Set current stream when creating command encoder (not just at submit)
- Add Page::notifyUse() virtual hook called on every allocation
- Implement PageImpl::notifyUse() to call recordStreamUse() for cross-stream usage
- This enables proper PyTorch-style multi-stream memory synchronization:
  wh... (continued)

Run Details

10853 of 18688 branches covered (58.07%)

Branch coverage included in aggregate %.

201 of 302 new or added lines in 6 files covered. (66.56%)

16 existing lines in 2 files now uncovered.

33052 of 44775 relevant lines covered (73.82%)

228616.78 hits per line

New Missed Lines in Diff

Lines	Coverage	∆	File
1	51.8	-0.02%	src/cuda-driver-api.cpp
4	81.91	1.04%	src/cuda/cuda-command.cpp
8	30.3	-1.52%	src/heap.h
88	54.08	-32.79%	src/cuda/cuda-heap.cpp

Uncovered Existing Lines

Lines	Coverage	∆	File
5	81.91	1.04%	src/cuda/cuda-command.cpp
11	54.08	-32.79%	src/cuda/cuda-heap.cpp

Jobs

ID	Job ID	Ran	Files	Coverage
1	macos-aarch64 - 21595320431.1	02 Feb 2026 03:14PM UTC	158	39.02	GitHub Action Run
2	windows-x86_64 - 21595320431.2	02 Feb 2026 03:11PM UTC	217	68.52	GitHub Action Run
3	linux-x86_64 - 21595320431.3	02 Feb 2026 03:08PM UTC	167	57.98	GitHub Action Run

shader-slang / slang-rhi / 21595320431
69%

README BADGES
x

Markdown

Textile

RDoc

HTML

Rst

New Missed Lines in Diff

Uncovered Existing Lines

Jobs

Source Files on build 21595320431

shader-slang / slang-rhi / 21595320431 69%

README BADGES x

Markdown

Textile

RDoc

HTML

Rst

New Missed Lines in Diff

Uncovered Existing Lines

Jobs

Source Files on build 21595320431

shader-slang / slang-rhi / 21595320431
69%

README BADGES
x