• Home
  • Features
  • Pricing
  • Docs
  • Announcements
  • Sign In

kubeflow / trainer
55%
master: 55%

Build:
Build:
LAST BUILD BRANCH: pr-29
DEFAULT BRANCH: master
Repo Added 20 Mar 2025 01:49PM UTC
Token 3qIdUH6ns6RNy0sBPPQ6ybJp7VqYkScU8 regen
Build 1067 Last
Files 26
Badge
Embed ▾
README BADGES
x

If you need to use a raster PNG badge, change the '.svg' to '.png' in the link

Markdown

Textile

RDoc

HTML

Rst

LAST BUILD ON BRANCH fix-resource-allocation
branch: fix-resource-allocation
CHANGE BRANCH
x
Reset
Sync Branches
  • fix-resource-allocation
  • 2836-expose-builruntimeinfo
  • KEP-volcano-scheduler
  • add-gitattr
  • add-license-scan-badge
  • add-local-example
  • add-local-trainer-client
  • add-local-trainer-example
  • add-ok-to-test
  • add-overlay-manifest-v2
  • add-pod-network-plugin-to-diagram
  • add-qwen3-1.7b
  • add-runtime-labels
  • add-sdk-release
  • add-standalone-manifest
  • automate-release
  • bo/feat/remove-launcher-chainer-validation
  • bo/test/add-ut-for-torch-runtime-valid
  • bump-jobset-v0.9.0
  • bump-torch-deepspeed
  • cache-oss
  • cache_initilizer
  • changelog-1.9.1
  • changelog-2.0.0
  • changelog-2.0.1
  • changelog-v2.0.0-rc.0
  • changelog-v2.0.0-rc.1
  • cherry-pick-2666-to-release-2.0
  • cherry-pick-2675-to-release-2.0
  • cherry-pick-2682-to-release-2.0
  • cherry-pick-2683-to-release-2.0
  • cherry-pick-2685-to-release-2.0
  • cherry-pick-2686-to-release-2.0
  • cherry-pick-2691-to-release-2.0
  • cherry-pick-2695-to-release-2.0
  • cherry-pick-2700-to-release-2.0
  • cherry-pick-2703-to-release-2.0
  • cherry-pick-2707-to-release-2.0
  • cherry-pick-2719-to-release-2.0
  • cherry-pick-2726-to-release-2.0
  • cherry-pick-2731-to-release-2.0
  • cherry-pick-2734-to-release-2.0
  • cherry-pick-2739-to-release-2.0
  • cherry-pick-2761
  • cherry-pick-2766
  • cherry-pick-2771-to-release-2.0
  • cherry-pick-2774-to-release-2.0
  • cherry-pick-2780
  • cherry-pick-2813
  • cherry-pick-2815
  • cherry-pick-2837-to-release-2.0
  • cherry-pick-2854-to-release-2.0
  • cherry-pick-changelog-1.9
  • chore/KEP-runtime-class
  • chore/gha
  • chore/merge-podspacoverride-test-cases
  • chore/upgrade-torchtune-version
  • ci/include-1.32-k8s
  • coscheduling-indexers-ut
  • deepspeed-runtime
  • dependabot/cargo/pkg/data_cache/crossbeam-channel-0.5.15
  • dependabot/cargo/pkg/data_cache/ring-0.17.14
  • dependabot/cargo/pkg/data_cache/test/tracing-subscriber-0.3.20
  • dependabot/cargo/pkg/data_cache/tokio-1.44.2
  • dependabot/go_modules/golang.org/x/net-0.38.0
  • dependabot/go_modules/golang.org/x/oauth2-0.27.0
  • dependabot/pip/cmd/runtimes/deepspeed/torch-2.6.0
  • dependabot/pip/cmd/runtimes/deepspeed/torch-2.7.1
  • dependabot/pip/cmd/runtimes/deepspeed/torch-2.8.0
  • dependabot/pip/cmd/runtimes/deepspeed/transformers-4.51.0
  • dependabot/pip/cmd/runtimes/deepspeed/transformers-4.52.1
  • dependabot/pip/cmd/runtimes/deepspeed/transformers-4.53.0
  • dont-merge-gpu-label-test
  • example/trainjob-yaml
  • feat/add-coscheduling-uts
  • feat/ctr-webhook
  • feat/dataset-preprocess
  • feat/example/add-speech-recognition-with-ddp-example
  • feat/initializers/s3
  • feat/llama3_2-manifests
  • feat/llm-trainer-v2
  • feat/local-model
  • feat/lora-support
  • feat/pvc-check
  • feat/replica-valid
  • feat/sdk-torchtune-config
  • feat/torchtune-plugin
  • feat/trainjob-affinity
  • feat/trainjob-imagepullsecrets
  • feat/webhook-validate-trainjob-name
  • feat/webhook/rfc1035
  • feature/add-xgboost-runtime
  • feature/debabrata
  • feature/helm-charts-v2
  • fix-arg-for-get-args-using-torchtune-config
  • fix-close-pr-message
  • fix-controller-rbac
  • fix-coveralls
  • fix-deepspeed-example
  • fix-deepspeed-npoc
  • fix-example-runtime
  • fix-latest-tag
  • fix-llm-hp-optimization-error
  • fix-mpi-key-mode
  • fix-oci-vm-tf
  • fix-outdated-intstr-lib
  • fix-permissions
  • fix-release-doc
  • fix-test-bug
  • fix-trainer-type-annotation
  • fix/cert-and-issuer
  • fix/disable-github-actions
  • fix/issue-template
  • fix/kep2401-lint
  • fix/multiple-depends-on
  • fix/python-type-import
  • fix/rbac/event
  • fix/tidy-KEP-2401
  • fix/torchtune-plugin
  • gpu-test-on-pr
  • gsoc-2442-jax-runtime-proposal
  • gsoc25-project7-kep
  • hatchling-package
  • helm-integration-tests
  • implement-resource-in-use-finalizer
  • implement-resource-in-use-for-cl-training-runtime
  • implement-validation-uts
  • indexers-ut
  • issue-2218-pod-spec-override-kep
  • issue-2706-v2-go-mod
  • issue-2789/implement-cluster-training-runtimes-deprecation-process
  • jax-runtime
  • jobset-name-prefix
  • jobset-validation
  • k8s_1.32_upgrade
  • kai_kep
  • kubecon-london-demo
  • kubelow-sdk-release
  • master
  • mlx-cuda-runtime
  • mlx-runtime
  • obtain-runtimeTemplate-via-info
  • openssf-badge
  • override_label_and_annotation
  • patch-1
  • pick/example-alpaca
  • pick/fix-torchtune-plugin
  • pkg/apply_unit-tests
  • pr-15
  • pr-17
  • pr-18
  • pr-19
  • pr-20
  • pr-21
  • pr-22
  • pr-24
  • pr-25
  • pr-26
  • pr-27
  • pr-28
  • pr-29
  • pr-30
  • pr-created-condition
  • pr-k8s-lint
  • pr-title-workflow
  • prometheus
  • refs/tags/v1.9.1
  • refs/tags/v2.0.0-rc.0
  • refs/tags/v2.0.0-rc.1
  • refs/tags/v2.0.1
  • release-1.9
  • release-2.0
  • release-python-doc
  • remove-command-runtimes
  • remove-k8s-version-matrix
  • remove-mpi
  • remove-sdk
  • remove-vendor-specific-parameters
  • revert-2646-fix-trainer-type-annotation
  • roadmap-2025
  • rqst-env-only-if-label-present
  • runtime_fix
  • safe-gpu-test
  • scorecard-workflow
  • sdk-ancestor-updates
  • sdk-fix-mpirun
  • security-doc
  • separate-models-from-sdk
  • solanyn/question-answer-example
  • support-for-gpu-cluster-using-oci-runner
  • support_kai
  • terrytangyuan-patch-1
  • test/fix-flaky-test
  • tmp_secret_verify
  • training-progression#2779
  • treat-ancestor-label-as-identifier
  • trivy-scans
  • update-approvers
  • update-examples-with-unpacking-params
  • update-github-runners
  • update-image-tags
  • update-license
  • update-logs-examples
  • update-manifest-images-to-ghcr
  • update-owners
  • update-release-process
  • update-sdk-reference
  • update-security-context
  • update-slack
  • update-stale-bot-version
  • use-tilt
  • validation-mpiruntimes
  • volcano-podgroup-build
  • vzamboulingame-upgrade-go-v1.24
  • workflow/helm
  • workflow/publish-helm-charts

02 Sep 2025 12:21PM UTC coverage: 54.734% (+2.6%) from 52.136%
17403349307

Pull #2653

github

jskswamy
fix(runtime): prevent launcher config override when runLauncherAsNode is true

Previously, when runLauncherAsNode was set to true, the launcher
container would receive the full trainer configuration including
image, command, args, and environment variables, which could
unintentionally override the launcher's original configuration.

This change ensures that launcher containers only receive resource
allocations (CPU, memory, GPU) when runLauncherAsNode is true, while
preserving their original image, command, and args. Node containers
continue to receive the full trainer configuration as expected.

The fix avoids override issues and improves the separation of
concerns between launcher and node container configurations in
MPI-based training jobs.

- Refactor Trainer method to apply full config only to Node containers
- Apply resources separately to both Node and Launcher containers
- Add comprehensive test case to verify launcher behavior
- Update existing test expectations to match correct behavior

Signed-off-by: Krishnaswamy Subramanian <subramk@thoughtworks.com>
Pull Request #2653: feat(runtime): add support for launcher resource allocation in MPI jobs

35 of 38 new or added lines in 1 file covered. (92.11%)

1081 of 1975 relevant lines covered (54.73%)

0.65 hits per line

Relevant lines Covered
Build:
Build:
1975 RELEVANT LINES 1081 COVERED LINES
0.65 HITS PER LINE
Source Files on fix-resource-allocation
  • Tree
  • List 26
  • Changed 1
  • Source Changed 0
  • Coverage Changed 1
Coverage ∆ File Lines Relevant Covered Missed Hits/Line

Recent builds

Builds Branch Commit Type Ran Committer Via Coverage
17403349307 fix-resource-allocation fix(runtime): prevent launcher config override when runLauncherAsNode is true Previously, when runLauncherAsNode was set to true, the launcher container would receive the full trainer configuration including image, command, args, and environment ... Pull #2653 02 Sep 2025 12:46PM UTC jskswamy github
54.73
17027046590 fix-resource-allocation Apply resources appropriately to both launcher and node containers The Trainer method has been updated to apply resources appropriately to both the launcher and node containers based on this flag. Key changes include: - Added the `isRunLauncherA... Pull #2653 17 Aug 2025 11:27PM UTC jskswamy github
50.15
15338735909 fix-resource-allocation Apply resources appropriately to both launcher and node containers The Trainer method has been updated to apply resources appropriately to both the launcher and node containers based on this flag. Key changes include: - Added the `isRunLauncherA... Pull #2653 30 May 2025 03:17PM UTC jskswamy github
30.45
See All Builds (1039)

Badge your Repo: trainer

We detected this repo isn’t badged! Grab the embed code to the right, add it to your repo to show off your code coverage, and when the badge is live hit the refresh button to remove this message.

Could not find badge in README.

Embed ▾
README BADGES
x

If you need to use a raster PNG badge, change the '.svg' to '.png' in the link

Markdown

Textile

RDoc

HTML

Rst

Refresh
  • Settings
  • Repo on GitHub
STATUS · Troubleshooting · Open an Issue · Sales · Support · CAREERS · ENTERPRISE · START FREE · SCHEDULE DEMO
ANNOUNCEMENTS · TWITTER · TOS & SLA · Supported CI Services · What's a CI service? · Automated Testing

© 2025 Coveralls, Inc