|
Ran
|
Jobs
3
|
Files
131
|
Run time
1min
|
Badge
README BADGES
|
push
github
Tensor Parallelism -- Review Version (#776) Summary: Pull Request resolved: https://github.com/pytorch/opacus/pull/776 We add a class named `grad_sample_module_fast_gradient_clipping_tp` to support 1D tensor parallelism for Opacus. We demonstrate the effectiveness on a toy example (examples/tp_toy.py) and llama 2 model (examples/tp_llama.py). The key contribution is to understand when merging per-sample norms from different local devices, the operation should be `reduce_all` or just keep the norm from device 0. Specifically, under the following exceptions: 1. The parameter is not a DTensor. 2. The model is nn.embedding and under RowWiseParallel. 3. The parameter is nn.linear.bias under RowWiseParallel. We should not merge the per-sample norm from local devices, but maintain the one from device 0. Currently, we do not support 1. Vanilla Opacus mode (non-GC). 2. Sequential Parallelism (usually applied to LayerNormalization). 3. Modify/customize the placements of module outputs, which might lead to wrong behavior of norm calculation. 4. Trainable parameter being a plain tensor or replicated because of the constraint of the optimizer. Reviewed By: aparna-aketi Differential Revision: D79062242 fbshipit-source-id: ffaf071d1
13 of 32 new or added lines in 3 files covered. (40.63%)
5556 of 6902 relevant lines covered (80.5%)
1.79 hits per line
| Lines | Coverage | ∆ | File |
|---|---|---|---|
| 19 |
36.67 |
opacus/grad_sample/grad_sample_module_fast_gradient_clipping_tp.py |
| ID | Job ID | Ran | Files | Coverage | |
|---|---|---|---|---|---|
| 1 | run-3 - 17571477754.1 | 70 |
47.81 |
GitHub Action Run | |
| 2 | run-1 - 17571477754.2 | 130 |
80.28 |
GitHub Action Run | |
| 3 | run-2 - 17571477754.3 | 130 |
80.27 |
GitHub Action Run |
| Coverage | ∆ | File | Lines | Relevant | Covered | Missed | Hits/Line |
|---|