Gradient Clipping¶
The process of adding differential privacy to a model involves bounds its sensitivity prior to
applying the Gaussian mechanism. This is achieved by clipping the per-sample gradients.
Normally for a parameterized layer if you have a tensor of parameters of size [m, n]
,
the size of the gradients will match it. This means that they get aggregated over the batch.
Here, we will keep them per-sample i.e., we will have a tensor of size [b_sz, m, n]
, where
the slice [i, :, :]
corresponds to the per-example gradients for the i-th example in the batch.
Per-sample gradient clipping has to be achieved under the following constraints:
1. The norm of the grad_sample of the loss with respect to all model parameters has
to be clipped so that if they were to be put in a single vector together. If C
is the clipping
threshold, this ensures the total norm will be at most C
.
Example
>>> T = torch.cat([p.grad_sample.flatten() for p in model.parameters()])
T
will have shape [B, N_TOTAL_PARAMS]
. The total L2 norm of each row of T
cannot be greater than C
.
2. This clipping should not backpropagate. This means that clipping in the layer i+1
should not affect computing the gradient of layer i
. To make sure this is followed
we will first compute the grad_sample of all layers without clipping. In a second pass, we will
go back to the per-sample gradients, clip them, and accumulate them in .grad
(thus replacing the “real” gradients).
Notes
There is only a single .backward() call as the second pass just works on top of the stored grad_sample.
- class opacus.per_sample_gradient_clip.PerSampleGradientClipper(module, norm_clipper, batch_first=True, loss_reduction='mean')[source]¶
Class to define a per-sample gradient clipper for a module. Per-sample gradient clipping bounds the sensitivity of the computation before applying the Gaussian mechanism.
Attaches to a module, and clips all grad_sample in the backward pass. It then puts them in each parameter’s
.grad
.- Parameters
module (
GradSampleModule
) – Module to which backward hooks are added and for which per-sample gradients are clippednorm_clipper (
NormClipper
) – A norm clipper object of classNormClipper
which encapsulated different clipping strategies (such as flat clipping for the entire model, or per-layer clipping)batch_first (
bool
) – Flag to indicate if the input tensor to the corresponding module has the first dimension represent the batch, for example of shape [batch_size, …, …]. Set to True if batch appears in first dimension else set to False (batch_first=False implies that the batch is always in the second dimension).loss_reduction (
str
) – Indicates if the loss reduction (for aggregating the gradients) is a sum or a mean operation. Can take valuessum
ormean
- clip_and_accumulate()[source]¶
Clips and sums up per-sample gradients into an accumulator. When this function is called
N >= 1
times on mini-batches of sizeB
(could be smaller on final batch), a call topre_step()
will populate the.grad
field with the average gradient over the entire batch of size(N-1)* B + b
withb <= B
.- Return type
- pre_step()[source]¶
Prepares the
.grad
field of the parameters and provides statistics on the maximum gradient norm which should be used to scale noise in the privacy engine (:class:~opacus.privacy_engine.PrivacyEngine
). This function is called before the optimizerstep()
.