The process of adding differential privacy to a model involves bounds its sensitivity prior to
applying the Gaussian mechanism. This is achieved by clipping the per-sample gradients.
Normally for a parameterized layer if you have a tensor of parameters of size
the size of the gradients will match it. This means that they get aggregated over the batch.
Here, we will keep them per-sample i.e., we will have a tensor of size
[b_sz, m, n], where
[i, :, :] corresponds to the per-example gradients for the i-th example in the batch.
Per-sample gradient clipping has to be achieved under the following constraints:
1. The norm of the grad_sample of the loss with respect to all model parameters has
to be clipped so that if they were to be put in a single vector together. If
C is the clipping
threshold, this ensures the total norm will be at most
>>> T = torch.cat([p.grad_sample.flatten() for p in model.parameters()])
T will have shape
[B, N_TOTAL_PARAMS]. The total L2 norm of each row of
cannot be greater than
2. This clipping should not backpropagate. This means that clipping in the layer
should not affect computing the gradient of layer
i. To make sure this is followed
we will first compute the grad_sample of all layers without clipping. In a second pass, we will
go back to the per-sample gradients, clip them, and accumulate them in
(thus replacing the “real” gradients).
There is only a single .backward() call as the second pass just works on top of the stored grad_sample.
PerSampleGradientClipper(module, norm_clipper, batch_first=True, loss_reduction='mean')¶
Class to define a per-sample gradient clipper for a module. Per-sample gradient clipping bounds the sensitivity of the computation before applying the Gaussian mechanism.
Attaches to a module, and clips all grad_sample in the backward pass. It then puts them in each parameter’s
Module) – Module to which backward hooks are added and for which per-sample gradients are clipped
bool) – Flag to indicate if the input tensor to the corresponding module has the first dimension represent the batch, for example of shape [batch_size, …, …]. Set to True if batch appears in first dimension else set to False (batch_first=False implies that the batch is always in the second dimension).
str) – Indicates if the loss reduction (for aggregating the gradients) is a sum or a mean operation. Can take values
Clips and sums up per-sample gradients into an accumulator. When this function is called
N >= 1times on mini-batches of size
B(could be smaller on final batch), a call to
pre_step()will populate the
.gradfield with the average gradient over the entire batch of size
(N-1)* B + bwith
b <= B.
- Return type
.gradfield of the parameters and provides statistics on the maximum gradient norm which should be used to scale noise in the privacy engine (:class:
~opacus.privacy_engine.PrivacyEngine). This function is called before the optimizer
Sets the function to be called after clipping to the input callable parameter (for example clipping stats collection)
Deletes the added attributes,
The two mentioned attributes are automatically deleted when
clip_and_accumulateare properly called. This is a safety measure to avoid further issues if regular use has not been followed.