Gradient Clipping

The process of adding differential privacy to a model involves bounds its sensitivity prior to applying the Gaussian mechanism. This is achieved by clipping the per-sample gradients. Normally for a parameterized layer if you have a tensor of parameters of size [m, n], the size of the gradients will match it. This means that they get aggregated over the batch. Here, we will keep them per-sample i.e., we will have a tensor of size [b_sz, m, n], where the slice [i, :, :] corresponds to the per-example gradients for the i-th example in the batch.

Per-sample gradient clipping has to be achieved under the following constraints:

1. The norm of the grad_sample of the loss with respect to all model parameters has to be clipped so that if they were to be put in a single vector together. If C is the clipping threshold, this ensures the total norm will be at most C.


>>> T =[p.grad_sample.flatten() for p in model.parameters()])

T will have shape [B, N_TOTAL_PARAMS]. The total L2 norm of each row of T cannot be greater than C.

2. This clipping should not backpropagate. This means that clipping in the layer i+1 should not affect computing the gradient of layer i. To make sure this is followed we will first compute the grad_sample of all layers without clipping. In a second pass, we will go back to the per-sample gradients, clip them, and accumulate them in .grad (thus replacing the “real” gradients).


There is only a single .backward() call as the second pass just works on top of the stored grad_sample.

class opacus.per_sample_gradient_clip.PerSampleGradientClipper(module, norm_clipper, batch_first=True, loss_reduction='mean')[source]

Class to define a per-sample gradient clipper for a module. Per-sample gradient clipping bounds the sensitivity of the computation before applying the Gaussian mechanism.

Attaches to a module, and clips all grad_sample in the backward pass. It then puts them in each parameter’s .grad.

  • module (GradSampleModule) – Module to which backward hooks are added and for which per-sample gradients are clipped

  • norm_clipper (NormClipper) – A norm clipper object of class NormClipper which encapsulated different clipping strategies (such as flat clipping for the entire model, or per-layer clipping)

  • batch_first (bool) – Flag to indicate if the input tensor to the corresponding module has the first dimension represent the batch, for example of shape [batch_size, …, …]. Set to True if batch appears in first dimension else set to False (batch_first=False implies that the batch is always in the second dimension).

  • loss_reduction (str) – Indicates if the loss reduction (for aggregating the gradients) is a sum or a mean operation. Can take values sum or mean


Clips and sums up per-sample gradients into an accumulator. When this function is called N >= 1 times on mini-batches of size B (could be smaller on final batch), a call to pre_step() will populate the .grad field with the average gradient over the entire batch of size (N-1)* B + b with b <= B.

Return type



Prepares the .grad field of the parameters and provides statistics on the maximum gradient norm which should be used to scale noise in the privacy engine (:class:~opacus.privacy_engine.PrivacyEngine). This function is called before the optimizer step().

Return type

Tuple[Tensor, int]


The maximum gradient norm per batch (repeated in batch dimension as a tensor) and the batch size


Sets the function to be called after clipping to the input callable parameter (for example clipping stats collection)


on_batch_clip_func (Callable[…, None]) – Function to be called after clipping

Return type



Deletes the added attributes, grad_sample and summed_grad.

The two mentioned attributes are automatically deleted when pre_step or clip_and_accumulate are properly called. This is a safety measure to avoid further issues if regular use has not been followed.