Gradient Clipping

The process of adding differential privacy to a model involves bounds its sensitivity prior to applying the Gaussian mechanism. This is achieved by clipping the per-sample gradients. Normally for a parameterized layer if you have a tensor of parameters of size [m, n], the size of the gradients will match it. This means that they get aggregated over the batch. Here, we will keep them per-sample i.e., we will have a tensor of size [b_sz, m, n], where the slice [i, :, :] corresponds to the per-example gradients for the i-th example in the batch.

Per-sample gradient clipping has to be achieved under the following constraints:

1. The norm of the grad_sample of the loss with respect to all model parameters has to be clipped so that if they were to be put in a single vector together. If C is the clipping threshold, this ensures the total norm will be at most C.


>>> T =[p.grad_sample.flatten() for p in model.parameters()])

T will have shape [B, N_TOTAL_PARAMS]. The total L2 norm of each row of T cannot be greater than C.

2. This clipping should not backpropagate. This means that clipping in the layer i+1 should not affect computing the gradient of layer i. To make sure this is followed we will first compute the grad_sample of all layers without clipping. In a second pass, we will go back to the per-sample gradients, clip them, and accumulate them in .grad (thus replacing the “real” gradients).


There is only a single .backward() call as the second pass just works on top of the stored grad_sample.

class opacus.per_sample_gradient_clip.PerSampleGradientClipper(module, norm_clipper, batch_first=True, loss_reduction='mean')[source]

Class to define a per-sample gradient clipper for a module. Per-sample gradient clipping bounds the sensitivity of the computation before applying the Gaussian mechanism.

Attaches to a module, and clips all grad_sample in the backward pass. It then puts them in each parameter’s .grad.

  • module (Module) – Module to which backward hooks are added and for which per-sample gradients are clipped

  • norm_clipper (NormClipper) – A norm clipper object of class NormClipper which encapsulated different clipping strategies (such as flat clipping for the entire model, or per-layer clipping)

  • batch_first (bool) – Flag to indicate if the input tensor to the corresponding module has the first dimension represent the batch, for example of shape [batch_size, …, …]. Set to True if batch appears in first dimension else set to False (batch_first=False implies that the batch is always in the second dimension).

  • loss_reduction (str) – Indicates if the loss reduction (for aggregating the gradients) is a sum or a mean operation. Can take values sum or mean


Clips and sums up per-sample gradients into an accumulator. When this function is called N >= 1 times on mini-batches of size B (could be smaller on final batch), a call to pre_step() will populate the .grad field with the average gradient over the entire batch of size (N-1)* B + b with b <= B.

Return type



Removes backward hooks from the module

Return type



Prepares the .grad field of the parameters and provides statistics on the maximum gradient norm which should be used to scale noise in the privacy engine (:class:~opacus.privacy_engine.PrivacyEngine). This function is called before the optimizer step().

Return type

Tuple[Tensor, int]


The maximum gradient norm per batch (repeated in batch dimension as a tensor) and the batch size


Sets the function to be called after clipping to the input callable parameter (for example clipping stats collection)


on_batch_clip_func (Callable[…, None]) – Function to be called after clipping

Return type



Deletes the added attributes, grad_sample and summed_grad.

The two mentioned attributes are automatically deleted when pre_step or clip_and_accumulate are properly called. This is a safety measure to avoid further issues if regular use has not been followed.