# Gradient Sample Module¶

Extends nn.Module so that its parameter tensors have an extra field called .grad_sample.

Initializes internal Module state, shared by both nn.Module and ScriptModule.

Adds hooks to model to save activations and backprop values. The hooks will 1. save activations into param.activations during forward pass 2. compute per-sample gradients in params.grad_sample during backward pass. Call remove_hooks(model) to disable this.

Parameters
• model – the model to which hooks are added

• loss_type – either “mean” or “sum” depending on whether backpropped loss was averaged or summed over batch (default: “mean”)

• batch_dim – the batch dimension (default: 0)

Return type

None

capture_backprops_hook(module, _forward_input, forward_output, loss_reduction, batch_first)[source]

Captures backprops in backward pass and store per-sample gradients.

Deletes .grad_sample from this module’s parameters.

Why del? Normally, zero_grad() would do p.grad.zero_() and keep the allocation. Normal grads can do this, because their shape is always the same. Grad samples do not behave like this, because they accumulate over the batch dim. If you have batch_size=32 and size (12, 16) and you backprop twice, you should expect to have grad_samples of size [64, 12, 16]. If you backprop once more, then you’ll get size [96, 12, 16] and so on. So when you zero out, you should be left with nothing so you can start over.

disable_hooks()[source]

Globally disable all hooks installed by this library. Why is this needed? As per https://github.com/pytorch/pytorch/issues/25723, there is a bug in Autograd that makes removing hooks do nothing if the graph was already constructed. For this reason, we have this method to at least turn them off.

Return type

None

enable_hooks()[source]

The opposite of disable_hooks(). Hooks are always enabled unless you explicitly disable them so you don’t need to call this unless you want to re-enable them.

Return type

None

forward(x)[source]

Defines the computation performed at every call.

Should be overridden by all subclasses.

Note

Although the recipe for forward pass needs to be defined within this function, one should call the Module instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.

classmethod is_supported(module)[source]

Check if this module is supported

Return type

bool

parametrized_modules()[source]

Recursively iterates over all submodules, returning those that have parameters (as opposed to “wrapper modules” that just organize modules).

Return type

Iterable[Module]

rearrange_grad_samples(module, backprops, loss_reduction, batch_first)[source]

Rearrange activations and grad_samples based on loss reduction and batch dim

Parameters
• module (Module) – the module for which per-sample gradients are computed

• backprops (Tensor) – the captured backprops

• loss_reduction (str) – either “mean” or “sum” depending on whether backpropped loss was averaged or summed over batch

• batch_first (bool) – True is batch dimension is first

Return type
remove_hooks()[source]

Removes hooks added by add_hooks()

Return type

None

to_standard_module()[source]

Returns the standard nn.Module wrapped by this, eliminating all traces of grad samples and hooks

Return type

Module

Returns

The wrapped module

trainable_modules()[source]

Recursively iterates over all submodules, returning those that have parameters and are trainable (ie they want a grad).

Return type

Iterable[Module]

Sets gradients of all model parameters to zero. See similar function under torch.optim.Optimizer for more context.

Parameters

set_to_none (bool) – instead of setting to zero, set the grads to None. See torch.optim.Optimizer.zero_grad() for details.

Computes per sample gradients for convolutional layers

Parameters
Return type

None

Computes per sample gradients for LSTMLinear layer. The DPLSTM class is written using this layer as its building block.

class

Parameters
Return type

None

Computes per sample gradients for SequenceBias layer

Parameters
Return type

None

Computes per sample gradients for nn.Embedding layer.

Parameters
Return type

None

Computes per sample gradients for GroupNorm

Parameters
Return type

None

Computes per sample gradients for InstanceNorm layers

Parameters
Return type

None

Computes per sample gradients for LayerNorm

Parameters
Return type

None

Computes per sample gradients for nn.Linear layer

Parameters
Return type

None

Creates a grad_sample attribute in the given parameter, or adds to it if the grad_sample attribute already exists.

Parameters
• param (Tensor) – Parameter to which grad_sample will be added

• grad_sample (Tensor) – Per-sample gradients tensor. Must be of the same shape as param with extra batch dimension

Return type

None

Creates a grad_sample attribute in the given parameter, or appends to it if the grad_sample attribute already exists.

Parameters
• param (Tensor) – Parameter to which grad_sample will be added

• grad_sample (Tensor) – Per-sample gradients tensor. Must be of the same shape as param with extra batch dimension

• batch_dim (int) – Position of the batch dimension in the shape of grad_sample

Return type

None

Registers the decorated function as the grad_sampler of target_class_or_classes, which is the function that will be invoked every time you want to compute a per-sample gradient of target_class_or_classes. The signature of every grad_sampler is always the same:
>>> @register_grad_sampler(nn.MyCustomClass)

It may help you to take a look at the existing grad_samplers inside Opacus, under opacus.grad_sample.