Critical flaws in `Avoid Catastrophic Forgetting with Rank-1 Fisher from Diffusion Models’

Avoid Catastrophic Forgetting with Rank-1 Fisher from Diffusion Models starts from an attractive problem: Elastic Weight Consolidation (EWC) needs a tractable local metric for old-task sensitivity, but the full Fisher is too large to use directly. The proposed solution is to exploit a near rank-1 empirical Fisher that appears in diffusion models at low SNR. If that direction were an old-task sensitivity direction, rank-1 EWC would replace diagonal EWC with a low-dimensional constraint that still retains cross-parameter structure.

This note analyzes the gap between that proposal and the argument in the paper. The paper proves a rank-1 structure in a low-SNR limit, then uses a different object in the implemented continual-learning method. The mismatch appears in four places: the low-SNR map is task-independent, the practical method averages over timesteps outside the stated framework, the linearity assumption is used as exact linearity, and the true Fisher in the same Gaussian regime is isotropic rather than rank-1.

The central question is simple. If an EWC matrix is supposed to protect an old task, which old-task quantity does the rank-1 direction preserve?

1. What EWC needs from an importance matrix

EWC regularizes the new-task objective by adding a quadratic penalty around the old-task solution:

\[ L_B(\theta ) + \frac {\lambda }{2} (\theta -\theta _A^\star )^\top F_A (\theta -\theta _A^\star ). \]

In the original Bayesian derivation, \(F_A\) approximates local curvature of the old-task posterior around \(\theta _A^\star \) Kirkpatrick et al.. In practical empirical-Fisher EWC, the same form is used with an empirical second moment of old-task gradients. Either way, the matrix is not merely a compression device. It is the local quadratic object through which old-task preservation is imposed.

For a rank-1 EWC penalty,

\[ F_A \approx \lambda _1 vv^\top , \]

all protected parameter movement lies along one direction \(v\). The method can only be justified as continual learning if \(v\) is a direction whose movement changes the old task in a task-dependent way. If the direction comes from a task-independent part of the diffusion objective, then the penalty is a generic stabilization term, not an old-task preservation term.

This distinction is the point at which the rank-1 diffusion argument breaks.

2. Low SNR removes the task

Consider the standard variance-preserving diffusion forward process,

\[ x_t = \sqrt {\bar \alpha _t}x_0 + \sqrt {1-\bar \alpha _t}\epsilon , \qquad \epsilon \sim \mathcal N(0,I). \]

At low SNR, \(\bar \alpha _t\) is close to zero, so

\[ x_t \approx \epsilon . \]

In this limit, the noisy marginal approaches an isotropic Gaussian independently of the data distribution. The optimal score is dominated by the Gaussian score

\[ s_t^\star (x_t) \approx - \frac {x_t}{1-\bar \alpha _t}. \]

Equivalently, the corresponding noise-prediction target is dominated by the schedule-dependent scaling

\[ \epsilon ^\star (x_t,t) \approx \frac {x_t}{\sqrt {1-\bar \alpha _t}}. \]

The leading-order map is therefore the same for all old data distributions with the same forward noising process. The task enters only through lower-order corrections as signal returns. Hence the low-SNR theorem identifies a distribution-independent denoising scale, not an old-task preservation map.

This is not a weak task signal. At leading order it is no task-dependent signal. Within the low-SNR framework used to motivate the continual-learning penalty, the rank-1 direction describes a universal Gaussian denoising scale, identical for every data distribution sharing the same forward process. EWC is then asked to remember something that, in this framework, was never learned as an old-task fact. The inconsistency is structural: the object that supplies the rank-1 direction contains no old-task content, while the continual-learning interpretation requires exactly such content.

3. The all-timestep implementation leaves the framework

The implementation does not use only the low-SNR object. In Section 3.3, the paper states that the practical expectation is taken over the joint diffusion-training sampling process: sample data, sample a timestep, construct \(x_t\), and compute the gradient. It then says that “averaging gradients over timesteps” gives the practical surrogate.

This is the step where the implemented method recovers access to task information. Mid-SNR and high-SNR denoising still depend on data geometry: class-conditional structure, local image statistics, and distribution-specific non-Gaussianity. If the rank-1 EWC term helps in continual learning, that help comes from these task-dependent terms, from generative distillation, or from their mixture with low-SNR gradients.

The paper makes the all-timestep averaging explicit. The problem is not that the implementation is hidden. The problem is that the interpretation is misaligned with its own framework. Low SNR gives the rank-1 argument and removes the task. Higher SNR restores the task and removes the clean low-SNR rank-1 derivation. The practical method averages them and then uses the low-SNR theorem to explain the result.

This deviation is not a minor implementation detail. It is the reason the method can work at all. A pure low-SNR rank-1 penalty protects the task-independent Gaussian denoising scale. The all-timestep surrogate reintroduces task-dependent denoising outside the theorem.

Section 3.3 calls this a practical surrogate. Nearby, the paper frames the same rank-1 Fisher construction as enabling a more effective application of EWC to continual learning and as effectively constraining replay-induced drift. That framing is not a computational justification: evaluating a restricted set of timesteps is not more expensive than averaging over all of them. Hence the justification is empirical, namely that the all-timestep mixture works better than the object analyzed by the theory. But then the implementation is no longer the method justified by the low-SNR framework. For that reason, a reader can plausibly interpret this bridge as intentionally misleading: the working method is rhetorically attached to the low-SNR theorem, but the bridge succeeds by leaving the theorem’s regime.

4. Approximately linear is not linear

The derivation also relies on a linearity story for the score or noise-prediction network. A linear surrogate proves a statement about the surrogate. For the real neural network one needs a perturbation argument.

Suppose the relevant per-sample gradient decomposes as

\[ g(x) = g_{\mathrm {lin}}(x) + e(x), \]

where \(g_{\mathrm {lin}}\) is the gradient induced by the linear surrogate and \(e\) is the nonlinear remainder. Then the empirical Fisher-like matrix is

\[ \mathbb E[gg^\top ] = \mathbb E[g_{\mathrm {lin}}g_{\mathrm {lin}}^\top ] + \mathbb E[g_{\mathrm {lin}}e^\top ] + \mathbb E[eg_{\mathrm {lin}}^\top ] + \mathbb E[ee^\top ]. \]

The rank-1 term remains the leading eigenspace only if the perturbation

\[ R = \mathbb E[g_{\mathrm {lin}}e^\top ] + \mathbb E[eg_{\mathrm {lin}}^\top ] + \mathbb E[ee^\top ] \]

is small relative to the eigengap of \(\mathbb E[g_{\mathrm {lin}}g_{\mathrm {lin}}^\top ]\). A sufficient form would be a Davis-Kahan-style condition such as

\[ \|R\|_{\mathrm {op}} \ll \lambda _1(G_{\mathrm {lin}})-\lambda _2(G_{\mathrm {lin}}), \qquad G_{\mathrm {lin}} = \mathbb E[g_{\mathrm {lin}}g_{\mathrm {lin}}^\top ]. \]

This would imply stability of the leading eigenspace under the nonlinear remainder. Alternatively, the paper could directly measure the nonlinear residual and eigengap, or prove that the pullback from output space to parameter space preserves the low-dimensional gradient structure.

Without such a step, “approximately linear” has no definite consequence for the empirical Fisher. A U-Net gradient contains time embeddings, activations, normalization layers, conditioning pathways, and local convolutional routes. Even if the denoising function is close to a scaling map in output space, the parameter gradients need not stay collinear after the pullback through the network.

This part has a concrete repair: state the nonlinear model as a perturbation of the linear surrogate, bound or measure the residual in operator norm, and compare it to the empirical eigengap. That would turn the linear calculation into a stability theorem rather than an analogy.

5. The true Fisher is isotropic in the same regime

The Fisher terminology creates a second collision. Consider the simplest conditional Gaussian linear model,

\[ x \sim \mathcal N(0,\Sigma _x), \qquad y \mid x,A \sim \mathcal N(Ax,\sigma ^2 I_m), \]

For this model, the log-likelihood score with respect to \(A\) is

\[ \nabla _A \log p_A(y\mid x) = \frac {1}{\sigma ^2}(y-Ax)x^\top . \]

Let \(r=y-Ax\). Under the model distribution, \(r\) has mean zero and covariance \(\sigma ^2 I_m\). Therefore the true Fisher is

\[ F_{\mathrm {true}} = \mathbb E_{x}\mathbb E_{y\sim p_A(\cdot \mid x)} \left [ \operatorname {vec}\!\left ( \frac {1}{\sigma ^2}rx^\top \right ) \operatorname {vec}\!\left ( \frac {1}{\sigma ^2}rx^\top \right )^\top \right ] = \frac {1}{\sigma ^2}\Sigma _x \otimes I_m. \]

At low SNR, the input is isotropic Gaussian noise, so \(\Sigma _x=\tau ^2 I_d\). Thus

\[ F_{\mathrm {true}} = \frac {\tau ^2}{\sigma ^2}I_{dm}. \]

There is no first eigenvector. Every Frobenius-norm direction in \(A\) changes the predictive distribution by the same local KL amount:

\[ \mathbb E_x D_{\mathrm {KL}} \left ( \mathcal N(Ax,\sigma ^2 I) \, \middle \| \, \mathcal N((A+\Delta A)x,\sigma ^2 I) \right ) = \frac {\tau ^2}{2\sigma ^2} \|\Delta A\|_F^2. \]

The true Fisher is therefore isotropic in the analyzed low-SNR Gaussian regime. It has no preferred rank-1 direction.

The rank-1 object in the paper is an empirical second moment of surrogate-loss gradients, not the true Fisher/KL metric. This is precisely the distinction emphasized by Kunstner, Balles, and Hennig: the empirical Fisher is not generally an empirical estimate of the Fisher, and it need not capture curvature. Diffusion MSE and denoising-score objectives are outside the special negative-log-likelihood classification setting where the vocabulary can hide part of the distinction.

Thus the low-SNR calculation does not derive a rank-1 Fisher-information geometry. It separates two objects: the empirical gradient second moment becomes rank-1 in their argument precisely where the true Fisher is isotropic.

6. What an aligned argument would need

Only one of these problems has a direct mathematical repair. The others require changing the claim being made or adding evidence for a different mechanism.

First, separate the low-SNR theorem from the continual-learning mechanism. The theorem describes empirical-gradient alignment in a task-independent limit. The method uses all timesteps and distillation. These should not be presented as the same justification.

Second, test timestep-restricted EWC matrices. If low SNR is the theoretical source of rank-1 preservation, then a low-SNR-only matrix should be compared against mid-SNR, high-SNR, and all-timestep matrices under the same replay/distillation setup. This would locate where old-task preservation actually enters.

Third, prove or measure nonlinear eigenspace stability. The linear surrogate can support the neural-network claim only after the residual terms are controlled relative to the eigengap.

Fourth, stop using the true-Fisher interpretation for the rank-1 object. The result concerns an empirical second moment of surrogate-loss gradients. That object may still be useful, but it should be analyzed differently. For this reason, one might use Wiest’s IEWC framework, which derives Elastic Weight Consolidation and the empirical Fisher in a more direct optimization-preservation framework.

7. Review context

The OpenReview page lists this work as an ICLR 2026 poster. ICLR is one of the main venues in machine learning.

That matters because the errors above are not hidden in implementation details. Rather, they concern the central chain from theorem to algorithm to empirical claim. If we inspect that chain, the theorem is about a task-independent limit, whereas the algorithm depends on leaving that limit. Moreover, the linear argument lacks the perturbation step needed for the nonlinear model, and the true Fisher calculation points precisely in the opposite direction.

This, however, is not just a failure of the authors. It is also one instance of a broader problem in modern machine learning research. A plausible empirical story, a mathematical calculation, and a working implementation can sound correct even when they contain multiple critical inconsistencies. For that reason, review is the mechanism that should look at the details, spot issues, and enforce alignment to proper scientific practice. While the cost of publishing something is falling, and so is the apparent average quality of published work, proper reviews are becoming more crucial. Here, that mechanism clearly did not do its job.