Gradient Routing for Continual Learning
This project tests whether sparsity and sparse gradient routing can help with continual learning. The original idea was to pretrain a model, introduce some sparsity, and route task gradients non-uniformly so that only a small subset of parameters adapts for each new task.
On a PermutedMNIST benchmark with eight sequential tasks, plain all-layer sparse routing turned out to be insufficient as a standalone mechanism. The reason was that mild sparsity produced only 2–3% near-zero weights, and routing gradients across all layers still reused the same parameter support across tasks. However, once the idea was refined and localized, it became a useful auxiliary mechanism.
The version that works freezes the upper trunk after pretraining and only adapts a sparse first composition layer plus the classifier. Two variants were tested: power-law routing on the first layer, and a support-aware version that explicitly avoids previously-used weights and boosts dormant ones. Both improved over the baseline.
| Method | Final avg. accuracy | Avg. forgetting | Task-1 retention |
|---|---|---|---|
| SSL baseline | 0.614 ± 0.028 | 0.363 ± 0.031 | 0.281 ± 0.083 |
| Power first-layer composer | 0.650 ± 0.012 | 0.278 ± 0.014 | 0.529 ± 0.011 |
| Support-aware first-layer composer | 0.645 ± 0.010 | 0.284 ± 0.012 | 0.496 ± 0.012 |
I also added a published SplitMNIST class-incremental benchmark and explored a task-blind online slot mechanism where sparse computation is discovered online by novelty detection. This reached 0.825 final accuracy with 0.215 forgetting, against a dense baseline of 0.199 with 0.998 forgetting.
The takeaway is that sparse gradient routing is most effective as an auxiliary mechanism: once sparse computation is localized to the right layer and discovered online, it can meaningfully reduce forgetting even without task identity.