Gradient Routing for Continual Learning

This project tests whether sparsity and sparse gradient routing can help with continual learning. The original idea was to pretrain a model, introduce some sparsity, and route task gradients non-uniformly so that only a small subset of parameters adapts for each new task.

On a PermutedMNIST benchmark with eight sequential tasks, plain all-layer sparse routing turned out to be insufficient as a standalone mechanism. The reason was that mild sparsity produced only 2–3% near-zero weights, and routing gradients across all layers still reused the same parameter support across tasks. However, once the idea was refined and localized, it became a useful auxiliary mechanism.

The version that works freezes the upper trunk after pretraining and only adapts a sparse first composition layer plus the classifier. Two variants were tested: power-law routing on the first layer, and a support-aware version that explicitly avoids previously-used weights and boosts dormant ones. Both improved over the baseline.

Method	Final avg. accuracy	Avg. forgetting	Task-1 retention
SSL baseline	0.614 ± 0.028	0.363 ± 0.031	0.281 ± 0.083
Power first-layer composer	0.650 ± 0.012	0.278 ± 0.014	0.529 ± 0.011
Support-aware first-layer composer	0.645 ± 0.010	0.284 ± 0.012	0.496 ± 0.012

I also added a published SplitMNIST class-incremental benchmark and explored a task-blind online slot mechanism where sparse computation is discovered online by novelty detection. This reached 0.825 final accuracy with 0.215 forgetting, against a dense baseline of 0.199 with 0.998 forgetting.

The takeaway is that sparse gradient routing is most effective as an auxiliary mechanism: once sparse computation is localized to the right layer and discovered online, it can meaningfully reduce forgetting even without task identity.