This study explores the capabilities of deep state-space models (Deep SSMs) for in-context learning on autoregressive tasks, showcasing their potential similar to transformers. By incorporating local self-attention in a single structured state-space model layer, the model can mimic an implicit linear model after one step of gradient descent. The diagonal linear recurrent layer acts as a gradient accumulator, improving performance on linear regression tasks. The study highlights the importance of local self-attention and multiplicative interactions in recurrent architectures, paving the way for effective training on general tasks. The findings offer insights into the mechanisms enabling the expressiveness of deep state-space models.
https://arxiv.org/abs/2410.11687