In recent years, there has been a renewed interest in traditional recurrent neural networks (RNNs) like LSTMs and GRUs due to the scalability limitations of Transformers. By removing hidden state dependencies from these models, researchers have found they can be trained in parallel without the need for backpropagation through time (BPTT). This has led to the development of minimal versions (minLSTMs and minGRUs) that use significantly fewer parameters and are 175 times faster for a sequence of length 512. Surprisingly, these stripped-down RNNs are able to match the performance of more recent sequence models.
https://arxiv.org/abs/2410.01201