In a tweet at the end of 2024, Harold Aptroot expressed disappointment that the SHLX instruction on Alder Lake takes 3 cycles rather than the expected 1 cycle. This left-shift instruction is part of the BMI2 instruction set. Surprisingly, the performance difference is seen when initializing the shift count register differently, with immediate values causing slower performance. Using 32-bit instructions or 64-bit instructions without immediates results in normal 1-cycle latency. The strange relationship between instruction initialization and performance is puzzling, especially since SHLX only considers the bottom 6 bits of the shift count. Stay tuned for updates on this perplexing performance anomaly.
https://tavianator.com/2025/shlx.html