The key takeaway is to time your code as you try different tricks to speed it up. Freeing up all 8 CPU registers for use in your code is important for maximizing register usage. Avoid complex instructions and unnecessary compare instructions. On the P4 processor, use ADD/SUB instead of INC/DEC. Utilize ADC and SBB for quick speed ups. Consider using MMX for adding or subtracting large numbers. BSWAP, ROL, ROR, RCL, and RCR can be clever tricks for handling data. Unrolling loops and aligning code and data can also boost performance. Prioritize using registers to pass parameters, rather than stacking them. Consider using smaller registers for potential speed improvements. Overall, optimize your instructions for efficient execution. Note: Some recommendations may be controversial or surprising.
https://masm32.com/masmcode/marklarson/index.htm