Bytes are all you need: Transformers operating directly on file bytes

The common approach to image classification using deep learning involves decoding images into an RGB tensor, but researchers have investigated performing classification directly on file bytes without the need for decoding. This allows for the creation of models that can operate on multiple input modalities and has applications in privacy-preserving inference. The model, called ByteFormer, achieves 77.33% classification accuracy when training and testing directly on TIFF file bytes and 95.42% on WAV files, and can perform inference on obfuscated input representations without any loss of accuracy. The code for ByteFormer is available for public use.

To top