The article discusses the internal workings of OpenAI’s GPT-4o model, focusing on how images are represented as embedding vectors. The author delves into hypothetical designs for a CNN architecture that could achieve this representation, debating different strategies such as raw pixels versus using a convolutional neural network like YOLO. An experiment is conducted to test the model’s performance in identifying shapes and colors on grids, leading to surprising results that challenge the initial hypothesis. Ultimately, the author suggests a pyramid strategy to encode images at various granular levels, potentially explaining how the 170 tokens required for processing high-res images are derived.
https://www.oranlooney.com/post/gpt-cnn/