Information Theory mathematically describes the concept of “information” by expressing “surprise” as the minimum amount of symbols required to communicate the outcome of a random process. To communicate samples drawn from a distribution, one needs at least as many symbols as the distribution’s entropy to unambiguously communicate those samples, according to Shannon’s Source Coding Theorem. In this post, we look at the theorem, which says that no matter what code function one chooses, the smallest possible expected code word length is the entropy of the categorical distribution. The theorem tells us that the entropy gives us a lower bound on how efficiently we can compress the information in the distribution.
https://mbernste.github.io/posts/sourcecoding/