Exploring HN by mapping and analyzing 40M posts and comments for fun

The author has constructed a map of all Hacker News posts semantically using text embeddings. Text embeddings are representations of text in a high-dimensional space, allowing for powerful search and analysis. The author has shared their journey from collecting data using a Node.js service to generating embeddings using BGE-M3 model and then using UMAP for dimensionality reduction. They also faced challenges such as dealing with non-textual titles and link rot. The author provides their data and source code for others to explore and invites collaboration. A fascinating insight is the use of RunPod for cost-effective GPU computation and db-rpc for efficient database connection.

https://blog.wilsonl.in/hackerverse/

To top