The author documents their journey of tackling the One Billion Row Challenge using CUDA. Their solution, running in 16.8 seconds on a V100, is the first of its kind without cudf, using hand-written kernels only. They share how they improved a C++ baseline by partitioning the workload, preparing file offsets, and implementing custom atomicMin and atomicMax for floats in their CUDA kernel. The author faces challenges due to CUDA limitations like the absence of std::string and std::map, leading to unconventional solutions like using sorted cities for binary search. Despite the hurdles, the CUDA solution provides a significant performance improvement over the baseline.
https://tspeterkim.github.io/posts/cuda-1brc