Databases for Data Scientist – And why you probably don’t need one

Data scientists are realizing the importance of working with databases for scalable analytics, but many struggle with making the leap. Contrary to popular belief, working with a database isn’t much different from a data.frame. Knowing how to write dplyr means you already know how to work with a database. To scale analysis, understanding RDBMS, primary/foreign keys, normalization, schemas, views, indexes, and the need for a full RDBMS is crucial. However, tools like Apache Arrow and DuckDB offer viable alternatives to traditional databases by organizing parquet files for efficient analytical workflows. Consider using Parquet with Hive partitioning and DuckDB/Arrow instead of mainstream databases like Postgres for analytics.

https://josiahparry.com/posts/2024-05-16-databases-for-ds