lunes, 12 de diciembre de 2022

Doing mundane things with SQL and Duckdb

I've been learning more and more sql, both the language, the implementations, and as always, looking for the extreme cases of "you shouldn't be doing this in sql, but here we go...", and diving into them as they usually are the more fun and enlightening parts.

One db I'm very fond of is duckdb. The philosophy behind it just clicks with me:

In-process, OLAP, zero-deps (tinyverse), quite a lot of innovation, but also lots of practical stuff in a very dwim style. 

Installing it is a breeze. And it works. I couldn't make it "not work" even if I tried (I'm using nix, and I know plenty of ways of making things not work).

- `pip install duckdb` works. It gets the binary along the way, and it works. In nix! (good luck with this approach installing any ML library with binaries)

- Put it in your `deps.edn`, getting it from maven: `{org.duckdb/duckdb_jdbc {:mvn/version "0.6.0"}`.  

- nix-shell -p duckdb.  Works also.

Once there, duckdb has plenty of cool dwim-y features, like getting a parquet file from an s3 bucket. Like that. I mean.... When we're typically used to a super slow iteration speed to bring the data close to the db, and then, the iteration of create table, drop table, create table.... duckdb does it with just a function call. Today we'll be talking about `read_csv_auto`.

Even in the simplest version, being able to read a csv into a table is already great.

For comparison, recently I wanted to import the data from https://github.com/cathytanimura/sql_book/ , from a very good SQL book I'm reading nowadays. To figure out a way to import the data from all chapters in a postgres, I ended up writing a docker-compose , found some bugs in the repo (which I PR'd fixes for), and ended up having my data loaded in postgres.

Today, I wanted to try something on duckdb. The thought process went like this:

"Oh, I wonder if [some sql question] would work. I could try it with the data from that sql analytics book.  Oh, I think I wrote some readme on how to load the data... (15 minutes later)... where am I again?"

 Then, I thought...  Oh wait, maybe duckdb...

1 minute later:

 


 So yeah....

`select * from 'https://raw.githubusercontent.com/cathytanimura/sql_book/master/Chapter%203%3A%20Time%20Series%20Analysis/us_retail_sales.csv';` just worked.

Internally, this used the httpfs extension, and used read_csv_auto, to figure out everything.  Read more in the duckdb docs.

No hay comentarios: