miércoles, 21 de diciembre de 2022

Trying things out with duckdb's dbgen

REPLs are great, but they are only a part of the "have fun" experience.

Sometimes, you need data to work on a repl and to be able to simulate the "real world".

In clojure, I've used specmonstah and I quite like it. It can build nested structures nicely and once you configured the db schema and the specs of the individual properties, you can use it as a building block from your code, to generate trees of data.

For sql, I got pretty wild with generate_series, and `create table XXX as select` and you can get quite far with those. 

But recently I discovered `dbgen` function in DuckDB and I'm pretty happy with it: this function (that comes bundled in any duckdb binary) can generate TPC-H sample databases, including (scale factor as a parameter!).

I find myself using it more and more in order to test some sql I have in mind that I'm not sure if it would work or not.  It's really good that it gets you from "I wonder what would happen if...." to "aha, I see" in a shorter time than any of my previous approaches to scaffolding sample data for sql tests.

The usage is pretty simple:

`call dbgen(sf=0.1)`, and off you go!

domingo, 18 de diciembre de 2022

Going meta until it hurts, with Brian Cantwell Smith

I realized today I haven't mentioned Brian Cantwell Smith in this blog before.  I'm not intimately familiar with his work, but I've read some parts of his papers, thesis, and talks.

And I can tell you, dear reader... if you're into meta/circular concepts, you'll have a great time reading his work.

Here are some links. The first one is a new talk by him, and later are works I've read and caused some impression on me. Look for more of his talks. All of them are pretty deep, and not " run on the mill" talk you find in youtube.

  • https://student.cs.uwaterloo.ca/~cs492/11public_html/p18-smith.pdf <- limits of correctness . I loved it. it's the most light of the links here
  • https://www.youtube.com/watch?v=0eE-CTX96v8
  • https://www.ics.uci.edu/~jajones/INF102-S18/readings/17_Smith84.pdf
  • https://dspace.mit.edu/handle/1721.1/15961

 I find these themes pretty dense, but it's enjoyable to have a peek from time to time. Similar to Dennet's, Spolski's, and Hofstadter's


Have fun


Update primary key and associated foreign keys in sql

If you have two tables with a foreign_key relationship, and you want to update the value of a primary key, postgresql will fail, telling you that there are associated tables with foreign keys to that value, so you can't break the consistency.


update subscription set cus_id=2 where cus_id=1;

ERROR:  23503: update or delete on table "subscription" violates foreign key constraint "xxxxxx_id_fkey" on table "xxxxx"
DETAIL:  Key (id)=(1) is still referenced from table "xxxxx".
LOCATION:  ri_ReportViolation, ri_triggers.c:2797

Oh, I know what I'll do (you think), I'll change both inside a transaction:

begin; update subscription set cus_id=2 where cus_id=1;...
ERROR:  23503: update or delete on table "subscription" violates foreign key constraint "xxxxxx_id_fkey" on table "xxxxx"
DETAIL:  Key (id)=(1) is still referenced from table "xxxxx".
LOCATION:  ri_ReportViolation, ri_triggers.c:2797

Oh, the same :( So even inside a transaction, we can't have inconsistent states.

So.... The trick to let postgres update both columns of the two tables at once is to use a cte. Common Table Expressions have many different uses, and one of them is to bundle multiple queries in a single one. When used with update/inserts it allows these kinds of tricks.

with c as (update customer set id=42 where id=1),
        s as (update subscription set cus_id=42 where cus_id=1)
select 1;

As a_horse_with_no_name explains here:

As (non-deferrable) foreign keys are evaluated on statement level and both the primary and foreign key are changed in the same statement, this works.

lunes, 12 de diciembre de 2022

Sum scan in sql

`scan` is an adverb in apl/j/k which is used to reduce an array over an operation, while keeping all the intermediate results. It's used in lots of very creative ways in the array programming languages. Here's some sample examples of "scan (\)".

:x:1+!9
1 2 3 4 5 6 7 8 9
+\x
1 3 6 10 15 21 28 36 45
*\x
1 2 6 24 120 720 5040 40320 362880

In most of those programming languages, scan acts as an adverb, which is a higher order function that modifies a verb. In clojure, you can get similar results with `reductions`.

Anyway, the topic of today is how to accumulate intermediate results like these, in sql.

Well, in the AoC 2022 (day 1), the first part was about summing numbers by paragraphs. 

Some people at duckdb discord are sharing their solutions, and we see it's kinda normal to use `row_number() over() as idx` idiom to add ascending numbers to any projection.

Another trick, to indicate the paragraph number in each line (for future group_by purposes), this is what Teggy writes:

 `select sum(i.value is null :: int) over (order by num)`

Here I show generate_series, row_number, and the sum scan based on "is null".


Doing mundane things with SQL and Duckdb

I've been learning more and more sql, both the language, the implementations, and as always, looking for the extreme cases of "you shouldn't be doing this in sql, but here we go...", and diving into them as they usually are the more fun and enlightening parts.

One db I'm very fond of is duckdb. The philosophy behind it just clicks with me:

In-process, OLAP, zero-deps (tinyverse), quite a lot of innovation, but also lots of practical stuff in a very dwim style. 

Installing it is a breeze. And it works. I couldn't make it "not work" even if I tried (I'm using nix, and I know plenty of ways of making things not work).

- `pip install duckdb` works. It gets the binary along the way, and it works. In nix! (good luck with this approach installing any ML library with binaries)

- Put it in your `deps.edn`, getting it from maven: `{org.duckdb/duckdb_jdbc {:mvn/version "0.6.0"}`.  

- nix-shell -p duckdb.  Works also.

Once there, duckdb has plenty of cool dwim-y features, like getting a parquet file from an s3 bucket. Like that. I mean.... When we're typically used to a super slow iteration speed to bring the data close to the db, and then, the iteration of create table, drop table, create table.... duckdb does it with just a function call. Today we'll be talking about `read_csv_auto`.

Even in the simplest version, being able to read a csv into a table is already great.

For comparison, recently I wanted to import the data from https://github.com/cathytanimura/sql_book/ , from a very good SQL book I'm reading nowadays. To figure out a way to import the data from all chapters in a postgres, I ended up writing a docker-compose , found some bugs in the repo (which I PR'd fixes for), and ended up having my data loaded in postgres.

Today, I wanted to try something on duckdb. The thought process went like this:

"Oh, I wonder if [some sql question] would work. I could try it with the data from that sql analytics book.  Oh, I think I wrote some readme on how to load the data... (15 minutes later)... where am I again?"

 Then, I thought...  Oh wait, maybe duckdb...

1 minute later:

 


 So yeah....

`select * from 'https://raw.githubusercontent.com/cathytanimura/sql_book/master/Chapter%203%3A%20Time%20Series%20Analysis/us_retail_sales.csv';` just worked.

Internally, this used the httpfs extension, and used read_csv_auto, to figure out everything.  Read more in the duckdb docs.