puntoblogspot: sql

Mostrando entradas con la etiqueta sql. Mostrar todas las entradas

jueves, 16 de febrero de 2023

SQL window functions

SQL's window functions are such a huge topic that some see them as a DSL inside SQL itself.

You can do amazing things in sql without window functions, but when you need sophisticated rankings, or sorting subgroups, or you feel you are struggling with `group by + having + order by`, window functions are usually the answer.

To me, Bruce Momjian's talk about window functions has been my default resource whenever I need a refresher.

But today I found a teaser of a book on window functions that's coming that looks very very good. I'll keep an eye on this page. Also, the pic is from that site, so it deserves a link from my high-traffic blog (you're welcome).

martes, 17 de enero de 2023

DuckDB's arg_max

In kdb+ there's a great way of showing aggregated information called `aj` for asof join. This way of joining data is a neat way to get the "latest record" in a join.

These kinds of operations are already quite painful to do in plain sql when grouping by. Ever had to get the latest invoice of each customer? probably you had to nest a query on top of another one, or use a fancy "distinct on".

I realized though, that some analytical engines have something called `arg_max`, that can alleviate the pain when doing nested group by. And Duckdb already implements it! (discussion here)

First, let's create a table with a few values ready to group and get the max. Then, we'll want to get the id of the max of every group. At that point, arg_max just does the right thing.

Btw, look at that nice Duckdb's group by all.

miércoles, 21 de diciembre de 2022

Trying things out with duckdb's dbgen

REPLs are great, but they are only a part of the "have fun" experience.

Sometimes, you need data to work on a repl and to be able to simulate the "real world".

In clojure, I've used specmonstah and I quite like it. It can build nested structures nicely and once you configured the db schema and the specs of the individual properties, you can use it as a building block from your code, to generate trees of data.

For sql, I got pretty wild with generate_series, and `create table XXX as select` and you can get quite far with those.

But recently I discovered `dbgen` function in DuckDB and I'm pretty happy with it: this function (that comes bundled in any duckdb binary) can generate TPC-H sample databases, including (scale factor as a parameter!).

I find myself using it more and more in order to test some sql I have in mind that I'm not sure if it would work or not. It's really good that it gets you from "I wonder what would happen if...." to "aha, I see" in a shorter time than any of my previous approaches to scaffolding sample data for sql tests.

The usage is pretty simple:

`call dbgen(sf=0.1)`, and off you go!

domingo, 18 de diciembre de 2022

Update primary key and associated foreign keys in sql

If you have two tables with a foreign_key relationship, and you want to update the value of a primary key, postgresql will fail, telling you that there are associated tables with foreign keys to that value, so you can't break the consistency.

update subscription set cus_id=2 where cus_id=1;

ERROR: 23503: update or delete on table "subscription" violates foreign key constraint "xxxxxx_id_fkey" on table "xxxxx"
DETAIL: Key (id)=(1) is still referenced from table "xxxxx".
LOCATION: ri_ReportViolation, ri_triggers.c:2797

Oh, I know what I'll do (you think), I'll change both inside a transaction:

begin; update subscription set cus_id=2 where cus_id=1;...
ERROR: 23503: update or delete on table "subscription" violates foreign key constraint "xxxxxx_id_fkey" on table "xxxxx"
DETAIL: Key (id)=(1) is still referenced from table "xxxxx".
LOCATION: ri_ReportViolation, ri_triggers.c:2797

Oh, the same :( So even inside a transaction, we can't have inconsistent states.

So.... The trick to let postgres update both columns of the two tables at once is to use a cte. Common Table Expressions have many different uses, and one of them is to bundle multiple queries in a single one. When used with update/inserts it allows these kinds of tricks.

with c as (update customer set id=42 where id=1),
s as (update subscription set cus_id=42 where cus_id=1)
select 1;

As a_horse_with_no_name explains here:

As (non-deferrable) foreign keys are evaluated on statement level and both the primary and foreign key are changed in the same statement, this works.

lunes, 12 de diciembre de 2022

Sum scan in sql

`scan` is an adverb in apl/j/k which is used to reduce an array over an operation, while keeping all the intermediate results. It's used in lots of very creative ways in the array programming languages. Here's some sample examples of "scan (\)".

:x:1+!9
1 2 3 4 5 6 7 8 9
+\x
1 3 6 10 15 21 28 36 45
*\x
1 2 6 24 120 720 5040 40320 362880

In most of those programming languages, scan acts as an adverb, which is a higher order function that modifies a verb. In clojure, you can get similar results with `reductions`.

Anyway, the topic of today is how to accumulate intermediate results like these, in sql.

Well, in the AoC 2022 (day 1), the first part was about summing numbers by paragraphs.

Some people at duckdb discord are sharing their solutions, and we see it's kinda normal to use `row_number() over() as idx` idiom to add ascending numbers to any projection.

Another trick, to indicate the paragraph number in each line (for future group_by purposes), this is what Teggy writes:

`select sum(i.value is null :: int) over (order by num)`

Here I show generate_series, row_number, and the sum scan based on "is null".

Doing mundane things with SQL and Duckdb

I've been learning more and more sql, both the language, the implementations, and as always, looking for the extreme cases of "you shouldn't be doing this in sql, but here we go...", and diving into them as they usually are the more fun and enlightening parts.

One db I'm very fond of is duckdb. The philosophy behind it just clicks with me:

In-process, OLAP, zero-deps (tinyverse), quite a lot of innovation, but also lots of practical stuff in a very dwim style.

Installing it is a breeze. And it works. I couldn't make it "not work" even if I tried (I'm using nix, and I know plenty of ways of making things not work).

- `pip install duckdb` works. It gets the binary along the way, and it works. In nix! (good luck with this approach installing any ML library with binaries)

- Put it in your `deps.edn`, getting it from maven: `{org.duckdb/duckdb_jdbc {:mvn/version "0.6.0"}`.

- nix-shell -p duckdb. Works also.

Once there, duckdb has plenty of cool dwim-y features, like getting a parquet file from an s3 bucket. Like that. I mean.... When we're typically used to a super slow iteration speed to bring the data close to the db, and then, the iteration of create table, drop table, create table.... duckdb does it with just a function call. Today we'll be talking about `read_csv_auto`.

Even in the simplest version, being able to read a csv into a table is already great.

For comparison, recently I wanted to import the data from https://github.com/cathytanimura/sql_book/ , from a very good SQL book I'm reading nowadays. To figure out a way to import the data from all chapters in a postgres, I ended up writing a docker-compose , found some bugs in the repo (which I PR'd fixes for), and ended up having my data loaded in postgres.

Today, I wanted to try something on duckdb. The thought process went like this:

"Oh, I wonder if [some sql question] would work. I could try it with the data from that sql analytics book. Oh, I think I wrote some readme on how to load the data... (15 minutes later)... where am I again?"

Then, I thought... Oh wait, maybe duckdb...

1 minute later:

So yeah....

`select * from 'https://raw.githubusercontent.com/cathytanimura/sql_book/master/Chapter%203%3A%20Time%20Series%20Analysis/us_retail_sales.csv';` just worked.

Internally, this used the httpfs extension, and used read_csv_auto, to figure out everything. Read more in the duckdb docs.

puntoblogspot