martes, 17 de enero de 2023

DuckDB's arg_max

In kdb+ there's a great way of showing aggregated information called `aj` for asof join. This way of joining data is a neat way to get the "latest record" in a join. 

These kinds of operations are already quite painful to do in plain sql when grouping by. Ever had to get the latest invoice of each customer? probably you had to nest a query on top of another one, or use a fancy "distinct on".

I realized though, that some analytical engines have something called `arg_max`, that can alleviate the pain when doing nested group by. And Duckdb already implements it! (discussion here)

First, let's create a table with a few values ready to group and get the max. Then, we'll want to get the id of the max of every group.   At that point, arg_max just does the right thing.

 

Btw, look at that nice Duckdb's group by all

D create table test as SELECT i as i ,random() as r, i%4 as gr from generate_series(1,100) t(i);
D select * from test;
┌───────┬──────────────────────┬───────┐
│ i │ r │ gr │
│ int64 │ double │ int64 │
├───────┼──────────────────────┼───────┤
│ 1 │ 0.14984029697949075 │ 1 │
│ 2 │ 0.20274345139105418 │ 2 │
│ 3 │ 0.07478489612107744 │ 3 │
│ 4 │ 0.782259951294926 │ 0 │
│ 5 │ 0.2284793616804479 │ 1 │
│ 6 │ 0.041695246948323966 │ 2 │
│ 7 │ 0.7573347926971816 │ 3 │
│ 8 │ 0.73962961294214 │ 0 │
│ 9 │ 0.7779207988125088 │ 1 │
│ 10 │ 0.9389072358466003 │ 2 │
│ · │ · │ · │
│ · │ · │ · │
│ · │ · │ · │
│ 91 │ 0.6364273847165581 │ 3 │
│ 92 │ 0.18678856715252357 │ 0 │
│ 93 │ 0.6967327018959291 │ 1 │
│ 94 │ 0.9226525455998845 │ 2 │
│ 95 │ 0.8744859222496129 │ 3 │
│ 96 │ 0.868602423665254 │ 0 │
│ 97 │ 0.2760435713632134 │ 1 │
│ 98 │ 0.03305368778134084 │ 2 │
│ 99 │ 0.6663320953180856 │ 3 │
│ 100 │ 0.36892018499060536 │ 0 │
├───────┴──────────────────────┴───────┤
│ 100 rows (20 shown) 3 columns │
└──────────────────────────────────────┘
D SELECT gr,max(r) from test group by all; -- did you notice this group by all? super cool!
┌───────┬────────────────────┐
│ gr │ max(r) │
│ int64 │ double │
├───────┼────────────────────┤
│ 0 │ 0.9624206805514219 │
│ 1 │ 0.9922432776056797 │
│ 2 │ 0.9643812561324754 │
│ 3 │ 0.9202900638152589 │
└───────┴────────────────────┘
-- How could we get the id of the one that has the max r?
-- https://stackoverflow.com/questions/3800551/select-first-row-in-each-group-by-group/7630564#7630564.
-- not easy. But in some cases, arg_max can just do the job
D SELECT gr,max(r),arg_max(i,r) from test group by gr;
┌───────┬────────────────────┬───────────────┐
│ gr │ max(r) │ arg_max(i, r) │
│ int64 │ double │ int64 │
├───────┼────────────────────┼───────────────┤
│ 0 │ 0.9624206805514219 │ 60 │
│ 1 │ 0.9922432776056797 │ 61 │
│ 2 │ 0.9643812561324754 │ 22 │
│ 3 │ 0.9202900638152589 │ 23 │
└───────┴────────────────────┴───────────────┘
view raw duckdb.sql hosted with ❤ by GitHub