jueves, 30 de noviembre de 2023

Nesting group by in postgres

Since my times using Crystal Reports (around 2005) I've been used to "a group by flattens all your columns into one value".  It was something inescapable (unless big hacks)

But lately, with complex types in postgres, and smarter aggregation functions, I've grown to create nested structures in group bys when I want to retain information about individual records inside the group.

It probably violates a few normalization rules, but  who cares?

So here's it:


select country, 
       count(*), 
       jsonb_pretty(jsonb_agg((id, status, deleted_at) order by deleted_at desc)) as statuses 
from customer 
group by country having count(*)>1 
order by count desc;
I think it's super cool to be able to generate a result with all the nested information of the integrants of each group, and even more, being able to order by, inside the groups!

I imagine there's a lot of cool stuff you can do later with UNNEST-style functions, that would allow us to recreate the registers as tables, to do further queries.

Or maybe this is just another hack, like the ones I was writing in 2005. But anyhow, they are very useful for data explorations.

Happy hacking

lunes, 23 de octubre de 2023

Just use Postgres for everything!

 Just discovered this post https://www.amazingcto.com/postgres-for-everything/ which is short, and to the point.

Copying it verbatim, because it deserves to be read even if you don't click my links

"

One way to simplify your stack and reduce the moving parts, speed up development, lower the risk and deliver more features in your startup is “Use Postgres for everything”. Postgres can replace - up to millions of users - many backend technologies, Kafka, RabbitMQ, Mongo and Redis among them.


Use Postgres for caching instead of Redis with UNLOGGED tables and TEXT as a JSON data type. Use stored procedures to add and enforce an expiry date for the data just like in Redis.


Use Postgres as a message queue with SKIP LOCKED instead of Kafka (if you only need a message queue).


Use Postgres with Timescale as a data warehouse.


Use Postgres with JSONB to store Json documents in a database, search and index them - instead of Mongo.


Use Postgres as a cron demon to take actions at certain times, like sending mails, with pg_cron adding events to a message queue.


Use Postgres for Geospacial queries.


Use Postgres for Fulltext Search instead of Elastic.


Use Postgres to generate JSON in the database, write no server side code and directly give it to the API.


Use Postgres with a GraphQL adapter to deliver GraphQL if needed.


There I’ve said it, just use Postgres for everything.

"


Also, there's a link to radical simplicity, which seems like another iteration of use-boring-technology.

This one is also great gist about using pg for basically everything https://gist.github.com/cpursley/c8fb81fe8a7e5df038158bdfe0f06dbb

lunes, 2 de octubre de 2023

Idempotency is so great

Once I'm presented with a problem, there are a few angles I instantly look at it, to get a shape of which kinds of solutions might work, and whether the problem allows for unknowns as I go, or it has to be right.

- One of them is idempotency. 

- Another is "does it allow for layered development/deployment?". Can I keep figuring out the rest of the owl as I'm drawing it?

- Steel thread vs gaining resolution as we go through

I remember Christopher Alexander's

The difference between the novice and the master is simply that the novice has not learnt, yet, how to do things in such a way that he can afford to make small mistakes. The master knows that the sequence of his actions will always allow him to cover his mistakes a little further down the line. It is this simple but essential knowledge which gives the work of a master carpenter its wonderful, smooth, relaxed, and almost unconcerned simplicity.

https://ericlathrop.com/2021/04/idempotence-now-prevents-pain-later/

https://news.ycombinator.com/item?id=31027403

https://www.berkansasmaz.com/every-programmer-should-know-idempotency/

https://newsletter.casewhen.xyz/p/data-explained-idempotence

https://www.youtube.com/watch?v=GRr4xeMn1uU

https://danluu.com/simple-architectures/ (links to some brandur posts about idempotency and jobs)

viernes, 29 de septiembre de 2023

A Few C++ Talks?

 Here are a few talks from the C++ community that when I watched I thought I'd rewatch again because they either are very cool even though I don't grok C++, or they talk about general programming concepts. Here's the list:


- https://www.youtube.com/watch?v=V5SCJIWIPjk Optimizing for Change in C++ - Ben Deane

- https://www.youtube.com/watch?v=2ouxETt75R4 Easy to use, hard to missuse - Ben Deane

- https://www.youtube.com/watch?v=sWgDk-o-6ZE CppCon 2015: Sean Parent "Better Code: Data Structures"

- https://www.youtube.com/watch?v=W2tWOdzgXHA&list=PLM5v5JsFsgP21eB4z2mIL8upkvT00Tw9B Sean Parent "Seasoning" talking about algorithms

- https://www.youtube.com/watch?v=JELcdZLre3s Function Composition in Programming Languages - Conor Hoekstra - CppNorth 2023

martes, 26 de septiembre de 2023

Naive option picker (fzf) in pure shell

 I've been a big fan of option pickers since.... forever. I've used ratmenu, ratmen, dmenu, dzen, percol, fzf, helm, vertico, consul...


For me, it's so fundamental that I need this functionality in every piece of code/script I write.

Finally I got to a working version of a very small version of it.

As always, pure shell, but with some advanced features can give you a succint function that works predictably.  The /dev/tty part was the trickiest to get working, as I didn't know how to get input from a pipe, and at the same time be able to read data from the keyboard.



lunes, 4 de septiembre de 2023

Yet another example of "code the stupidest thing first"

 I've read a post called "Code the shortest path first", and I couldn't agree more, really. The core idea is a well known one to the reader of this blog (me, mostly). It's not even about YAGNI, or "The Wrong Abstraction". It's about having that "tracer bullet" or "steel thread" that guides you through the implementation phases. Also, if the feature looks too big from the beginning, you can try "SPIDR".

Also, what I find is that my first code, it's usually highly compressed, and that also gives a first soul to the implementation that lives on during the implementation. 

jueves, 31 de agosto de 2023

postgres jsonb_diff and array_unique

 I recently copypasted a couple of functions from stackoverflow for my prostgres projects that are worth mentioning, cause they show a pattern on how to write plpsql functions:



Clojure and aplisms

The more I do APL and K, the more I see forks and hooks everywhere else.

That's just a couple of little helpers I've added in some clojure projects. One is "apply-if", which more than from APL, I'm taking it from any procedural language, where you can do `x= if p(x) then f(x) else x`. We used to do this a lot in lua (using ands and ors). But in a combinator style.


Second one is an attempt at the fork pattern.

Look how the avg looks like. it's verbose, but it captures the idea very close to how APL does it.



martes, 22 de agosto de 2023

2 classes of system designers

Here's a devastating quote from Andy Kinslow (NATO SOFTWARE ENGINEERING CONFERENCE ,1968) 


There are two classes of system designers. The first, if given five problems will solve them one at a time.

The second will come back and announce that these aren’t the real problems, and will eventually propose a solution to the single problem which underlies the original five.

This is the ‘system type’ who is great during the initial stages of a design project. However, you had better get rid of him after the first six months if you want to get a working system.


I found it in the Software Engineering conference pdf , but I didn't know anything about that "Andy Kinslow". And looking for that person, led me to some interesting findings, like an APL paper, or finding out that Andy Kinslow pioneered work on Time-Sharing-System, following leads of Corbato and Strachey

miércoles, 16 de agosto de 2023

Copy table from another db in postgres

 This one is somehow a followup on "Duplicate row in plain sql". What I needed in this case was to copy 

Both dbs live in the same physical postgres, and the main user has rights over everything, so the process was painless, and took 5 minutes to set it up (accounting for reading the docs).


The trick is to use the postgres_fdw and create a schema "dev" inside db_test, so I can access the tables in db_dev.


Now you can easily do your  `insert into table_to_copy select * from dev.table_to_copy;`

jueves, 29 de junio de 2023

Duplicate row in plain sql

 Another no-magic-here post, but quite a useful one when working with data in dev, and needing to massage it.

Have you found the need to duplicate a row, only with slight modification? SQL doesn't have a standard way to do it, and while some rdbms accept some sort of "SELECT * EXCEPT(id) FROM xyz", you might not have it available in your db (postgres)

The clearest solution I found is here:

It's all about creating a temporary table, and use "like" and "insert into", to avoid having to write all the fields:

CREATE TEMP TABLE tmp (like web_book);
INSERT INTO tmp SELECT * FROM web_book WHERE id = 3;
UPDATE tmp SET id = nextval('web_book_id_seq');
INSERT INTO web_book SELECT * from tmp;


In fact, we could simplify it further (at least in postgres) with:

CREATE TEMP TABLE tmp as SELECT * FROM web_book WHERE id = 3;
UPDATE tmp SET id = nextval('web_book_id_seq');
INSERT INTO web_book SELECT * from tmp;

viernes, 23 de junio de 2023

"delete limit X" is not possible in postgres...

And how to fix it.

 

So I was pretty sure that I had done some sort of large-ish destructive (as in side-effect-y) operations in batches of small numbers.  And today I had to delete a bunch of rows from a bunch of tables.

It turns out that postgres doesn't support "LIMIT" in UPDATE nor DELETEs.

 First of all, the misunderstanding when I remembered doing destructive operations in batches comes from calling postgres functions via select.

Let's say you have to kill the db connections to a db:

SELECT pg_terminate_backend(pid) FROM pg_stat_activity WHERE datname='db_1234567890abcdef';

This, being a select, you can do things like:

SELECT pg_terminate_backend(pid) FROM pg_stat_activity WHERE datname like 'db_%' limit 100;

So yes, you are batching operations, but it's because "selects" are not always side-effect-free.

So now, back to our real problem of doing updates or deletes in batches.

 It's not that it hasn't been asked or thought about. Here's one of the mail threads about it. Tom Lane's answer highlights some of the issues from the correctness perspective.


Ok, but if I really want to do it... what's the best way? Crunchydata has a blog post explaining multiple ways of doing it, but the main takeaway is to use CTEs to select the amount you want (using limit), and then delete by id "where exists" or "where id in".

WITH rows AS (
  SELECT
 something
  FROM
    big_table
  LIMIT 10
)
DELETE FROM big_table
WHERE something IN (SELECT something FROM rows);

Cool stuff. Nothing magical, but it's a good pattern to be familiar with.

miércoles, 17 de mayo de 2023

Naming, Notation, and Idioms conventions

Variable names is a big point of discussion in communities, projects, PRs...

I'm on the "terse" side of the spectrum, and I think many times variable names are hindering more than helping. I lay on the "more inline anonymous things" rather than "everything-has-to-have-a-very-specific-name-in-the-toplevel-of-a-file". Of course, names should convey their meaning, and if they fail to do so, it's a bad naming, but I want to talk about the more subtle process of choosing active/passive names, short/long,etc... that conveys how one thinks about the code while navigating it.

Here are some of the concepts that inspire me on naming variables and organizing code "on the small".

Misscomunicating via overconstraining and overcommunicating: Sometimes, we create a piece of code thought for a particular use case, and the more descriptive we name the variables, the less obvious it is that that functionality can be used in other situations. In strongly typed languages, this interchangeability of use cases and discoverability is already killed by the language "features", but in the languages I like to use, it's perfectly ok to call the same functions to objects of different semantic meanings, even of different type or structure.

Huffmanization:"huffman coding" principle, meaning that things that are mentioned more often should be shorter.

Perl: has so many good tips about naming, and how to understand the dimensions of concepts in programming.  To name something, "topicalization" is introducing a topic you wanna talk about, and then be able to refer it with another (shorter) name (usually a pronoun) for some time, until that topic is out of scope.

APL, SmallerCodeBetterCode, Suggestivity: Showing the guts of the algorithms inline has its benefits (see Notation as a Tool of Thought paper), and the apl family is a great showcase of that. Variable names and the symbols used for the operators have a great importance there too. Also, not binding names keeps the algorithms generic and reusable (at the human level, not binding an algorithm to a usecase but to the concrete thing it does to a data structure has way more suggestivity than a triple nested structure with its own accessors).

K:  The great k's style.pdf has so much unconventional knowledge I can only suggest to go and read it with fresh and curious eyes. So bold it's what got me into k.  Also, "Stages of denial encountering K"

fast.ai naming conventions. Jeremy Howard's insights on programming are again, a bit controversial, but his tips have been always enlightening. If he says X, doesn't mean X is the only truth, but it means you should look at X with fresh and curious eyes, and maybe you'll discover a treasure.  He comes from an APL and Perl background, and writes so nice terse code I can only agree with what he says over and over.  Look at fast.ai style conventions, and the fast.ai abbreviations list. "One line of code should implement one complete idea, where possible"

Picolisp. I don't know how Alexander Burger chose all the names for picolisp's core functions, but there are lots of cool ideas both in the namings, and how the language api makes the code flow in a very natural way. (For example, @, or the `do` flexibility).  Picolisp code tends to be very dense. Not apl/j/k-dense, but very dense still.

k/q: Also, lots of names to pick from, and the information density is pretty high. From k I picked the massive overloading of functions to do "the right thing" depending on the types. I've never seen such a compact core, and it's funny to think that the functions and their overloads seem quite intuitive after having learned them for some time.  I never thought I'd say that, but here we are.

Forth and Factor: A bunch of words to pick from. If you're looking for a word, chances are that there's a Forth or Factor dictionary with a related concept.  dup, dip, swap, bi2,.... If you get familiar with those, there's a bunch of names and name compositions you can start using in your code, and things will just make sense.

Clojure: Clojure is of course very nice namewise, and it has very carefully choosen words. Even the threading macros, which are not names, but allow writing code in a way that leans to a very flow-y way. I never understood why `assert` doesn't return the first value in case of not throwing, but apart from that, no complains.

Red: Red language is the one I know the least of this list (maybe along with forth/factor). The language has such impressive properties also, not found anywhere else that I know of.  In that link I pasted, there's a function reference. Also, lots of interesting names to pick from.

Unconditional Code: Michael Feathers did a very down-to-earth talk in 2018 about Unconditional Code. The concept is not the same as the "anti-if" movement from the smalltalk community. In this case, Michael proposes to solve corner cases of our code by either changing the specs a little bit (many times making the code more generic), assume sane defaults (see Sussman's NYCLisp talk when he talks about flight autopilot, or John Ousterhout's Philosophy of Software Design), and some times, look for generality in the algorithm by exploiting data properties (examples of this is to make functions always process lists of things, and allowing empty lists so the function does nothing)

Other notations

viernes, 5 de mayo de 2023

Where to stop in the extensibility ladder

Here's a snippet where I explored different places where to "stop" on the ladder of extensibility/abstraction/suggestivity. It's at a very small scale, but those things add up.

Code is in clojure, which makes it even easier to mix and match approaches, because symbols and sets are functions (Like K, where list access and function application looks the same) 



viernes, 14 de abril de 2023

SQL UNNEST to iterate over an array

I'm a very happy user of supa_audit library. This very small sql library allowed leverages some old postgres common knowledge about triggers and audits in a very nicely packed form, so you can track down changes to tables or rows in tables.

You can easily track changes in table "foo" just by running 

SELECT audit.enable_tracking('foo'::regclass)

I started with very naive way of tracking down objects, by identifying them via `table_name`+`record->>'id'`. That sort of works, but it doesn't scale at all, requiring full scans for everything, and detoasting `record` when the record is big enough.

supa_audit provides this `to_record_id` function that generates the indexed record_id value for a given row. What happens if we want to run it for 2 ad-hoc values?

Well, there's `unnest`, which is very cool. It takes an array, and converts it to a table-like relation, so you can use it as a `from`. That way, we could process N values in one sql go with the following:

select audit.to_record_id('tablename'::oid,'{id}', jsonb_build_object('id',id))::uuid as record_id from unnest(?)

Nice trick, right?

miércoles, 29 de marzo de 2023

State machines

 I love state machines, and I use them in real world scenarios at least once a year.

The more I'm using them, in different languages or different situations, the more I get to "it's just an if" conclusion.


In this HN post about xstate, there's a minimalistic implementation in swift which I find a very nice distillation of how I use state machines usually. 

switch [current-state, event ]

  case ['foo', 'bar' ] -> ...


And that's 90% of it.

I've used ruby's statemachine, clojure's tilakone and clj-statecharts (there's also this new maestro lib which I haven't used yet), and home made lua ones. 

But the essence is, having a clear distinction of where am I, what I get, and what do I have to do. For the "what do I have to do", here's what I usually do:

- Updating the state of an object (usually the one bound to the state machine) is fine.

- I tend to run actions only for things that need to be immediate. Sending an alert in a UI has to be synchronous I guess.

- But if I can, I like to run some idempotent process that will run the actions. That process should notice that the state machine has stuff to do. That process is the one that has the knowledge of actions that should happen in XYZ situation. This process runs them, and logs them somewhere, so that next polling iteration doesn't redo them. These actions are things like "cancel a subscription after being unpaid for x days". The state machine will set the status to "unpaid", and set the cancel-at, but the polling process will check for "unpaid" subscriptions with cancel-at<now() that are still running, and kill them for good.


Further reading on xstate and fsm and one/two/three level fsms:

- http://www.findinglisp.com/blog/archives/2004_06_01_archive.html

- https://www.industriallogic.com/patterns/P22.pdf

- https://hillside.net/plop/plop2003/Papers/Adamczyk-State-Machine.pdf

domingo, 26 de marzo de 2023

Finding Approximately Repeated Patterns in Time Series

I found this talk about time series which seems super cool, about smart things people can do with time series

https://www.youtube.com/watch?v=BYjOp2NoDdc

And here are some more details in time series discords

domingo, 12 de marzo de 2023

Johnny.Decimal

From this HN thread I just discovered a file/info organization system called Johnny Decimal.

And it looks very very sane. And I think it can be easily combined with the person-action-object memory system I'm already using.

Going to try it and report back.

jueves, 16 de febrero de 2023

SQL window functions

SQL's window functions are such a huge topic that some see them as a DSL inside SQL itself.

You can do amazing things in sql without window functions, but when you need sophisticated rankings, or sorting subgroups, or you feel you are struggling with `group by + having + order by`, window functions are usually the answer.

 To me, Bruce Momjian's talk about window functions has been my default resource whenever I need a refresher.

But today I found a teaser of a book on window functions that's coming that looks very very good. I'll keep an eye on this page.  Also, the pic is from that site, so it deserves a link from my high-traffic blog (you're welcome).


lunes, 13 de febrero de 2023

Duckdb news and posts, early 2023

Motherduck is lately increasing their activity on the net, and they recently published a couple of interesting blogposts:

One, with the provocative-cliche of "Big Data is dead", which had quite a bit of traction in HN, and the other, a guest post, much more practical and "fun": "solving aoc with duckdb".

If you follow what happens in duckdbland, you'll know that the duckcon happened during FOSDEM2023. Here are the videos

One of the talks (not in duckcon, but in FOSDEM), talks about pytohn+duckdb. It's very nice to be able to use pandas dataframes from duckdb with the 0 copy approach.

Another talk is about duckdb extensions. Very DWIM, and in general, the practical approach the team has is quite refreshing.  Unfortunately, I don't do C++, so writing extensions is out of my league.  But I started thinking what would it take to be able to write extensions in Lua(JIT?).  Probably a low level wizard could make it happen quite fast. But performance wise... no idea.

Also, the biggest news: DuckDB 0.7.0 is out!


martes, 17 de enero de 2023

DuckDB's arg_max

In kdb+ there's a great way of showing aggregated information called `aj` for asof join. This way of joining data is a neat way to get the "latest record" in a join. 

These kinds of operations are already quite painful to do in plain sql when grouping by. Ever had to get the latest invoice of each customer? probably you had to nest a query on top of another one, or use a fancy "distinct on".

I realized though, that some analytical engines have something called `arg_max`, that can alleviate the pain when doing nested group by. And Duckdb already implements it! (discussion here)

First, let's create a table with a few values ready to group and get the max. Then, we'll want to get the id of the max of every group.   At that point, arg_max just does the right thing.

 

Btw, look at that nice Duckdb's group by all