miércoles, 21 de diciembre de 2022

Trying things out with duckdb's dbgen

REPLs are great, but they are only a part of the "have fun" experience.

Sometimes, you need data to work on a repl and to be able to simulate the "real world".

In clojure, I've used specmonstah and I quite like it. It can build nested structures nicely and once you configured the db schema and the specs of the individual properties, you can use it as a building block from your code, to generate trees of data.

For sql, I got pretty wild with generate_series, and `create table XXX as select` and you can get quite far with those. 

But recently I discovered `dbgen` function in DuckDB and I'm pretty happy with it: this function (that comes bundled in any duckdb binary) can generate TPC-H sample databases, including (scale factor as a parameter!).

I find myself using it more and more in order to test some sql I have in mind that I'm not sure if it would work or not.  It's really good that it gets you from "I wonder what would happen if...." to "aha, I see" in a shorter time than any of my previous approaches to scaffolding sample data for sql tests.

The usage is pretty simple:

`call dbgen(sf=0.1)`, and off you go!

domingo, 18 de diciembre de 2022

Going meta until it hurts, with Brian Cantwell Smith

I realized today I haven't mentioned Brian Cantwell Smith in this blog before.  I'm not intimately familiar with his work, but I've read some parts of his papers, thesis, and talks.

And I can tell you, dear reader... if you're into meta/circular concepts, you'll have a great time reading his work.

Here are some links. The first one is a new talk by him, and later are works I've read and caused some impression on me. Look for more of his talks. All of them are pretty deep, and not " run on the mill" talk you find in youtube.

  • https://student.cs.uwaterloo.ca/~cs492/11public_html/p18-smith.pdf <- limits of correctness . I loved it. it's the most light of the links here
  • https://www.youtube.com/watch?v=0eE-CTX96v8
  • https://www.ics.uci.edu/~jajones/INF102-S18/readings/17_Smith84.pdf
  • https://dspace.mit.edu/handle/1721.1/15961

 I find these themes pretty dense, but it's enjoyable to have a peek from time to time. Similar to Dennet's, Spolski's, and Hofstadter's


Have fun


Update primary key and associated foreign keys in sql

If you have two tables with a foreign_key relationship, and you want to update the value of a primary key, postgresql will fail, telling you that there are associated tables with foreign keys to that value, so you can't break the consistency.


update subscription set cus_id=2 where cus_id=1;

ERROR:  23503: update or delete on table "subscription" violates foreign key constraint "xxxxxx_id_fkey" on table "xxxxx"
DETAIL:  Key (id)=(1) is still referenced from table "xxxxx".
LOCATION:  ri_ReportViolation, ri_triggers.c:2797

Oh, I know what I'll do (you think), I'll change both inside a transaction:

begin; update subscription set cus_id=2 where cus_id=1;...
ERROR:  23503: update or delete on table "subscription" violates foreign key constraint "xxxxxx_id_fkey" on table "xxxxx"
DETAIL:  Key (id)=(1) is still referenced from table "xxxxx".
LOCATION:  ri_ReportViolation, ri_triggers.c:2797

Oh, the same :( So even inside a transaction, we can't have inconsistent states.

So.... The trick to let postgres update both columns of the two tables at once is to use a cte. Common Table Expressions have many different uses, and one of them is to bundle multiple queries in a single one. When used with update/inserts it allows these kinds of tricks.

with c as (update customer set id=42 where id=1),
        s as (update subscription set cus_id=42 where cus_id=1)
select 1;

As a_horse_with_no_name explains here:

As (non-deferrable) foreign keys are evaluated on statement level and both the primary and foreign key are changed in the same statement, this works.

lunes, 12 de diciembre de 2022

Sum scan in sql

`scan` is an adverb in apl/j/k which is used to reduce an array over an operation, while keeping all the intermediate results. It's used in lots of very creative ways in the array programming languages. Here's some sample examples of "scan (\)".

:x:1+!9
1 2 3 4 5 6 7 8 9
+\x
1 3 6 10 15 21 28 36 45
*\x
1 2 6 24 120 720 5040 40320 362880

In most of those programming languages, scan acts as an adverb, which is a higher order function that modifies a verb. In clojure, you can get similar results with `reductions`.

Anyway, the topic of today is how to accumulate intermediate results like these, in sql.

Well, in the AoC 2022 (day 1), the first part was about summing numbers by paragraphs. 

Some people at duckdb discord are sharing their solutions, and we see it's kinda normal to use `row_number() over() as idx` idiom to add ascending numbers to any projection.

Another trick, to indicate the paragraph number in each line (for future group_by purposes), this is what Teggy writes:

 `select sum(i.value is null :: int) over (order by num)`

Here I show generate_series, row_number, and the sum scan based on "is null".


Doing mundane things with SQL and Duckdb

I've been learning more and more sql, both the language, the implementations, and as always, looking for the extreme cases of "you shouldn't be doing this in sql, but here we go...", and diving into them as they usually are the more fun and enlightening parts.

One db I'm very fond of is duckdb. The philosophy behind it just clicks with me:

In-process, OLAP, zero-deps (tinyverse), quite a lot of innovation, but also lots of practical stuff in a very dwim style. 

Installing it is a breeze. And it works. I couldn't make it "not work" even if I tried (I'm using nix, and I know plenty of ways of making things not work).

- `pip install duckdb` works. It gets the binary along the way, and it works. In nix! (good luck with this approach installing any ML library with binaries)

- Put it in your `deps.edn`, getting it from maven: `{org.duckdb/duckdb_jdbc {:mvn/version "0.6.0"}`.  

- nix-shell -p duckdb.  Works also.

Once there, duckdb has plenty of cool dwim-y features, like getting a parquet file from an s3 bucket. Like that. I mean.... When we're typically used to a super slow iteration speed to bring the data close to the db, and then, the iteration of create table, drop table, create table.... duckdb does it with just a function call. Today we'll be talking about `read_csv_auto`.

Even in the simplest version, being able to read a csv into a table is already great.

For comparison, recently I wanted to import the data from https://github.com/cathytanimura/sql_book/ , from a very good SQL book I'm reading nowadays. To figure out a way to import the data from all chapters in a postgres, I ended up writing a docker-compose , found some bugs in the repo (which I PR'd fixes for), and ended up having my data loaded in postgres.

Today, I wanted to try something on duckdb. The thought process went like this:

"Oh, I wonder if [some sql question] would work. I could try it with the data from that sql analytics book.  Oh, I think I wrote some readme on how to load the data... (15 minutes later)... where am I again?"

 Then, I thought...  Oh wait, maybe duckdb...

1 minute later:

 


 So yeah....

`select * from 'https://raw.githubusercontent.com/cathytanimura/sql_book/master/Chapter%203%3A%20Time%20Series%20Analysis/us_retail_sales.csv';` just worked.

Internally, this used the httpfs extension, and used read_csv_auto, to figure out everything.  Read more in the duckdb docs.

martes, 11 de octubre de 2022

Oral History of Dan Ingalls

Dan Ingalls and Brian Kernighan are the few people that I can listen for N hours straight, no x2, just x1, and just listen and be in awe.

 

There's this new interview to him, and I think it's amazing. Nothing to do with Alan's talks. The speed, the body language... many different things, and equally amazing.

https://www.youtube.com/watch?v=EMwHeqQZgFw

 

Btw, at 2:30 is when he talks about Steve Job's demo.

Also, 2:45 or so, squeak implementation and portability, blowing your mind.

jueves, 6 de octubre de 2022

jq assignment of multiple fields

Jq has become the de-facto json console tool. Apart from slicing and dicing your jsons, one can do things like assigning new fields to it, or modifying existing ones.

As I explain in my scripting field guide, changing epoch times to a readable timestamp can be done like:

echo '{"date":1643192480}' | jq '.date|=todateiso8601' #  {"date": "2022-01-26T10:22:48Z"}

 

To assign multiple fields in jq, you must interleave '|' inbetween, effectively piping the result to the next assignment.

And here's the bash snippet of the day:

join_by() { local IFS="$1"; shift; echo "$*"; }
pretty_dates() {
  jq "$(join_by "|" ${@/%/|=todateiso8601})"

`curl http://..... | pretty_dates .start_date .end_date` will build the proper assignments and use jq so that we get a nicely formatted json output.

Again, for small helpers like these, a bit of advanced bash can get you very very far.

 

The explanation of the weird line is: from the default (${@}) params as an array, append every element (/%/) with |=todateiso8601). That, joined by |. And that becomes the "filter" to jq.


martes, 13 de septiembre de 2022

Reinventing the Queue

Yesterday, this HN thread Write your own task queue got me thinking again on the NIH and "build vs buy". 

I'm sometimes an advocate for using own implementations of subsystems if you don't need the whole complexity of premade solutions. But...

I usually wouldn't write a task queue, as this is rarely my core business, and that I don't think my needs are anything custom, but I'll take whatever comes with the package.

That said, I've built one task queue in the past, the excuse being I was working on CommonLisp, and the platform already used all the AWS stack: (DynamoDB and Kinesis Streams) for similar things, and no ready-made solution was available. 

In that case, workers were really an active object with an active loop that kept polling the correct stream, and when a job came, an attribute in a dynamoDB table would be used to track the status. It was simple, and it allowed for reimplementations of the queues for local testing. So yeah, I've done it also...

But I can't see myself doing it when using Ruby/Python/... I feel it's very rare I'm gonna have a need that is not covered in your run off the mill job system.

Anyway, wakatime did the same thing, they implemented their own queue. And they all seem to complain about celery.

And more recent projects that approach queues:

- https://github.com/drpancake/chard. async, no threads/process

- https://github.com/thread/django-lightweight-queue. django tailored one.


sábado, 3 de septiembre de 2022

Ratpoison news in 2022?

I've been using the same obscure window manager since about 2003. I remember using it in 'Fedora Core 2'(released in 2004).

Lately I've thought about moving to something that behaves better with Zoom, but I really couldn't find an alternative.

 

There's https://github.com/jcs/sdorfehs , which comes from ratpoison, and it's maintained by jcs. It has the latest ratpoison patches that didn't make it to the last stable ratpoison release, and a few new features.  Well.... it turns out one of the new features breaks my workflow, so for me, this is a "nope".

I also thought about using evilwm, which is a floating wm with ratpoison shortcuts (more or less). But I do like tiling.  So maybe the combination of ratpoison and evilwm (via tmpwm) would be the sweetest spot. I remember using that config years ago, but now you have to make sure to have the latest ratpoison built from git, because tmpwm breaks on multiple screens.

Something I discovered just this week is this youtube channel from a user called root_sti. The videos are great(great music), showing neat configurations and a million scripts and tunings, using dzen2, fzf, evilwm, ratpoison, and voidlinux. And all configs are here. Lots of shellscript there :).

Maybe using sxhkd and emulating the shortcuts via wmctl... not sure

Also, a funny video just from a few minutes ago in hackernews. The first tiling window manager (1988)

viernes, 26 de agosto de 2022

Vim and Vscode to camelcase

Earlier today I saw a post on vscode subreddit where the author explores a plugin to transform parts of the code from/to camel/snake/kebab case.

Automatically I started thinking how I'd do it in emacs/vim. Given I'm lately not doing so much text munging as I used to, I feel a bit rusty wrt editing-fu. Let's see...

In the video, the guy uses something like multiple-cursors to select the area from the beginning of line till the colon. That was the most difficult thing to me, because neither vim nor emacs have multiple cursors without installing a plugin.

So, I though there had to be a "declarative" way (whatever it means) to do it, and thought of a regexp that would convert the _x to X , for all _ before a colon.

It turns out I could do it quite successfully with vim, just researching a bit how to do lookaheads.

I took 

foo_bar_f: foo_bar,
foo_bar: foo,
foo_bar_: foo  

as my test case, because it has underscores in both sides, it has missplaced underscores, and different lenghts (so we can't use visual block)

This is what I came up with. Visual select and

:s/_\(.\)\ze.*:/\U\1\L/g

I was a bit surprised that it took me a bit longer than expected, and that I had to use a few "advanced" regex features.  

I haven't tried to do it in emacs yet.  Any ideas? Any simplification to my solution?

viernes, 19 de agosto de 2022

java time

This is so helpful that I'm just putting it here for future reference.

https://stackoverflow.com/questions/32437550/whats-the-difference-between-instant-and-localdatetime

viernes, 24 de junio de 2022

A better watch: viddy

I just found this thread in HN, and I think I'm sold already on aliasing watch to this improved version of the watch command. https://news.ycombinator.com/item?id=31829343

 

Also, nice trick of aliases, where

A trailing space in value causes the next word
        to be checked for alias substitution when the alias is expanded

jueves, 19 de mayo de 2022

Hugo documentation

Some time ago I created a new homepage for https://raimonster.com, where I was supposed to migrate my blog to. The migration never happened really (that's why I'm writing this post here), but I learned hugo on the way, and that was already something. 

During the learning process, it took me quite a bit of time to understand how hugo would render the site, what were the templates, the posts, the index pages, the home page, and how they "yielded" from one to the other. I even wrote a post about it. It was nice to find I was not alone finding the docs confusing, and this link explains why it is so confusing. Reading that post was like reading my own mind, explaining the reasons of the confusion.

martes, 26 de abril de 2022

Parsing args with python (alt version)

Parsing cli arguments is a solved problem. Every language has a library to get the flags from your command line. Maybe not in the stdlib, but for sure there blessed libs for it.

In Python, argparse is the official way. It's very complete and everything, but I've found it's api quite hard to follow, and I've read code that feels like written in a copypaste style, and instead of giving you more intuition on the program as a developer, it hides the knowledge of which flags go where, and which defaults they have, and which options make sense to each subcommand.

I recently rewrote a tool that had exactly that problem. Unused flags that no one removed from the argparse parsing code, hidden defaults...

When I rewrote that code, I started with this yosefk post in mind (and some reminiscence of picolisp's cli handling). I'll just do the simplest thing. Even against the standard official way (so kids, don't do it at home, or yes, do it only at home. Or not).

Here's the first version


What's already something super good is that there's 0 overhead for getting your parameters.  Of course, the parameters are not processed at all, so you'll have to cover for defaults or wrong number of args. But the language already is going to help on some of those. It will just complain if you try to call testing without an argument.

Another nice thing is that you can predict what's going to happen without checking any docs or anything. For someone that doesn't create cli apps every day, this is a nice pro IMHO.

Second version:


What can we say about this? This one takes --flag=value, and you can pass them in the order you want because the nice splat operator will take care of it. It's very rewarding to use the language features to work alongside your goal.  **kwargs, does also a great job, providing for optional parameters, that every function will make sure to validate (It should be their job anyway, not the argparser's IMHO).


Of course, in the end, the overengineering takes over:


And we end up feeling envy from --help. In this case we also can use reflection to fill in the parameters and docs from the signatures and docstrings.

That was a nice exercise!

(In the end, I just went with the first solution, because KISS and YAGNI and Suckless.)





jueves, 17 de marzo de 2022

SQL subselect arbitrary field from arbitrary table

Imagine you used to have `select f1, f2 from t1`. But now f2 becomes a fk to a table t2, so now we have f2_id that points to t2 (with fields id, name) where t1.f2_id = t2.id .


In my case, I only had access to the select fields list (the 'from t1' was already fixed, and I just could modify the retrieved fields), so I couldn't do an actual join, BUT! I found you can do a similar thing to an sql join for individual fields, adding a subquery as a field in the field_list.


select t1.f1, (select name from t2 where  t2.id=t1.f2_id) as f2 from t1 


Really really nice to know that t1.f2_id can be gathered from the subquery. The subquery will be triggered once per t1 row, so it'll probably be slower than a join, but check your EXPLAIN ANALYZE, and see if the cost is bearable, or even if it has any impact at all for your use case.

Some more explanations in here, or here.

viernes, 11 de febrero de 2022

Read-only psql

This is how I wrote a short snippet to open read-only postgres connection using psql.  Notice the amount of non-trivial stuff in those lines (technically, only one line of code).

 

I tried to show many concepts in those short snippets I post around here. 

  • For example, notice how some lines need backslashes to continue the line, but some others not. 
  • Also notice how to use <() to create a temporary file, so we don't have to have a duplicate file (probably outdated). 
  • The command being a "cat originalfile && echo "customline") makes it a fresh file each time, that inherits from your originalfile. 
  • Finally, good old "$@" to proxy the parameters to another command.

in zsh, you can add "compdef ropsql=psql" so it also proxies the autocompletion

sábado, 22 de enero de 2022

Postgres AoC

Here's a link I don't want to lose, a thread about amazing people doing Advent of Code in Postgres. If you ask 'why', you're probably not the good audience for that anyway.

https://news.ycombinator.com/item?id=29467671

sábado, 15 de enero de 2022

Classless CSS

So, this is really a thing: Like, IDGAF about CSS, but want my site to look slightly more modern than danluu's page.

https://news.ycombinator.com/item?id=29929438

 

Nice, nice, nice