miércoles, 22 de septiembre de 2021

Caught by a regex

I'm a big fan of regexes. 

I've not found yet a DSL that has proved so useful in so many situations (text handling, really).

But today, I got caught by something I didn't know, and that totally escaped me. It's not even in the sense of "now you have 2 problems". Because the problem at hand was perfectly suited for regular expressions. no capturing, no grouping, no back references, no (negative)?look(ahead|behind). A typo gone wild.

It's about `]`, and the fact that if it's paired with an earlier `[`, the regexp matches any character inbetween them. That's known, and what's also known is that escaping characters inside works a bit different, and also that `-` means "range", UNLESS it's in the first or last position.

But, What caught me today is that `]` (without escaping) works as a literal if there's no pairing `[`. 

  • /(/ -> invalid regexp
  • /)/ -> invalid regexp
  • /[/ -> invalid regexp
  • /]/ -> fine. O_O

This is so foreign and arbitrary I couldn't really believe it. I don't know if there's a deeper reason for this being this way, but I can't find a reason why it couldn't behave the same way as grouping parenthesis.

This was cascading from a regex like /[`~! @#$%\^&*()_+={\[}]|:;\"'<,>.?\/-]/ .

It is intended to check that there's at least 1 symbol in a string.

If you look closely, there no escaping for the ] in the middle, the regexp is still valid, but the bracketed part is smaller than what you'd think.  And it doesn't fail to compile the regex.

But to make things worse, just after the ']', there's a pipe symbol (because they are close in the keyboard). and now, this symbol is not treated literally anymore, but it counts as an alternative choice.

With this, what do we have here? 

The regexp matches either one of [`~! @#$%\^&*()_+={\[}] bracket, or the string ':;\"'<,>.?\/-'. This made it even more difficult to find, because some validations passed, while some didn't. If it wouldn't be for the '|' being just after ']', all validations would have failed.

Well... 2 level fuckup from a single regexp due to an missing backslash and the US keyboard layout.

How many problems do I have now?

No hay comentarios: