domingo, 4 de octubre de 2009

Perl 5.10 regexes

It's been around for 2 years (Perl 5.10 was released in December 2007), but I mostly used it as 5.8.8 (I didn't think that my perl level would make any difference between so near versions).

I've seen some of the improvements from 5.8.X and I think there are some things that even for newbie/intermediate perlers like me might be of interest.
Say "eeeoo"

Ok, this is probably a feature that won't make you migrate to 5.10, but it's there, and starting using it is trivial. Put it short, For all x , say x == print x ,"\n"; it's shorter to write, and you don't have to write the "\n". It's like writeln or println of other languages. We're done with that.

Defined or

I love shortcuts, from ||= to 'open or die' . I think it's a very elegant way of avoiding an extra if.
As you know $x ||= 'foo' is the same as $x = $x || 'foo' , so if $x evaluates to false, then $x becomes 'foo'. The problem with perl is that 0,'' (that's empty string), and undefined , all evaluate to false. //= is the way to check if something is defined in a shortcut way.

given/when

Finally, perl had a switch/case statement. In fact, to emulate the 'switch/case' statement from other languages, we could use dispatch tables (being much more flexible than java'sor c++'s one). Given/when is far more flexible and advanced than usual (comparison by ==) cases. It uses the new smartmatch operator, allowing to compare different type of things and doing 'The correct thing" (tm)

Smart Match

Ok, so what's the smart match operator?. It's a binary operator whose syntax is ~~ , and it tries to compare left operand with right operand in a DWIM (tm) way.

For example, %hash ~~ $scalar searches if exists $hash{$scalar} . Array ~~ Regex , greps the array with the Regex. You get what I mean.... for more info, look at perldoc perlsyn.

Regexes

Now what had to be the main topic of the post (that will end in just another section).
As you can see in perldoc perldelta of 5.10.0 , there have been many improvements on 5.10 regex engine. I'll only cover a few of them basically because of 2 things:

1) I don't understand everything there
2) laziness

The most practical new feature is "possessive quantifiers". We know perl regex quantifiers are greedy unless we put the '?'. That works when the given match has lots of possible matches, but there's a backtracking involved that maybe you don't need.

Maybe you want really greedy matches, as in greedy algorithms (when a decision is made, it stays forever). If you add a plus sign just after any quantifier, you make it 'non-backtracking', so if it 'eats' a character, it won't permit it to be in the next token in any circumstance.

I've come with a stupid example. Say we want to match pairs of letters in a string, and have the last pair or the last one (odd length) in a separate place:

$_='a'x6;
m/^(aa)+(a+)$/;
print $2,"\n";

$_='a'x7;
m/^(aa)+(a+)$/;
print $2,"\n";


what happened in the first case was that the pair matching advanced untill the end of the string and when noticed that the last (a+) had to match something, it backtracked twice until both conditions where satisfied.

The second example was easier because when it couldn't match a pair, a spare 'a' was there to fit in (a+) slot.

Now we'll try with ++ .

$_='a'x6;
m/^(aa)++(a+)$/;
print $2,"\n"; #error

$_='a'x7;
m/^(aa)++(a+)$/;
print $2,"\n"; #ok


As (aa)++ makes it unbacktrackable, there's no match. So, keep in mind it can really speed up things, but it won't match same things, so be aware of this.

Another regex new feature is named captures. now you can name captures with (?<foo>pattern) . after doing the match, you can retreive the matches through the %+ hash : $+{foo} .

There are other improvements, but I'll leave them for now.

Only tell you that 5.10.1 is already here, and just improved on that, allowing more funny ways of using smart match and given/when. Perl 5.11 is out too (but it's a development release).

If you're hungry of perl new things, have a look at Perl 6. If you want more regex hardcore, take a look at Damian Conway's Regexp::Grammars.

That's all for now. Thanks for reading.