lunes, 15 de febrero de 2010

robots go shopping!


One way to go shopping is spending long afternoons walking among crowds. I imagine it can be less boring than I think but anyway... I've written a little perl program that finds cheap books for me, and mail me if there's any good oportunity. This makes shopping more fun for sure. :)

I've used HTML::TreeBuilder to parse the html. Yo'll have to take a look at HTML::Element too.

The site I'm parsing has changed three times since I started writing this script (2 weeks ago), so it will get outdated very soon, but well...

HTML::TreeBuilder can treat any given html as a Tree (huh, really? ¬¬ ) and lets you navigate through it in multiple ways. A very intuitive way to traverse a tree looking for some tag is using look_down. it accepts an even number of arguments, the first being an attribute name, and the next being the content you want to match.


For example here's a stupid code that extracts some tables from an html.

my $tb = HTML::TreeBuilder->new();
$tb->parse('file.html');
my $root = $tree->root;
my @res = $tb->look_down('_tag', 'table', 'class' , 'result');
say $_->as_text for (@res);

The steps are really easy and this lib gives you the power to write web spiders fairly easy. If you're more fond of python, you can use beautiful soup (which has a smalltalk port too).

As the code is pure unstable crap, I'm not going to upload it for the moment...

OFF-TOPIC PS for last Saturday hardparty-ers: I'm still alive, well and kicking. Thanks, sorry, and good luck!

3 comentarios:

brainstorm dijo...

Cool stuff, dunno why but I'm in love with web scraping... here it is something related I posted a long while ago:

http://blogs.nopcode.org/brainstorm/2006/12/20/use-templateextract/

But you know what's even cooler nowadays ?:

http://nokogiri.org/
+
http://www.selectorgadget.com/

Go and try ;)

Cheers from a former esCERT guy ;)

raig dijo...

Nopcoders working at esCERT. That's cool :) .

Thanks for the info brainstorm. I'll definately look at those links. I had read about Template::Extract, but never messed with it. AudreyT's modules deserve a look.

Take care :)

David dijo...

It's only hype! We wanna see the code!