domingo, 8 de marzo de 2009

Squeak + Soup + Regex engine = google fight!

About a year ago, I wrote a kind of google fight in perl. At first it was to automatize a process I was doing more and more:

When I don't know how a word is spelled, I tend to look the options at google, and depending on the number of results, I choose. The perl program has been included as an example of WWW::Mechanize at a course in La Laguna University which makes me really happy :) .

Now, I wanted to try it in squeak, as it seems a good toy problem: it includes some web scraping , some string comparisons with regex involved, and, of course connecting to inet and little logic control.

Well, to run the following example, you'll have to get squeak Soup (a port of python beautiful soup) , including it as a new monticello http repo (http://www.squeaksource.com/Soup ) and install the latest version of soup. In my pharo image, it complains about using startsWith (deprecated) instead of beginsWith . you can just ignore those warnings or fix them (that's what I did).

Then install regex engine from SqueakMap Package Loader . I had to use the beta version, because the latest crashed, but I may have done an error installing it, so don't take anything about comparisons beta vs stable for granted.

After that, we're ready to run the next code in a workspace:

mySearcher := [:wordToSearch| |tmpText re m |
soup := Soup fromUrl: ('http://www.google.co.uk/search?q=', wordToSearch , '') .
results := soup findAllTags:
[:e | e name = 'div'
and: [(e attributeAt: 'id') = 'ssb']].

tmpText := ((results at:1) findChildTag: [:e | e name = 'p']) text.

re := 'about ([\d,.]+)' asRe.
m := re search: tmpText.
((m matches at:1) copyWithoutAll: ',' ) asNumber.
].

googleFight := [ :x :y |
( (mySearcher value: x) > ( mySearcher value: y ) )
ifTrue: [x]
ifFalse: [y]
].


googleFight value: 'hello' value: 'helo'
Need any further explanation? post a comment

--- EDIT ---
I contacted to Zulq Alam (Soup Author) and he updated the beginsWith vs startsWith thing, and provided some alternative snippets to take full profit of Soup. As it's related to #doesNotUnderstand selector which I plan on writting a post, I'll leave it for now.
Thanks Zulq :)

2 comentarios:

Anónimo dijo...

Hello from Russia!
Can I quote a post "No teme" in your blog with the link to you?

Raimon Grau dijo...

Of course :)