The basic problem
Because Google has quit giving out free keys for their search API, I'm having to resort to screen scraping to pull what I want out of a search result. I'm currently trying to do that with RegEx. The main problem is getting the right pattern to pull the information out.
A search result, in HTML, from Google is broken up into several parts but all I'm interested in is the description of the webpage it finds. I don't want the title, and I'm not interested in the repeat of the search term that's bolded. In short, the results I want look like:
<font size=-1><b>a title, usually the search term again</b> THE CONTENT I WANT GOES HERE <br>
And here's the current pattern I'm trying: @"<font\s*size=\-1><b>[\w\d\s]+<\/b>\ *(.*[^\s])\s*<br>"
No matter what pattern I've tried (and I think at this point I've tried hundreds) I can't seem to pull out the "CONTENT I WANT." So I'm asking for help.
My current code for this method and a few of the last previous patterns can be found here: http://pastebin.ca/874155
If you'd like to see an image of what I mean: http://img401.imageshack.us/img401/3479/regexxd8.jpg
(the 1 and 0 in the console area represent how many matches it found and which result it pulled to show in array notation. since there's only 1 match it only pulls the first)
Explanation as to WHY I'm doing this and why I want only the website desc
I'm making a chatting bot that works on IRC. Right now I've actually got it set up so that it can read AIML files. That's a bit too bland and repetitive, however, so in an attempt to spice it up I added the ability for it to repeat back what someone else had said prior (it sits in the channel, watches people and records what people says to use again later). This, however, again, is rather bland and sometimes it repeats back what someone had said just seconds prior.
In an attempt to bring a new concept to the bot, I decided I wanted it to search google and respond via the search results. The basic idea is that search results will change through time, so hopefully the bot won't repeat things that often. As a fallback, if it can't find any of the results of what someone is saying to the bot it'll just pull out something that someone had said previous just as it has always done.
In any case, knowing how to pull out the result may help me in another project down the road.
I really appreciate any help that can be brought to this. I'm about ready to pull my hair out over regular expressions.
I wrote a very quick program to test patterns without having to edit the source and recompiling. This should help speed up testing. This is to the source with the exe in the \debug\ folder. If anyone out there can help me write the pattern this should help speed things up with testing it.
Thanks to that quick program I know WHY the pattern isn't working. It's apparently not stopping at the first <br> and counting each thing as a separate match. Instead it's stopping at the final <br> in the HTML page. I can test this further by including the <span class=a> and it'll stop at the final instance of the <br><span class=a> within the HTML.
So now the real question is: what do I have to add to the pattern in order to make it stop at the <br> I need it to?