Thread: RegEX pattern problems. Help!

  1. #1
    Registered User
    Join Date
    Mar 2006
    Posts
    17

    RegEX pattern problems. Help!

    The basic problem
    Because Google has quit giving out free keys for their search API, I'm having to resort to screen scraping to pull what I want out of a search result. I'm currently trying to do that with RegEx. The main problem is getting the right pattern to pull the information out.

    A search result, in HTML, from Google is broken up into several parts but all I'm interested in is the description of the webpage it finds. I don't want the title, and I'm not interested in the repeat of the search term that's bolded. In short, the results I want look like:

    <font size=-1><b>a title, usually the search term again</b> THE CONTENT I WANT GOES HERE <br>

    And here's the current pattern I'm trying: @"<font\s*size=\-1><b>[\w\d\s]+<\/b>\ *(.*[^\s])\s*<br>"


    No matter what pattern I've tried (and I think at this point I've tried hundreds) I can't seem to pull out the "CONTENT I WANT." So I'm asking for help.

    My current code for this method and a few of the last previous patterns can be found here: http://pastebin.ca/874155

    If you'd like to see an image of what I mean: http://img401.imageshack.us/img401/3479/regexxd8.jpg

    (the 1 and 0 in the console area represent how many matches it found and which result it pulled to show in array notation. since there's only 1 match it only pulls the first)


    Explanation as to WHY I'm doing this and why I want only the website desc
    I'm making a chatting bot that works on IRC. Right now I've actually got it set up so that it can read AIML files. That's a bit too bland and repetitive, however, so in an attempt to spice it up I added the ability for it to repeat back what someone else had said prior (it sits in the channel, watches people and records what people says to use again later). This, however, again, is rather bland and sometimes it repeats back what someone had said just seconds prior.

    In an attempt to bring a new concept to the bot, I decided I wanted it to search google and respond via the search results. The basic idea is that search results will change through time, so hopefully the bot won't repeat things that often. As a fallback, if it can't find any of the results of what someone is saying to the bot it'll just pull out something that someone had said previous just as it has always done.

    In any case, knowing how to pull out the result may help me in another project down the road.


    I really appreciate any help that can be brought to this. I'm about ready to pull my hair out over regular expressions.



    ---Edit---
    I wrote a very quick program to test patterns without having to edit the source and recompiling. This should help speed up testing. This is to the source with the exe in the \debug\ folder. If anyone out there can help me write the pattern this should help speed things up with testing it.

    regExTesterGUI


    ---Edit 2----
    Thanks to that quick program I know WHY the pattern isn't working. It's apparently not stopping at the first <br> and counting each thing as a separate match. Instead it's stopping at the final <br> in the HTML page. I can test this further by including the <span class=a> and it'll stop at the final instance of the <br><span class=a> within the HTML.

    So now the real question is: what do I have to add to the pattern in order to make it stop at the <br> I need it to?
    Last edited by Iyouboushi; 01-27-2008 at 01:23 AM.

  2. #2
    Registered User
    Join Date
    May 2003
    Posts
    1,619
    * is greedy, it finds the longest match conceivable.

    In between where you match the </b> and the <br>, do something like this (basically forbid it from matching a < character:

    \s*([^\s^<]+)\s*
    You ever try a pink golf ball, Wally? Why, the wind shear on a pink ball alone can take the head clean off a 90 pound midget at 300 yards.

  3. #3
    Jack of many languages Dino's Avatar
    Join Date
    Nov 2007
    Location
    Chappell Hill, Texas
    Posts
    2,332
    Try this:

    @"<font\s*size=\-1><b>.+</b> *(.*[^\s])\s*<br>"

    Todd

    Edit: Works in Ruby - didn't try it in C#, and I tested it with this syntax:
    Code:
    <html>
    <head>
    <title>my title</title> 
    </head>
    <body>
    <h1>term is "banana"
    <font size=-1><b>Banana</b>  A curved, narrow, yellow fruit  <br>
    <font size=-1><b>Apple</b>A roundish red fruit<br>
    <font size=-1><b>Orange</b>A very round, dimply, orange fruit<br>
    </p>
    <br>
    </body>
    </html>

  4. #4
    Registered User
    Join Date
    May 2003
    Posts
    1,619
    Quote Originally Posted by Todd Burch View Post
    Try this:

    @"<font\s*size=\-1><b>.+</b> *(.*[^\s])\s*<br>"

    Todd

    Edit: Works in Ruby - didn't try it in C#, and I tested it with this syntax:
    I believe that only works if your regex is parsing line by line (that is, not permitting . to match a newline character).

    In general you can use .+? or .*? to make the regex non-greedy.
    You ever try a pink golf ball, Wally? Why, the wind shear on a pink ball alone can take the head clean off a 90 pound midget at 300 yards.

  5. #5
    Jack of many languages Dino's Avatar
    Join Date
    Nov 2007
    Location
    Chappell Hill, Texas
    Posts
    2,332
    Good catch on the greedy. This works either line by line or all bunched on one line.

    "<font\s*size=\-1><b>.+?</b>\s*(.*?)\s*<br>"

    Thanks. Todd

Popular pages Recent additions subscribe to a feed

Similar Threads

  1. Replies: 12
    Last Post: 01-02-2009, 07:24 AM
  2. posix regex
    By jnsk in forum Linux Programming
    Replies: 2
    Last Post: 03-12-2004, 02:37 PM
  3. (pattern *) pat & pattern * pat
    By siubo in forum C Programming
    Replies: 1
    Last Post: 04-08-2003, 10:03 PM
  4. How do I print a pattern flush right?
    By Basia in forum C Programming
    Replies: 5
    Last Post: 06-11-2002, 07:15 AM
  5. text pattern recognition
    By mtsmox in forum C++ Programming
    Replies: 5
    Last Post: 02-27-2002, 08:38 AM