Regular Expression Feedback

This is a discussion on Regular Expression Feedback within the Tech Board forums, part of the Community Boards category; I need a regular expression that reliably catches any valid url. I'm using preg (Perl-compatible) on PHP. I've done some ...

  1. #1
    Super Moderator
    Join Date
    Sep 2001
    Posts
    4,913

    Regular Expression Feedback

    I need a regular expression that reliably catches any valid url. I'm using preg (Perl-compatible) on PHP. I've done some testing with it and it seems to work perfectly, but I wanted to open this up and see if anyone has any suggestions. I've tried to make it fairly modular (the break-down is below), but for all I know there are special characters I haven't remembered in the query string, etc...

    I know there are a ton online, but they're either not preg, or they just didn't work for me. I also wanted to write my own to be sure I thoroughly understood every part of it. Some of the examples had weird combinations, the reasoning for which I didn't understand.

    I'm aware that this won't work if my hostname is local - that's okay, I don't need it to. It's also okay if it matches the occasional non-url (within reasonable limits)- I just to make sure it will match every real one.

    Optional protocol
    1 or more Subdomains ending with a '.'
    1 top-level domain (no '.')
    Optional port
    Optional file-path/query string

    Code:
    /([a-z]{3,10}:\/\/)?([a-z0-9\-]+\.)+([a-z]{2,6})(:\d+)?[a-z0-9?=$\/\.%#]*/
    Any thoughts or suggestions?

    edit: I will have the 'i' (case-insensitive) flag turned on

  2. #2
    Complete Beginner
    Join Date
    Feb 2009
    Posts
    312
    - your regexp catches a lot of invalid URLs, e.g. xyzxyzxyz://foo.com
    - a valid DNS name may end in ".", e.g. http://www.google.com./
    - URLs may contain username/password, e.g. http://<user>:<pass>@bar.com

    There may be some points that I'm missing now.

    Greets,
    Philip
    Last edited by Snafuist; 05-28-2009 at 12:38 PM.
    All things begin as source code.
    Source code begins with an empty file.
    -- Tao Te Chip

  3. #3
    Ethernal Noob
    Join Date
    Nov 2001
    Posts
    1,901
    There'd be a lot less room for error if you encapsulated a URI into an object or class and preformed actions on it's parts one at a time as opposed to doing it all at once with an archaic (although at times useful) method or text parsing.

    I think in the modern world using a Regular expression for something so simple tends to be difficult to validate.
    Here to Deceive, Inveigle, Obfuscate Since 1945

  4. #4
    spurious conceit MK27's Avatar
    Join Date
    Jul 2008
    Location
    segmentation fault
    Posts
    8,300
    Quote Originally Posted by indigo0086 View Post
    There'd be a lot less room for error if you encapsulated a URI into an object or class and preformed actions on it's parts one at a time as opposed to doing it all at once with an archaic (although at times useful) method or text parsing.

    I think in the modern world using a Regular expression for something so simple tends to be difficult to validate.
    What??? The (substrings) are there, they can validated.

    And how are you going to parse it into parts of an object?

    I would have thought there would be some existing library for this anyway.
    C programming resources:
    GNU C Function and Macro Index -- glibc reference manual
    The C Book -- nice online learner guide
    Current ISO draft standard
    CCAN -- new CPAN like open source library repository
    3 (different) GNU debugger tutorials: #1 -- #2 -- #3
    cpwiki -- our wiki on sourceforge

  5. #5
    C++ Witch laserlight's Avatar
    Join Date
    Oct 2003
    Location
    Singapore
    Posts
    21,394
    Quote Originally Posted by MK27
    And how are you going to parse it into parts of an object?
    Perhaps parse_url() could be used.

    Quote Originally Posted by MK27
    I would have thought there would be some existing library for this anyway.
    Probably, but none that is available by default or easily enabled as an extension.
    C + C++ Compiler: MinGW port of GCC
    Version Control System: Bazaar

    Look up a C++ Reference and learn How To Ask Questions The Smart Way

  6. #6
    Complete Beginner
    Join Date
    Feb 2009
    Posts
    312
    Some more hints:

    Here are some more URIs that you won't recognize (stolen from the German Wikipedia)

    ldap://[2001:db8::7]/c=GB?objectClass?one
    file://C:\UserName.HostName\Projects\Wikipedia_Articles\U RI.xml
    byond://BYOND.world.123456789


    Furthermore:

    - a URI may contain a fragment identifier: www.foo.com/bar.html#frag_ident
    - characters may be encoded: www.foo.com/bar%20baz
    - have a look at RFC 3986 - Uniform Resource Identifier (URI): Generic Syntax and use an already existing URI parser... ;-)

    Greets,
    Philip
    All things begin as source code.
    Source code begins with an empty file.
    -- Tao Te Chip

  7. #7
    spurious conceit MK27's Avatar
    Join Date
    Jul 2008
    Location
    segmentation fault
    Posts
    8,300
    Quote Originally Posted by laserlight View Post
    Perhaps parse_url() could be used.
    I don't know PHP. I think that this would qualify as what I meant by "library" (maybe module is a better word), except I imagine that is actually a built in? (I am sure that there are a zillion perl modules with some kind of "parse_url()" method in them).

    Considering what PHP is supposed to be I'm surprised you would have to approach this task as if it were something new and unusual.

    Anyways, dollars to donuts it does come down to regexp's internally :P
    I have not even heard of another way of string parsing in widespread use.
    C programming resources:
    GNU C Function and Macro Index -- glibc reference manual
    The C Book -- nice online learner guide
    Current ISO draft standard
    CCAN -- new CPAN like open source library repository
    3 (different) GNU debugger tutorials: #1 -- #2 -- #3
    cpwiki -- our wiki on sourceforge

  8. #8
    C++ Witch laserlight's Avatar
    Join Date
    Oct 2003
    Location
    Singapore
    Posts
    21,394
    Quote Originally Posted by MK27
    I think that this would qualify as what I meant by "library" (maybe module is a better word), except I imagine that is actually a built in?
    But as the PHP manual noted, parse_url() is about parsing, not validation.
    C + C++ Compiler: MinGW port of GCC
    Version Control System: Bazaar

    Look up a C++ Reference and learn How To Ask Questions The Smart Way

  9. #9
    Complete Beginner
    Join Date
    Feb 2009
    Posts
    312
    Quote Originally Posted by MK27 View Post
    I have not even heard of another way of string parsing in widespread use.
    You should look up finite state automata reading character by character. ;-)

    Greets,
    Philip
    All things begin as source code.
    Source code begins with an empty file.
    -- Tao Te Chip

  10. #10
    Super Moderator
    Join Date
    Sep 2001
    Posts
    4,913
    your regexp catches a lot of invalid URLs, e.g. xyzxyzxyz://foo.com
    There's a pretty long list of possible protocols, which is why I opted for the method I did. In reality, I probably only need to be looking for http and https urls, so I may just hardcode those in to.

    a valid DNS name may end in ".", e.g. Google
    Didn't think of that - but the regex I have matches that example because I expected a . and a / in the file path anyway

    URLs may contain username/password, e.g. http://<user>:<pass>@bar.com
    Huh - didn't know about that - I'll look into it. Thanks!

    There'd be a lot less room for error if you encapsulated a URI into an object or class
    This is for a tiny PHP script - so I'm more concerned with short fairly simple code than I am with a margin of error significantly smaller than what I have now. All I'm using the regex for is searching a potentially large string for a URL and extracting the url as a whole. I don't need to work on individual portions of the match - I just need to find the match, and regex's seem to be the best solution.

    I may need to break it down in the future - in which case I will probably redesign with a more object-oriented approach.

    a URI may contain a fragment identifier: www.foo.com/bar.html#frag_ident
    Ooh - I knew I should've added # - thanks!

    characters may be encoded: www.foo.com/bar%20baz
    and the %

    have a look at RFC 3986 - Uniform Resource Identifier (URI): Generic Syntax and use an already existing URI parser... ;-)
    As mentioned - right now I just need to find them, not parse them. This is the simplest way (well, the regex itself isn't simple, but I just want a simple script that's valid PHP anyway)

    ldap://[2001:db8::7]/c=GB?objectClass?one
    file://C:\UserName.HostName\Projects\Wikipedia_Articles\U RI.xml
    byond://BYOND.world.123456789
    Okay the middle one I don't need to match, but ldap is one I need to look into - thanks for pointing that out. Is that byond one even real? Wouldn't be too hard to catch that, though...


    edit: I added # and % to the possible characters in the file path. I might also add something like (<[a-z]+:[a-z]+>@)? after the protocol for the username and password

  11. #11
    spurious conceit MK27's Avatar
    Join Date
    Jul 2008
    Location
    segmentation fault
    Posts
    8,300
    Quote Originally Posted by sean View Post
    regex's seem to be the best solution.
    Still pretty sure that regex's are the ONLY solution, whether you do it yourself or use some function that will do it for you*. It is not as if you could drop the URI off a roof and hope it will land, properly segmented, into a magic object.

    * keeping that in mind and that regexp's are the same in perl and PHP, I would swing by perlmonks (you don't have to register, you can post anonymously) and put this question ("regexp for extracting url!") to them -- you will see more action than if you threw raw meat into a pool of circling sharks -- and assuming there is a module that does this, you could dig thru and have a look a the regex's there.
    C programming resources:
    GNU C Function and Macro Index -- glibc reference manual
    The C Book -- nice online learner guide
    Current ISO draft standard
    CCAN -- new CPAN like open source library repository
    3 (different) GNU debugger tutorials: #1 -- #2 -- #3
    cpwiki -- our wiki on sourceforge

  12. #12
    C++ Witch laserlight's Avatar
    Join Date
    Oct 2003
    Location
    Singapore
    Posts
    21,394
    Quote Originally Posted by MK27
    Still pretty sure that regex's are the ONLY solution, whether you do it yourself or use some function that will do it for you*. It is not as if you could drop the URI off a roof and hope it will land, properly segmented, into a magic object.
    Perhaps instead of saying "only solution" you should express your opinion by saying "best solution", "simplest solution", etc, since "only solution" is obviously factually incorrect.
    C + C++ Compiler: MinGW port of GCC
    Version Control System: Bazaar

    Look up a C++ Reference and learn How To Ask Questions The Smart Way

  13. #13
    Ethernal Noob
    Join Date
    Nov 2001
    Posts
    1,901
    Quote Originally Posted by MK27 View Post
    Still pretty sure that regex's are the ONLY solution, whether you do it yourself or use some function that will do it for you*. It is not as if you could drop the URI off a roof and hope it will land, properly segmented, into a magic object.

    If that was directed at me that was hardly what I meant.

    If he had created a class he could have taken in it, the full URL string. Within it he could have parsed individual sections of the url piece by piece. The domain, the protocol, the path, etc. Had any of thsoe been malformed an error could have been thrown.

    I'm not saying he shouldn't use Regex at all, just that doing it all at once is prone to error.

    I have seen a regex that propperly validated dates, including valid/invalid leap days. I am sure whoever wrote that could have done it in a smarter way. Maybe they were pretty clever to create such a regex, butI'd like to see that person refactor it refactor it.

    And i would hardly call finding a "god-like" regex the simplest or best solution, let alone only one.
    Here to Deceive, Inveigle, Obfuscate Since 1945

  14. #14
    spurious conceit MK27's Avatar
    Join Date
    Jul 2008
    Location
    segmentation fault
    Posts
    8,300
    Quote Originally Posted by indigo0086 View Post
    I'm not saying he shouldn't use Regex at all, just that doing it all at once is prone to error.
    Who cares what you "meant" to say...
    Quote Originally Posted by indigo0086 View Post
    I think in the modern world using a Regular expression for something so simple tends to be difficult to validate.
    With a static string there is no difference between doing it all at once or doing it in pieces. Sean's OP has (paranthesized substrings) (in it), so you do the whole regular expression grab at once, then you can validate substring #1, and if it passes validate substring #2, etc.
    C programming resources:
    GNU C Function and Macro Index -- glibc reference manual
    The C Book -- nice online learner guide
    Current ISO draft standard
    CCAN -- new CPAN like open source library repository
    3 (different) GNU debugger tutorials: #1 -- #2 -- #3
    cpwiki -- our wiki on sourceforge

Popular pages Recent additions subscribe to a feed

Similar Threads

  1. Screwy Linker Error - VC2005
    By Tonto in forum C++ Programming
    Replies: 5
    Last Post: 06-19-2007, 02:39 PM
  2. recursion error
    By cchallenged in forum C Programming
    Replies: 2
    Last Post: 12-18-2006, 08:15 AM
  3. Regular Expression
    By tintifaxe in forum C++ Programming
    Replies: 3
    Last Post: 06-14-2006, 07:16 AM
  4. Please Help - Problem with Compilers
    By toonlover in forum C++ Programming
    Replies: 5
    Last Post: 07-23-2005, 10:03 AM
  5. Regular Expression Troubles
    By Unregistered in forum C++ Programming
    Replies: 2
    Last Post: 04-11-2002, 04:21 PM

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21