C Board  

Go Back   C Board > Community Boards > Tech Board

Reply
 
LinkBack Thread Tools Display Modes
Old 05-28-2009, 10:49 AM   #1
Super Moderator
 
Join Date: Sep 2001
Posts: 4,746
Regular Expression Feedback

I need a regular expression that reliably catches any valid url. I'm using preg (Perl-compatible) on PHP. I've done some testing with it and it seems to work perfectly, but I wanted to open this up and see if anyone has any suggestions. I've tried to make it fairly modular (the break-down is below), but for all I know there are special characters I haven't remembered in the query string, etc...

I know there are a ton online, but they're either not preg, or they just didn't work for me. I also wanted to write my own to be sure I thoroughly understood every part of it. Some of the examples had weird combinations, the reasoning for which I didn't understand.

I'm aware that this won't work if my hostname is local - that's okay, I don't need it to. It's also okay if it matches the occasional non-url (within reasonable limits)- I just to make sure it will match every real one.

Optional protocol
1 or more Subdomains ending with a '.'
1 top-level domain (no '.')
Optional port
Optional file-path/query string

Code:
/([a-z]{3,10}:\/\/)?([a-z0-9\-]+\.)+([a-z]{2,6})(:\d+)?[a-z0-9?=$\/\.%#]*/
Any thoughts or suggestions?

edit: I will have the 'i' (case-insensitive) flag turned on
sean is offline   Reply With Quote
Old 05-28-2009, 12:11 PM   #2
Complete Beginner
 
Join Date: Feb 2009
Posts: 312
- your regexp catches a lot of invalid URLs, e.g. xyzxyzxyz://foo.com
- a valid DNS name may end in ".", e.g. http://www.google.com./
- URLs may contain username/password, e.g. http://<user>:<pass>@bar.com

There may be some points that I'm missing now.

Greets,
Philip
__________________
All things begin as source code.
Source code begins with an empty file.
-- Tao Te Chip

Last edited by Snafuist; 05-28-2009 at 12:38 PM.
Snafuist is offline   Reply With Quote
Old 05-28-2009, 12:18 PM   #3
Ethernal Noob
 
Join Date: Nov 2001
Posts: 1,891
There'd be a lot less room for error if you encapsulated a URI into an object or class and preformed actions on it's parts one at a time as opposed to doing it all at once with an archaic (although at times useful) method or text parsing.

I think in the modern world using a Regular expression for something so simple tends to be difficult to validate.
__________________
Here to Deceive, Inveigle, Obfuscate Since 1945
indigo0086 is offline   Reply With Quote
Old 05-28-2009, 12:28 PM   #4
critical genius
 
MK27's Avatar
 
Join Date: Jul 2008
Location: SE Queens
Posts: 5,176
Quote:
Originally Posted by indigo0086 View Post
There'd be a lot less room for error if you encapsulated a URI into an object or class and preformed actions on it's parts one at a time as opposed to doing it all at once with an archaic (although at times useful) method or text parsing.

I think in the modern world using a Regular expression for something so simple tends to be difficult to validate.
What??? The (substrings) are there, they can validated.

And how are you going to parse it into parts of an object?

I would have thought there would be some existing library for this anyway.
__________________

"A man can't just sit around." -- Larry Walters
MK27 is online now   Reply With Quote
Old 05-28-2009, 12:33 PM   #5
C++ Witch
 
laserlight's Avatar
 
Join Date: Oct 2003
Location: Singapore
Posts: 11,324
Quote:
Originally Posted by MK27
And how are you going to parse it into parts of an object?
Perhaps parse_url() could be used.

Quote:
Originally Posted by MK27
I would have thought there would be some existing library for this anyway.
Probably, but none that is available by default or easily enabled as an extension.
__________________
C + C++ Compiler: MinGW port of GCC
Build + Version Control System: SCons + Bazaar

Look up a C/C++ Reference and learn How To Ask Questions The Smart Way
laserlight is online now   Reply With Quote
Old 05-28-2009, 12:38 PM   #6
Complete Beginner
 
Join Date: Feb 2009
Posts: 312
Some more hints:

Here are some more URIs that you won't recognize (stolen from the German Wikipedia)

ldap://[2001:db8::7]/c=GB?objectClass?one
file://C:\UserName.HostName\Projects\Wikipedia_Articles\U RI.xml
byond://BYOND.world.123456789


Furthermore:

- a URI may contain a fragment identifier: www.foo.com/bar.html#frag_ident
- characters may be encoded: www.foo.com/bar%20baz
- have a look at RFC 3986 - Uniform Resource Identifier (URI): Generic Syntax and use an already existing URI parser... ;-)

Greets,
Philip
__________________
All things begin as source code.
Source code begins with an empty file.
-- Tao Te Chip
Snafuist is offline   Reply With Quote
Old 05-28-2009, 12:38 PM   #7
critical genius
 
MK27's Avatar
 
Join Date: Jul 2008
Location: SE Queens
Posts: 5,176
Quote:
Originally Posted by laserlight View Post
Perhaps parse_url() could be used.
I don't know PHP. I think that this would qualify as what I meant by "library" (maybe module is a better word), except I imagine that is actually a built in? (I am sure that there are a zillion perl modules with some kind of "parse_url()" method in them).

Considering what PHP is supposed to be I'm surprised you would have to approach this task as if it were something new and unusual.

Anyways, dollars to donuts it does come down to regexp's internally :P
I have not even heard of another way of string parsing in widespread use.
__________________

"A man can't just sit around." -- Larry Walters
MK27 is online now   Reply With Quote
Old 05-28-2009, 12:40 PM   #8
C++ Witch
 
laserlight's Avatar
 
Join Date: Oct 2003
Location: Singapore
Posts: 11,324
Quote:
Originally Posted by MK27
I think that this would qualify as what I meant by "library" (maybe module is a better word), except I imagine that is actually a built in?
But as the PHP manual noted, parse_url() is about parsing, not validation.
__________________
C + C++ Compiler: MinGW port of GCC
Build + Version Control System: SCons + Bazaar

Look up a C/C++ Reference and learn How To Ask Questions The Smart Way
laserlight is online now   Reply With Quote
Old 05-28-2009, 12:44 PM   #9
Complete Beginner
 
Join Date: Feb 2009
Posts: 312
Quote:
Originally Posted by MK27 View Post
I have not even heard of another way of string parsing in widespread use.
You should look up finite state automata reading character by character. ;-)

Greets,
Philip
__________________
All things begin as source code.
Source code begins with an empty file.
-- Tao Te Chip
Snafuist is offline   Reply With Quote
Old 05-28-2009, 01:26 PM   #10
Super Moderator
 
Join Date: Sep 2001
Posts: 4,746
Quote:
your regexp catches a lot of invalid URLs, e.g. xyzxyzxyz://foo.com
There's a pretty long list of possible protocols, which is why I opted for the method I did. In reality, I probably only need to be looking for http and https urls, so I may just hardcode those in to.

Quote:
a valid DNS name may end in ".", e.g. Google
Didn't think of that - but the regex I have matches that example because I expected a . and a / in the file path anyway

Quote:
URLs may contain username/password, e.g. http://<user>:<pass>@bar.com
Huh - didn't know about that - I'll look into it. Thanks!

Quote:
There'd be a lot less room for error if you encapsulated a URI into an object or class
This is for a tiny PHP script - so I'm more concerned with short fairly simple code than I am with a margin of error significantly smaller than what I have now. All I'm using the regex for is searching a potentially large string for a URL and extracting the url as a whole. I don't need to work on individual portions of the match - I just need to find the match, and regex's seem to be the best solution.

I may need to break it down in the future - in which case I will probably redesign with a more object-oriented approach.

Quote:
a URI may contain a fragment identifier: www.foo.com/bar.html#frag_ident
Ooh - I knew I should've added # - thanks!

Quote:
characters may be encoded: www.foo.com/bar%20baz
and the %

Quote:
have a look at RFC 3986 - Uniform Resource Identifier (URI): Generic Syntax and use an already existing URI parser... ;-)
As mentioned - right now I just need to find them, not parse them. This is the simplest way (well, the regex itself isn't simple, but I just want a simple script that's valid PHP anyway)

Quote:
ldap://[2001:db8::7]/c=GB?objectClass?one
file://C:\UserName.HostName\Projects\Wikipedia_Articles\U RI.xml
byond://BYOND.world.123456789
Okay the middle one I don't need to match, but ldap is one I need to look into - thanks for pointing that out. Is that byond one even real? Wouldn't be too hard to catch that, though...


edit: I added # and % to the possible characters in the file path. I might also add something like (<[a-z]+:[a-z]+>@)? after the protocol for the username and password
sean is offline   Reply With Quote
Old 05-28-2009, 01:38 PM   #11
critical genius
 
MK27's Avatar
 
Join Date: Jul 2008
Location: SE Queens
Posts: 5,176
Quote:
Originally Posted by sean View Post
regex's seem to be the best solution.
Still pretty sure that regex's are the ONLY solution, whether you do it yourself or use some function that will do it for you*. It is not as if you could drop the URI off a roof and hope it will land, properly segmented, into a magic object.

* keeping that in mind and that regexp's are the same in perl and PHP, I would swing by perlmonks (you don't have to register, you can post anonymously) and put this question ("regexp for extracting url!") to them -- you will see more action than if you threw raw meat into a pool of circling sharks -- and assuming there is a module that does this, you could dig thru and have a look a the regex's there.
__________________

"A man can't just sit around." -- Larry Walters
MK27 is online now   Reply With Quote
Old 05-28-2009, 01:46 PM   #12
C++ Witch
 
laserlight's Avatar
 
Join Date: Oct 2003
Location: Singapore
Posts: 11,324
Quote:
Originally Posted by MK27
Still pretty sure that regex's are the ONLY solution, whether you do it yourself or use some function that will do it for you*. It is not as if you could drop the URI off a roof and hope it will land, properly segmented, into a magic object.
Perhaps instead of saying "only solution" you should express your opinion by saying "best solution", "simplest solution", etc, since "only solution" is obviously factually incorrect.
__________________
C + C++ Compiler: MinGW port of GCC
Build + Version Control System: SCons + Bazaar

Look up a C/C++ Reference and learn How To Ask Questions The Smart Way
laserlight is online now   Reply With Quote
Old 05-28-2009, 02:13 PM   #13
Ethernal Noob
 
Join Date: Nov 2001
Posts: 1,891
Quote:
Originally Posted by MK27 View Post
Still pretty sure that regex's are the ONLY solution, whether you do it yourself or use some function that will do it for you*. It is not as if you could drop the URI off a roof and hope it will land, properly segmented, into a magic object.

If that was directed at me that was hardly what I meant.

If he had created a class he could have taken in it, the full URL string. Within it he could have parsed individual sections of the url piece by piece. The domain, the protocol, the path, etc. Had any of thsoe been malformed an error could have been thrown.

I'm not saying he shouldn't use Regex at all, just that doing it all at once is prone to error.

I have seen a regex that propperly validated dates, including valid/invalid leap days. I am sure whoever wrote that could have done it in a smarter way. Maybe they were pretty clever to create such a regex, butI'd like to see that person refactor it refactor it.

And i would hardly call finding a "god-like" regex the simplest or best solution, let alone only one.
__________________
Here to Deceive, Inveigle, Obfuscate Since 1945
indigo0086 is offline   Reply With Quote
Old 05-28-2009, 02:26 PM   #14
critical genius
 
MK27's Avatar
 
Join Date: Jul 2008
Location: SE Queens
Posts: 5,176
Quote:
Originally Posted by indigo0086 View Post
I'm not saying he shouldn't use Regex at all, just that doing it all at once is prone to error.
Who cares what you "meant" to say...
Quote:
Originally Posted by indigo0086 View Post
I think in the modern world using a Regular expression for something so simple tends to be difficult to validate.
With a static string there is no difference between doing it all at once or doing it in pieces. Sean's OP has (paranthesized substrings) (in it), so you do the whole regular expression grab at once, then you can validate substring #1, and if it passes validate substring #2, etc.
__________________

"A man can't just sit around." -- Larry Walters
MK27 is online now   Reply With Quote
Reply

Thread Tools
Display Modes

Forum Jump

Similar Threads
Thread Thread Starter Forum Replies Last Post
Screwy Linker Error - VC2005 Tonto C++ Programming 5 06-19-2007 02:39 PM
recursion error cchallenged C Programming 2 12-18-2006 09:15 AM
Regular Expression tintifaxe C++ Programming 3 06-14-2006 07:16 AM
Please Help - Problem with Compilers toonlover C++ Programming 5 07-23-2005 10:03 AM
Regular Expression Troubles Unregistered C++ Programming 2 04-11-2002 04:21 PM


All times are GMT -6. The time now is 09:25 AM.


Powered by vBulletin® Version 3.8.1
Copyright ©2000 - 2010, Jelsoft Enterprises Ltd.
Search Engine Optimization by vBSEO 3.3.2

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22