![]() |
| | #1 |
| Super Moderator Join Date: Sep 2001
Posts: 4,746
| Regular Expression Feedback I know there are a ton online, but they're either not preg, or they just didn't work for me. I also wanted to write my own to be sure I thoroughly understood every part of it. Some of the examples had weird combinations, the reasoning for which I didn't understand. I'm aware that this won't work if my hostname is local - that's okay, I don't need it to. It's also okay if it matches the occasional non-url (within reasonable limits)- I just to make sure it will match every real one. Optional protocol 1 or more Subdomains ending with a '.' 1 top-level domain (no '.') Optional port Optional file-path/query string Code: /([a-z]{3,10}:\/\/)?([a-z0-9\-]+\.)+([a-z]{2,6})(:\d+)?[a-z0-9?=$\/\.%#]*/
edit: I will have the 'i' (case-insensitive) flag turned on |
| sean is offline | |
| | #2 |
| Complete Beginner Join Date: Feb 2009
Posts: 312
| - your regexp catches a lot of invalid URLs, e.g. xyzxyzxyz://foo.com - a valid DNS name may end in ".", e.g. http://www.google.com./ - URLs may contain username/password, e.g. http://<user>:<pass>@bar.com There may be some points that I'm missing now. Greets, Philip
__________________ All things begin as source code. Source code begins with an empty file. -- Tao Te Chip Last edited by Snafuist; 05-28-2009 at 12:38 PM. |
| Snafuist is offline | |
| | #3 |
| Ethernal Noob Join Date: Nov 2001
Posts: 1,891
| There'd be a lot less room for error if you encapsulated a URI into an object or class and preformed actions on it's parts one at a time as opposed to doing it all at once with an archaic (although at times useful) method or text parsing. I think in the modern world using a Regular expression for something so simple tends to be difficult to validate.
__________________ Here to Deceive, Inveigle, Obfuscate Since 1945 |
| indigo0086 is offline | |
| | #4 | |
| critical genius Join Date: Jul 2008 Location: SE Queens
Posts: 5,176
| Quote:
And how are you going to parse it into parts of an object? I would have thought there would be some existing library for this anyway. | |
| MK27 is online now | |
| | #5 | ||
| C++ Witch Join Date: Oct 2003 Location: Singapore
Posts: 11,324
| Quote:
Quote:
__________________ C + C++ Compiler: MinGW port of GCC Build + Version Control System: SCons + Bazaar Look up a C/C++ Reference and learn How To Ask Questions The Smart Way | ||
| laserlight is online now | |
| | #6 |
| Complete Beginner Join Date: Feb 2009
Posts: 312
| Some more hints: Here are some more URIs that you won't recognize (stolen from the German Wikipedia) ldap://[2001:db8::7]/c=GB?objectClass?one file://C:\UserName.HostName\Projects\Wikipedia_Articles\U RI.xml byond://BYOND.world.123456789 Furthermore: - a URI may contain a fragment identifier: www.foo.com/bar.html#frag_ident - characters may be encoded: www.foo.com/bar%20baz - have a look at RFC 3986 - Uniform Resource Identifier (URI): Generic Syntax and use an already existing URI parser... ;-) Greets, Philip
__________________ All things begin as source code. Source code begins with an empty file. -- Tao Te Chip |
| Snafuist is offline | |
| | #7 | |
| critical genius Join Date: Jul 2008 Location: SE Queens
Posts: 5,176
| Quote:
Considering what PHP is supposed to be I'm surprised you would have to approach this task as if it were something new and unusual. Anyways, dollars to donuts it does come down to regexp's internally :P I have not even heard of another way of string parsing in widespread use. | |
| MK27 is online now | |
| | #8 | |
| C++ Witch Join Date: Oct 2003 Location: Singapore
Posts: 11,324
| Quote:
__________________ C + C++ Compiler: MinGW port of GCC Build + Version Control System: SCons + Bazaar Look up a C/C++ Reference and learn How To Ask Questions The Smart Way | |
| laserlight is online now | |
| | #9 | |
| Complete Beginner Join Date: Feb 2009
Posts: 312
| Quote:
Greets, Philip
__________________ All things begin as source code. Source code begins with an empty file. -- Tao Te Chip | |
| Snafuist is offline | |
| | #10 | ||||||||
| Super Moderator Join Date: Sep 2001
Posts: 4,746
| Quote:
Quote:
Quote:
Quote:
I may need to break it down in the future - in which case I will probably redesign with a more object-oriented approach. Quote:
Quote:
Quote:
Quote:
edit: I added # and % to the possible characters in the file path. I might also add something like (<[a-z]+:[a-z]+>@)? after the protocol for the username and password | ||||||||
| sean is offline | |
| | #11 |
| critical genius Join Date: Jul 2008 Location: SE Queens
Posts: 5,176
| Still pretty sure that regex's are the ONLY solution, whether you do it yourself or use some function that will do it for you*. It is not as if you could drop the URI off a roof and hope it will land, properly segmented, into a magic object. * keeping that in mind and that regexp's are the same in perl and PHP, I would swing by perlmonks (you don't have to register, you can post anonymously) and put this question ("regexp for extracting url!") to them -- you will see more action than if you threw raw meat into a pool of circling sharks -- and assuming there is a module that does this, you could dig thru and have a look a the regex's there. |
| MK27 is online now | |
| | #12 | |
| C++ Witch Join Date: Oct 2003 Location: Singapore
Posts: 11,324
| Quote:
__________________ C + C++ Compiler: MinGW port of GCC Build + Version Control System: SCons + Bazaar Look up a C/C++ Reference and learn How To Ask Questions The Smart Way | |
| laserlight is online now | |
| | #13 | |
| Ethernal Noob Join Date: Nov 2001
Posts: 1,891
| Quote:
If that was directed at me that was hardly what I meant. If he had created a class he could have taken in it, the full URL string. Within it he could have parsed individual sections of the url piece by piece. The domain, the protocol, the path, etc. Had any of thsoe been malformed an error could have been thrown. I'm not saying he shouldn't use Regex at all, just that doing it all at once is prone to error. I have seen a regex that propperly validated dates, including valid/invalid leap days. I am sure whoever wrote that could have done it in a smarter way. Maybe they were pretty clever to create such a regex, butI'd like to see that person refactor it refactor it. And i would hardly call finding a "god-like" regex the simplest or best solution, let alone only one.
__________________ Here to Deceive, Inveigle, Obfuscate Since 1945 | |
| indigo0086 is offline | |
| | #14 | |
| critical genius Join Date: Jul 2008 Location: SE Queens
Posts: 5,176
| Quote:
With a static string there is no difference between doing it all at once or doing it in pieces. Sean's OP has (paranthesized substrings) (in it), so you do the whole regular expression grab at once, then you can validate substring #1, and if it passes validate substring #2, etc. | |
| MK27 is online now | |
![]() |
| Thread Tools | |
| Display Modes | |
|
Similar Threads | ||||
| Thread | Thread Starter | Forum | Replies | Last Post |
| Screwy Linker Error - VC2005 | Tonto | C++ Programming | 5 | 06-19-2007 02:39 PM |
| recursion error | cchallenged | C Programming | 2 | 12-18-2006 09:15 AM |
| Regular Expression | tintifaxe | C++ Programming | 3 | 06-14-2006 07:16 AM |
| Please Help - Problem with Compilers | toonlover | C++ Programming | 5 | 07-23-2005 10:03 AM |
| Regular Expression Troubles | Unregistered | C++ Programming | 2 | 04-11-2002 04:21 PM |