Click to See Complete Forum and Search --> : Help with regex matching urls


journy101
04-16-2005, 08:54 PM
Im completely new to regex and need help with matching a url. I will have a file that contains urls in possibly many formats. I am trying to write a regular expression that will alow me to find and store any urls found.

Firstly I have started with this

(http|https|ftp)://(\w{3}.)?\w+.\w{2,3}\s+

and this will match fine with http://www.yahoo.com and http://www.yahoo.nl if there is a space afterward. But it will include the space in the result, I dont want this space. If i remove the \s+ it will match with http://www.yahoo.milk everything up to the k. but obviously this is invalid and I dont want it to match at all if there is a k.

in the end, I hope to make a regexp that will match urls in the format

protocol://domain.xxx
ftp.domain.xxx
www.domain.xxx

so that if no protocol is supplyed it will still match if it has a prefix that matches ftp, www and possibly others I have not thought of.

I have looked for existing regular expressions on the net that match my criteria but the only one I found, did not work.

I need some assistance in solveing the problem discribed above with my simple regex where the space is nessisary to exclue unwanted 4 character sub domains, however is not wanted in the result string which will be stored into a hash.

Jeff Mott
04-16-2005, 09:06 PM
The regular expression for URI, when done correctly, will be very long and very complicated. Fortunately, you don't have to worry about writing one yourself. The module URI::Find (http://search.cpan.org/~rosch/URI-Find-0.15/lib/URI/Find.pm) will do it for you.

journy101
04-17-2005, 11:56 AM
URI::find sounds great however I will be using C++ and a library boost::regex++ to acomplish the regex. I should have mentioned this before, I posted hear on a perl focum only because it seems that regex is part of perl and that i might find the answers here.

Sory for this lack of information.

Jeff Mott
04-17-2005, 05:30 PM
Afraid I don't know of any pre-existing URI pattern libraries for C++, so you may in fact have to develop your own. The best I can do is point you to the URI syntax rules.

http://www.ietf.org/rfc/rfc2396.txt

About 2/3 of the way down are BNF rules that define a valid URI.

journy101
04-18-2005, 04:38 PM
thanks, I will keep you updated as I make progress. The program I am developing locates the fastest download mirror by pinging each consecutavely then examining responce times. I have all the component libraries I will need to construct this program, all is left is for me to make it. The parseing will perhaps be my greatest learning experince.

thanks for the link, it will prove usefull.