journy101
04-16-2005, 08:54 PM
Im completely new to regex and need help with matching a url. I will have a file that contains urls in possibly many formats. I am trying to write a regular expression that will alow me to find and store any urls found.
Firstly I have started with this
(http|https|ftp)://(\w{3}.)?\w+.\w{2,3}\s+
and this will match fine with http://www.yahoo.com and http://www.yahoo.nl if there is a space afterward. But it will include the space in the result, I dont want this space. If i remove the \s+ it will match with http://www.yahoo.milk everything up to the k. but obviously this is invalid and I dont want it to match at all if there is a k.
in the end, I hope to make a regexp that will match urls in the format
protocol://domain.xxx
ftp.domain.xxx
www.domain.xxx
so that if no protocol is supplyed it will still match if it has a prefix that matches ftp, www and possibly others I have not thought of.
I have looked for existing regular expressions on the net that match my criteria but the only one I found, did not work.
I need some assistance in solveing the problem discribed above with my simple regex where the space is nessisary to exclue unwanted 4 character sub domains, however is not wanted in the result string which will be stored into a hash.
Firstly I have started with this
(http|https|ftp)://(\w{3}.)?\w+.\w{2,3}\s+
and this will match fine with http://www.yahoo.com and http://www.yahoo.nl if there is a space afterward. But it will include the space in the result, I dont want this space. If i remove the \s+ it will match with http://www.yahoo.milk everything up to the k. but obviously this is invalid and I dont want it to match at all if there is a k.
in the end, I hope to make a regexp that will match urls in the format
protocol://domain.xxx
ftp.domain.xxx
www.domain.xxx
so that if no protocol is supplyed it will still match if it has a prefix that matches ftp, www and possibly others I have not thought of.
I have looked for existing regular expressions on the net that match my criteria but the only one I found, did not work.
I need some assistance in solveing the problem discribed above with my simple regex where the space is nessisary to exclue unwanted 4 character sub domains, however is not wanted in the result string which will be stored into a hash.