Click to See Complete Forum and Search --> : Need a regex to check form submission url format


dancerman
11-13-2008, 10:26 AM
I want a simple as possible regex to check the format of my form submission URL string, I don't care whether is an actual real working URL, just that it be in proper URL format and, IF POSSIBLE add someting so that the URL characters are the only charactrers entered in the field:
so it would be an err msg prompt for legit users to enter their link in the proper format;
On the uther hand, it should not allow spammers to enter tons of text in the url field.
I have used the following regex to check that there are characters supplied in addition to the http:// value in the form field, and it works with simple urls but it does not work with urls such as
http://somesubdomain-somesitename.org/somepagename.htm
if ($FORM{'url'} eq 'http://' || $FORM{'url'} !~ /^(f|ht)tp:\/\/\w+\.\w+/) {
&no_url;
}

Sixtease
11-14-2008, 02:20 AM
This task may be harder than would seem. Mainly because there's pretty much no limit on the length of URL and there is also no limitation on the characters used (as far as I know). A spammer can thus enter a valid URL:http://foo.com? good pr0n: www.sex.cnFurthermore, are you sure you don't want to allow URL's like this?username.livejournal.com/13245?mode=replyThere's no protocol part and pretty much anything can follow the question mark or hash.

What I'd probably do is I'd either employ captcha or limit the size of URL to, say, 64 chars or employ a spam filter.

tobiaseichner
11-16-2008, 12:38 PM
Yes, this is a hard task. By chance I found the following regular expression at a discussion board recently. I haven't tested it thoroughly, but it looks good enough to give it a try:


^s?https?:\/\/[-_.!~*'()a-zA-Z0-9;\/?:\@&=+\$,%#]+$

dancerman
11-17-2008, 09:48 AM
This task may be harder than would seem. Mainly because there's pretty much no limit on the length of URL and there is also no limitation on the characters used (as far as I know). A spammer can thus enter a valid URL:http://foo.com? good pr0n: www.sex.cnFurthermore, are you sure you don't want to allow URL's like this?username.livejournal.com/13245?mode=replyThere's no protocol part and pretty much anything can follow the question mark or hash.

What I'd probably do is I'd either employ captcha or limit the size of URL to, say, 64 chars or employ a spam filter.
---------------------------->>

I see that I should ask for help on checkng a user's url format.
There will be unwanted urls, nothing much to be done when there is a valid url. Users of my various submission forms will be entering simple urls to their activity websites. Since they are not as a group all that proficient using computers or typing valid url format I am hoping to create a perl regex to check that the url field had characters in a valid url format.
Does anyone have a working perl url format checker regex?

Nedals
11-18-2008, 08:43 PM
..Since they are not as a group all that proficient using computers or typing valid url format

Given the above, you could imply that the url will be as simple as: www.domain.com
or you may get username.domain.com preceeded sometimes with http:// or https://
and you may get a country code: www.domain.com.uk

Here's a regex that will handle that lot and will eliminate any query string: '?query=string'

my $tested_url = ( $url=~/^((http:\/\/|https:\/\/)?\w+\.\w+(\.\w{2,4})?\.\w{2,4}).+$/ ) ? $1 : 'invalid';

(http:\/\/|https:\/\/)? check (? makes it optional)
\w+\.\w+\ check www.domain -or- username.domain
(\.\w{2,4})? check .com, .net, .biz, etc (2 to 4 chars only)
\.\w{2,4} check .uk or other country code (2 to 4 chars only)

dancerman
11-19-2008, 12:17 AM
THANKS!
Your code is right on for my set-up.
Let me say that in poking around the 'net and hacking at the regex myself, it is clear that most attempts at this are really feeble and or are way over the top for 99.99% of cases. In my case, it is not critical that the url in fact be valid, as it is their problem if potential customers cannot connect to their site. So, I'm interested only in helping with basic format checking, and not re-writing the url, which is to say that query strings, etc. should trigger a "invalid url, UR not posting that on my site" error msg.
I had a submission like this: http://www.somedomain-anevent.org/eventgflyer.html
so while most urls submitted are simple, there are more of the subdomain.domain.XXX/index.html variety.
That said, then I do have a question, as my forms require the http or https code to be active on the form results page, the www is optional, since it is not always required. (this appears to be a little known secret)

So, the \w+\.\w+\ seems to present a potential problem as, when there is no www the \. would seem better written as \.? but some times I have seen that as (\.)? and then there is that pesky hyphen, wihich for sure is optional, so how to allow for the hyphen and dot after the optional www?

Sixtease
11-19-2008, 01:36 AM
Nice concept, but needs a bit of debugging. At least the end of the domain name should be sought for. See what this does:$url = "http://www.username.hosting.co.uk/"; # valid in my book
my $tested_url = ( $url=~/^((http:\/\/|https:\/\/)?\w+\.\w+(\.\w{2,4})?\.\w{2,4}).+$/ ) ? $1 : 'invalid';
print $tested_url;

# output:
http://www.username.host # ouch!

Nedals
11-19-2008, 03:50 PM
As you said earlier...
This task may be harder than would seem.:)

dancerman,
You need to decide what formats you wish to allow. The code and notes I provided should give you enough to construct the appropiate regex. If you get in trouble, post back and one of us will set you straight. :)

PolyGreat
12-08-2008, 08:13 AM
I'm just writing this piece right here in the message box, so this is raw and untested, but it might work for you.


if ($myURL =~ m/
(
( (http|https) #Accept, optionally, either form of HTTP address
(:\/\/) )? #Along with the matching ://
((\w|-|\d)+\.)+ #One or more words, with hyphens or digits,
#each of which is followed by a period
(\w+) #Some ending like com, net, org, tv, ru, tw, etc.
(\~|\/|\w|\d|\.|-)* #Some additional directory, file, or script
(\?.*(?<!\p{IsSpace})) #More non-white-space stuff
# (This last line could be commented to eliminate the script directives.)
)
/xi ) {

#IF all that matched, then do something here like:
$properURL = $1;

};


Blessings,

PolyGreat