Click to See Complete Forum and Search --> : How to scan the text from a coming web page?


Valzara
07-06-2005, 03:51 AM
Hi all,

I am implementing a web browser with filtering function, which means certain pre-set webpage wil lbe blocked if using my browser.

I had done the URL address filtering, which means if the URL in the address bar is same with the pre-set URL, that web page will be blocked.

My next objective is keyword filtering, which means the browser will scan through the texts on the coming web page, if too much keyword are found, that page will considered harmful and will be blocked.

Since my URL filtering is using Java I/O to do it, so now my idea is, can I use Java I/O for this keyword filtering as well? Scan through the texts of the webpage, and compare it with the pre-set word in the text file, if the comparison are success (means found the keyword) more than 5 times, that website considered harmful and will be blocked.

But the main problem for me is that I do not know how to make my browser scan through the texts of the coming web page, anyone here know how to do this? Please teach me, and give me example if there're any, please.

THANKS :) :)

buntine
07-06-2005, 07:24 AM
I once wrote a code editor in Java (Oak may remember it) that had some keyword highlighting. I will dig it up and see how I managed to scan a body of text looking for keywords.

Regards.

buntine
07-06-2005, 08:18 AM
Ok, I wrote a class named FindAndReplace to, obviously, find certain tokens and replace them. A key method you will need to look at is called search and takes three parameters.

It is a large file, so I will attach it. Alot of it can be disredarded in your case.

Regards

Valzara
07-06-2005, 12:26 PM
buntine, you are my savior.

I really need to thank you for giving me idea on using that Java I/O for the URL filtering. If not you I really don't know how to do it.

Now I will try out the method you mentioned (however it seems strange to me...)

*salute* !!! !!!

buntine
07-06-2005, 07:18 PM
No worries.

Valzara
07-07-2005, 02:27 AM
Hi buntine,

I had went through your program. In the 'Search' class where you use to find the keyword, the code that you use to declare text body is :

int docLength = text.length();

issit?

This will scan through the body of text right? But I am confuse that, is that means the target text document to be scanned is named 'text' ? If I want to scan through the text in a webpage, can I use :

(example)
int docLength = URL.length();

I just confuse that if I can simply use this to capture and scan the text in a coming webpage. Please tell me, THANKS. *bow*!!!

Valzara
07-10-2005, 12:20 PM
Hi buntine,

I had working on the text scanning function, the 'search' funciton in your 'FindAndReplace' is very helpful for me.

However, When I had done with the function, it seems not scanning the texts in the coming web page. I does nothing for this fuction, however my code 'seems' logically.

In my file, I put

for (int i=0; i<(docLength-wordLength)+1; i++)
{
String temp = url.substring(i, (i+wordLength)); ... ...

But it seems it doesn't catch the text form the url...

Can teach me some solution on it? I was stuck at this point for 2 days already, (is not even an error, just don't know why it can't catch the texts from the coming webpage and scan it) If you know what's wrong with this code, please tell me. THANK YOU !!! :) :) :) *bow* *bow* *bow*

BTW, the place where I implement the recognizing and filtering function is in 'dcToolBar'. :)

buntine
07-10-2005, 10:54 PM
Have you done some debugging like printing the value of a few of the numeric variables?

Try some things like:

System.out.println(Integer.toString((docLength-wordLength)+1))

Regards.

Valzara
07-11-2005, 05:59 AM
Oh, you means try to print out how many keywords care catched in the body text?
Let me try out first, because the main problem for me I can't catch the text from the webpage itself.

Thanks !!! *salute*

Ah I forgot to ask this, is my code for the text scanning ok? IN the sense of logically?

Valzara
07-12-2005, 12:00 PM
Hi buntine,

Finally I had clear out all the little little error, and I get that browser run again. But when I try out that keyword filtering part, the browser hang, it does nothing. And it not even print out how many keywords are catched.

I try to put the word 'gun ' in the Keyword.txt, then go to this site:
http://db.gamefaqs.com/console/ps2/file/grand_theft_auto_sa_r.txt
(Which contain lots of word 'gun')

When I access this site, that browser simply hang and does nothing. But when I try to access the other website, oh my god it shows some strange code on the page, the page is not display properly... I try to access Yahoo, Yahoo is showing but it does not display properly, with lots of strange code on the site.

So is my code for the keyword scanning got problem? Or it's not logically correct? Please tell me, I have no idea about what's wrong with my code.

p/s: When you try to rn the program, is the normal website display properly, like Yahoo ?

Sorry for spending your time, but I really pull off my hair due to this. Please tell me what's wrong with the keyword scanning function, THANKS !!!! *bow* *bow* *bow*

Valzara
07-14-2005, 01:17 AM
hello....anybody home?

Valzara
07-19-2005, 09:42 AM
Oh my god, I tried out some other method on my project, like try to move the keyword scanning function to a new try & catch statement, but end up the same result, which it will not scan and catch the text which appear on a webpage.

I was thinking this, is this line

String url2 = url.toString(); and this line
int docLength = url2.length();

means the .length only catch the address of the url, not the entire body of the URL ??

Can somebody help me please, please teach me something on this problem...!!!buntine, my code are refer to the 'search' function in your FindandReplace, but how come the same solution does not scan and catch the text in a webpage?

buntine
07-19-2005, 03:27 PM
How are you storing the text? As one variable? If so, I would double check exactly what is stored within that variable.

This is a very large problem and I simply cannot devote the time required to solve this problem for free. Time does not permit.

I had a look through your code and nothing came to mind. Mayby you should restart if your not getting anywhere with this code.

The .length function will return the number of characters in any given string.

Regards.

Valzara
07-19-2005, 11:44 PM
buntine,

Oh, anyway I did not ask you to solve the problems for me, what you had done is already a salvation to me. :) *salute* *salute*

Hm, since you said the .length function will return the number of character in any given string, so what I did in my code, will that just return the address of URL only? coz it's only the length of the URL string.


However, will this be possible?
Store the html file (the webpage itself) into a text file, then only I try to scan from that text file for keyword.
In fact, I am really doubt the .length funciton automatically go and catch the texts in the entire webpage?

buntine
07-20-2005, 12:16 AM
No. The length() function will only return a number representing the amount of characters in a String (or another permitting object).

You should be able to pass the text itself (just the variable) to the search function.

Regards.

Valzara
07-20-2005, 04:15 AM
Ooo, no wonder it just never catch the body text for a webpage...

So if I do it locally? I means, store the html into a text file, then use the similiar finction like the Search in FindandReplace to scan through the body text?

THANKS.