I'd like to make an online index of existing web pages.
The website to index is not mine, but it doesn't have a search tool nor have it anytime soon.
I can download them all to my local computer, and make them all wordpress pages (I'm good at it, but not at SQL) but I think my missing link is how to correlate the content with the real online page. If I had an existing tool / system to index pages that would probably fill in the gap, because I don't really need the content other than to create the index. After that, the content is useless.
So the found pages should link to the original website, not to the one I'll put up online, which will be only a search form.
Ok, I've tried the script:
The problem is… I can't get it to browse as a browser's agent and it keeps connecting as a "robot", and relying on the robots.txt file, failing to index the pages marked as disallow… or at least so says the error message: "File checking forbidden by required/disallowed string rule".
I tried to change some if conditions, to make it NOT to find the robots file, or ignore it, but it didn't work. I also tried a mod I found online to "ignore robots" but it did the same, except there was no error. it just ended. Sphider-plus (1.6) did the same.
If anyone knows how to hack it, I'd appreciate the tip.
Thanks.
Bookmarks