/    Sign up×
Community /Pin to ProfileBookmark

Php Webcrawl Pages Based On Meta Tags Only ?

Folks,

As you know I am trying to build my own web crawler now.
I am not too bothered about ranking algorithm, I have my own up my sleeves. Plus, can worry about ranking algo once the Indexing us finished and that won’t happen unless I complete my crawler first.
I was originally planning on associating keywords/phrases with your website link (eg site homepage) based on the link anchors of all links found throughout your website that link to your concerned link (in this example, your site homepage) aswell as anchors of all links found on other sites (those foreign domains that link to your site homepage) and associate the keywords/phrases found on the page’s (eg your site homepage’s) meta tags.
But what if a new page on your website pops up on the internet and it has no meta tags nor any foreign domains linking to it. In that case, the only keywords/phrases I can associate with that new link is the anchor texts of all the links that are linking to it from your website. If only a handful of other pages are linking to it then only a handful of keywords/phrases get associated with it. That is no good.
I know Google and the like analyse the crawled page’s content using word synonyms and associate those synonym keywords to the page. That way, chances of the new page getting found under any of these keywords raise the potential to be found. But I ain’t getting into synonyms yet.
And so my main question is:
What other form of data must I associate with the crawled page apart from the anchor texts of all internal pagesinking to it and the page’s very own meta tags ?

to post a comment

0Be the first to comment 😎

×

Success!

Help @developer_web spread the word by sharing this article on Twitter...

Tweet This
Sign in
Forgot password?
Sign in with TwitchSign in with GithubCreate Account
about: ({
version: 0.1.9 BETA 4.24,
whats_new: community page,
up_next: more Davinci•003 tasks,
coming_soon: events calendar,
social: @webDeveloperHQ
});

legal: ({
terms: of use,
privacy: policy
});
changelog: (
version: 0.1.9,
notes: added community page

version: 0.1.8,
notes: added Davinci•003

version: 0.1.7,
notes: upvote answers to bounties

version: 0.1.6,
notes: article editor refresh
)...
recent_tips: (
tipper: @Yussuf4331,
tipped: article
amount: 1000 SATS,

tipper: @darkwebsites540,
tipped: article
amount: 10 SATS,

tipper: @Samric24,
tipped: article
amount: 1000 SATS,
)...