/    Sign up×
Community /Pin to ProfileBookmark

cURL Web Crawler & Xml Site Map Questions

cURL Php Folks.

Do you know of a good cURL tutorial that shows you how to build a web crawler that will crawl all kinds of link formats such as these ones …

[code]
foreach ($linklist as $link) {
$l = $link->getAttribute(“href”);
// Process all of the links we find. This is covered in part 2 and part 3 of the video series.
if (substr($l, 0, 1) == “/” && substr($l, 0, 2) != “//”) {
$l = parse_url($url)[“scheme”].”://”.parse_url($url)[“host”].$l;
} else if (substr($l, 0, 2) == “//”) {
$l = parse_url($url)[“scheme”].”:”.$l;
} else if (substr($l, 0, 2) == “./”) {
$l = parse_url($url)[“scheme”].”://”.parse_url($url)[“host”].dirname(parse_url($url)[“path”]).substr($l, 1);
} else if (substr($l, 0, 1) == “#”) {
$l = parse_url($url)[“scheme”].”://”.parse_url($url)[“host”].parse_url($url)[“path”].$l;
} else if (substr($l, 0, 3) == “../”) {
$l = parse_url($url)[“scheme”].”://”.parse_url($url)[“host”].”/”.$l;
} else if (substr($l, 0, 11) == “javascript:”) {
continue;
} else if (substr($l, 0, 5) != “https” && substr($l, 0, 4) != “http”) {
$l = parse_url($url)[“scheme”].”://”.parse_url($url)[“host”].”/”.$l;
}
[/code]

I got the above code from a youtube tutorial. Author: https://howcode.org.

Remember the tutorial must show how to write web crawler that will be able to crawl links no matter what format they’re in. No matter what directory the crawler is on on the websites. Absolute links, relative links, links with “../” and so on. You know what I mean. The above code tells it all.

Q2. Percentage wise, how many websites nowadays contain meta tags ? I ask because, if I write a web crawler that fetches pages and sniffs for meta tags and finds none then under what keywords and keyphrases should it index the crawled webpages that got no meta tags (meta keywords tag, meta description tag) ? No good just sniffing the html title tag, is there ?

Q3. I hear nowadays or for over a dacade now, websites create xml site map to feed their website links to bots. Is that right ? Anyway, I need to find a tutorial that teaches how to write a web crawler in cURL php that crawls links found on xml site map. Know of any good tutorials ?
What keywords should I google for ?

Q4. Know of any php script or a php tutorial that will show me how to auto build an xml site map for my website so all links are displayed in the site map ?

Q5. Any suggestions or tips ?

Folks, I’m not new to building Web crawlers. Have built many .exe ones using Ubot Studio. it is a bot builder. Now, I need to build a php web crawler because I want to build a searchengine.
I don’t want to keep my home pc online all the time and use the .exe desktop web crawler to crawl the net. Hence, best I build a php one or a web app and get my website to crawl the net when people like you submit your xml site maps.
I’m not gonna build a web crawler that wonders off from domain to domain.
It’s like this. You submit your xml site map and my php cURL web crawler will fetch all your domain’s links and spider your website. It won’t wonder off to other domains. Understand ?
That way, it won’t drain my website bandwidth. It will only crawl links of people who submit their links.
Don’t ask me why people would bother submitting their links to me because I’m going to offer something they can’t refuse.
And so, I need some guidance here on how to proceed building my web crawler.
if you know of any good php cURL tutorial then speak-up or atleast suggest some good keywords for me to google cos (British slang of “because” 😉) right now I can’t find any good ones through (British spelling of American speeling “thru”) google.

to post a comment
PHP

2 Comments(s)

Copy linkTweet thisAlerts:
@developer_webauthorMay 10.2020 — No one came across such tutorials ? 😁
Copy linkTweet thisAlerts:
@developer_webauthorMay 12.2020 — @olegrozdobudko#1618344

What was all that ? You forgot to add your code inside the code tags.
×

Success!

Help @developer_web spread the word by sharing this article on Twitter...

Tweet This
Sign in
Forgot password?
Sign in with TwitchSign in with GithubCreate Account
about: ({
version: 0.1.9 BETA 4.25,
whats_new: community page,
up_next: more Davinci•003 tasks,
coming_soon: events calendar,
social: @webDeveloperHQ
});

legal: ({
terms: of use,
privacy: policy
});
changelog: (
version: 0.1.9,
notes: added community page

version: 0.1.8,
notes: added Davinci•003

version: 0.1.7,
notes: upvote answers to bounties

version: 0.1.6,
notes: article editor refresh
)...
recent_tips: (
tipper: @Yussuf4331,
tipped: article
amount: 1000 SATS,

tipper: @darkwebsites540,
tipped: article
amount: 10 SATS,

tipper: @Samric24,
tipped: article
amount: 1000 SATS,
)...