PHP Web Scraping
Are there any good resources for learning web scraping with PHP? I'm mostly looking for a good book. If there are any good articles that would be good too.
What are any legal issues behind data scraping? I've been looking around and it seems like everything is fine except for how you use the information you scrape.
if you are scraping via HTTP, the specification documents are a good starting point. if you are gonna be scraping different scheme other than HTTP (i.e. FTP, TCP, UDP, etc.) or would like to get down and dirty with scraping, i would recommend using sockets and stream extensions over cURL extension. i've written wrapper classes for HTTP over sockets. i'm sure i can answer some questions on that aspect.
The legal issue is that now-a-days there is generally a presumption of copyright. You may not reproduce whole articles etc. Even partial extracts can breach copyright if they materially damage the commercial interests e.g. giving the ending of a book or film.
Thanks for the posts guys. Ive started fooling around with the curl library. What are advantages of using sockets over curl?
What about any copyright on images? Lets say I crawl social networks looking for images and information on specific people who's names were typed into a text box.
If you are using web scraping or web extraction for your business improvement then it's fine.
Most of the people use web scraping or web extraction tool for eCommerce or any online store to compare products price.
I have list of article resource of web data extraction but i can not post link here. If moderator allows me to post a link then i will give it to you else contact me.
Here's a great resource for gathering and parsing DOM: http://simplehtmldom.sourceforge.net/
Images, meta tags, etc. Very similar to jQuery selectors.
Warning: Scraping is both exhaustive on your bandwidth and your CPU. Remember to free resources when possible.
Disclaimer: You should always have permission from web admins and owners to scrape their site. Copyrights and such can make life a nightmare if you cross the wrong person.
Any image is copyrighted unless you have explicit permission of the owner, or the owner has explicitly put it in the public domain (e.g. Wikimedia). Stuff you just find on social networks is going to be copyrighted most always.
Originally Posted by VNAsian
My Rules in Scraping:
• Don't publish images or text that you scrape that the owner doesn't want you to.
***- I would not use images scraped from a social media site (people like their privacy and you don't know the source of many images).
***- I may consider using product images, prices and such scraped from stores for a review site (they want ads).
• Don't scrape too fast, be patient and don't put stress on their servers.
Some people say NEVER to use regex in scraping and to only use DOM. Each of these have their advantages and disadvantages. Many sites aren't coded well and are difficult to scrape data from, especially those where the structure changes frequently, for many of these and throw away project regex can be a good option.
I do Agree with Gravy.
But consider this (IMHO) before u start to learn :
The Good – There’s not much that’s good about web scraping. Unless you’re looking to use unsavory tactics, steal competitors’ content and pricing, or use other sites’ intellectual property, web scraping is just all around bad.
The Bad – The really bad news about web scraping is that it can lead to the theft of your content, which, if used on other sites, can significantly affect your SEO performance and rankings. It can also give competitors access to your proprietary pricing and product information, which ultimately gives them a leg up in the marketplace with the customers you’re actively seeking.
The Ugly – In a nutshell, web scraping can be a huge detriment to your brand. It can threaten your sales and conversions, lower your site’s SEO rankings or even get you blacklisted, negate the benefits of the content you’ve worked hard to produce, and can cause you to spend even more resources to make up for its damaging effects.
The really bad news about web scraping is that it can lead to the theft of your content, which, if used on other sites, can significantly affect your SEO performance and rankings. It can also give competitors access to your proprietary pricing and product information, which ultimately gives them a leg up in the marketplace with the customers you’re actively seeking.
Users Browsing this Thread
There are currently 1 users browsing this thread. (0 members and 1 guests)