www.webdeveloper.com
Results 1 to 11 of 11

Thread: PHP Web Scraping

  1. #1
    Join Date
    Jul 2013
    Posts
    6

    PHP Web Scraping

    Are there any good resources for learning web scraping with PHP? I'm mostly looking for a good book. If there are any good articles that would be good too.

    What are any legal issues behind data scraping? I've been looking around and it seems like everything is fine except for how you use the information you scrape.

  2. #2
    Join Date
    Dec 2002
    Location
    Seattle, WA
    Posts
    1,843
    if you are scraping via HTTP, the specification documents are a good starting point. if you are gonna be scraping different scheme other than HTTP (i.e. FTP, TCP, UDP, etc.) or would like to get down and dirty with scraping, i would recommend using sockets and stream extensions over cURL extension. i've written wrapper classes for HTTP over sockets. i'm sure i can answer some questions on that aspect.

  3. #3
    Join Date
    Mar 2012
    Posts
    1,740
    The legal issue is that now-a-days there is generally a presumption of copyright. You may not reproduce whole articles etc. Even partial extracts can breach copyright if they materially damage the commercial interests e.g. giving the ending of a book or film.

  4. #4
    Join Date
    Jul 2013
    Posts
    6
    Thanks for the posts guys. Ive started fooling around with the curl library. What are advantages of using sockets over curl?

    What about any copyright on images? Lets say I crawl social networks looking for images and information on specific people who's names were typed into a text box.

  5. #5
    Join Date
    Jul 2014
    Location
    Ahmedabad
    Posts
    3
    If you are using web scraping or web extraction for your business improvement then it's fine.

    Most of the people use web scraping or web extraction tool for eCommerce or any online store to compare products price.

    I have list of article resource of web data extraction but i can not post link here. If moderator allows me to post a link then i will give it to you else contact me.

    Thanks !

  6. #6
    Join Date
    May 2014
    Posts
    9
    Hey,

    Here's a great resource for gathering and parsing DOM: http://simplehtmldom.sourceforge.net/
    Images, meta tags, etc. Very similar to jQuery selectors.

    Warning: Scraping is both exhaustive on your bandwidth and your CPU. Remember to free resources when possible.

    Disclaimer: You should always have permission from web admins and owners to scrape their site. Copyrights and such can make life a nightmare if you cross the wrong person.

  7. #7
    Join Date
    Aug 2006
    Posts
    1,934
    Quote Originally Posted by VNAsian View Post
    What about any copyright on images? Lets say I crawl social networks looking for images and information on specific people who's names were typed into a text box.
    Any image is copyrighted unless you have explicit permission of the owner, or the owner has explicitly put it in the public domain (e.g. Wikimedia). Stuff you just find on social networks is going to be copyrighted most always.

  8. #8
    Join Date
    May 2014
    Posts
    77
    My Rules in Scraping:

    • Don't publish images or text that you scrape that the owner doesn't want you to.
    ***- I would not use images scraped from a social media site (people like their privacy and you don't know the source of many images).
    ***- I may consider using product images, prices and such scraped from stores for a review site (they want ads).
    • Don't scrape too fast, be patient and don't put stress on their servers.

    Some people say NEVER to use regex in scraping and to only use DOM. Each of these have their advantages and disadvantages. Many sites aren't coded well and are difficult to scrape data from, especially those where the structure changes frequently, for many of these and throw away project regex can be a good option.

  9. #9
    Join Date
    Aug 2014
    Posts
    1
    Thanks for this post use the scrapping in php read this..Programmer

  10. #10
    Join Date
    Aug 2014
    Posts
    14
    I do Agree with Gravy.

    But consider this (IMHO) before u start to learn :

    The Good – There’s not much that’s good about web scraping. Unless you’re looking to use unsavory tactics, steal competitors’ content and pricing, or use other sites’ intellectual property, web scraping is just all around bad.
    The Bad – The really bad news about web scraping is that it can lead to the theft of your content, which, if used on other sites, can significantly affect your SEO performance and rankings. It can also give competitors access to your proprietary pricing and product information, which ultimately gives them a leg up in the marketplace with the customers you’re actively seeking.
    The Ugly – In a nutshell, web scraping can be a huge detriment to your brand. It can threaten your sales and conversions, lower your site’s SEO rankings or even get you blacklisted, negate the benefits of the content you’ve worked hard to produce, and can cause you to spend even more resources to make up for its damaging effects.

  11. #11
    Join Date
    Aug 2014
    Posts
    14
    The really bad news about web scraping is that it can lead to the theft of your content, which, if used on other sites, can significantly affect your SEO performance and rankings. It can also give competitors access to your proprietary pricing and product information, which ultimately gives them a leg up in the marketplace with the customers you’re actively seeking.

Thread Information

Users Browsing this Thread

There are currently 1 users browsing this thread. (0 members and 1 guests)

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •  
HTML5 Development Center



Recent Articles