Crawling World Wild Web at Scale__web

Source: Internet
Author: User
Tags python script

In this post we discuss some to the existing technologies for scraping, parsing and analyzing Web pages. We also talk about some of the challenges software engineers might face while scraping dynamic Web pages. scraping/parsing/mining Web Pages

In September, the IPhone 5 was released. We were interested in finding out people reactions to the new iPhone. We wanted to write a simple Python script to scrape and parse online reviews and run a sentiment analysis on the collected Reviews. There are many applications for automated opinion mining where companies are interested into finding out their customers rea Ctions to New.

For scraping reviews We used Python Urllib module. For parsing pages contents and grabbing the required HTML elements we used a Python library called BeautifulSoup. For sentiment analysis, we found the API which is built on the top of the Python NLTK Library for classification. This is the link to sentiment API. Finally, for sending HTTP requests to text-processing website, we used another Python library called Pycurl the IS basic Ally a Python interface to Libcurl. You can pull/view the opinion mining code from its GitHub repo:opinion mining.

Please note that there are the other solutions for scraping web. For instance, another the open source framework for scraping websites and extracting needed information. Scraping Amazon Reviews

For another project, we are were interested in scraping reviews to George Foreman Grill from Amazon website. For scraping this page, your can open the page on your Chrome browser and use Chrome inspector to inspect the review elemen TS and figure out what HTML elements your need to grab when parsing the Web page. If you are inspect one of the reviews, you'll be the review is wrapped by a ' div ' element using ' Reviewtext ' to its s Tyle class.

The code snippet for scraping and parsing Amazon reviews has been shown:

          Amazon_url = "..." #add the link to Amazon page here
          ur = urllib.urlopen (amazon_url)
          soup = BeautifulSoup (Ur.read ( )
          posts = Soup.select ("Div.reviewtext")
          print Posts[0].text #this prints the "a" review
      

How do we grab the DIV elements for reviews by filtering by the style class. can check beautiful Soup documentation for more details. With above snippet, one can get all reviews successfully. Scraping Macys Reviews

We also wanted to scrape the reviews for the same product but from Macys Wbsite. So, let ' s try the same approach as shown earlier and the what we. The only difference are that inspect Macys page, and you'll be review is wrapped in a SPAN element using the S Tyle class of ' Bvrrreviewtext '. So, we do the following change to our snippet:

          Macys_url = "..." #add the link to macys page
          ur = urllib.urlopen (macys_url)
          soup = BeautifulSoup (Ur.read ())
          p OSTs = Soup.select ("span.") Bvrrreviewtext ")
          print Posts[0].text #this should print review in theory!
      

If you are try above code, you are wont get anything for review content. And more interestingly if your try print ur.read () after the second line and ignore the rest of code, "ll get a None obj Ect. Why?

The issue is this macys reviews have been populated by Ajax calls from their Web server. In the other words, this isn't a statically loaded HTML page. So, the basically using urllib does not work here. How to scrape dynamically Loaded Web Pages?

To resolve above issue, your need to figure out how Macys populate the reviews by making a POST call to a link on their web Server. Then, you need to make this POST call request to populate the reviews. The other possible solution are to use a framework/library to simulate the operation of a browser. Here we are are going to use the PHANTOMJS which are a headless WebKit with a JavaScript APIs to scriptable scrape from Macys.

You can download/use PHANTOMJS on your machine by following the instructions: how to build Phantomjs. Code below is our hack around getting the Macys reviews using Phantomjs:

//Get Macys reviews var page = require (' webpage '). Create (), url = ' http://www1.macys.com/shop/product/george-foreman-g
rp95r-grill-6-servings?id=797879 ';
    Page.open (URL, function (status) {if (Status!== ' success ') {Console.log (' Unable to access network '); else {var results = page.evaluate (function () {var Allspans = document.getelementsbytagname (' SP
            an ');
            var reviews = [];
                    for (var i = 0; i < allspans.length i++) {if (Allspans[i].classname = = ' Bvrrreviewtext ') {
                Reviews.push (allspans[i].innerhtml);
            } return reviews;
        });
    Console.log (Results.join (' \ n '));
} phantom.exit ();
      }); 

You are here to above code how we go over grabbing reviews by getting the span elements with style class ' Bvrrreviewtext '. Another possible solution that we have found be ghost.py but ' t get didn to test it.

To check out my simple crawler/sentiment analyzer so we have developed for crawling Amazon/macys reviews, visit our Git Hub Repository here:opinion-mining. scraping Web at Scale

One problem that we have come across often then scraping the web at scale was the time-consuming nature of the scraping. Most of the time we need to scrape billions of pages. This arises the need for coming up with a distributed solution to optimize the scraping time.

One simple solution, we came up with the for designing a distributed web crawler is to use Work Queue to distribute time- Consuming tasks among multiple workers (i.e. Web crawlers). So, the basic idea are to use work queues to schedule tasks for scraping/parsing many pages by running multiple workers sim ultaneously.

Can view a simple example for a distributed web crawler here:distributed crawling and Rmq last words

So, this is a quick review on scrapping of web pages and some of the challenges your may encounter D Wild Web.

source:http://www.aioptify.com/crawling.php

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.