Easily crawl Web pages with Python _

Easily crawl Web pages with Python __python

Last Update:2018-07-24 Source: Internet

Author: User

Tags virtual environment

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

[Translated from original English: Easy Web scraping with Python]

I wrote an article more than a year ago "web scraping using node.js". Today I revisit this topic, but this time I'm going to use Python so that the techniques offered by both languages can be compared and compared. problem

I'm sure you know that I attended the Pycon conference in Montreal earlier this month. All the lectures and tutorials have been posted on YouTube, and the catalogue is in pyvideo.org.

I think it would be useful to know which videos are most popular at this conference, so we're going to write a reptile script. The script will get a valid video list from the pyvideo.org and then get the statistics from each YouTube page. It sounds interesting. Let's get started. Tools

There are two basic tasks in crawling Web sites: Loading Web pages into a string. Parse HTML from the Web page to locate the location of interest.

Python provides two great tools for the above two tasks. I will use requests to load the Web page, use BeautifulSoup to do parsing.

We can put the top two packages into a virtual environment:

$ mkdir pycon-scraper
$ virtualenv venv
$ source venv/bin/activate
(venv) $ pip Install requests Beautifulsou P4

If you are using a Windows operating system, note that the activation commands for the above virtual environment are different and you should use Venv\scripts\activate.
Basic Crawl Technology

The first thing to do when writing a reptile script is to manually observe the page to crawl to determine how the data is positioned.

First, we'll take a look at the Pycon Conference video list on the http://pyvideo.org/category/50/pycon-us-2014. Check the HTML source code for this page and we found that the result of the video list is almost as long as this:

<div id= "Video-summary-content" >
    <div class= "Video-summary" >    <!--the video-->
        <div class= "Thumbnail-data" >...</div>
        <div class= "Video-summary-data" >
            <div>
                <strong><a href= "#link to video page#" > #title #</a></strong>
            </div>
        < /div>
    </div>
    <div class= "Video-summary" >    <!--second video-->
        ...
    </div> ...
</div>

The first task is to load the page and then extract the links to each individual page, because the links to YouTube videos are on these separate pages.

Using requests to load a Web page is very simple:

Import Requests
response = Requests.get (' http://pyvideo.org/category/50/pycon-us-2014 ')

That's it. The HTML of this page can be obtained from the Response.text after the function returns.

The next task is to extract a link to each individual video page. You can do this with CSS selector syntax by BeautifulSoup, and you may be familiar with it if you are a client developer.

To get these links, we're going to use a selector that can crawl all the <a> elements in each <div> with ID video-summary-data. Since there are a few <a> elements in each video, we will only keep the <a> elements whose URLs start with/video, which is the only individual video page. The CSS selector that implements the above standard is Div.video-summary-data A[href^=/video]. The following code fragment uses this selector via BeautifulSoup to obtain the <a> elements that point to the video page:

Import BS4
soup = bs4. BeautifulSoup (response.text)
links = soup.select (' div.video-summary-data a[href^=/video] ')

Because what we really care about is the link itself and not the <a> element that contains it, we can use list parsing to improve the code.

Links = [A.attrs.get (' href ') for a in Soup.select (' Div.video-summary-data A[href^=/video] ')]

Now we have an array of all the links that point to each individual page.

The following script collates all the technologies we've mentioned today:

Import requests
Import BS4

root_url = ' http://pyvideo.org '
index_url = Root_url + '/category/50/ pycon-us-2014 '

def get_video_page_urls ():
    response = Requests.get (index_url)
    soup = bs4. BeautifulSoup (Response.text) return
    [A.attrs.get (' href ') to a in Soup.select (' Div.video-summary-data a[href^=/ VIDEO]]

print (Get_video_page_urls ())

If you run the above script you will get an array full of URLs. Now we need to parse each URL to get more information about each Pycon meeting. Crawl Connected pages

The next step is to load every page in our URL array. If you want to see what these pages look like, here's a sample: Http://pyvideo.org/video/2668/writing-restful-web-services-with-flask. Yes, that's me, that's one of my meetings.

From these pages we can grab the title of the meeting and see it at the top of the page. We can also get the speaker's name and YouTube link from the sidebar, and the sidebar is at the bottom right of the embedded video. The code to get these elements is shown below:

def get_video_data (video_page_url):
    video_data = {}
    response = Requests.get (Root_url + video_page_url)
    Soup = bs4. BeautifulSoup (response.text)
    video_data[' title '] = Soup.select (' Div#videobox h3 ') [0].get_text ()
    Video_ data[' speakers '] = [A.get_text () for A in Soup.select (' Div#sidebar A[href^=/speaker] ')]
    video_data[' Youtube_url ' = Soup.select (' Div#sidebar a[href^=http://www.youtube.com] ') [0].get_text ()

Some things to note about this function: The URL to crawl from the home page is relative, so the root_url needs to be added to the front. The title of the conference is obtained from the

Now it's time to grab a view of the YouTube page from each video. Then the above function is actually very simple to write down. In the same way, we can also grab like numbers and dislike numbers.

def Get_video_data (Video_page_url): # response = Requests.get (video_data[' youtube_url ']) soup = bs4. BeautifulSoup (response.text) video_data[' views ' = Int (re.sub (' [^0-9] ', ', ', soup.)
                                     Select ('. Watch-view-count ') [0].get_text (). Split () [0])) video_data[' likes '] = Int (re.sub (' [^0-9] ', ', ", Soup.select ('. Likes-count ') [0].get_text (). Split () [0])) video_data[' dislikes '] = Int (re.sub (' [^0-9 ] ', ', Soup.select ('. Dislikes-count ') [0].get_text (). Split () [0]) return vide O_data

The above call to the Soup.select () function, using a selector with a specified ID name, captures the statistical data of the video. But the text of the element needs to be processed to become a number. Consider the example of viewing numbers, which are displayed on YouTube as "1,344 views". After separating (split) numbers and text with a single space, only the first part is useful. Because there are commas in the numbers, you can use regular expressions to filter out any characters that are not digits.

To complete the crawler, the following function calls all of the previously mentioned code:

Def show_video_stats ():
    video_page_urls = Get_video_page_urls () for
    Video_page_url in Video_page_urls:
        Print Get_video_data (Video_page_url)

Parallel Processing

The script workplace So far is good, but with more than 100 videos It's going to run for a while. In fact we didn't do any work, most of the time was wasted on the download page, during this time the script was blocked. If the foot instinct runs multiple downloads at the same time, it may be more efficient, right.

Looking back at the time when writing a reptile article using Node.js, concurrency was brought about by the asynchronous nature of JavaScript. It can be done with Python, but it needs to be shown to specify. Like this example, I will open a pool of processes with 8 parallel processes. The code is surprisingly concise:

From multiprocessing import Pool

def show_video_stats (options):
    pool = Pool (8)
    video_page_urls = Get_video _page_urls ()
    results = Pool.map (Get_video_data, Video_page_urls)

Multiprocessing. The Pool class opens 8 worker processes waiting for the assignment task to run. Why is 8. This is twice times the number of cores on my computer. When experimenting with different sizes of process pools, I found this to be the best size. Less than 8 makes the script run too slowly, and more than 8 will not make it faster.

Calling Pool.map () is similar to calling a regular map (), which will call the function specified by the first argument once for each element in the iteration variable specified by the second argument. The biggest difference is that it will send these to run the processes owned by the process pool, so in this example eight tasks will run in parallel.

The time saved is quite large. On my computer, the first version of the script was completed in 75 seconds, but the process pool version did the same job for only 16 seconds. Complete the crawler script

My final version of the reptile script has done more to get the data.

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More