Walkthrough of Python crawler Artifact beautiful soup usage with video crawl instance

Source: Internet
Author: User
1. Installing BEAUTIFULSOUP4
Easy_install installation method, Easy_install need to be installed in advance

Easy_install BEAUTIFULSOUP4

Pip installation method, Pip also needs to be installed in advance. There is also a BeautifulSoup package in PyPI, which is the release version of Beautiful Soup3. Installation is not recommended here.

Pip Install Beautifulsoup4

Debain or Ubuntu installation mode

Apt-get Install PYTHON-BS4

You can also install through the source code, download BS4 source code

Python setup.py Install

2. Small trial Sledgehammer

# coding=utf-8 "@ BeautifulSoup download Baidu paste Image" "Import urllibfrom bs4 Import beautifulsoupurl = ' http://tieba.baidu.com/p /3537654215 ' # download Web page html = urllib.urlopen (URL) content = Html.read () html.close () # Use BeautifulSoup to match the picture Html_soup = BeautifulSoup (content) # Picture Code we are in [Python crawler Basics 1--urllib] (http://blog.xiaolud.com/2015/01/22/spider-1st/" Python crawler base 1--urllib ") has been analyzed in the # compared with regular expression to match, BeautifulSoup provides a more simple and flexible way all_img_links = Html_soup.findall (' img ', class_= ' Bde_image ') # Next is the cliché download image img_counter = 1for img_link in all_img_links:  img_name = '%s.jpg '% img_counter< C1/>urllib.urlretrieve (img_link[' src '], img_name)  Img_counter + = 1

Very simple, the code comment has been explained very clearly. BeautifulSoup provides a more simple and flexible way to analyze the site source code, faster access to the picture link.


3. Crawling instances
3.1 Basic gripping techniques
When writing a crawler script, the first thing to do is to manually observe the page to crawl to determine how the data is positioned.

First, let's take a look at the Pycon Conference video list on http://pyvideo.org/category/50/pycon-us-2014. Check the HTML source code of this page we found that the results of the video list are almost as long as this:

    
 
      ...                  #title                
 
   #    ...    ...

So the first task is to load this page and then extract a link to each individual page, because the links to YouTube videos are on these separate pages.

Using requests to load a Web page is very simple:

Import requestsresponse = Requests.get (' http://pyvideo.org/category/50/pycon-us-2014 ')

That's it! The HTML of this page can be obtained from Response.text after the function returns.

The next task is to extract a link to each individual video page. This can be done using CSS selector syntax with BeautifulSoup, which you might be familiar with if you are a client developer.

To get these links, we're going to use a selector that captures all the elements in each ID video-summary-data. Since each video has several elements, we will keep only those elements whose URLs begin with/video, which are the only individual video pages. The CSS selector that implements the above criteria is Div.video-summary-data A[href^=/video]. The following code snippet uses this selector with BeautifulSoup to get the elements that point to the video page:

Import Bs4soup = Bs4. BeautifulSoup (response.text) links = soup.select (' div.video-summary-data a[href^=/video] ')

Because what we really care about is the link itself rather than the element that contains it, we can use list parsing to improve the code.

Links = [A.attrs.get (' href ') for a in Soup.select (' Div.video-summary-data A[href^=/video] ')]
Now we have an array of all the links that point to each individual page.

The following script organizes all of the technologies that we have now mentioned:

Import requestsimport bs4root_url = ' http://pyvideo.org ' Index_url = root_url + '/category/50/pycon-us-2014 ' def get_ Video_page_urls ():  response = Requests.get (index_url)  soup = bs4. BeautifulSoup (Response.text)  return [A.attrs.get (' href ') for a in Soup.select (' Div.video-summary-data a[href^=/ Video] ')]print (Get_video_page_urls ())

If you run the above script you will get an array full of URLs. Now we need to parse each URL to get more information about each Pycon meeting.

3.2 Fetching connected pages
The next step is to load every page in our URL array. If you want to see what these pages look like, here's a sample: Http://pyvideo.org/video/2668/writing-restful-web-services-with-flask. Yes, that's me, that's one of my meetings!

From these pages we can grab the title of the meeting and see it at the top of the page. We can also get the speaker's name and YouTube link from the sidebar, which is in the bottom right of the embedded video. The code to get these elements is shown below:

def get_video_data (video_page_url):  video_data = {}  response = Requests.get (Root_url + video_page_url)  Soup = bs4. BeautifulSoup (response.text)  video_data[' title ' = Soup.select (' Div#videobox h3 ') [0].get_text ()  Video_ data[' Speakers ' = [A.get_text () for A in Soup.select (' Div#sidebar A[href^=/speaker] ')]  video_data[' Youtube_url '] = Soup.select (' Div#sidebar a[href^=http://www.youtube.com] ') [0].get_text ()

Some things to note about this function:

The URL crawled from the homepage is a relative path, so the root_url needs to be added to the front.
The title of the conference is from the ID videobox.

element is obtained in the. Note [0] is required because the call to select () returns an array, even if there is only one match.
The name of the speaker and the YouTube link are obtained in a similar way to how links are obtained on the homepage.
Now it's time to grab the views from each video's YouTube page. Then the above function is actually very simple to write down. Similarly, we can also crawl like and dislike numbers.

def get_video_data (Video_page_url):  # ...  Response = Requests.get (video_data[' youtube_url ')  soup = bs4. BeautifulSoup (response.text)  video_data[' views '] = Int (re.sub (' [^0-9] ', ', '                   soup.select '. Watch-view-count ') [0].get_text (). Split () [0]))  video_data[' likes '] = Int (re.sub (' [^0-9] ', ' ',                   soup.select ( '. Likes-count ') [0].get_text (). Split () [0]))  video_data[' dislikes '] = Int (re.sub (' [^0-9] ', ' ',                     soup.select ( '. Dislikes-count ') [0].get_text (). Split () [0]))  return Video_data

The above call the Soup.select () function, using a selector that specifies the ID name, collects the video statistics. But the text of the element needs to be processed before it becomes a number. Consider the example of a viewing number, which is shown on YouTube as "1,344 views". After separating (split) numbers and text with a single space, only the first part is useful. Since there are commas in the numbers, you can use regular expressions to filter out any characters that are not numbers.

In order to complete the crawler, the following function invokes all the previously mentioned code:

Def show_video_stats ():  video_page_urls = Get_video_page_urls () for  Video_page_url in Video_page_urls:    Print Get_video_data (Video_page_url)

3.3 Parallel processing
The script up to now workplace well, but with more than 100 videos It's going to run for a while. In fact, we didn't do any work, most of the time was wasted on the download page, during this time the script was blocked. If the foot instinct runs multiple download tasks at the same time, it might be more efficient, right?

Recalling the time when writing a crawler article using node. js, concurrency comes from the asynchronous nature of JavaScript. Python can do the same, but it needs to be specified in the display. Like this example, I'm going to open a process pool with 8 processes that can be parallelized. The code is surprisingly concise:

From multiprocessing import pooldef show_video_stats (options):  pool = Pool (8)  video_page_urls = Get_video_page _urls ()  results = Pool.map (Get_video_data, Video_page_urls)

Multiprocessing. The Pool class opens 8 worker processes waiting for the assignment task to run. Why are there 8 of them? This is twice times the number of cores on my computer. When experimenting with different sizes of process pools, I found this to be the best size. Less than 8 makes the script run too slowly, and more than 8 will not make it faster.

Calling Pool.map () is similar to calling a regular map (), which invokes the function specified by the first argument once for each element in the iteration variable specified by the second parameter. The biggest difference is that it will send these to the processes owned by the process pool, so in this case eight tasks will run in parallel.

The time saved is quite large. On my computer, the first version of the script took 75 seconds to complete, but the process pool version did the same job for 16 seconds!

3.4 Complete the crawler script
My final version of the crawler has done more to get the data.

I added a--sort command line parameter to specify a sorting criterion, and you can specify Views,likes or dislikes. The script will decrement the resulting array according to the specified attributes. Another parameter,--max, represents the number of results to display, in case you just want to see the top-ranked few. Finally, I added a--csv option that allows you to easily bring data into the spreadsheet software, specifying that the data be printed in CSV format instead of the table pair Gill.

The full script appears below:

Import argparseimport refrom multiprocessing import poolimport requestsimport bs4root_url = ' http://pyvideo.org ' index_ url = root_url + '/category/50/pycon-us-2014 ' def get_video_page_urls (): Response = Requests.get (index_url) soup = bs4. BeautifulSoup (Response.text) return [A.attrs.get (' href ') for a in Soup.select (' Div.video-summary-data A[href^=/video] ')]def Get_video_data (video_page_url): Video_data = {} response = Requests.get (Root_url + video_page_url) soup = bs4. BeautifulSoup (response.text) video_data[' title ' = Soup.select (' Div#videobox h3 ') [0].get_text () video_data[' Speakers '] = [A.get_text () for A in Soup.select (' Div#sidebar A[href^=/speaker] ')] video_data[' youtube_url '] = Soup.selec  T (' Div#sidebar a[href^=http://www.youtube.com] ') [0].get_text () response = Requests.get (video_data[' Youtube_url ']) Soup = bs4. BeautifulSoup (response.text) video_data[' views '] = Int (re.sub (' [^0-9] ', ' ', Soup.select ('. Watch-view-co Unt ') [0].get_text (). Split () [0]) Video_data[' likes '] = Int (re.sub (' [^0-9] ', ', ' Soup.select ('. Likes-count ') [0].get_text (). Split () [0])) video_data[ ' Dislikes ' = Int (re.sub (' [^0-9] ', ' ', Soup.select ('. Dislikes-count ') [0].get_text (). Split () [0])) retur n video_datadef Parse_args (): parser = Argparse.  Argumentparser (description= ' Show pycon video statistics. ')            Parser.add_argument ('--sort ', metavar= ' FIELD ', choices=[' views ', ' likes ', ' dislikes '), default= ' views ', Help= ' Sort by the specified field.  Options is views, likes and dislikes. ')  Parser.add_argument ('--max ', metavar= ' Max ', Type=int, help= ' show the top max entries only. ')  Parser.add_argument ('--csv ', action= ' store_true ', default=false, help= ' output the data in CSV format. ')  Parser.add_argument ('--workers ', Type=int, default=8, help= ' number of workers to use, 8 by default. ') Return Parser.parse_args () def show_video_stats (options): Pool = Pool (options.workers) Video_page_urls = Get_video_page_urls () results = sorted (Pool.map (Get_video_data, Video_page_urls), Key=lambda Video:video[options.sort], reverse=true) max = Options.max If Max is None or Max > len (results): max = len (results) if Options.csv:prin T (U ' "title", "Speakers", "views", "likes", "dislikes") else:print (U ' views +1-1 title (speakers) ') for I in range (max) : If Options.csv:print (U ' "{0}", "{1}", {2},{3},{4} '. Format (results[i][' title '), ', '. Join (results[i][' speak ERs ']), results[i][' views '], results[i][' likes '], results[i][' dislikes '])) else:print (U ' {0:5d} {1:3d} {2:3        D} {3} ({4}) '. Format (results[i][' views '], results[i][' likes '), results[i][' dislikes '], results[i][' title ', ', '. Join (results[i][' speakers '))) if __name__ = = ' __main__ ': Show_video_stats (Parse_args ())

The output below is the top 25 most viewed meetings when I finish writing the code:

(venv) $ python pycon-scraper.py--sort views--max--workers 8Views +1-1 Title (speakers) 3002 0 Keynote-guido Van Rossum (Guido van Rossum) 2564 0 Computer Science Fundamentals for self-taught programmers (Justin abrahms) 2369 1 7 0 ansible-python-powered radically simple IT Automation (Michael DeHaan) 2165 6 analyzing raps lyrics with Python (Julie Lavoie) 2158 3 exploring machine learning with Scikit-learn (Jake Vanderplas, Olivier Grisel) 2065 0 Fast P Ython, Slow Python (Alex Gaynor) 2024 0 Getting Started with Django, a crash course (Kenneth Love) 1986 0 It's Dan Gerous to Go alone:battling the Invisible Monsters in Tech (Julie Pagano) 1843 0 discovering Python (David Beazley) 1 672 0 All Your Ducks in A row:data structures in the standard Library and Beyond (Brandon Rhodes) 1558 1 Keynote -Fernando Pérez (Fernando Pérez) 1449 6 0 descriptors and metaclasses-understanding and Using Python ' s more advanced Features (Mike Müller) 1402 0 Flask by Example (Miguel Grinberg) 1342 6 0 Python epiphanies (Stuart Williams) 1219 5 0 0 to 00111100 with Web2py (G. Clifford Williams) 1169 0 Cheap helicopters in My Living the (Ned Jackson Lovely) 1146 0 IPython in de Pth:high productivity Interactive and parallel python (Fernando Perez) 1127 5 0 2d/3d graphics with python on mobile PL Atforms (Niko skrypnik) 1081 8 0 generators:the Final Frontier (David Beazley) 1067 0 designing poetic APIs (Erik R OSE) 1064 6 0 Keynote-john Perry Barlow (John Perry Barlow) 1029 0 What's Async, how Does It work, and when Shoul D I use It? (A. Jesse Jiryu Davis) 981 0 The Sorry State of SSL (Hynek schlawack) 961 2 farewell and Welcome Home:python in T Wo Genders (Naomi Ceder) 958 6 0 Getting Started testing (Ned Batchelder)
  • Related Article

    Contact Us

    The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

    If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

    A Free Trial That Lets You Build Big!

    Start building with 50+ products and up to 12 months usage for Elastic Compute Service

    • Sales Support

      1 on 1 presale consultation

    • After-Sales Support

      24/7 Technical Support 6 Free Tickets per Quarter Faster Response

    • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.