Describes how to use the Python crawler BeautifulSoup with a video crawling instance

Source: Internet
Author: User
This article mainly describes the usage of the Python crawler BeautifulSoup by using video crawling instances. BeautifulSoup is a package designed for Python to obtain data, which is concise and powerful. For more information, see 1. Install BeautifulSoup4
Easy_install

easy_install beautifulsoup4

Pip installation method. pip also needs to be installed in advance. In addition, there is a package named BeautifulSoup in PyPi, which is the release version of Beautiful soup3. installation is not recommended here.

pip install beautifulsoup4

Debain or ubuntu Installation Method

apt-get install Python-bs4

You can also use the source code to download the BS4 source code.

Python setup.py install

2. Test the knife

# Coding = UTF-8 ''' @ download the image ''' import urllibfrom bs4 import BeautifulSoupurl = 'HTTP: // encode to download the webpage html = urllib. urlopen (url) content = html. read () html. close () # Use BeautifulSoup to match the image html_soup = BeautifulSoup (content) # Image Code in [Python crawler basics 1 -- urllib] (http://blog.xiaolud.com/2015/01/22/spider-1st/ "Python crawler basics 1 -- urllib ") it has already been analyzed in # compared to regular expressions for matching, BeautifulSoup provides a simpler and more flexible way for all_img_links = html_soup.findAll ('img ', class _ = 'bde _ imag ') # The next step is to download the image img_counter = 1for img_link in all_img_links: img_name = '{s.jpg '% img_counter urllib. urlretrieve (img_link ['src'], img_name) img_counter + = 1

It is very simple, and the code comments have clearly explained. BeautifulSoup provides a simpler and more flexible way to analyze the website source code and get the image link faster.


3. Crawling instances
3.1 basic crawling Technology
When writing a crawler script, the first thing is to manually observe the page to be crawled to determine how the data is located.

First, let's take a look at the PyCon conference video list on http://pyvideo.org/category/50/pycon-us-2014. Check the HTML source code of this page and we find that the video list result is almost as long as this:

...

#title#

...

...

The first task is to load the page and extract the links to each individual page, because the links to the YouTube video are on these individual pages.

Using requests to load a web page is very simple:

import requestsresponse = requests.get('http://pyvideo.org/category/50/pycon-us-2014')

It is it! After this function is returned, you can obtain the HTML of this page from response. text.

The next task is to extract links to each individual video page. You can use the CSS selector syntax through BeautifulSoup. If you are a client developer, you may be familiar with this.

To obtain these links, we need to use a selector that can capture

All elements. Since each video has several elements, we will keep only those elements whose URLs start with/video. These are the only independent video pages. The CSS selector implementing the above standard is p. video-summary-data a [href ^ =/video]. The following code snippet uses the BeautifulSoup selector to obtain the elements pointing to the video page:

import bs4soup = bs4.BeautifulSoup(response.text)links = soup.select('p.video-summary-data a[href^=/video]')

Because we really care about the link itself rather than its elements, we can use list parsing to improve the above Code.

Links = [a. attrs. get ('href ') for a in soup. select ('p. video-summary-data a [href ^ =/video]')]
Now we have an array containing all links that point to each individual page.

The following script summarizes all the technologies we have mentioned:

import requestsimport bs4root_url = 'http://pyvideo.org'index_url = root_url + '/category/50/pycon-us-2014'def get_video_page_urls():  response = requests.get(index_url)  soup = bs4.BeautifulSoup(response.text)  return [a.attrs.get('href') for a in soup.select('p.video-summary-data a[href^=/video]')]print(get_video_page_urls())

If you run the above script, you will get an array full of URLs. Now we need to parse each URL to get

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.