This article mainly describes the usage of the Python crawler BeautifulSoup by using video crawling instances. BeautifulSoup is a package designed for Python to obtain data, which is concise and powerful. For more information, see
1. Install BeautifulSoup4
Easy_install
easy_install beautifulsoup4
Pip installation method. pip also needs to be installed in advance. In addition, there is a package named BeautifulSoup in PyPi, which is the release version of Beautiful soup3. installation is not recommended here.
pip install beautifulsoup4
Debain or ubuntu Installation Method
apt-get install Python-bs4
You can also use the source code to download the BS4 source code.
Python setup.py install
2. Test the knife
# Coding = UTF-8 ''' @ download the image ''' import urllibfrom bs4 import BeautifulSoupurl = 'HTTP: // encode to download the webpage html = urllib. urlopen (url) content = html. read () html. close () # Use BeautifulSoup to match the image html_soup = BeautifulSoup (content) # Image Code in [Python crawler basics 1 -- urllib] (http://blog.xiaolud.com/2015/01/22/spider-1st/ "Python crawler basics 1 -- urllib ") it has already been analyzed in # compared to regular expressions for matching, BeautifulSoup provides a simpler and more flexible way for all_img_links = html_soup.findAll ('img ', class _ = 'bde _ imag ') # The next step is to download the image img_counter = 1for img_link in all_img_links: img_name = '{s.jpg '% img_counter urllib. urlretrieve (img_link ['src'], img_name) img_counter + = 1
It is very simple, and the code comments have clearly explained. BeautifulSoup provides a simpler and more flexible way to analyze the website source code and get the image link faster.
3. Crawling instances
3.1 basic crawling Technology
When writing a crawler script, the first thing is to manually observe the page to be crawled to determine how the data is located.
First, let's take a look at the PyCon conference video list on http://pyvideo.org/category/50/pycon-us-2014. Check the HTML source code of this page and we find that the video list result is almost as long as this:
...
#title#
...
...
The first task is to load the page and extract the links to each individual page, because the links to the YouTube video are on these individual pages.
Using requests to load a web page is very simple:
import requestsresponse = requests.get('http://pyvideo.org/category/50/pycon-us-2014')
It is it! After this function is returned, you can obtain the HTML of this page from response. text.
The next task is to extract links to each individual video page. You can use the CSS selector syntax through BeautifulSoup. If you are a client developer, you may be familiar with this.
To obtain these links, we need to use a selector that can capture
All elements. Since each video has several elements, we will keep only those elements whose URLs start with/video. These are the only independent video pages. The CSS selector implementing the above standard is p. video-summary-data a [href ^ =/video]. The following code snippet uses the BeautifulSoup selector to obtain the elements pointing to the video page:
import bs4soup = bs4.BeautifulSoup(response.text)links = soup.select('p.video-summary-data a[href^=/video]')
Because we really care about the link itself rather than its elements, we can use list parsing to improve the above Code.
Links = [a. attrs. get ('href ') for a in soup. select ('p. video-summary-data a [href ^ =/video]')]
Now we have an array containing all links that point to each individual page.
The following script summarizes all the technologies we have mentioned:
import requestsimport bs4root_url = 'http://pyvideo.org'index_url = root_url + '/category/50/pycon-us-2014'def get_video_page_urls(): response = requests.get(index_url) soup = bs4.BeautifulSoup(response.text) return [a.attrs.get('href') for a in soup.select('p.video-summary-data a[href^=/video]')]print(get_video_page_urls())
If you run the above script, you will get an array full of URLs. Now we need to parse each URL to get