Capture webpage content using Python and beautiful Soup

Source: Internet
Author: User

Python 3 provides the URL opening module urllib. Request and HTML parsing module html. parser. However, the HTML. parser module has simple functions and is difficult to meet the needs of parsing webpage content. Beautiful Soup 4 is a powerful Python library for parsing HTML and XML files. And it provides very well-rounded documentation (http://www.crummy.com/software/BeautifulSoup/bs4/doc ).

 

Beautiful Soup 4 installation and related issues

The latest version of beautiful Soup is 4.1.1 which can be obtained here (http://www.crummy.com/software/BeautifulSoup/bs4/download ). I am using Mac OSX. To install beautiful Soup on this platform, just unzip the installation package and run the setup. py file:

 
$ Python3 setup. py install

If you obtainSyntaxerror"Invalid Syntax" on the lineRoot_tag_name = U' [document] '.CodeConvert to Python 3:

$2to3-3.2-W bs4

 

Chinese encoding in URL

The URL contains Chinese characters, as shown in the following figure:

Http://flight.qunar.com/site/oneway_list.htm? Searchdepartureairport =Beijing& Searcharrivalairport =Lijiang& Searchdeparturetime = 2012-08-09

If you directly pass the URL to urllib. Request. urlopen, A typeerror occurs. The solution is to construct a parameter name and parameter value tuples and encode them using the urllib. parse. urlencode method. The sample code is as follows:

 1 Url = '  Http://flight.qunar.com/site/oneway_list.htm  '  2 Values = {'  Searchdepartureairport  ' : '  Beijing  ' , '  Searcharrivalairport  ' : '  Lijiang  ' , '  Searchdeparturetime  ' : ' 2012-07-25  '  }  3 Encoded_param = Urllib. parse. urlencode (values)  4 Full_url = URL + '  ?  ' + Encoded_param

 

Webpage content capture: The following sample code shows how to capture the webpage content when the baidu search keyword "tennis" is found.

 1   Import  Urllib. parse  2   Import Urllib. Request  3   From Bs4 Import  Beautifulsoup  4   5 Url = '  Http://www.baidu.com  '  6 Values = { '  WD  ' : '  Tennis '  }  7 Encoded_param = Urllib. parse. urlencode (values)  8 Full_url = URL + '  ?  ' + Encoded_param  9 Response = Urllib. Request. urlopen (full_url)  10 Soup = Beautifulsoup (response)  11 Soup. find_all ('  A  ' )
Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.