Python 3 provides the URL opening module urllib. Request and HTML parsing module html. parser. However, the HTML. parser module has simple functions and is difficult to meet the needs of parsing webpage content. Beautiful Soup 4 is a powerful Python library for parsing HTML and XML files. And it provides very well-rounded documentation (http://www.crummy.com/software/BeautifulSoup/bs4/doc ).
Beautiful Soup 4 installation and related issues
The latest version of beautiful Soup is 4.1.1 which can be obtained here (http://www.crummy.com/software/BeautifulSoup/bs4/download ). I am using Mac OSX. To install beautiful Soup on this platform, just unzip the installation package and run the setup. py file:
$ Python3 setup. py install
If you obtainSyntaxerror"Invalid Syntax" on the lineRoot_tag_name = U' [document] '.CodeConvert to Python 3:
$2to3-3.2-W bs4
Chinese encoding in URL
The URL contains Chinese characters, as shown in the following figure:
Http://flight.qunar.com/site/oneway_list.htm? Searchdepartureairport =Beijing& Searcharrivalairport =Lijiang& Searchdeparturetime = 2012-08-09
If you directly pass the URL to urllib. Request. urlopen, A typeerror occurs. The solution is to construct a parameter name and parameter value tuples and encode them using the urllib. parse. urlencode method. The sample code is as follows:
1 Url = ' Http://flight.qunar.com/site/oneway_list.htm ' 2 Values = {' Searchdepartureairport ' : ' Beijing ' , ' Searcharrivalairport ' : ' Lijiang ' , ' Searchdeparturetime ' : ' 2012-07-25 ' } 3 Encoded_param = Urllib. parse. urlencode (values) 4 Full_url = URL + ' ? ' + Encoded_param
Webpage content capture: The following sample code shows how to capture the webpage content when the baidu search keyword "tennis" is found.
1 Import Urllib. parse 2 Import Urllib. Request 3 From Bs4 Import Beautifulsoup 4 5 Url = ' Http://www.baidu.com ' 6 Values = { ' WD ' : ' Tennis ' } 7 Encoded_param = Urllib. parse. urlencode (values) 8 Full_url = URL + ' ? ' + Encoded_param 9 Response = Urllib. Request. urlopen (full_url) 10 Soup = Beautifulsoup (response) 11 Soup. find_all (' A ' )