Python crawler Summary (ii) Common data types and their analytic methods

Last Update:2016-07-31 Source: Internet

Author: User

Tags xml parser xpath

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Python crawler Summary (ii) Common data types

In the previous article, we briefly introduced how to use Python to send HTTP/HTTPS requests for online data, data collected from the Web data types have many kinds, mainly:

Put it in the HTML.
Put it directly in JavaScript.
Put it in JSON.
Put it in XML.

Note: Many of the concepts here are in web front-end development, because most of the data we collect is from the web, so it is necessary to know some front-end knowledge.

Below I briefly introduce the various data types, and combine some examples to explain their parsing methods.

Data types are placed in HTML

HTML is Hypertext Markup Language, most of the current web pages are written in HTML, naturally a lot of data are stored in HTML.

just put it in JavaScript.

JavaScript is widely used as the main front-end development script, and many key data are placed directly into JavaScript scripts.

Put it in JSON.

JSON is the JavaScript Object notation . It is actually language-independent, and is often used in JavaScript storage or asynchronous transmission (which we'll discuss later in detail) . Because JavaScript In the Web front-end development of a large number of applications, JSON has become one of the main web data types.

Put it in XML.

XML refers to Extensible Markup Language, which is designed to transmit and store data. Many JavaScript programmers like to use XML to transfer data, and some pages are written directly in XML (not HTML).

ExampleExample 1-watercress film "Being Hot" information collection

Visit popular movies with watercress movies (Figure 1) https://movie.douban.com/:

Figure 1

Right-click on the browser to view the source code (view page source):

Figure 2

The previous article of the crawler crawl of the Watercress movie page is shown in Figure 2 these HTML tags.

Now if we're going to collect the movie name, the release year and the movie rating information for the "hit" movie. The first step is to locate the information. The methods of locating are as follows: Regular expression method, XPath localization method and BeautifulSoup localization method.

Regular Expression method:

Students who are not familiar with regular expressions can check here: Regular expressions. This positioning method is the most versatile method, not only can locate the "normal" of the watercress film to put the information into the HTML data, but also for some other: in JavaScript, placed in the JSON, In XML, annotation information and so on can also be processed. But the disadvantage is that it is difficult to write a regular expression that is correct, and it is often necessary to debug it repeatedly.

Here's how I use regular expressions to position the movie name, release year, movie rating information (Figure 3 is the run result).

ImportUrllib2,reurl="https://movie.douban.com/"page=urllib2.urlopen (URL). Read () result= Re.findall ('<li class= "Ui-slide-item" data-title= "(. +?)" Data-release= "(\d+?)" Data-rate= "([0-9.] +?)"', page) forIteminchResult:Print 'Movie:', Item[0],'Date:', Item[1],'Score:', Item[2]

Figure 3: Running results

XPath positioning method:

XPath is a language for finding information in an XML document, but it can also be used to look up HTML information. It can be conveniently positioned with Xpth's powerful syntax, and now many browsers have plugins that make it easy to get an XPath path to an element, such as Firefox's Firebug plugin.

Here is the code that is positioned with XPath:

ImportUrllib2,re fromlxmlImportEtreeurl="https://movie.douban.com/"page=urllib2.urlopen (URL). Read () Temp=etree. HTML (page) Results= Temp.xpath ("//li[@class = ' Ui-slide-item ')") forIteminchResults:Print 'Movie:', Item.get ('Data-title'),'Date:', Item.get ('Data-release'),'Score:', Item.get ('data-rate')

BeautifulSoup Positioning Method:

BeautifulSoup is a html/xml parser that handles non-canonical markup and generates a parse tree. It provides simple and common navigation (navigating), search and modify the parse tree operation. It can greatly save you the time to locate information. Because it is a third-party library, it needs to be installed specifically. Its installation methods and documentation can be found at: https://www.crummy.com/software/BeautifulSoup/bs4/doc/

Here is the code that is positioned with BeautifulSoup:

ImportUrllib2,re fromBs4ImportBeautifulsoupurl="https://movie.douban.com/"page=urllib2.urlopen (URL). Read () Soup= BeautifulSoup (page,"Html.parser") Results= Soup.find_all ("Li", class_='Ui-slide-item') forIteminchResults:Print 'Movie:', Item.get ('Data-title'),'Date:', Item.get ('Data-release'),'Score:', Item.get ('data-rate')

The XPath method and the BeautifulSoup method are quite handy for locating elements, but they cannot locate the data that is placed directly in JavaScript, such as the information that is not in the XML HTML. The actual analytic positioning process often takes the method of combining the above methods. For example: First, use regular expressions to match the HTML data from the JS code, and then parse it with XPath or BeautifulSoup.

Note: The BeautifulSoup output is Unicode encoded, and if you want the result to be utf-8 encoded, remember to transcode.

Example 2-Capture the title of the latest article issued by the People's Daily in the public

Link:

Http://mp.weixin.qq.com/profile?src=3&timestamp=1469945371&ver=1&signature= bssqmk1ly77m4o22qti37cbhjhwnv7c9v4aor9hlhasvbdaqsoh7ebjeivjgvbrz7ic-eofx2fq4bsztco0syg==

3, the article information listed below is not in the HTML tag, but in the JSON data type in the JS code.

Figure 4

This is also very common. The solution is to take the JSON data out of the regular expression first, and then parse the JSON data in Python's JSON library. Here's the code:

ImportUrllib2,json,reurl="http://mp.weixin.qq.com/profile?src=3&timestamp=1469945371&ver=1&signature= bssqmk1ly77m4o22qti37cbhjhwnv7c9v4aor9hlhasvbdaqsoh7ebjeivjgvbrz7ic-eofx2fq4bsztco0syg=="page=urllib2.urlopen (URL). Read () result= Re.findall ("var msglist = ' (\{\s+\}) ';", page) temp= Re.sub ("&quot;",'"', result[0]) Json_data= Json.loads (temp,encoding='Utf-8') forIteminchjson_data['List']:    Printitem['App_msg_ext_info']['title']

Operation Result:

Figure 4

Other instructions

There is a data type in the XML processing method and the existence of the same processing method in HTML, you can refer to the operation of instance 1 here is not to repeat.

The following are mainly to remind you must pay attention to the parsing process of coding problems.

Coding issues

In the process of parsing should pay attention to the coding problem, because the webpage has UTF-8 encoding, also has GBK code, also has GB2312 and so on. If the coding problem is not handled well, it is likely to cause problems such as input and output exceptions, regular expression matching errors, and so on. My solution is to stick to a central idea: "No matter what code you are, go to the resolver unified Utf-8 code." For example, some Web pages are GBK encoded, I'll do a transcoding operation before processing it:

Utf8_page = Gbk_page.decode ("GBK"). Encode ("UTF8" )

At the same time, in the initialization location of the code (or at the very beginning) I usually add the following code:

Import sysreload (SYS) sys.setdefaultencoding ('UTF8')

At the same time the code file encoding method must also be guaranteed to be utf-8.

This treatment is more clear and uniform. There will be no utf-8 regular expression and a GBK string to match. Or the output data is UTF8 encoded string, and GBK encoded string causes IO error.

If you do not know what the page is encoded in advance, we recommend that you use Python's third-party package chardet:https://pypi.python.org/pypi/chardet/It can automatically help you identify the encoding of the Web page. Usage is:

Import Chardet Import Urllib2 # can choose different data according to the need TestData = Urllib2.urlopen ('http://www.baidu.com/'). Read ()print Chardet.detect (TestData)

The operating result is:

Accurate to determine the Baidu encoding method is utf-8.

Summarize

Today, we introduce the common data types, and introduce the following examples: The regular expression method, the XPath localization method, the BeautifulSoup localization method, and the three locating methods. Finally, the author introduces the coding problem of the frequent headache of crawler beginners, and introduces the Chardet Tools. Next I'm going to introduce the processing methods of Ajax asynchronous loading and the common anti-crawler strategies and how to deal with them.

Python crawler Summary (ii) Common data types and their analytic methods

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More