Python crawler broken up read

Source: Internet
Author: User

Recently the leader has given a task to export all the data from the unit's database to a local Excel table. I thought, this is not quite simple, give me the password account of the database, a few statements to fix.

The results are disappointing, the unit database can only be viewed through the background management system, the platform does not provide bulk export function, as for the database direct access to what, but also want to do not think, the big leader does not give the batch.

So, can only take the stupid way, the web crawler to climb down!

So, we re-picked up the python that has been discarded for six months. began to delve into how to write a simple little reptile.

Python's idea of writing crawlers is actually simple. Here's a brief.

1) Python emulation login. The main is to get cookie~

2) Analyze the characteristics of the data contained in the HTTP packet during the interaction with the platform. The main thing is the request and response.

The tricky part of the platform is that it's not a single place to extract data. First of all, to get a large list, the list will have pagination, and then click on each item in the list to see the details.

Through the analysis of the HTTP packet, the process is as follows:

Demo Login-Start get list request (post, JSON), return list data (JSON), initiating get details request (get), return details page (HTML)

Complete data requires flattening list data and detail page data. The former, just parse the JSON data can be, but the following details page, the HTML page to parse, extract the required items.

The process is not complicated, but there are too many to write. This article focuses on recording the pits that have been trampled. The main is two big pits.

Pit NO1: A python coding method for egg aches

This hole can be divided into several small questions to answer

1) What is the relationship between Unicode and utf-8?

There is a good explanation for this question, and that is: UTF8 is a coding method for encoding Unicode character sets.

The Unicode character set itself is a mapping that links each real-world character to a value, which is a logical relationship. Utf-8 is the encoding method, which is the algorithm for encoding the value represented by Unicode.

To put it simply, that is: Character->unicode->utf-8

For example: Chinese "hello", \u4f60\u597d, \XE4\XBD\XA0\XE5\XA5\XBD

2) What is the relationship between STR and Unicode?

STR and Unicode are python2. X inside the concept.

For example s=u ' Hello '

The s variable is a Unicode string and is a Unicode object (type (s) = = Unicode)

where Len (s) = 2

The stored value is \u4f60\u597d.

As for STR, it is the most primitive data stream of python. Can be understood as a byte stream, that is, binary code.

Python has a different datatypes. One is ' Unicode ' and the other is ' str '.

Type ' Unicode ' is meant-working with codepoints of characters.

Type ' str ' is meant for working with encoded binary representation of characters.

In the case of "Hello" above, the Unicode is \u4f60\u597d. This value can also be encoded once utf-8, and eventually become the stored byte stream, which is \XE4\XBD\XA0\XE5\XA5\XBD

In Python3, all str becomes unicode,3 in bytes instead of 2. STR in X

StackOverflow has a solution that's pretty good http://stackoverflow.com/questions/18034272/python-str-vs-unicode-types

unicode, which is Python 3 ' s str , was meant to handle text. Text is a sequence of code points which could bebigger than a single byte. Text can is encoded in a specific encoding to represent the text as raw bytes (e.g. utf-8 , latin-1 ...). Note that's not unicode encoded! The internal representation used by Python are an implementation detail, and we shouldn ' t care about it as long as it's a ble to represent the code points you want.

The contrary is str a plain sequence of bytes. It does not represent text! In fact, in Python 3 is str  called bytes .

You can think unicode of as a general representation of some text, which can is encoded in many different ways into a Sequen Ce of binary data represented via str .

Note that using you str has a Lower-level control on the bytes of a specific encoding representation, while Usin G You can only control at the unicode code-point level.

So, it is very clear ~

3) How to use encode and decode.

With the basis of the two points above, the use of this method is not difficult.

The so-called encode, is unicode->str, meaningful words, into the word stream.

And Decode, is str->unicode, the word stream, into meaningful words.

Decode is used for STR, encode is used for Unicode.

For example:

U_a=u'Hello'   #here is the Unicode characteru_a#output u ' \u4f60\u597d 's_a= U_a.encode ('Utf-8')#U_a is utf-8 encoded and converted to byte streams_a#output ' \xe4\xbd\xa0\xe5\xa5\xbd 'U_a_= S_a.decode ('Utf-8')#Utf-8 decoding to S_a, reverting to UnicodeU_a_#output u ' \u4f60\u597d '

Utf-8 is a coding method, in addition, the common GBK and so on.

4) #coding: What is the difference between Utf-8 and setdefaultencoding?

#coding: Utf-8 function is to define the encoding of the source code, if not defined, this source can not contain Chinese characters.

Setdefaultencoding is the default encoding of Unicode type data when Python code executes. (Set the current default string encoding used by the Unicode implementation.) This is because Unicode has many coding options, including UTF-8, UTF-16, UTF-32, Chinese and GBK. When you call the decode and encode functions, the default codec is used without displaying the specified parameters.

It is important to note that in idle under Windows, GBK encoding is used by default without explicitly indicating the U-prefix.

Let's look at an example:

A ='Hello'  #in the idle under Windows, A is GBK encodedA#output ' \xc4\xe3\xba\xc3 ' This is GBKb = A.decode ('GBK')#perform GBK decoding to UnicodeB#output u ' \u4f60\u597d 'PrintB#Output Hellob = A.decode ()#in the case of unspecified parameters, the default is ASCII codec, this time error,               #unicodedecodeerror: ' ASCII ' codec can ' t decode byteA = u'Hello'b= A.encode ()#Likewise, it will be an error.               #unicodeencodeerror: ' ASCII ' codec can ' t encode characters

Pit NO2: Complex regular expressions (to be continued)

Python crawler broken up read

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.