This article mainly introduces how to use Python to crawl data from the Internet to save specific attributes. it solves the encoding problem and how to use regular expression matching data. for details, see the following
Encoding problems
Because it involves Chinese characters, it is inevitable that the encoding problem is involved. this opportunity is completely clarified.
The problem starts with the encoding of the text. The original English code is only 0 ~ 255, which is exactly 8 bits and 1 byte. To represent different languages, we naturally need to expand. In Chinese, there are GB series. What is the relationship between Unicode and UTF-8?
Unicode is an encoding scheme, also known as the universal code. However, this encoding is not used for storing data on a computer. it can be said that it acts as a man-in-the-middle. You can store Unicode encoding (encode) as a UTF-8, or GB, on your computer. The UTF-8 or GB can also be decoded to Unicode.
In python, Unicode is a type of object, represented by a u header, for example, u 'China', and string is a type of object, it is a string that actually exists on the computer in a specific encoding mode. For example, the 'Chinese' in UTF-8 encoding is different from that in gbk encoding. You can see the following code:
The code is as follows:
>>> Str = u'chinese'
>>> Str1 = str. encode ('utf8 ')
>>> Str2 = str. encode ('gbk ')
>>> Print repr (str)
U' \ u4e2d \ u6587'
>>> Print repr (str1)
'\ Xe4 \ xb8 \ xad \ xe6 \ x96 \ x87'
>>> Print repr (str2)
'\ Xd6 \ xd0 \ xce \ xc4'
As you can see, in fact, only such encoding is stored in the computer, rather than one Chinese character. during print, you must know the encoding method used at that time to print the data correctly. It is well mentioned that Unicode in python is a real string, while string is a byte string.
File encoding
Since there are different encodings, if you directly write the string in the code file, which encoding is it? This is determined by the file encoding. Files are always saved in a certain encoding method. The python file can be written with the coding statement to describe the encoding method used to save the file. If the declared encoding method is inconsistent with the actually saved encoding method, an exception occurs. See the following example: declare the file stored in UTF-8 as gbk
The code is as follows:
# Coding: gbk
Str = u'hangzhou'
Str1 = str. encode ('utf8 ')
Str2 = str. encode ('gbk ')
Str3 = 'hangzhou'
Print repr (str)
Print repr (str1)
Print repr (str2)
Print repr (str3)
The error File "test. py ", line 1 SyntaxError: Non-ASCII character '\ xe6' in file test. py on line 1, but no encodi ng declared; see http://www.python.org/peps/pep-0263.html for details changed
The code is as follows:
# Coding: utf8
Str = u'hangzhou'
Str1 = str. encode ('utf8 ')
Str2 = str. encode ('gbk ')
Str3 = 'hangzhou'
Print repr (str)
Print repr (str1)
Print repr (str2)
Print repr (str3)
Output normal result u' \ u6c49'' \ xe6 \ xb1 \ x89 ''\ xba'' \ xe6 \ xb1 \ x89'
Basic method
In fact, it is very easy to crawl a web page using python, just a few simple words
The code is as follows:
Import urllib2
Page = urllib2.urlopen ('URL'). read ()
In this way, you can obtain the page content. Next, use the regular expression matching to match the required content.
However, there will be various details.
Login
This is a website that requires logon authentication. It is not difficult. just import the cookielib and urllib libraries.
The code is as follows:
Import urllib, urllib2, cookielib
Cookiejar = cookielib. CookieJar ()
UrlOpener = urllib2.build _ opener (urllib2.HTTPCookieProcessor (cookiejar ))
In this way, a cookie is loaded, and you can remember the information after logging on to the open client using urlOpener.
Reconnection
If we only want to achieve the above level and do not pack open, as long as the network conditions fluctuate, an exception is thrown directly and the entire program is exited. this is a very bad program. At this time, you only need to handle the exception and try multiple times:
The code is as follows:
Def multi_open (opener, * arg ):
While True:
RetryTimes = 20
While retryTimes> 0:
Try:
Return opener. open (* arg)
Except t:
Print '.',
RetryTimes-= 1
Regular Expression Matching
In fact, regular expression matching is not a very good method, because it has poor fault tolerance and the web pages must be completely unified. If there is a slight inconsistency, it will fail. Later, we can see that the selection is based on xpath. you can try again next time.
Writing regular expressions is actually skillful:
Non-greedy match. For example, if a label is "hello", a is to be extracted. if it is written as this expression, it will not work: hello. Because * is greedy. This is to use .? : Hello.
Cross-row match. To implement cross-row, you can use the DOTALL flag to match the line feed. However, the entire matching process becomes very slow. The original match is in the unit of action. The entire process can be O (nc2) at most, n is the number of rows, and c is the average number of columns. Now it is very likely to change to O (nc) 2 ). My implementation scheme is to use \ n to match the line feed, so that we can clearly specify the maximum number of lines to be matched across hops. For example, abc \ s * \ n \ s * def indicates that the query is performed on a different row. (. \ N )? You can specify to match as few rows as possible.
Pay attention to the following points. Some rows have \ r at the end. That is to say, a row ends with \ r \ n. If you didn't know this, the regular expression was debugged for a long time. Now \ s is used directly to indicate the space at the end of the line and \ r.
No capturing group. In order not to affect the captured group, the above (. \ n) can be changed (? :. \ N), which is ignored when the group is captured.
Escape the brackets. Because brackets are used to represent groups in the regular expression, they are escaped to match the brackets. It is best to use a regular string with the r prefix. if not, escape the character.
Fast regular expression. After writing so many modes, we can also sum up a rule. Take out the paragraphs related to the characters to be matched. Use (.?) . Replace the Newline \ n with the string \ s \ n \ s *, and then remove the space at the end of the first line of the line. The entire process can be written quickly in vim.
Excel operations
This time the data is put into Excel. I realized later that if I put it into the database, there may not be so many things. But it's already half written, so it's hard to look back.
Search for Excel, you can come up with several solutions. one is to use the xlrt/xlwt Library, which can be run regardless of whether Excel is installed on the computer, but can only be in the xls format. Another option is to directly package com, which requires software installation on the computer. I use the previous one.
There is no problem with basic read/write. However, if the data volume is large, the problem arises.
Insufficient memory. As soon as the program runs, the memory usage increases by 1.1 points. I checked it again later, knowing that we should use flush_row_data. However, errors still occur. The memory usage remains stable. However, memory error still occurs. This is a ghost. It is re-checked and run repeatedly. No results at all. It is terrible that bugs only occur when the data volume increases, and it usually takes several hours to wait for the data volume to increase. This debugging cost is too high. By chance, I suddenly found out that although the memory usage was stable in general, there would be a small increase in regularity, and this regularity would be related to flush_row_data. I have been wondering where data is flushed. It turns out that xlwt is a tough practice. Store the data in the memory or flush it to a temp. when saving the data, write the data at one time. The problem is that the memory usage is soaring as one-time writing occurs. So how can I use flush_row_data? Why not flush into the place where the data is to be written at the beginning.
Number of rows. This is determined by the xls format. The maximum number of rows can be 65536. In addition, a large amount of data is required, and it is not convenient to open the file.
Based on the above two points, this policy is adopted. if the number of rows is a multiple of 1000, flush is performed once. if the number of rows exceeds 65536, a new sheet is created. if more than three sheets are created, create a new file. For convenience, I packed xlwt
The code is as follows:
# Coding: UTF-8 #
Import xlwt
Class XLS:
'''A class wrap the xlwt '''
Max_row= 65536
MAX_SHEET_NUM = 3
Def _ init _ (self, name, captionList, typeList, encoding = 'utf8', flushBound = 1000 ):
Self. name = name
Self. captionList = captionList [:]
Self. typeList = typeList [:]
Self. workbookIndex = 1
Self. encoding = encoding
Self. wb = xlwt. Workbook (encoding = self. encoding)
Self. sheetIndex = 1
Self. _ addSheet ()
Self. flushBound = flushBound
Def _ addSheet (self ):
If self. sheetIndex! = 1:
Self.wb.save(self.name+str(self.workbookindex+'.xls ')
If self. sheetIndex> XLS. MAX_SHEET_NUM:
Self. workbookIndex + = 1
Self. wb = xlwt. Workbook (encoding = self. encoding)
Self. sheetIndex = 1
Self. sheet = self. wb. add_sheet (self. name. encode (self. encoding) + str (self. sheetIndex ))
For I in range (len (self. captionList )):
Self. sheet. write (0, I, self. captionList [I])
Self. row = 1
Def write (self, data ):
If self. row> = XLS. MAX_ROW:
Self. sheetIndex + = 1
Self. _ addSheet ()
For I in range (len (data )):
If self. typeList [I] = "num ":
Try:
Self. sheet. write (self. row, I, float (data [I])
Failed T ValueError:
Pass
Else:
Self. sheet. write (self. row, I, data [I])
If self. row % self. flushBound = 0:
Self. sheet. flush_row_data ()
Self. row + = 1
Def save (self ):
Self.wb.save(self.name+str(self.workbookindex+'.xls ')
Convert Special characters on a webpage
Because the webpage also has its own unique escape characters, it is troublesome to perform regular matching. I found a solution to replace the dictionary in the official document. I thought it was good and I made some extensions. Some of them are used to maintain regular expression correctness.
The code is as follows:
Html_escape_table = {
"&":"&",
'"':""",
"'":"'",
">": "> ",
"<": "<",
U "·":"·",
U "°": "° ",
# Regular expression
".": R "\.",
"^": R "\ ^ ",
"$": R "\ $ ",
"{": R "\{",
"}": R "\}",
"\": R "\\",
"|": R "\ | ",
"(": R "\(",
")": R "\)",
"+": R "\ + ",
"*": R "\*",
"? ": R "\? ",
}
Def html_escape (text ):
"Produce entities within text ."""
Tmp = "". join (html_escape_table.get (c, c) for c in text)
Return tmp. encode ("UTF-8 ")
End
The experience is almost the same. However, the program written at the end cannot be read. Bad style. At first, I thought about writing it first. And then try again.
The final program will take a long time to run, with most of the network communication time. Can I use multiple threads to refactor it? Think about it.