How Python crawls Web site data is saved using

Source: Internet
Author: User
Coding issues
Because it involves the Chinese language, so inevitably involves the problem of coding, this time to take this opportunity to be completely clear.
The problem should be from the coding of words. The original English code is only 0~255, just 8 bits and 1 bytes. In order to represent a variety of different languages, it is natural to expand. Chinese words have GB series. Maybe you've heard of Unicode and UTF-8, so what's the relationship between them?
Unicode is a coding scheme, also known as the Universal code, it can be seen in a wide range. But the specific storage on the computer, do not use this code, it can be said that it plays an intermediary role. You can then save the Unicode encoding (encode) to UTF-8, or GB, to the computer. UTF-8 or GB can also be decoded (decode) to be reverted to Unicode.
In Python, Unicode is a class of objects that appear to be preceded by u, such as U ' Chinese ', and string is a class object, which is the actual string that exists on the computer in the specific encoding mode. For example, the ' Chinese ' under the Utf-8 code and the ' Chinese ' under the GBK code are not the same. You can see the following code:

Copy the Code code as follows:


>>> str=u ' Chinese '
>>> str1=str.encode (' UTF8 ')
>>> str2=str.encode (' GBK ')
>>> print repr (str)
U ' \u4e2d\u6587 '
>>> print repr (str1)
' \xe4\xb8\xad\xe6\x96\x87 '
>>> print repr (str2)
' \xd6\xd0\xce\xc4 '

Can be seen, in fact, stored in the computer is just such a code, rather than a Chinese character, in print, you need to know what kind of encoding is used, in order to correctly print out. There is a good idea that Unicode in Python is the real string, and string is a byte string
File encoding
Since there are different encodings, if you write string directly in the code file, what kind of encoding is it? This is determined by the encoding of the file. The file is always stored in a certain encoding mode. The Python file can be written with the coding declaration statement, which is used to indicate what encoding the file is stored in. An exception occurs if the declaration is encoded in a way that is inconsistent with the actual saved encoding. You can see the following example: Files saved with Utf-8 are declared as GBK

Copy the Code code as follows:


#coding: GBK
Str=u ' Han '
Str1=str.encode (' UTF8 ')
Str2=str.encode (' GBK ')
Str3= ' Han '
Print repr (str)
Print repr (STR1)
Print repr (STR2)
Print repr (STR3)

Hint error File "test.py", line 1 syntaxerror:non-ascii character ' \xe6 ' in File test.py on line 1, but no encodi ng declared; See http://www.python.org/peps/pep-0263.html for details instead

Copy the Code code as follows:


#coding: UTF8
Str=u ' Han '
Str1=str.encode (' UTF8 ')
Str2=str.encode (' GBK ')
Str3= ' Han '
Print repr (str)
Print repr (STR1)
Print repr (STR2)
Print repr (STR3)

Output normal result u ' \u6c49 ' \xe6\xb1\x89 ' \xba\xba ' \xe6\xb1\x89 '

Basic methods
Actually crawling a Web page with Python is simple, with just a few simple words

Copy the Code code as follows:


Import Urllib2
Page=urllib2.urlopen (' URL '). Read ()

This allows you to get to the content of the page. The next step is to match the required content with a regular match.
But there are all sorts of details that really need to be done.
Login
This is a website that requires login authentication. It's not too difficult, just import the cookielib and Urllib libraries on the line.

Copy the Code code as follows:


Import Urllib,urllib2,cookielib
Cookiejar = Cookielib. Cookiejar ()
Urlopener = Urllib2.build_opener (urllib2. Httpcookieprocessor (Cookiejar))

This will load into a cookie, with Urlopener to open login can remember the information.
Wire Break re-connect
If only to achieve the above degree, do not package the open, as long as the network situation some ups and downs, directly throw an exception, quit the entire program, is a very bad program. This time, as long as the exception to deal with, try a few more times on the line:

Copy the Code code as follows:


def multi_open (Opener,*arg):
While True:
Retrytimes=20
While retrytimes>0:
Try
Return Opener.open (*arg)
Except
print '. ',
Retrytimes-=1

regular Match
In fact, regular matching is not a particularly good method, because its fault tolerance is not good, the Web page to be completely unified. If there is a slight disunity, it will fail. Later I saw that there is a selection based on XPath, you can try it next time.
Writing a regular is actually a certain skill:
Non-greedy match. For example, a label like this:Hello, to remove a, if you write such an expression, it is not:Hello。 Because * a greedy match was made. This is going to use.? :Hello。
Match across rows. One way to implement cross-line is to use the Dotall flag bit, so that. will match to line break. However, the entire matching process becomes slow. The original match is in the unit of behavior. The whole process is O (NC2), n is the number of rows, and C is the average number of columns. It is now very possible to turn O ((NC) 2). My implementation is to use \ n to match line breaks, so that you can clearly indicate how many rows the match spans. For example: Abc\s*\n\s*def, it is pointed out that the search is a separate row. (. \ n)? You can specify that the rows match as few as possible.
In fact, there is a point to note. Some of the end of the line with a \ r. This means that a line is terminated with \ r \ n. I did not know this at the beginning, the regular debugging for a long time. Now directly with the \s, indicating the end of line and \ r.
No capturing groupings. To not affect the captured grouping, the above (. \ n) can be changed to (?:. \ n) so that it is ignored when the group is captured.
The parentheses are escaped. Because the parentheses are used to denote groupings in the regular, they are escaped in order to match the parentheses. A regular string is best used as a string with an R prefix, and if not, it is escaped.
Fast regular. Wrote so many patterns, also summed up a rule out. First, take out the paragraphs related to the characters you want to match. To match something with (.?) Replace. Replace the newline with the string \s\n\s*, and then remove the space at the end of the line. The whole process can be written quickly in vim.
Excel operations
This time the data is put into Excel. It was only later that I realized that if we put it in the database, it might not be so much. But it's half-written, and it's hard to turn back.
Search for Excel, you can draw a few scenarios, one is to use the XLRT/XLWT library, this regardless of whether the computer installed Excel, can be run, but only in the XLS format. There is also a direct packaging of COM, need to install software on the computer to line. I'm using the first one.
Basic reading and writing are no problem. But the amount of data is a big problem.
Not enough memory. As soon as the program runs, the memory footprint is up 1.1 points. Look at the back again, know to use Flush_row_data. But it still makes a mistake. A look at memory consumption, there is no problem, has been very smooth. But in the end there will be memory error. This is really a ghost. Again and again, repeatedly run. There is no result at all. What is fatal is that the bug only appears when the amount of data is large, and the amount of data is often much better than a few hours, the cost of debug is too high. An accidental opportunity, suddenly found that memory consumption, although the overall smooth, but there will be a regular occurrence of small upswing, and this regularity, will not and flush_row_data, related. What has been puzzling is where the data was flush. The original XLWT is a very painful practice. The data is in memory, or flush to a temp, to save, and then write once. And the problem out over in this one-time write, memory soared. Then what do I need to flush_row_data? Why flush into where you want to write at the beginning.
Number of row limits. This is determined by the XLS format itself, and the maximum number of rows is only 65536. And the data is large, the file opened is not convenient.
Combined with the above two points, finally took a strategy, if the number of rows is a multiple of 1000, to flush, if the number of rows more than 65536, a new sheet, if more than 3 sheet, then create a new file. For convenience, wrap the XLWT.

Copy the Code code as follows:


#coding: utf-8#
Import XLWT

Class XLS:
' A class wrap the XLWT '
max_row=65536
Max_sheet_num=3

def __init__ (self,name,captionlist,typelist,encoding= ' UTF8 ', flushbound=1000):
Self.name=name
self.captionlist=captionlist[:]
self.typelist=typelist[:]
Self.workbookindex=1
Self.encoding=encoding
SELF.WB=XLWT. Workbook (encoding=self.encoding)
Self.sheetindex=1
Self.__addsheet ()
Self.flushbound=flushbound

def __addsheet (self):
If Self.sheetindex! = 1:
Self.wb.save (Self.name+str (self.workbookindex) + '. xls ')
If Self.sheetindex>xls. Max_sheet_num:
Self.workbookindex+=1
SELF.WB=XLWT. Workbook (encoding=self.encoding)
Self.sheetindex=1

Self.sheet=self.wb.add_sheet (Self.name.encode (self.encoding) +str (Self.sheetindex))
For I in range (len (self.captionlist)):
Self.sheet.write (0,i,self.captionlist[i])

Self.row=1

def write (Self,data):
If Self.row>=xls. Max_row:
Self.sheetindex + = 1
Self.__addsheet ()

For I in range (len (data)):
If self.typelist[i]== "num":
Try
Self.sheet.write (Self.row,i,float (Data[i]))
Except ValueError:
Pass
Else
Self.sheet.write (Self.row,i,data[i])

If Self.row% Self.flushbound = = 0:
Self.sheet.flush_row_data ()
Self.row+=1

def save (self):
Self.wb.save (Self.name+str (self.workbookindex) + '. xls ')

Convert Web page special characters
Because the Web page also has its own unique escape character, it is a bit of a hassle when it comes to regular matching. In the official documents to find a dictionary replacement scheme, private thought good, to do some expansion. Some of them are to keep the correctness of the regular.

Copy the Code code as follows:


Html_escape_table = {
"&": "&",
'"': """,
"'": "'",
">": ">",
"<": "<",
U "•": "•",
U "°": "°",
#regular expression
".": R "\.",
"^": R "\^",
"$": R "\$",
"{": R "\{",
"}": R "\}",
"\ \": r "\ \",
"|": R "\|",
"(": R "\ (",
")": R "\)",
"+": r "\+",
"*": R "\*",
"?": R "\?",
}

def html_escape (text):
"" "produce entities within text." "
Tmp= "". Join (Html_escape_table.get (C,C) for C in text)
Return Tmp.encode ("Utf-8")

Knot
That's pretty much the experience. But the program that was written at the end doesn't bear to look at it anymore. The style is not good. At first, I tried to write it first. And try not to change it.
The final program runs for a long time, with network communications taking up the most of it. Is it possible to consider using multithreading for refactoring? Think about it, let it be.

  • Related Article

    Contact Us

    The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

    If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

    A Free Trial That Lets You Build Big!

    Start building with 50+ products and up to 12 months usage for Elastic Compute Service

    • Sales Support

      1 on 1 presale consultation

    • After-Sales Support

      24/7 Technical Support 6 Free Tickets per Quarter Faster Response

    • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.