International - English

Topic Center

Contact Sales

Home > Developer > Python

How Python crawls Web site data is saved using

Last Update:2016-06-16 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Coding issues
Because it involves the Chinese language, so inevitably involves the problem of coding, this time to take this opportunity to be completely clear.
The problem should be from the coding of words. The original English code is only 0~255, just 8 bits and 1 bytes. In order to represent a variety of different languages, it is natural to expand. Chinese words have GB series. Maybe you've heard of Unicode and UTF-8, so what's the relationship between them?
Unicode is a coding scheme, also known as the Universal code, it can be seen in a wide range. But the specific storage on the computer, do not use this code, it can be said that it plays an intermediary role. You can then save the Unicode encoding (encode) to UTF-8, or GB, to the computer. UTF-8 or GB can also be decoded (decode) to be reverted to Unicode.
In Python, Unicode is a class of objects that appear to be preceded by u, such as U ' Chinese ', and string is a class object, which is the actual string that exists on the computer in the specific encoding mode. For example, the ' Chinese ' under the Utf-8 code and the ' Chinese ' under the GBK code are not the same. You can see the following code:

Copy the Code code as follows:

>>> str=u ' Chinese '
>>> str1=str.encode (' UTF8 ')
>>> str2=str.encode (' GBK ')
>>> print repr (str)
U ' \u4e2d\u6587 '
>>> print repr (str1)
' \xe4\xb8\xad\xe6\x96\x87 '
>>> print repr (str2)
' \xd6\xd0\xce\xc4 '

Can be seen, in fact, stored in the computer is just such a code, rather than a Chinese character, in print, you need to know what kind of encoding is used, in order to correctly print out. There is a good idea that Unicode in Python is the real string, and string is a byte string
File encoding
Since there are different encodings, if you write string directly in the code file, what kind of encoding is it? This is determined by the encoding of the file. The file is always stored in a certain encoding mode. The Python file can be written with the coding declaration statement, which is used to indicate what encoding the file is stored in. An exception occurs if the declaration is encoded in a way that is inconsistent with the actual saved encoding. You can see the following example: Files saved with Utf-8 are declared as GBK

Copy the Code code as follows:

#coding: GBK
Str=u ' Han '
Str1=str.encode (' UTF8 ')
Str2=str.encode (' GBK ')
Str3= ' Han '
Print repr (str)
Print repr (STR1)
Print repr (STR2)
Print repr (STR3)

Hint error File "test.py", line 1 syntaxerror:non-ascii character ' \xe6 ' in File test.py on line 1, but no encodi ng declared; See http://www.python.org/peps/pep-0263.html for details instead

Copy the Code code as follows:

#coding: UTF8
Str=u ' Han '
Str1=str.encode (' UTF8 ')
Str2=str.encode (' GBK ')
Str3= ' Han '
Print repr (str)
Print repr (STR1)
Print repr (STR2)
Print repr (STR3)

Output normal result u ' \u6c49 ' \xe6\xb1\x89 ' \xba\xba ' \xe6\xb1\x89 '

Basic methods
Actually crawling a Web page with Python is simple, with just a few simple words

Copy the Code code as follows:

Import Urllib2
Page=urllib2.urlopen (' URL '). Read ()

This allows you to get to the content of the page. The next step is to match the required content with a regular match.
But there are all sorts of details that really need to be done.
Login
This is a website that requires login authentication. It's not too difficult, just import the cookielib and Urllib libraries on the line.

Copy the Code code as follows:

Import Urllib,urllib2,cookielib
Cookiejar = Cookielib. Cookiejar ()
Urlopener = Urllib2.build_opener (urllib2. Httpcookieprocessor (Cookiejar))

This will load into a cookie, with Urlopener to open login can remember the information.
Wire Break re-connect
If only to achieve the above degree, do not package the open, as long as the network situation some ups and downs, directly throw an exception, quit the entire program, is a very bad program. This time, as long as the exception to deal with, try a few more times on the line:

Copy the Code code as follows:

def multi_open (Opener,*arg):
While True:
Retrytimes=20
While retrytimes>0:
Try
Return Opener.open (*arg)
Except
print '. ',
Retrytimes-=1

regular Match
In fact, regular matching is not a particularly good method, because its fault tolerance is not good, the Web page to be completely unified. If there is a slight disunity, it will fail. Later I saw that there is a selection based on XPath, you can try it next time.
Writing a regular is actually a certain skill:
Non-greedy match. For example, a label like this:Hello, to remove a, if you write such an expression, it is not:Hello。 Because * a greedy match was made. This is going to use.? ：Hello。
Match across rows. One way to implement cross-line is to use the Dotall flag bit, so that. will match to line break. However, the entire matching process becomes slow. The original match is in the unit of behavior. The whole process is O (NC2), n is the number of rows, and C is the average number of columns. It is now very possible to turn O ((NC) 2). My implementation is to use \ n to match line breaks, so that you can clearly indicate how many rows the match spans. For example: Abc\s*\n\s*def, it is pointed out that the search is a separate row. (. \ n)? You can specify that the rows match as few as possible.
In fact, there is a point to note. Some of the end of the line with a \ r. This means that a line is terminated with \ r \ n. I did not know this at the beginning, the regular debugging for a long time. Now directly with the \s, indicating the end of line and \ r.
No capturing groupings. To not affect the captured grouping, the above (. \ n) can be changed to (?:. \ n) so that it is ignored when the group is captured.
The parentheses are escaped. Because the parentheses are used to denote groupings in the regular, they are escaped in order to match the parentheses. A regular string is best used as a string with an R prefix, and if not, it is escaped.
Fast regular. Wrote so many patterns, also summed up a rule out. First, take out the paragraphs related to the characters you want to match. To match something with (.?) Replace. Replace the newline with the string \s\n\s*, and then remove the space at the end of the line. The whole process can be written quickly in vim.
Excel operations
This time the data is put into Excel. It was only later that I realized that if we put it in the database, it might not be so much. But it's half-written, and it's hard to turn back.
Search for Excel, you can draw a few scenarios, one is to use the XLRT/XLWT library, this regardless of whether the computer installed Excel, can be run, but only in the XLS format. There is also a direct packaging of COM, need to install software on the computer to line. I'm using the first one.
Basic reading and writing are no problem. But the amount of data is a big problem.
Not enough memory. As soon as the program runs, the memory footprint is up 1.1 points. Look at the back again, know to use Flush_row_data. But it still makes a mistake. A look at memory consumption, there is no problem, has been very smooth. But in the end there will be memory error. This is really a ghost. Again and again, repeatedly run. There is no result at all. What is fatal is that the bug only appears when the amount of data is large, and the amount of data is often much better than a few hours, the cost of debug is too high. An accidental opportunity, suddenly found that memory consumption, although the overall smooth, but there will be a regular occurrence of small upswing, and this regularity, will not and flush_row_data, related. What has been puzzling is where the data was flush. The original XLWT is a very painful practice. The data is in memory, or flush to a temp, to save, and then write once. And the problem out over in this one-time write, memory soared. Then what do I need to flush_row_data? Why flush into where you want to write at the beginning.
Number of row limits. This is determined by the XLS format itself, and the maximum number of rows is only 65536. And the data is large, the file opened is not convenient.
Combined with the above two points, finally took a strategy, if the number of rows is a multiple of 1000, to flush, if the number of rows more than 65536, a new sheet, if more than 3 sheet, then create a new file. For convenience, wrap the XLWT.

Copy the Code code as follows:

#coding: utf-8#
Import XLWT

Class XLS:
' A class wrap the XLWT '
max_row=65536
Max_sheet_num=3

def __init__ (self,name,captionlist,typelist,encoding= ' UTF8 ', flushbound=1000):
Self.name=name
self.captionlist=captionlist[:]
self.typelist=typelist[:]
Self.workbookindex=1
Self.encoding=encoding
SELF.WB=XLWT. Workbook (encoding=self.encoding)
Self.sheetindex=1
Self.__addsheet ()
Self.flushbound=flushbound

def __addsheet (self):
If Self.sheetindex! = 1:
Self.wb.save (Self.name+str (self.workbookindex) + '. xls ')
If Self.sheetindex>xls. Max_sheet_num:
Self.workbookindex+=1
SELF.WB=XLWT. Workbook (encoding=self.encoding)
Self.sheetindex=1

Self.sheet=self.wb.add_sheet (Self.name.encode (self.encoding) +str (Self.sheetindex))
For I in range (len (self.captionlist)):
Self.sheet.write (0,i,self.captionlist[i])

Self.row=1

def write (Self,data):
If Self.row>=xls. Max_row:
Self.sheetindex + = 1
Self.__addsheet ()

For I in range (len (data)):
If self.typelist[i]== "num":
Try
Self.sheet.write (Self.row,i,float (Data[i]))
Except ValueError:
Pass
Else
Self.sheet.write (Self.row,i,data[i])

If Self.row% Self.flushbound = = 0:
Self.sheet.flush_row_data ()
Self.row+=1

def save (self):
Self.wb.save (Self.name+str (self.workbookindex) + '. xls ')

Convert Web page special characters
Because the Web page also has its own unique escape character, it is a bit of a hassle when it comes to regular matching. In the official documents to find a dictionary replacement scheme, private thought good, to do some expansion. Some of them are to keep the correctness of the regular.

Copy the Code code as follows:

Html_escape_table = {
"&": "&",
'"': """,
"'": "'",
">": ">",
"<": "<",
U "•": "•",
U "°": "°",
#regular expression
".": R "\.",
"^": R "\^",
"$": R "\$",
"{": R "\{",
"}": R "\}",
"\ \": r "\ \",
"|": R "\|",
"(": R "\ (",
")": R "\)",
"+": r "\+",
"*": R "\*",
"?": R "\?",
}

def html_escape (text):
"" "produce entities within text." "
Tmp= "". Join (Html_escape_table.get (C,C) for C in text)
Return Tmp.encode ("Utf-8")

Knot
That's pretty much the experience. But the program that was written at the end doesn't bear to look at it anymore. The style is not good. At first, I tried to write it first. And try not to change it.
The final program runs for a long time, with network communications taking up the most of it. Is it possible to consider using multithreading for refactoring? Think about it, let it be.



This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

Related Keywords:

what is web site how fast is site what wordpress theme is site using how big is python how to publish web site what is best web hosting site how to hack site using cmd

Python design mode-UML-Package diagrams (Package Diagram) 09-09

Python abstract class (ABC module) 09-18

Python: send emails 12-08

Python: Database Operations 12-08

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

What's Trending

Top 10 Tags

datastax versions naming convention zookeeper client class definition md5 microsoft sql server 2005 data structures exception handling error handling

Top 10 Keywords

microsoft download center down wordpress address url site address url wordpress address url windows installer 4 0 download 302 not found web address url definition site address url wordpress db2 integer mac os installation step by step pdf abbreviation for return

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More