Python crawl Web site data Save using method

Python crawl Web site data Save using method _python

Last Update:2017-01-18 Source: Internet

Author: User

Tags flush in python

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Coding problems
Because it involves Chinese, it inevitably involves the problem of coding, this time to take this opportunity to be thoroughly understood.
The problem is to start with the encoding of the text. The original English code is only 0~255, just 8-digit 1 bytes. In order to express a variety of different languages, it is natural to expand. Chinese words have GB series. Maybe you've heard of Unicode and UTF-8, so what's the relationship between them?
Unicode is a coding scheme, also known as the Universal Code, which can be seen to contain a wide range. But it's not encoded in a computer, it can be said to act as a middleman. You can then put Unicode encoding (encode) as UTF-8, or GB, and then stored on your computer. UTF-8 or GB can also be decoded (decode) and restored to Unicode.
Unicode in Python is a class of objects that begin with u, such as U ' Chinese ', and string is a class of objects that actually exist on the computer in a specific encoding. For example, under the Utf-8 code ' Chinese ' and ' Chinese ' under the GBK code, they are not the same. You can look at the following code:

Copy Code code as follows:

>>> str=u ' Chinese '
>>> str1=str.encode (' UTF8 ')
>>> str2=str.encode (' GBK ')
>>> print repr (str)
U ' \u4e2d\u6587 '
>>> print repr (str1)
' \xe4\xb8\xad\xe6\x96\x87 '
>>> print repr (str2)
' \xd6\xd0\xce\xc4 '

You can see, in fact, stored in the computer is just such a code, not a single Chinese character, in print when you want to know what kind of encoding, in order to print out correctly. There is a good saying that Unicode in Python is the real string, and string is a byte string
File encoding
Since there are different encodings, if you write a string directly in the code file, what kind of code is it? This is determined by the encoding of the file. Files are always saved in a certain encoding. The Python file can write a coding declaration statement that shows how the file was saved in code. An exception occurs if the declared encoding is inconsistent with the actual saved encoding. You can see the following example: A file saved in Utf-8 is declared as GBK

Copy Code code as follows:

#coding: GBK
Str=u ' Han '
Str1=str.encode (' UTF8 ')
Str2=str.encode (' GBK ')
Str3= ' Han '
Print repr (str)
Print repr (STR1)
Print repr (STR2)
Print repr (STR3)

Prompt for error File "test.py", line 1 syntaxerror:non-ascii character ' \xe6 ' in File test.py on line 1, but no encodi ng declared; Read http://www.python.org/peps/pep-0263.html for details to

Copy Code code as follows:

#coding: UTF8
Str=u ' Han '
Str1=str.encode (' UTF8 ')
Str2=str.encode (' GBK ')
Str3= ' Han '
Print repr (str)
Print repr (STR1)
Print repr (STR2)
Print repr (STR3)

Output normal result u ' \u6c49 ' \xe6\xb1\x89 ' \xba\xba ' \xe6\xb1\x89 '

Basic methods
Actually crawling a Web page in Python is simple, with just a few simple words

Copy Code code as follows:

Import Urllib2
Page=urllib2.urlopen (' URL '). Read ()

This allows you to get to the content of the page. Then use a regular match to match the desired content.
But the real thing to do, there will be all sorts of details.
Login
This is a website that needs to sign in to authenticate. It's not too hard, just import cookielib and urllib libraries.

Copy Code code as follows:

Import Urllib,urllib2,cookielib
Cookiejar = Cookielib. Cookiejar ()
Urlopener = Urllib2.build_opener (urllib2. Httpcookieprocessor (Cookiejar))

This will be loaded into a cookie, with Urlopener to open login can remember the information.
Disconnected connection
If only to achieve the above level, not to open the packaging, as long as the network situation some ups and downs, the direct throw out the entire program, is a very bad program. This time, as long as the exception is processed, try more than a few times on the line:

Copy Code code as follows:

def multi_open (Opener,*arg):
While True:
Retrytimes=20
While retrytimes>0:
Try
Return Opener.open (*arg)
Except
print '. ',
Retrytimes-=1

regular Match
In fact, regular matching is not a particularly good method, because it is very bad fault tolerance, the Web page to be completely unified. If there is a slight unification, it will fail. Later I saw that there was a selection based on XPath, and you could try it next time.
Writing is actually a certain skill:
Not greedy match. such as a label: hello</span&gt, to take out a, if written such an expression, it is not: hello. Because the * has been a greedy match. This is to use.? : hello.
Cross-row matching. The idea of implementing a cross line is to use the Dotall sign bit, so that. It will match to the line break. However, the entire matching process becomes slow. The original match is in the behavior unit. The whole process is O (NC2), n is the number of rows, and C is the average number of columns. It is now highly possible to become O ((NC) 2). My implementation is to match the newline with \ n, which clearly indicates how many rows the match jumps up to. For example: Abc\s*\n\s*def, it is pointed out that the search is a separate line. (. \ n)? You can specify that you want to match as few rows as possible.
There is actually a point to be noticed here. Some end of the line is with \ r. That means the line ends with \ r \ n. Did not know this at the outset, is the debugging for a long time. Now use \s directly to indicate the end of line space and \ r.
No capture groupings. To not affect a captured grouping, the above (. \ n) can be changed to (?:. \ n) so that when a group is captured, it is ignored.
The parentheses are to be escaped. Because the parentheses are used to denote grouping in the regular, they are escaped in order to match the single bracket. A regular string is best used with a string with the R prefix, or, if not, to escape.
Fast regular. Wrote so many patterns, but also summed up a rule out. First, take the paragraphs that are related to the character you want to match. To match something with (.?) Replace. Replace the newline with the string \s\n\s*, and then remove the space at the end of the line. The whole process can be written quickly in vim.
Excel operations
This time the data is put into Excel. And then realize that if you put it in the database, there might be less. But it's half the story, it's hard to turn back.
Search Excel, you can come up with a few solutions, one is using the XLRT/XLWT library, this regardless of whether the computer installed Excel, can run, but only XLS format. There is also a direct packaging of COM, need to install software on the computer to do. I used the previous one.
Basic reading and writing is fine. But when the volume of data is large, there is a problem.
Not enough memory. As soon as the program ran, the memory footprint rose at 1.1. Check back again, know to use Flush_row_data. But there's still going to be a mistake. A look at the memory footprint, there is no problem, has been very smooth. But in the end there will be memory error. This is a hell of a. Again and again to check, run repeatedly. There is no result at all. What's killing me is that bugs only appear when the volume of data is large, and the amount of data that is so large often takes several hours, and the cost of this debug is too high. A chance, suddenly found memory footprint, although the overall smooth, but will be the regularity of the emergence of small upsurge, and this regularity, will not and flush_row_data, related. It's always been a wonder where data was flush. The original XLWT is a very painful approach. Put the data in memory, or flush to a temp, to save, and write again. And the problem is in this one-time write, memory is soaring. Then what do I need to flush_row_data? Why not start flush into the place to write.
Number of line restrictions. This is the XLS format itself, the maximum number of rows can only be 65536. And a large number of data, the file is not convenient to open.
Combined with the above two points, finally adopted such a strategy, if the number of rows is 1000 multiples, a flush, if the number of rows more than 65536, open a new sheet, if more than 3 sheet, then create a new file. For convenience, wrap the XLWT

Copy Code code as follows:

#coding: utf-8#
Import XLWT

Class XLS:
' A class wrap the XLWT '
max_row=65536
Max_sheet_num=3

def __init__ (self,name,captionlist,typelist,encoding= ' UTF8 ', flushbound=1000):
Self.name=name
self.captionlist=captionlist[:]
self.typelist=typelist[:]
Self.workbookindex=1
Self.encoding=encoding
SELF.WB=XLWT. Workbook (encoding=self.encoding)
Self.sheetindex=1
Self.__addsheet ()
Self.flushbound=flushbound

def __addsheet (self):
If Self.sheetindex!= 1:
Self.wb.save (Self.name+str (self.workbookindex) + '. xls ')
If Self.sheetindex>xls. Max_sheet_num:
Self.workbookindex+=1
SELF.WB=XLWT. Workbook (encoding=self.encoding)
Self.sheetindex=1

Self.sheet=self.wb.add_sheet (Self.name.encode (self.encoding) +str (Self.sheetindex))
For I in range (len (self.captionlist)):
Self.sheet.write (0,i,self.captionlist[i])

Self.row=1

def write (Self,data):
If Self.row>=xls. Max_row:
Self.sheetindex + 1
Self.__addsheet ()

For I in range (len (data)):
If self.typelist[i]== "num":
Try
Self.sheet.write (Self.row,i,float (data[i))
Except ValueError:
Pass
Else
Self.sheet.write (Self.row,i,data[i])

If self.row% Self.flushbound = 0:
Self.sheet.flush_row_data ()
Self.row+=1

def save (self):
Self.wb.save (Self.name+str (self.workbookindex) + '. xls ')

Convert Web page special characters
Because the Web page also has its own unique escape character, it is a bit of a hassle when making a regular match. In the official document to find a dictionary replacement scheme, private thought good, to do some expansion. Some of them are to keep the correctness of the regular.

Copy Code code as follows:

Html_escape_table = {
"&": "&",
'"': """,
"'": "'",
">": ">",
"<": "<",
U "•": "·",
U "°": "°",
#regular expression
".": R "\.",
"^": R "\^",
"$": R "\$",
"{": R "\{",
'} ': R ' \} ',
"\": R "\",
"|": R "\|",
"(: R) \ (",
")": R "\)",
"+": r "\+",
"*": R "\*",
"?": R "\?",
}

def html_escape (text):
"" "produce entities within text." "
Tmp= "". Join (Html_escape_table.get (C,C) for C in text)
Return Tmp.encode ("Utf-8")

Knot
That's about the same experience. However, the final written procedures themselves can not bear to see. The style is very bad. Start thinking about trying it first. Then try to try and not change.
The final program runs for a long time, in which the network traffic takes up most of it. Is it possible to consider using multithreaded refactoring? Think about it, or just do it.

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More