Python crawls website data

Source: Internet
Author: User

Before the start of school, I took a task to crawl data with specific attributes from the Internet. I learned python and practiced it before.

Encoding Problems

Because it involves Chinese characters, it is inevitable that the encoding problem is involved. This opportunity is completely clarified.

The problem starts with the encoding of the text. The original English code is only 0 ~ 255, Which is exactly 8 bits and 1 byte. To represent different languages, we naturally need to expand. In Chinese, there are GB series. What is the relationship between Unicode and UTF-8?

Unicode is an encoding scheme, also known as the universal code. However, this encoding is not used for storing data on a computer. It can be said that it acts as a man-in-the-middle. You can store Unicode encoding (encode) as a UTF-8, or GB, on your computer. The UTF-8 or GB can also be decoded to Unicode.

In python, Unicode is a type of object, represented by a u header, for example, u 'China', and string is a type of object, it is a string that actually exists on the computer in a specific encoding mode. For example, the 'Chinese' in UTF-8 encoding is different from that in gbk encoding. You can see the following code:

>>> Str = u'chinese' >>> str1 = str. encode ('utf8') >>> str2 = str. encode ('gbk') >>> print repr (str) U' \ u4e2d \ u6587 '>>> print repr (str1) '\ xe4 \ xb8 \ xad \ xe6 \ x96 \ x87'> print repr (str2) '\ xd6 \ xd0 \ xce \ xc4'

As you can see, in fact, only such encoding is stored in the computer, rather than one Chinese character. During print, you must know the encoding method used at that time to print the data correctly. It is well mentioned that Unicode in python is a real string, while string is a byte string.

File Encoding

Since there are different encodings, If you directly write the string in the code file, which encoding is it? This is determined by the file encoding. Files are always saved in a certain encoding method. The python file can be written with the coding statement to describe the encoding method used to save the file. If the declared encoding method is inconsistent with the actually saved encoding method, an exception occurs. See the following example: declare the file stored in UTF-8 as gbk

# Coding: gbkstr = u'hangzhou' str1 = str. encode ('utf8') str2 = str. encode ('gbk') str3 = 'hang' print repr (str) print repr (str1) print repr (str2) print repr (str3)

The error File "test. py ", line 1 SyntaxError: Non-ASCII character '\ xe6' in file test. py on line 1, but no encodi ng declared; see http://www.python.org/peps/pep-0263.html for details changed

# Coding: utf8str = u'hangzhou' str1 = str. encode ('utf8') str2 = str. encode ('gbk') str3 = 'hang' print repr (str) print repr (str1) print repr (str2) print repr (str3)

Output normal result U' \ u6c49'' \ xe6 \ xb1 \ x89 ''\ xba'' \ xe6 \ xb1 \ x89'

For more information, see this article http://www.cnblogs.com/huxi/archive/2010/12/05/1897271.html

Basic Method

In fact, it is very easy to crawl a Web page using python, just a few simple words

import urllib2page=urllib2.urlopen('url').read()

In this way, you can obtain the page content. Next, use the regular expression matching to match the required content.

However, there will be various details.

Login

This is a website that requires logon authentication. It is not difficult. Just import the cookielib and urllib libraries.

import urllib,urllib2,cookielibcookiejar = cookielib.CookieJar()urlOpener = urllib2.build_opener(urllib2.HTTPCookieProcessor(cookiejar))

In this way, a cookie is loaded, and you can remember the information after logging on to the open client using urlOpener.

Reconnection

If we only want to achieve the above level and do not pack open, as long as the network conditions fluctuate, an exception is thrown directly and the entire program is exited. This is a very bad program. At this time, you only need to handle the exception and try multiple times:

def multi_open(opener,*arg):while True:retryTimes=20while retryTimes>0:try:return opener.open(*arg)except:print '.',retryTimes-=1
Regular Expression matching

In fact, regular expression matching is not a very good method, because it has poor fault tolerance and the web pages must be completely unified. If there is a slight inconsistency, it will fail. Later, we can see that the selection is based on xpath. You can try again next time.

Writing regular expressions is actually skillful:

  • Non-Greedy match. For example, for a label: <span class = 'A'> hello </span>, a is to be taken out. If expression a is written like this, it will not work: <span class =. *> hello </span>. Because * is greedy. This is to be used.? : <Span class =.?> Hello </span>.
  • Cross-row match. To implement cross-row, you can use the DOTALL flag to match the line feed. However, the entire matching process becomes very slow. The original match is in the unit of action. The entire process is at most O (nc2),N is the number of rows, and c is the average number of columns. Now it is very likely to change to O (nc) 2 ). My implementation scheme is to use \ n to match the line feed, so that we can clearly specify the maximum number of lines to be matched across hops. For example, abc \ s * \ n \ s * def indicates that the query is performed on a different row. (.\ N)? You can specify to match as few rows as possible.
  • Pay attention to the following points. Some rows have \ r at the end. That is to say, a row ends with \ r \ n. If you didn't know this, the regular expression was debugged for a long time. Now \ s is used directly to indicate the space at the end of the line and \ r.
  • No capturing group. In order not to affect the captured group (.\ N) can be changed (? :.\ N), which will be ignored when the group is captured.
  • Escape the brackets. Because brackets are used to represent groups in the regular expression, they are escaped to match the brackets. It is best to use a regular string with the r prefix. If not, escape the character.
  • Fast regular expression. After writing so many modes, we can also sum up a rule. Take out the paragraphs related to the characters to be matched. Use (.?) . Replace line feed \ n with string \ s\ N \ s *, and then remove the space at the end of the first line of the line. The entire process can be written quickly in vim.
Excel operations

This time the data is put into Excel. I realized later that if I put it into the database, there may not be so many things. But it's already half written, so it's hard to look back.

Search for Excel, you can come up with several solutions. One is to use the xlrt/xlwt library, which can be run regardless of whether Excel is installed on the computer, but can only be in the xls format. Another option is to directly package com, which requires software installation on the computer. I use the previous one.

There is no problem with basic read/write. However, if the data volume is large, the problem arises.

  • Insufficient memory. As soon as the program runs, the memory usage increases by 1.1 points. I checked it again later, knowing that we should use flush_row_data. However, Errors still occur. The memory usage remains stable. However, memory error still occurs. This is a ghost. It is re-checked and run repeatedly. No results at all. It is terrible that bugs only occur when the data volume increases, and it usually takes several hours to wait for the data volume to increase. This debugging cost is too high. By chance, I suddenly found out that although the memory usage was stable in general, there would be a small increase in regularity, and this regularity would be related to flush_row_data. I have been wondering where data is flushed. It turns out that xlwt is a tough practice. Store the data in the memory or flush it to a temp. When saving the data, write the data at one time. The problem is that the memory usage is soaring as one-time writing occurs. So how can I use flush_row_data? Why not flush into the place where the data is to be written at the beginning.
  • Number of rows. This is determined by the xls format. The maximum number of rows can be 65536. In addition, a large amount of data is required, and it is not convenient to open the file.

Based on the above two points, this policy is adopted. If the number of rows is a multiple of 1000, flush is performed once. If the number of rows exceeds 65536, a new sheet is created. If more than three sheets are created, create a new file. For convenience, I packed xlwt

#coding:utf-8#import xlwtclass XLS:'''a class wrap the xlwt'''MAX_ROW=65536MAX_SHEET_NUM=3def __init__(self,name,captionList,typeList,encoding='utf8',flushBound=1000):self.name=nameself.captionList=captionList[:]self.typeList=typeList[:]self.workbookIndex=1self.encoding=encodingself.wb=xlwt.Workbook(encoding=self.encoding)self.sheetIndex=1self.__addSheet()self.flushBound=flushBounddef __addSheet(self):if self.sheetIndex != 1:self.wb.save(self.name+str(self.workbookIndex)+'.xls')if self.sheetIndex>XLS.MAX_SHEET_NUM:self.workbookIndex+=1self.wb=xlwt.Workbook(encoding=self.encoding)self.sheetIndex=1self.sheet=self.wb.add_sheet(self.name.encode(self.encoding)+str(self.sheetIndex))for i in range(len(self.captionList)):self.sheet.write(0,i,self.captionList[i])self.row=1def write(self,data):if self.row>=XLS.MAX_ROW:self.sheetIndex += 1self.__addSheet()for i in range(len(data)):if self.typeList[i]=="num":try:self.sheet.write(self.row,i,float(data[i]))except ValueError:passelse:self.sheet.write(self.row,i,data[i])if self.row % self.flushBound == 0:self.sheet.flush_row_data()self.row+=1def save(self):self.wb.save(self.name+str(self.workbookIndex)+'.xls')
Convert special characters on a webpage

Because the webpage also has its own unique escape characters, it is troublesome to perform regular matching. I found a solution to replace the dictionary in the official document. I thought it was good and I made some extensions. Some of them are used to maintain regular expression correctness.

html_escape_table = {"&": "&amp;",'"': "&quot;","'": "&apos;",">": "&gt;","<": "&lt;",u"·":"&#183;",u"°":"&#176;",#regular expression".":r"\.","^":r"\^","$":r"\$","{":r"\{","}":r"\}","\\":r"\\","|":r"\|","(":r"\(",")":r"\)","+":r"\+","*":r"\*","?":r"\?",}def html_escape(text):"""Produce entities within text."""tmp="".join(html_escape_table.get(c,c) for c in text)return tmp.encode("utf-8")
End

The experience is almost the same. However, the program written at the end cannot be read. Bad style. At first, I thought about writing it first. And then try again.

The final program will take a long time to run, with most of the network communication time. Can I use multiple threads to refactor it? Think about it.

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.