Python crawler Practice --- crawling library borrowing information, python Crawler

Last Update:2016-12-10 Source: Internet

Author: User

Tags http cookie

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Python crawler Practice --- crawling library borrowing information, python Crawler

Python crawler Practice --- crawling library borrowing Information

For original works, please refer to the Source: Python crawler practice-crawling library borrowing Information

I borrowed a lot of books from the library some time ago. When I borrow more books, I will easily forget the expiration date of each book. I am always worried that I will default, which will affect future borrowing, I am too lazy to log on to the borrowing system of the school library, so I plan to write a crawler to capture my borrowing information and crawl the expiration date of each book, and write the txt file, so that you can open the txt file every time you forget to view it. Every time the borrowing information changes, you just need to re-run the program, the original txt file will be overwritten by the new file, and the content in it will be updated.

Technologies used:　　

Python 2.7 and urllib2, cookielib, and re modules are used at the same time. Urllib2 is used to create a request, capture webpage information, and return a response object similar to a file type. cookielib is used to store cookie objects to implement the simulated login function; the re module supports regular expressions to match the captured page information to obtain the desired information..

Capture a page:　　

Using urllib2 to capture a Web page is very simple:

1 import urllib22 response = urllib2.urlopen("http://www.baidu.com")3 html = response.read()

The urlopen () method in urllib2, which literally means to open a URL (uniform resource locator) address. The address of the hundred-degree homepage entered in the example above follows the HTTP protocol, in addition to the http protocol, the urlopen () method can also open addresses that follow the ftp and file protocols, such:

1 response = urllib2.urlopen("ftp://example.com")

In addition to URL parameters, the urlopen () method also accepts the data and timeout parameters:

1 response = urllib2.urlopen(url ,data ,timeout)

Data is the data that needs to be imported when a webpage is opened. For example, when a logon interface is opened, the user name, password, and other information are often required. The usage of data is displayed when you log on to the library system below; timeout is used to set the timeout time. If the page does not respond after a certain period of time, an error is returned. In the urlopen () method, data and timeout are not required. You can leave them empty. Note: data is required when data is required for a page.

As you can see, when opening a webpage, you often need to input multiple parameters. In addition, the HTTP protocol is based on the request and response, that is, when the client sends a request and the server returns a response, the urlopen () method is usually used to construct a request object for parameter input, the request object includes url, data, timeout, headers, and other information:

1 import urllib22 request = urllib2.Request("http://www.baidu.com")3 response = urllib2.urlopen(request)4 html = response.read()

The results of this Code are the same as those obtained above, but the logic is clearer and clearer.

Cookie usage:　　

When accessing some websites, the website needs to store some data and information locally (encrypted) on the client and return it to the server in the next request, otherwise, the server rejects the request and the data is stored in the local cookie. For example, You need to log on to the school library system. After the logon is complete, the server stores encrypted data locally in the cookie, when the client sends a request to query the borrow information, it sends the data in the cookie together with the server. The server determines the cookie information and allows access; otherwise, the request is rejected.

The Cookielib module provides the CookieJar class to capture and store HTTP cookie data. To create a cookie, you only need to create a CookieJar instance:

1 import cookielib2 cookie = coolielib.CookieJar()

Is it okay to create a cookie? Not that simple. The operations we need to do are to send a login request, record the cookie, re-send a request to read the borrow information, and feedback the cookie information to the server, the original urlopen () method is no longer competent. Fortunately, the urllib2 module also provides an OpenerDirector class that can accept a cookie processor as a parameter to implement the above functions, this cookie processor can be obtained by accepting a cookie Object Instantiation through the HTTPCookieProcessor class. That is, a cookie processor handler is obtained through the HTTPCookieProcessor instantiation, and then the processor headler is passed into OpenDirector for instantiation as a parameter to obtain an opener that can capture cookie data. The Code is as follows:

1 import urllib22 import cookielib3 4 cookie = cookielib.CookieJar()5 handler = urllib2.HTTPCookieProcessor(cookie)6 opener = urllib2.build_opener(handler)7 response = opener.open("http://www.baidu.com")

Log on to the library system:

So far, we can capture the library borrowing information. Let's take a look at the login interface of the hit Library:

First, in the Firefox browser, use the httpfox plug-in for network listening to see what data needs to be sent to the server to log on to this page:

Enter the Logon account and password, open the httpfox plug-in, click start to start listening, and then click Login to log on:

It is the page after login and the information captured throughout the login process. Select the first captured information and click the Headers tab below to view the data to be sent on this page. Some websites need to check Headers for requests to access them. Only the Headers meet the requirements can be accessed. When logging on to the library system, you can first try not to send a data header. If you can access the library smoothly, there is no Headers check. The method for sending data is GET, that is, you only need to add the data information to be sent after the login request. In the Request-Line attribute on the Headers tab, the "GET/lib/opacAction. do ", the actual request URL after adding the IP address is" http: // 202.118.250.131/lib/opacAction. do ", the question mark is followed by the data required for login, including account, password and other information.

Click the QueryString tab to view the data transmitted by the GET method:

The data to be transmitted includes five items, which are stored in the dictionary type. After being encoded by the urlencode () method, the data can be directly added after the login URL. Therefore, the request sent to the server is) is:

 1 import urllib 2  3 loginURL = 'http://202.118.250.131/lib/opacAction.do' 4 queryString = urllib.urlencode({ 5             'method':'DoAjax', 6             'dispatch':'login', 7             'registerName':'', 8             'rcardNo':'16S137028 0', 9             'pwd':'******'10         })11 requestURL = self.loginURL + '?' + self.queryString

After obtaining the request URL, you can simulate the login to the library system. During the simulated login process, you need to use the cookie mentioned above. Otherwise, you cannot perform subsequent access. During code compilation, define a library class to convert the access process into an object-oriented process. You can instantiate multiple library objects as needed and operate on multiple instances separately. First of all, the library Class should have an initialization method (_ init _) and a method to get the page (getPage). When you open a webpage, you should use the opener instance mentioned above, automatically capture and store cookies:

 1 import urllib 2 import urllib2 3 import cookielib 4 import re 5  6 class library: 7     def __init__(self): 8         self.loginURL='http://202.118.250.131/lib/opacAction.do' 9         self.queryString = urllib.urlencode({10             'method':'DoAjax',11             'dispatch':'login',12             'registerName':'',13             'rcardNo':'16S137028 0',14             'pwd':'114477yan'15         })16         self.requestURL = self.loginURL + '?' + self.queryString17         self.cookies=cookielib.CookieJar()18         self.handler=urllib2.HTTPCookieProcessor(self.cookies)19         self.opener=urllib2.build_opener(self.handler)20     def getPage(self):21         request1 = urllib2.Request(self.requestURL)22         request2 = urllib2.Request(' http://202.118.250.131/lib/opacAction.do?method=init&seq=301 ')23         result = self.opener.open(request1)24         result = self.opener.open(request2)25         return result.read()26 27 lib = library()28 print lib.getPage()

In the above Code, log on to result = self. opener. open (request1). If no exception occurs during logon, you do not need to check the data header. Then, use self. opener to open the borrow query page.
Http: // 202.118.250.131/lib/opacAction. do? Method = init & seq = 301, so this code prints the HTML code of the borrow query interface, which is part of the printed result:

Obtain borrow information:

After the page information is captured, the next step is to match and store the information as needed. When matching page information, the regular expression is used for matching. The support for regular expressions is supported by the Python Re module. For details about how to use regular expressions, refer to the following: python Regular Expression Guide

When Re module is used for matching, the regular expression string is often compiled into a Pattern instance first, and then Re in re module. findall (pattern, string) returns the data matching the regular expression in the string in the form of a list. If there is more than one group in pattern, the returned result will be a list of tuples, so the regular expression: <table .*? Id = "tb .*? Width = "50%"> <font size = 2> (.*?) </Font> .*? <Tr> .*? <Tr> .*? <Font size = 2> (.*?) </Font> .*? <Font size = 2> (.*?) </Font> .*? </TABLE>, in the formula, each (.*?) Represents a group, that is, if there are three groups in this formula, a list of tuples is returned for matching, and each of the tuples has three data records.

In the library class, define a method for obtaining information (getInformation) to obtain the required data through regular expression matching:

1 def getInformation(self):2         page = self.getPage()3         pattern = re.compile('<table.*?id="tb.*?width="50%"><font size=2>(.*?)</font>.*?<tr>.*?<tr>.*?'+4                         '<font size=2>(.*?)</font>.*?<font size=2>(.*?)</font>.*?</TABLE>',re.S)5         items = re.findall(pattern,page)

And then write the data to the file one by one using the write () method. However, before writing information, you need to perform some small processing on the captured information. As mentioned earlier, the findall () method returns a list of tuples, that is, [a, B, c], [d, e, f], [g, h, I], the write () method cannot operate on tuples, therefore, you need to manually translate the tuples into strings, save them to a list, and write each string to the file through traversal:

 1 def getInformation(self): 2         page = self.getPage() 3         pattern = re.compile('<table.*?id="tb.*?width="50%"><font size=2>(.*?)</font>.*?<tr>.*?<tr>.*?'+ 4                         '<font size=2>(.*?)</font>.*?<font size=2>(.*?)</font>.*?</TABLE>',re.S) 5         items = re.findall(pattern,page) 6  7         contents = [] 8         for item in items: 9             content = item[0]+'    from   '+item[1]+'   to   '+item[2]+'\n'10             contents.append(content)11         self.writeData(contents)12 def writeData(self,contents):13         file = open('libraryBooks.txt','w+')14         for content in contents:15             file.write(content)16         file.close()

Now, the whole crawler is complete, and the complete code is attached below:

Success:　　

 1 __author__='Victor' 2 #_*_ coding:'utf-8' _*_ 3 import urllib 4 import urllib2 5 import cookielib 6 import re 7  8 class library: 9     def __init__(self):10         self.loginURL='http://202.118.250.131/lib/opacAction.do'11         self.queryString = urllib.urlencode({12             'method':'DoAjax',13             'dispatch':'login',14             'registerName':'',15             'rcardNo':'16S137028 0',16             'pwd':'114477yan'17         })18         self.requestURL = self.loginURL + '?' + self.queryString19         self.cookies=cookielib.CookieJar()20         self.handler=urllib2.HTTPCookieProcessor(self.cookies)21         self.opener=urllib2.build_opener(self.handler)22     def getPage(self):23         request1 = urllib2.Request(self.requestURL)24         request2 = urllib2.Request('http://202.118.250.131/lib/opacAction.do?method=init&seq=301')25         result = self.opener.open(request1)26         result = self.opener.open(request2)27         return result.read()28     def getInformation(self):29         page = self.getPage()30         pattern = re.compile('<table.*?id="tb.*?width="50%"><font size=2>(.*?)</font>.*?<tr>.*?<tr>.*?'+31                         '<font size=2>(.*?)</font>.*?<font size=2>(.*?)</font>.*?</TABLE>',re.S)32         items = re.findall(pattern,page)33 34         contents = []35         for item in items:36             content = item[0]+'    from   '+item[1]+'   to   '+item[2]+'\n'37             contents.append(content)38         self.writeData(contents)39     def writeData(self,contents):40         file = open('libraryBooks.txt','w+')41         for content in contents:42             file.write(content)43         file.close()44 45 lib = library()46 lib.getInformation()

The following is the borrowed information. I have to say that the effect is not very good, but I just want to try it out:

For original works, please refer to the Source: Python crawler practice-crawling library borrowing Information

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More