Case: Crawler using regular expressions
Now that we have the regular expression, the weapon of the Divine Soldier, we can filter the source code of all the Web pages crawled.
Let's try crawling the content together. Web site: http://www.neihan8.com/article/list_5_1.html
After opening, it is not difficult to see inside a gray often content of the satin, when you turn the page, pay attention to the changes in the URL address:
First page url:http://www.neihan8.com/article/list_5_1. html
Second page url:http://www.neihan8.com/article/list_5_2. html
Third page url:http://www.neihan8.com/article/list_5_3. html
Fourth page url:http://www.neihan8.com/article/list_5_4. html
So our URL law found, to crawl all the jokes, only need to modify a parameter. Let's start by crawling all the pieces down one step at a pace.
First step: Get Data 1. In our previous usage, we need to write a method for loading the page.
Here we uniformly define a class that handles URL requests as a member method.
We create a file called duanzi_spider.py.
Then define a spider class and add a member method that loads the page
import urllib2class Spider: """ 内涵段子爬虫类 """ def loadPage(self, page): """ @brief 定义一个url请求网页的方法 @param page 需要请求的第几页 @returns 返回的页面html """ url = "http://www.neihan8.com/article/list_5_" + str(page)+ ".html" #User-Agent头 user_agent = ‘Mozilla/5.0 (compatible; MSIE 9.0; Windows NT6.1; Trident/5.0‘ headers = {‘User-Agent‘: user_agent} req = urllib2.Request(url, headers = headers) response = urllib2.urlopen(req) html = response.read() print html #return html
The implementation of the above loadpage should be familiar to everyone, you need to be aware that the member method that defines a Python class requires an extra parameter to be added self
.
So the page in loadpage (self, page) is the page we specify to request.
Finally, print to the screen by printing HTML.
Then we write a main function to see the test a LoadPage method
2. Write the main function to test a loadpage method
if __name__ == ‘__main__‘: """ ====================== 内涵段子小爬虫 ====================== """ print ‘请按下回车开始‘ raw_input() #定义一个Spider对象 mySpider = Spider() mySpider.loadpage(1)
Then we need to simply get the page source code to deal with:
def loadPage(self, page): """ @brief 定义一个url请求网页的方法 @param page 需要请求的第几页 @returns 返回的页面html """ url = "http://www.neihan8.com/article/list_5_" + str(page)+ ".html" #User-Agent头 user_agent = ‘Mozilla/5.0 (compatible; MSIE 9.0; Windows NT6.1; Trident/5.0‘ headers = {‘User-Agent‘: user_agent} req = urllib2.Request(url, headers = headers) response = urllib2.urlopen(req) html = response.read() gbk_html = html.decode(‘gbk‘).encode(‘utf-8‘) # print gbk_html return gbk_html
Note: The encoding of the Chinese is different for each website, so the Html.decode (' GBK ') is not a generic notation, depending on the encoding of the site
Step Two: Filter the data
Next we've got the entire page of data. However, a lot of the content we do not care about, so next we need to filter. How to filter, use the regular expression described in the previous section.
import re
- Then, in the We get the
gbk_html
filter to match.
We need a matching rule:
We can open the content of the Web page, the mouse right-click "View Source" you will be surprised to find that we need each of the pieces of the content is in a <div>
label, and each div
has a propertyclass = "f18 mb20"
So, we just need to match all the data we have to <div class="f18 mb20">
</div>
the page.
According to the regular expression, we can deduce that a formula is:
<div.*?class="f18 mb20">(.*?)</div>
This expression is actually a match to all of the div
class="f18 mb20
contents (see the previous regular introduction)
Then apply this regular to the code, and we'll get the following code:
def loadPage (self, page): "" "@brief defines a method for a URL to request a Web page @param page needs to be requested @returns page HTML" "" url = "http://www.neihan8.com/article/list_5_" + str (page) + ". html" #User-agent header user_agent = ' mozilla/5.0 (comp atible; MSIE 9.0; Windows NT6.1; trident/5.0 ' headers = {' User-agent ': user_agent} req = Urllib2. Request (URL, headers = headers) Response = Urllib2.urlopen (req) HTML = response.read () gbk_html = Html.decode (' GB K '). Encode (' Utf-8 ') #找到所有的段子内容 <div class = "F18 Mb20" ></div> #re. S If there is no re. S is a string that matches only one row and there are no rules, and if not, the next line is re-match # if you add re. S is to match all the strings to a whole pattern = Re.compile (R ' <div.*?class= "F18 mb20" > (. *?) </div> ', Re. S) item_list = Pattern.findall (gbk_html) return item_listdef printonepage (self, Item_list, page): "" "@bri The list of jokes that EF handles gets @param item_list gets a list @param page processing page "" "Print" ******* page%d crawl complete ... ******* "%pag E for item in Item_list:print "================ "Print ite
It is important to note that one is a re.S
parameter that matches in a regular expression.
If there is no re. S is a string that matches only one row and has no matching rules, and if not, the next line is re-matched.
- If you add re. S is to match all the strings to a whole, and findall encapsulates all the matching results into a list.
- Then we wrote a
item_list
way to traverse it printOnePage()
. The OK program is written here, and we'll do it again.
[email protected] ~$ python duanzi_spider.py
All the jokes on our first page, without any other information, are all printed out.
- You will find a lot of jokes
<p>
, </p>
very uncomfortable, in fact this is the HTML of a paragraph of the label.
It can't be seen on the browser, but if we print it in text <p>
, we just need to get rid of what we don't want.
We can simply modify the Printonepage () as follows.
def printOnePage(self, item_list, page): """ @brief 处理得到的段子列表 @param item_list 得到的段子列表 @param page 处理第几页 """ print "******* 第 %d 页 爬取完毕...*******" %page for item in item_list: print "================" item = item.replace("<p>", "").replace("</p>", "").replace("<br />", "") print item
Step three: Save the data
- We can store all the jokes in the file. For example, we can not print out every item we get, but we can store it in a file called Duanzi.txt.
def writeToFile(self, text):‘‘‘ @brief 将数据追加写进文件中 @param text 文件内容‘‘‘ myFile = open("./duanzi.txt", ‘a‘) #追加形式打开文件 myFile.write(text) myFile.write("-----------------------------------------------------") myFile.close()
- Then we change the print statement
writeToFile()
to a local MyStory.txt file in all the pieces of the current page.
def printOnePage(self, item_list, page):‘‘‘ @brief 处理得到的段子列表 @param item_list 得到的段子列表 @param page 处理第几页‘‘‘ print "******* 第 %d 页 爬取完毕...*******" %page for item in item_list: # print "================" item = item.replace("<p>", "").replace("</p>", "").replace("<br />", "") # print item self.writeToFile(item)
Fourth Step: Display the data
Defdowork "' Let the crawler work '" while self.enable: try:item_list = Self.loadpage (self.page) except urllib2. Urlerror, E: print e.reason continue #对得到的段子item_list处理 self.printonepage (item_list, self.page) self.page + = 1 #此页处理完毕, processing next page print "press ENTER to continue ..." print "input quit quit" command = Raw_input () if (Command = = "quit"): Self.enable = false break
- Finally, we execute our code, after the completion of the current path to see the Duanzi.txt file, which already has the connotation of our content of the satin.
The above is a very thin use of small crawler, it is very convenient to use, if you want to crawl the information of other sites, just need to modify some of the parameters and some of the details on the line.
Unstructured data and structured data extraction----Case: crawler using regular expressions