Unstructured data and structured data extraction----Case: crawler using regular expressions

Source: Internet
Author: User

Case: Crawler using regular expressions

Now that we have the regular expression, the weapon of the Divine Soldier, we can filter the source code of all the Web pages crawled.

Let's try crawling the content together. Web site: http://www.neihan8.com/article/list_5_1.html

After opening, it is not difficult to see inside a gray often content of the satin, when you turn the page, pay attention to the changes in the URL address:

    • First page url:http://www.neihan8.com/article/list_5_1. html

    • Second page url:http://www.neihan8.com/article/list_5_2. html

    • Third page url:http://www.neihan8.com/article/list_5_3. html

    • Fourth page url:http://www.neihan8.com/article/list_5_4. html

So our URL law found, to crawl all the jokes, only need to modify a parameter. Let's start by crawling all the pieces down one step at a pace.

First step: Get Data 1. In our previous usage, we need to write a method for loading the page.

Here we uniformly define a class that handles URL requests as a member method.

We create a file called duanzi_spider.py.

Then define a spider class and add a member method that loads the page

import urllib2class Spider:    """        内涵段子爬虫类    """    def loadPage(self, page):        """            @brief 定义一个url请求网页的方法            @param page 需要请求的第几页            @returns 返回的页面html        """    url = "http://www.neihan8.com/article/list_5_" + str(page)+ ".html"    #User-Agent头    user_agent = ‘Mozilla/5.0 (compatible; MSIE 9.0; Windows NT6.1; Trident/5.0‘    headers = {‘User-Agent‘: user_agent}    req = urllib2.Request(url, headers = headers)    response = urllib2.urlopen(req)    html = response.read()    print html    #return html

The implementation of the above loadpage should be familiar to everyone, you need to be aware that the member method that defines a Python class requires an extra parameter to be added self .

    • So the page in loadpage (self, page) is the page we specify to request.

    • Finally, print to the screen by printing HTML.

    • Then we write a main function to see the test a LoadPage method

2. Write the main function to test a loadpage method
if __name__ == ‘__main__‘:    """        ======================            内涵段子小爬虫        ======================    """    print ‘请按下回车开始‘ raw_input() #定义一个Spider对象 mySpider = Spider() mySpider.loadpage(1)
    • If the program executes properly, we will print all the HTML code on the first page of the content on the screen. But we found that the Chinese part of HTML might be garbled.

Then we need to simply get the page source code to deal with:
def loadPage(self, page):    """        @brief 定义一个url请求网页的方法        @param page 需要请求的第几页        @returns 返回的页面html    """    url = "http://www.neihan8.com/article/list_5_" + str(page)+ ".html"    #User-Agent头    user_agent = ‘Mozilla/5.0 (compatible; MSIE 9.0; Windows NT6.1; Trident/5.0‘    headers = {‘User-Agent‘: user_agent}    req = urllib2.Request(url, headers = headers)    response = urllib2.urlopen(req)    html = response.read()    gbk_html = html.decode(‘gbk‘).encode(‘utf-8‘)    # print gbk_html    return gbk_html

Note: The encoding of the Chinese is different for each website, so the Html.decode (' GBK ') is not a generic notation, depending on the encoding of the site

    • This way we execute the following duanzi_spider.py again, we will find that the previous Chinese garbled can be displayed normally.

Step Two: Filter the data

Next we've got the entire page of data. However, a lot of the content we do not care about, so next we need to filter. How to filter, use the regular expression described in the previous section.

    • First of all
import re
    • Then, in the We get the gbk_html filter to match.
We need a matching rule:

We can open the content of the Web page, the mouse right-click "View Source" you will be surprised to find that we need each of the pieces of the content is in a <div> label, and each div has a propertyclass = "f18 mb20"

So, we just need to match all the data we have to <div class="f18 mb20"> </div> the page.

According to the regular expression, we can deduce that a formula is: <div.*?class="f18 mb20">(.*?)</div>
    • This expression is actually a match to all of the div class="f18 mb20 contents (see the previous regular introduction)

    • Then apply this regular to the code, and we'll get the following code:

def loadPage (self, page): "" "@brief defines a method for a URL to request a Web page @param page needs to be requested @returns page HTML" "" url = "http://www.neihan8.com/article/list_5_" + str (page) + ". html" #User-agent header user_agent = ' mozilla/5.0 (comp atible; MSIE 9.0; Windows NT6.1; trident/5.0 ' headers = {' User-agent ': user_agent} req = Urllib2. Request (URL, headers = headers) Response = Urllib2.urlopen (req) HTML = response.read () gbk_html = Html.decode (' GB K '). Encode (' Utf-8 ') #找到所有的段子内容 <div class = "F18 Mb20" ></div> #re. S If there is no re. S is a string that matches only one row and there are no rules, and if not, the next line is re-match # if you add re. S is to match all the strings to a whole pattern = Re.compile (R ' <div.*?class= "F18 mb20" > (. *?) </div> ', Re. S) item_list = Pattern.findall (gbk_html) return item_listdef printonepage (self, Item_list, page): "" "@bri The list of jokes that EF handles gets @param item_list gets a list @param page processing page "" "Print" ******* page%d crawl complete ... ******* "%pag E for item in Item_list:print "================ "Print ite 
  • It is important to note that one is a re.S parameter that matches in a regular expression.

  • If there is no re. S is a string that matches only one row and has no matching rules, and if not, the next line is re-matched.

  • If you add re. S is to match all the strings to a whole, and findall encapsulates all the matching results into a list.
    • Then we wrote a item_list way to traverse it printOnePage() . The OK program is written here, and we'll do it again.
[email protected] ~$ python duanzi_spider.py
All the jokes on our first page, without any other information, are all printed out.
    • You will find a lot of jokes <p> , </p> very uncomfortable, in fact this is the HTML of a paragraph of the label.
    • It can't be seen on the browser, but if we print it in text <p> , we just need to get rid of what we don't want.

    • We can simply modify the Printonepage () as follows.

def printOnePage(self, item_list, page):    """ @brief 处理得到的段子列表 @param item_list 得到的段子列表 @param page 处理第几页 """ print "******* 第 %d 页 爬取完毕...*******" %page for item in item_list: print "================" item = item.replace("<p>", "").replace("</p>", "").replace("<br />", "") print item
Step three: Save the data
    • We can store all the jokes in the file. For example, we can not print out every item we get, but we can store it in a file called Duanzi.txt.
def writeToFile(self, text):‘‘‘    @brief 将数据追加写进文件中    @param text 文件内容‘‘‘    myFile = open("./duanzi.txt", ‘a‘) #追加形式打开文件    myFile.write(text)    myFile.write("-----------------------------------------------------")    myFile.close()
    • Then we change the print statement writeToFile() to a local MyStory.txt file in all the pieces of the current page.
def printOnePage(self, item_list, page):‘‘‘ @brief 处理得到的段子列表 @param item_list 得到的段子列表 @param page 处理第几页‘‘‘ print "******* 第 %d 页 爬取完毕...*******" %page for item in item_list: # print "================" item = item.replace("<p>", "").replace("</p>", "").replace("<br />", "") # print item self.writeToFile(item)
Fourth Step: Display the data
    • Next we overlay the page with the pass of the parameter to traverse all the contents of the content of the satin bar.

    • Just add some logical processing to the outer layer.

Defdowork "' Let the crawler work '" while self.enable: try:item_list = Self.loadpage (self.page) except urllib2. Urlerror, E: print e.reason continue  #对得到的段子item_list处理 self.printonepage (item_list, self.page) self.page + = 1  #此页处理完毕, processing next page print  "press ENTER to continue ..." print  "input quit quit" command = Raw_input () if (Command = =  "quit"): Self.enable = false break         
  • Finally, we execute our code, after the completion of the current path to see the Duanzi.txt file, which already has the connotation of our content of the satin.

The above is a very thin use of small crawler, it is very convenient to use, if you want to crawl the information of other sites, just need to modify some of the parameters and some of the details on the line.

Unstructured data and structured data extraction----Case: crawler using regular expressions

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.