Web Spider Combat Simple crawler Combat (crawl "Watercress reading score 9 points to list")

Source: Internet
Author: User

1. Introduction to Web Spider

Web Spider, also known as web Crawler, is a robot that automatically captures information from Internet Web pages. They are widely used in Internet search engines or other similar sites to obtain or update the content and retrieval methods of these sites. They can automatically collect all of the page content they can access, for further processing by the search engine (sorting out the downloaded pages), and allows users to quickly retrieve the information they need.

2, a simple web crawler case

The author sees the Web page (homepage) of the Watercress book when he browses the webpage, as follows:

Because the book has a total of 409 books, 17 pages, if you want to browse through, it takes a long time, want to save a good book, it is a more difficult thing, therefore, think can use the crawler (Web Spider) to save the title, say dry, Here's a detailed description of how to use Python to crawl a list of books.

3, single page crawl and analysis 3.1, crawl

The first is the crawl of a single page, where the Python urllib2 library is used, and the URLLIB2 library crawls the Web page as HTML to the local code as follows:

def spider(url, user_agent="wswp"):    print "Downloading: ", url    # 设置代理    headers = {"User-agent": user_agent}    request = urllib2.Request(url, headers=headers)    html = ""    try:        html = urllib2.urlopen(request).read()    except urllib2.URLError as e:        print "Download error: ", e.reason        html = None    return html

The request method, the Urlopen method, and the Read method are used during the crawl process. Through the above simple crawl, the Web page in HTML format to crawl to the local.

3.2, the page analysis of the crawl

In the analysis module, the main use of the regular expression, the use of the RE library in Python, using regular expressions to extract the name of the book, such as:

The analysis code for the page is as follows:

def parse_page(html):    html = html.replace("\r", "")    html = html.replace("\n", "")    html = html.replace("\013", "")        result = re.findall(‘<div class="title">(.*?)</div>‘, html)    book_list = []    for x in result:        # 得到书名        book_name = re.findall(‘<a.*?>(.+)</a>‘, x.strip())        book_list.append(book_name[0].strip())    return book_list

Finally got the name of 25 books on the page, as follows:

3.3. Main process

The modules used throughout the process are:

The main process is:

if __name__ == "__main__":    seed = "https://www.douban.com/doulist/1264675/?start=0&sort=seq&sub_type="    html = spider(seed)    book_list = parse_page(html)    print len(book_list)    for x in book_list:        print x
4. Grab the Complete book list

The above describes the process of crawling one of the pages, in order to be able to crawl the complete directory, you need to parse all the Web page URLs, and every URL is crawled, where the Web page URL in the navigation below the page:

In the HTML code, the format is:

Therefore, it is necessary to add the function of analysis URL in the analysis module, so the improved Parse_page function is:

def parse_page(html, url_map):    # 1、去除无效的字符    html = html.replace("\r", "")    html = html.replace("\n", "")    html = html.replace("\013", "")    # 2、解析出书名    result_name = re.findall(‘<div class="title">(.*?)</div>‘,html)    book_list = []    for x in result_name:        # 提取出书名        book_name = re.findall(‘<a.*?>(.+)</a>‘, x.strip())        book_list.append(book_name[0].strip())    # 3、解析出还有哪些网址    result_url = re.findall(‘<div class="paginator">(.*?)</div>‘, html)    url_list = re.findall("[a-zA-z]+://[^\s\"]*", result_url[0])    for x in url_list:        x = x.strip()        if x not in url_map:            url_map[x] = 0    return book_list, url_map

After parsing the title, parse out the URL.

4.2. Control

After using function Parse_page function to crawl a webpage, analyze the book in the Web page, at the same time, the Web page link to other pages of the URL extracted, so that we need a control module, can be extracted from the URL in order to crawl, analysis, extraction. The code for such a control module is as follows:

def control(seed):    # 设置map用于记录哪些网址需要爬取    url_map = {}    book_list = []    # 爬取种子网址    html = spider(seed)    url_map[seed] = 1 #种子网址已经爬取过    # 解析种子网址    book_tmp, url_map = parse_page(html, url_map)    for x in book_tmp:        book_list.append(x)    # 对url_map中的的网址依次爬取    while True:        for k, v in url_map.items():            if v == 0:                # 爬取                html = spider(k)                url_map[k] = 1                book_tmp, url_map = parse_page(html, url_map)                for x in book_tmp:                    book_list.append(x)                break            else:                continue        if 0 not in url_map.values():            break    return book_list

Through a map to store all the pages of the URL, key is the URL, value for whether or not to crawl, 0 means not crawled, 1 means has been crawled. Analyze the map by looping until all of the key corresponding pages have been crawled.

4.3. Main function

The main functions are:

if __name__ == "__main__":    seed = "https://www.douban.com/doulist/1264675/?start=0&amp;sort=seq&amp;sub_type="    book_list = control(seed)    print len(book_list)    for x in book_list:        print x

The number of books that were eventually fetched was 408, but the first page showed 409:

The study found that there was a book that did not:

Therefore, the whole crawl is no problem.

The final list of books is as follows:

In the above implementation of a simple crawler, of course, want to crawl more and more complex web site, this crawler is not, next, we will slowly dive into the crawler more technology.

Web Spider Combat Simple crawler Combat (crawl "Watercress reading score 9 points to list")

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.