Crawling Web pages with spiders based on Python tutorial

Source: Internet
Author: User
Tags object model readline

Target: Use Python to crawl the data of the Baidu Encyclopedia Python entry page

The process of running a reptile structure:



URL Manager:

Manage a collection of crawled URLs and crawled URLs

Prevent repetitive crawl and cyclic crawling

Supported Features:

Add a new URL to the collection to crawl

Determine if the URL to add is in the collection

To get the crawl URL from the collection

A URL that determines whether the collection is yet to be crawled

Move URL from pending crawl to crawled collection

Web page Downloader :

Download Web pages that correspond to the Internet to local tools

Web page Parser :

In a later instance, explain in detail

Example Explanation:

Divided into four sections:

Url_manager,url_downloader, Html_parser, Html_outputer:

 

Url_manager: #url管理器

Class Urlmanager(object):

#初始化url列表: Access crawled URL list, access to the list of crawled URLs

def __init__(self):

Self.old_urls = set () #已爬取过得url列表集合

Self.new_urls = set () #未爬取过得url列表集合

#向url管理器添加一个新的待爬取的url

def Add_new_url(self,url): #向url管理器添加一个新的url

If URL is None:

Return

#若这个url既不在爬取过得url列表集合也不在未爬取的url列表集合

If URL not in Self.old_urls and URL not inself.new_urls:

Self.new_urls.add (URL) #将这个url添加到未爬取的url列表集合

def add_new_urls(self,urls): #向url管理器添加批量url

If URL is None or len (URLs) ==0:

Return

For URL in URLs:

Self.add_new_url (URL)

def Has_new_url(self): #判断url管理器是否有新的url

Return Len (self.new_urls)!=0

def Get_new_url(self): #从待爬取的url列表中获取一个url

New_url= Self.new_urls.pop ()

The #pop () function removes an element from the list (the default last element) and returns the value of the element

Self.old_urls.add (New_url)

Return New_url

Url_downloader: #url下载器

Import Urllib2

Class Htmldownloader(object):

def download(self,url): #返回下载好的url的内容

If URL is None:

Return None

Response = Urllib2.urlopen (URL)

If Response.getcode ()!= 200:

Return None

Return Response.read ()

#参考http://www.cnblogs.com/huangcong/archive/2011/08/31/2160633.html

‘’’

Urllib2.urlopen (URL): To open an HTML document through the Urlopen (URL [, data]) function in the Urllib module, you must provide the URL address of the document, including the file name. function Urlopen not only can open a file located on a remote Web server, but it can open a local file and return a file-like object from which we can read data from an HTML document. Once the HTML document is open, we can use the read ([nbytes]), ReadLine (), and ReadLines () functions to read the file as if using a regular file. To read the contents of an entire HTML document, you can use the Read () function, which returns the contents of the file as a string.

GetCode (): Returns the HTTP status code. If it is an HTTP request, the 200 request completes successfully; 404 URL is not found

Read (): The file object provides three read methods:. Read (),. ReadLine (), and. ReadLines (). Each method can accept a variable to limit the amount of data read at a time, but they usually do not use a variable. read () reads the entire file at a time, and is typically used to place the contents of the file in a string variable. However, the. Read () generates the most direct string representation of a file's content, but it is unnecessary for sequential row-oriented processing, and is not possible if the file is larger than available memory.

‘’’

html_parser: #html解析器

From BS4 import BeautifulSoup

Import re

Import Urlparse

Class Htmlparser(object):

def _get_new_urls(self, Page_url, soup): #在soup里获取页面新的urls

New_urls = set () #获取到新的url列表的集合

Links = soup.find_all (' A ', Href=re.compile (R '/view/\d+\.htm '))

For link in Links:

New_url = link[' href '] #href属性对应的值就是页面中的url链接

New_full_url =urlparse.urljoin (Page_url,new_url)

#拼接成完整的url

New_urls.add (New_full_url)

return New_urls #返回DOM树里面所有的新的url集合

#正则匹配: In soup the query tag is a, regular match <a target= "_blank" href= "/view/592974.htm" > #解释器 </a>

#re. Compile: You can compile a regular expression into a regular expression object. Regular expressions that are often used can be compiled into regular expression objects, which can increase the efficiency of a certain type.

#urljoin: The function urljoin (base, url [, allow_fragments]) is the concatenation URL, which takes the first argument as its base address, and then combines it with the relative address in the second argument to form an absolute URL address. function Urljoin is especially useful when dealing with several files at the same location by appending a new filename to the URL base address. It should be noted that if the base address is not in character/end, then the rightmost portion of the URL base address is replaced by this relative path. For example, a URL with a base address of Http://www.testpage.com/pub,URL is test.html, then the two will be merged into http://www.testpage.com/test.html, not http:// Www.testpage.com/pub/test.html. If you want to keep the end directory in the path, make sure that the URL base address is character/end.

#http://www.cnblogs.com/huangcong/archive/2011/08/31/2160633.html

The above link has a specific instance of the Urljoin function, you can run the code to enhance your understanding.

def _get_new_data(self, Page_url, soup): #解析title和summary

Res_data = {} #data字典中存放title, Summary,url value

res_data[' url '] = Page_url

Title_node = Soup.find (' dd ', class_= ' lemmawgt-lemmatitle-title '). Find (' H1 ')

res_data[' title '] = Title_node.get_text ()

Summary_node = Soup.find (' div ', class_= "Lemma-summary")

res_data[' summary '] = Summary_node.get_text ()

Return Res_data

#python中title的源码

#<ddclass= "Lemmawgt-lemmatitle-title" >

<ahref= "javascript:;" class= "Edit-lemma cmn-btn-hover-bluecmn-btn-28 j-edit-link" style= "Display:inline-block;" ><emclass= "Cmn-icon wiki-lemma-iconswiki-lemma-icons_edit-lemma" ></em> edit </a>

<aclass= "Lock-lemma" target= "_blank href="/view/10812319.htm "title=" lock "><em class=" Cmn-icon Wiki-lemma-iconswiki-lemma-icons_lock-lemma "></em> Lock </a>

</dd>

#python中summary的源码

#<divclass= "para" label-module= "para" >python (English pronunciation:/ˈpaɪθən/American pronunciation:/ˈpaɪθɑːn/), is a <a target= "_blank" href= " /view/125370.htm "> Object-oriented </a>, interpretive <a target=" _blank "href="/view/2561555.htm "> Computer programming language </a> <a target= "_blank" href= "/view/2975166.htm" >guido van Rossum</a> was invented in 1989 and the first public release was issued in 1991. </div>

Get_text (): Is the return text, the label for each BeautifulSoup processed object is valid

def parse(self,page_url,html_cont): #返回content中的链接和数据

#传入的两个参数: The URL that is currently being crawled, and the content of the URL

If Page_url is None or Html_contis none:

Return

Soup = BeautifulSoup (Html_cont, ' html.parser ', from_encoding = ' utf-8 ') #创建一个beautifulsoup对象

#进行两个解析: New URLs and data are parsed from the content

New_urls = Self._get_new_urls (Page_url,soup)

New_data = Self._get_new_data (Page_url,soup)

Return New_urls,new_data

#beautifulsoup官网:

Html_doc = "" "

<body>

<pclass= "title" ><b>the dormouse ' s story</b></p>

<p class= "Story" >once Upona time there were three little; and their names were

<ahref= "Http://example.com/elsie" class= "sister" id= "Link1" >ELSIE</A>

<a href= "Http://example.com/lacie" class= "sister" id= "Link2" >Lacie</a>

<ahref= "Http://example.com/tillie" class= "sister" id= "Link3" >Tillie</a>;

And they lived at the bottom of awell.</p>

<pclass= "Story" >...</p>

"""

Frombs4import BeautifulSoup
soup= BeautifulSoup (Html_doc, ' Html.parser ')

All the pages have a code of their own. UTF-8 is the standard code for the current website. So, when crawling these pages, the crawler must be able to understand the encoding of these pages. Otherwise, it is very likely that you see the correct characters on the Web page, and the result of crawling is garbled. And BeautifulSoup can handle these codes skillfully. The encoding in BeautifulSoup is generally in a Web page, and you can see the encoding of the page from the CharSet attribute.

BeautifulSoup uses the Unicodedammit library to automatically detect the encoding of documents. BeautifulSoup automatically converts content to Unicode encoding when it creates a soup object. Knowing the original encoding of the HTML document,soup.original_encoding will tell us what the original encoding of the document is. Specifies the encoding of an HTML document, and the Unicodedammit library searches the entire document and then detects what encoding the document takes, wasting time and unicodedammit possible to detect errors. If you know what the document encoding is, you can use from_encoding to specify the encoding of the document when you initially create the BeautifulSoup object. Soup = BeautifulSoup (html_markup, "lxml", from_encoding= "Utf-8")

When a BeautifulSoup object is created, the entire downloaded page content is generated into a DOM tree (Document Object Model), which enables the traversal and access of the upper and upper elements in the form of a tree.

Html_outputer:

Class Htmloutputer(object):

def __init__(self): #初始化存放数据的列表

Self.datas = []

def collect_data(self,data): #收集数据

If data is None:

Return

Self.datas.append (data)

def output_html(self): #将收集好的数据以一个html文件输出

Fout = open (' output.html ', ' W ') #写模式

Fout.write ("

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.