Target: Use Python to crawl the data of the Baidu Encyclopedia Python entry page
The process of running a reptile structure:
URL Manager:
Manage a collection of crawled URLs and crawled URLs
Prevent repetitive crawl and cyclic crawling
Supported Features:
Add a new URL to the collection to crawl
Determine if the URL to add is in the collection
To get the crawl URL from the collection
A URL that determines whether the collection is yet to be crawled
Move URL from pending crawl to crawled collection
Web page Downloader :
Download Web pages that correspond to the Internet to local tools
Web page Parser :
In a later instance, explain in detail
Example Explanation:
Divided into four sections:
Url_manager,url_downloader, Html_parser, Html_outputer:
Url_manager: #url管理器
Class Urlmanager(object):
#初始化url列表: Access crawled URL list, access to the list of crawled URLs
def __init__(self):
Self.old_urls = set () #已爬取过得url列表集合
Self.new_urls = set () #未爬取过得url列表集合
#向url管理器添加一个新的待爬取的url
def Add_new_url(self,url): #向url管理器添加一个新的url
If URL is None:
Return
#若这个url既不在爬取过得url列表集合也不在未爬取的url列表集合
If URL not in Self.old_urls and URL not inself.new_urls:
Self.new_urls.add (URL) #将这个url添加到未爬取的url列表集合
def add_new_urls(self,urls): #向url管理器添加批量url
If URL is None or len (URLs) ==0:
Return
For URL in URLs:
Self.add_new_url (URL)
def Has_new_url(self): #判断url管理器是否有新的url
Return Len (self.new_urls)!=0
def Get_new_url(self): #从待爬取的url列表中获取一个url
New_url= Self.new_urls.pop ()
The #pop () function removes an element from the list (the default last element) and returns the value of the element
Self.old_urls.add (New_url)
Return New_url
Url_downloader: #url下载器
Import Urllib2
Class Htmldownloader(object):
def download(self,url): #返回下载好的url的内容
If URL is None:
Return None
Response = Urllib2.urlopen (URL)
If Response.getcode ()!= 200:
Return None
Return Response.read ()
#参考http://www.cnblogs.com/huangcong/archive/2011/08/31/2160633.html
‘’’
Urllib2.urlopen (URL): To open an HTML document through the Urlopen (URL [, data]) function in the Urllib module, you must provide the URL address of the document, including the file name. function Urlopen not only can open a file located on a remote Web server, but it can open a local file and return a file-like object from which we can read data from an HTML document. Once the HTML document is open, we can use the read ([nbytes]), ReadLine (), and ReadLines () functions to read the file as if using a regular file. To read the contents of an entire HTML document, you can use the Read () function, which returns the contents of the file as a string.
GetCode (): Returns the HTTP status code. If it is an HTTP request, the 200 request completes successfully; 404 URL is not found
Read (): The file object provides three read methods:. Read (),. ReadLine (), and. ReadLines (). Each method can accept a variable to limit the amount of data read at a time, but they usually do not use a variable. read () reads the entire file at a time, and is typically used to place the contents of the file in a string variable. However, the. Read () generates the most direct string representation of a file's content, but it is unnecessary for sequential row-oriented processing, and is not possible if the file is larger than available memory.
‘’’
html_parser: #html解析器
From BS4 import BeautifulSoup
Import re
Import Urlparse
Class Htmlparser(object):
def _get_new_urls(self, Page_url, soup): #在soup里获取页面新的urls
New_urls = set () #获取到新的url列表的集合
Links = soup.find_all (' A ', Href=re.compile (R '/view/\d+\.htm '))
For link in Links:
New_url = link[' href '] #href属性对应的值就是页面中的url链接
New_full_url =urlparse.urljoin (Page_url,new_url)
#拼接成完整的url
New_urls.add (New_full_url)
return New_urls #返回DOM树里面所有的新的url集合
#正则匹配: In soup the query tag is a, regular match <a target= "_blank" href= "/view/592974.htm" > #解释器 </a>
#re. Compile: You can compile a regular expression into a regular expression object. Regular expressions that are often used can be compiled into regular expression objects, which can increase the efficiency of a certain type.
#urljoin: The function urljoin (base, url [, allow_fragments]) is the concatenation URL, which takes the first argument as its base address, and then combines it with the relative address in the second argument to form an absolute URL address. function Urljoin is especially useful when dealing with several files at the same location by appending a new filename to the URL base address. It should be noted that if the base address is not in character/end, then the rightmost portion of the URL base address is replaced by this relative path. For example, a URL with a base address of Http://www.testpage.com/pub,URL is test.html, then the two will be merged into http://www.testpage.com/test.html, not http:// Www.testpage.com/pub/test.html. If you want to keep the end directory in the path, make sure that the URL base address is character/end.
#http://www.cnblogs.com/huangcong/archive/2011/08/31/2160633.html
The above link has a specific instance of the Urljoin function, you can run the code to enhance your understanding.
def _get_new_data(self, Page_url, soup): #解析title和summary
Res_data = {} #data字典中存放title, Summary,url value
res_data[' url '] = Page_url
Title_node = Soup.find (' dd ', class_= ' lemmawgt-lemmatitle-title '). Find (' H1 ')
res_data[' title '] = Title_node.get_text ()
Summary_node = Soup.find (' div ', class_= "Lemma-summary")
res_data[' summary '] = Summary_node.get_text ()
Return Res_data
#python中title的源码
#<ddclass= "Lemmawgt-lemmatitle-title" >
<ahref= "javascript:;" class= "Edit-lemma cmn-btn-hover-bluecmn-btn-28 j-edit-link" style= "Display:inline-block;" ><emclass= "Cmn-icon wiki-lemma-iconswiki-lemma-icons_edit-lemma" ></em> edit </a>
<aclass= "Lock-lemma" target= "_blank href="/view/10812319.htm "title=" lock "><em class=" Cmn-icon Wiki-lemma-iconswiki-lemma-icons_lock-lemma "></em> Lock </a>
</dd>
#python中summary的源码
#<divclass= "para" label-module= "para" >python (English pronunciation:/ˈpaɪθən/American pronunciation:/ˈpaɪθɑːn/), is a <a target= "_blank" href= " /view/125370.htm "> Object-oriented </a>, interpretive <a target=" _blank "href="/view/2561555.htm "> Computer programming language </a> <a target= "_blank" href= "/view/2975166.htm" >guido van Rossum</a> was invented in 1989 and the first public release was issued in 1991. </div>
Get_text (): Is the return text, the label for each BeautifulSoup processed object is valid
def parse(self,page_url,html_cont): #返回content中的链接和数据
#传入的两个参数: The URL that is currently being crawled, and the content of the URL
If Page_url is None or Html_contis none:
Return
Soup = BeautifulSoup (Html_cont, ' html.parser ', from_encoding = ' utf-8 ') #创建一个beautifulsoup对象
#进行两个解析: New URLs and data are parsed from the content
New_urls = Self._get_new_urls (Page_url,soup)
New_data = Self._get_new_data (Page_url,soup)
Return New_urls,new_data
#beautifulsoup官网:
Html_doc = "" "
<body>
<pclass= "title" ><b>the dormouse ' s story</b></p>
<p class= "Story" >once Upona time there were three little; and their names were
<ahref= "Http://example.com/elsie" class= "sister" id= "Link1" >ELSIE</A>
<a href= "Http://example.com/lacie" class= "sister" id= "Link2" >Lacie</a>
<ahref= "Http://example.com/tillie" class= "sister" id= "Link3" >Tillie</a>;
And they lived at the bottom of awell.</p>
<pclass= "Story" >...</p>
"""
Frombs4import BeautifulSoup
soup= BeautifulSoup (Html_doc, ' Html.parser ')
All the pages have a code of their own. UTF-8 is the standard code for the current website. So, when crawling these pages, the crawler must be able to understand the encoding of these pages. Otherwise, it is very likely that you see the correct characters on the Web page, and the result of crawling is garbled. And BeautifulSoup can handle these codes skillfully. The encoding in BeautifulSoup is generally in a Web page, and you can see the encoding of the page from the CharSet attribute.
BeautifulSoup uses the Unicodedammit library to automatically detect the encoding of documents. BeautifulSoup automatically converts content to Unicode encoding when it creates a soup object. Knowing the original encoding of the HTML document,soup.original_encoding will tell us what the original encoding of the document is. Specifies the encoding of an HTML document, and the Unicodedammit library searches the entire document and then detects what encoding the document takes, wasting time and unicodedammit possible to detect errors. If you know what the document encoding is, you can use from_encoding to specify the encoding of the document when you initially create the BeautifulSoup object. Soup = BeautifulSoup (html_markup, "lxml", from_encoding= "Utf-8")
When a BeautifulSoup object is created, the entire downloaded page content is generated into a DOM tree (Document Object Model), which enables the traversal and access of the upper and upper elements in the form of a tree.
Html_outputer:
Class Htmloutputer(object):
def __init__(self): #初始化存放数据的列表
Self.datas = []
def collect_data(self,data): #收集数据
If data is None:
Return
Self.datas.append (data)
def output_html(self): #将收集好的数据以一个html文件输出
Fout = open (' output.html ', ' W ') #写模式
Fout.write ("