Python Web static crawler __python

Source: Internet
Author: User

This article is based on the video course web, crawl Baidu Encyclopedia 1000 entry page information.

Programming Environment: Python3.5

Crawl Web information includes the following sections: URL Manager, downloader, parser, output:

(1) Read the URL of the Web page to crawl, which can be named Root_url

(2) Parse the contents of the Root_url Web page and save the other URLs included in the URL Manager

(3) Input HTML file, including url,title,summary information

The following will be in the form of code to explain how to crawl Web page information.

Main function:

#-*-coding:utf-8-*-import url_manager #导入URL管理器 import html_download #导入下载器 import html_parser #导入解析器 Import Html_outputer #导入输出器 class Spidermain (object): Def __init__ (self): #构造函数初始化 Self.urls=url_manager. Urlmanage () self.downloader=html_download. Downloader () Self.parser=html_parser. Parser () Self.outputer=html_outputer. Outputer () def Crawl (self,root_url): Count=1 self.urls.add_new_url (Root_url) #添加根url while S
                Elf.urls.has_new_url (): #判断URL管理器中是否还存在URL (theoretically, because every time you open a Web page, all of its hyperlinks are stored in the URL manager) Try: #会出现不存在URL的情况 
                New_url=self.urls.get_new_url () #提取url print (' Crawl%d:%s '% (Count,new_url)) #打印URL内容并计数 Html_content=self.downloader.download (New_url) #下载URL的内容 Urls,data=self.parser.parse (new_ url,html_content) #解析URL的内容, get all URLs under the URL page and the title and summary self.urls.add_new_urls (URLs) of the URL #将上一步得到的所有URL添加进URL容
   To facilitate the loop call             Self.outputer.collect_data (data) #收集数据 to prepare for the next export to an HTML file if count==1000: #爬取1000个URL Break Count+=1 Except:print ("Crawl failed") se Lf.outputer.output () #输出器 to output crawled content to an HTML file if __name__== "__main__": #主函数 root_url= "http://baike.baidu.com/view/210
 87.htm "#根url Obj_spider=spidermain () obj_spider.crawl (Root_url) #执行crawl函数
URL Manager:


Class Urlmanage (object):
    def __init__ (self):
        self.new_urls=set ()
        Self.old_urls=set ()
    def add_new_ URL (self,url): #添加新的URL (add one URL at a time)
        if URL is None: return if URL isn't in Self.new_urls and URL not in
        self.old_ URLs: #说明该url既不在待爬取的URL列表里, and not within the list of crawled URLs
            self.new_urls.add (URL)

    def has_new_url (self): #判断是否含有URL
        Return Len (self.new_urls)!=0

    def get_new_url (self):  #提取URL给后续解析, and remove it from the new_urls, and deposit it in Old_urls
        New_url =self.new_urls.pop ()
        Self.old_urls.add (new_url) return
        new_url
    def add_new_urls (self,urls):  #将待爬取网页的所有超链接导入new_urls集合中
        If URL is None or len (URLs) ==0: Return to
        URLs in URLs:
            self.add_new_ URL (URL)
        
Download device:

From Urllib Import Request

Class Downloader ():
    def download (self,url):
        If URL are None:
        return Response=request.urlopen (URL) #打开url  
        if Response.getcode ()!=200:   #如果response. GetCode ()!=200, which indicates a crawl failure
            Return the None return
        response.read ()        #读取url内容, including the entire page information (HTML form)

Parser:
From BS4 import beautifulsoup #使用网页解析器BeautifulSoup4解析下载后的信息 import re from urllib Import parse Class Parser (): def Get_urls (Self,page_url,soup): Urls=set () #href格式为/view/123.htm links=soup.find_all (' A ', href=re.com Pile (R '/view/\d+\.htm ') #使用正则化, store all of the URLs of the href format/view/123.htm for the link in LINKS:NEW_URL=PARSE.URLJ
        Oin (page_url,link[' href ') #需要补全href格式, using Urljoin stitching two URLs to get a complete parsed URL urls.add (new_url) #将解析后的url添加进urls中 Return URL def get_data (self,page_url,soup): #提取url的title及summary data={} data[' URL ']=page_ur
        L title=soup.find (' DD ', class_= "Lemmawgt-lemmatitle-title"). Find ("H1") data[' title ']=title.get_text ()  Summary=soup.find (' div ', class_= "lemma-summary") data[' summary ']=summary.get_text () return Data def Parse (Self,page_url,html_cont): If Page_url is-None or Html_cont is None:return soup=beautif Ulsoup (Html_cont, ' HTML.PArser ', from_encoding= ' UTF8 ') urls=self.get_urls (page_url,soup) #将解析后的url存入urls中 Data=self.get_data (Page_u




 Rl,soup) #将url的title及summary存入data中 return Urls,data

Output device:

#-*-Coding:utf-8-*-Import String Class Outputer (): Def __init__ (self): self.datas=[] def collect_data ( Self,data): If data is None:return self.datas.append (data) def output (self): Fout   =open (' output.html ', ' W ', encoding= ' utf-8 ') #创建html文件 fout.write ('  

Additional explanations for the beautifulsoup of the Web page parser are as follows:

Import re from BS4 import beautifulsoup html_doc = "" The results were as follows:

Get all links with a
Http://example.com/elsie Elsie a
http://example.com/lacie Lacie a
Http://example.com/tillie Tillie
Get links with Tillie
a http://example.com/tillie tillie
get a link with LSI
a Http://example.com/elsie Elsie
get link
P Once upon a time there were three little sisters their; names
and
T Illie;
And they lived at the bottom of a.

Reference from: Python crawler----Web parser and beautifulsoup third party module

Next, add a description of the Urljoin function, for example:

#urlparse解析器简单举例
from urllib import parse
print parse.urljoin (' http://baike.baidu.com/view/21087.htm ', ' /view/53557.htm ') #这里要注意 "/" use, you can try to see the specific usage
Get the result:

Http://baike.baidu.com/view/53557.htm
At the beginning, according to the video tutorial to write code, found that the HTML file is not read in Chinese, or the occurrence of Chinese garbled, for this in the output added

Fout.write ("
Reference from: Crawler display Chinese garbled

Finally, the result of the crawl is (HTML file):


Above, is that I according to MU-Network video Tutorials python development of simple crawler data, as well as my debugging process, encountered problems and solutions.

Attach the complete code, directly execute HTMLPC file on the line.



Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.