Python crawler Learning (3)

Last Update:2018-02-01 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Learn and create a simple crawler in the web, crawl Baidu Wikipedia related terms information

A third-party parsing package (BEAUTIFULSOUP4) is used in the program, and the installation command in the Windows environment: Pip Install BEAUTIFULSOUP4

1. New Package

2, the new related class file, which contains:

index.py, package entry class file;

Url_manager.py,url Manager class file, which mainly manages the list of URLs to be crawled and crawled URLs, avoiding duplication or circular crawls;

Html_downloader.py,html Content Downloader class file, download receive URL content to local;

Html_parser.py,html content parser class file, parse URL content, take out requirement data;

html_outputer.py, crawls the Content collection output class file, collects the request data in each URL mainly and outputs;

3, Paste (code comments in good detail ...) ）

index.py

#!/usr/bin/env python#-*-coding:utf-8-*-# adduser:gao# addtime:2018-01-31 22:22# Description: Ingress from Crawler Import        Url_manager, Html_downloader, Html_parser, Html_outputerclass Index (object): # constructor initializes each manager def __init__ (self): # URL Manager self.urls = Url_manager. Urlmanager () # HTML download Manager Self.downloader = Html_downloader. Htmldownloader () # HTML parsing manager Self.parser = Html_parser. Htmlparser () # data collection Manager Self.outputer = Html_outputer.        Htmloutputer () # crawler def craw (self, url): Count = 1 # Add the initial address to the URL manager self.urls.add_new_url (URL)                # Determine if a new URL address exists when the crawler is executing while self.urls.has_new_url (): Try: # get a new URL address                New_url = Self.urls.get_new_url () print ' Craw%d:%s '% (count, new_url) # Get URL address content Html_cont = Self.downloader.download (new_url) # Parse URL content news_url, New_data = Self.parser.parSe (New_url, html_cont) # resolves the resulting URL address list in bulk into the URL manager self.urls.add_new_urls (news_url)                # Collect Data Self.outputer.collect_data (new_data) if count = = 100:break        Count + = 1 # exception handling Except:print ' Craw failed ' # output collected data Self.outputer.output_html () if __name__ = = ' __main__ ': # initial URL address Initial_url = ' https://baike.baidu.com/item/%E5%94%9 0%e8%af%97%e4%b8%89%e7%99%be%e9%a6%96/18677 ' # Create object OBJ = Index () # Call crawler Obj.craw (Initial_url)

url_manager.py

#!/usr/bin/env python#-*-coding:utf-8-*-# adduser:gao# addtime:2018-01-31 22:22# Description:url Manager Class UrlManage        R (Object): # constructor Initialize URL list def __init__ (self): # to crawl URL list self.new_urls = set () # crawled URL list Self.old_urls = set () # Add a single URL def add_new_url (self, URL): # Determine if the URL is empty if URL is none:re            Turn # To determine if the URL already exists or if it has been crawled if the URL is not in Self.new_urls and URL is not in Self.old_urls: # added to the list of URLs to crawl  Self.new_urls.add (URL) # Bulk add URL def add_new_urls (self, URLs): # Determine if URLs are empty if URLs is None or Len (URLs) ==0:return # loop execution add a single URL for the URL in Urls:self.add_new_url (URL) # to determine if it exists        New (not crawled) URL def has_new_url (self): return len (self.new_urls)! = 0 # Get new (not crawled) URL def get_new_url (self):        # Remove a URL from the list of URLs to crawl new_url = Self.new_urls.pop () # Add the fetched URL to the crawled URL list self.old_urls.add (new_url)       # return the fetched URL Return New_url

html_downloader.py

#!/usr/bin/env python#-*-coding:utf-8-*-# adduser:gao# addtime:2018-01-31 22:22# description:html Downloader import urllib2 Class Htmldownloader (object):    # Download URL content    def download (self, url):        # Determine if URL is empty        if URL is None:            return None        # download URL content        response = urllib2.urlopen (URL)        # Determine URL address request result        if Response.getcode ()! =:            Return None        # returns URL loaded content return        Response.read ()

html_parser.py

#!/usr/bin/env python#-*-coding:utf-8-*-# adduser:gao# addtime:2018-01-31 22:22# description:html parser import Re, url Parsefrom BS4 Import Beautifulsoupclass Htmlparser (object): # Parse the contents of the URL def parse (self, URL, html_cont): # parameter is Null to determine if URL is None or Html_cont is None:return # Create resolved object soup = BeautifulSoup (Html_cont, '        Html.parser ', from_encoding= ' Utf-8 ') # Get a new list of URLs to crawl new_urls = self._get_new_urls (URL, soup) # Get Demand data     New_data = self._get_new_data (URL, soup) # Returns the URL list and data return new_urls, New_data # Gets the list of terms in the URL content URL def _get_new_urls (self, URL, soup): # Initialize URL list new_urls = set () # get page All terms a tag list links = so             Up.find_all (' A ', Href=re.compile (R ' ^/item/')) for link in links:new_url = link[' href '] # get the href attribute value in a tag        Full_url = Urlparse.urljoin (URL, new_url) # stitching the full URL address New_urls.add (full_url) # Add the full URL to the URL list        # return URL listReturn New_urls # Get the requirement data in the URL content def _get_new_data (self, URL, soup): # Initialize the data dictionary with the database = {} Title_n Ode = Soup.find (' dd ', class_= ' lemmawgt-lemmatitle-title '). Find (' H1 ') # Get title tag data[' title '] = Title_node.get_te XT () # Gets the title text Summary_node = Soup.find (' div ', class_= ' lemma-summary ') # Gets the entry summary label data[' summary '] = sum Mary_node.get_text () # Get entry profile text data[' url '] = URL # entry corresponding URL address # returns data return

html_outputer.py

#!/usr/bin/env python#-*-coding:utf-8-*-# adduser:gao# addtime:2018-01-31 22:22# description:html Output Class HTMLOUTP Uter (object): # constructor initializes the collected data Def __init__ (self): self. data = [] # collect_data (self, data): # The data is null judged if the None:return # is added Data to list self. Data.append (data) # outputs the data collected Def output_html (self): # Open File Out_file = Opening (' output.html ', ' W ') o Ut_file.write (' 
4. Run and Invoke
The code is finished, and you must see if it can be used.
In the package, select the entry class file (index.py) and right-click to execute the Run command;
Outside the package, that is called the Crawler, the first to introduce the package entry class file (index.py), then instantiate the object, finally call the crawl method, of course, you have to execute the Run command.
　　
My Code: Https://github.com/HeyEasy/PythonCode
Python crawler Learning (3)

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Python crawler Learning (3)

Contact Us

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support