Module:
(1) URL Manager: Manage URLs
(2) Web Downloader (urllib2): Download the Web page specified by the URL you want to crawl as a string
(3) Web Interpreter (BeautifulSoup): parsing
Ways to download Web pages using URLLIB2:
Method 1:
Import'http://www.baidu.com' # definesurl= Urllib2.urlopen (URL) # request URLprint res1.getcode () # Get dynamic code print len (Res1.read ()) # get content
Method 2:
#disguised as a browserImportUrllib2,cookieliburl='http://www.baidu.com'Request=Urllib2. Request (URL) request.add_header ('user-agent','mozilla/5.0')#add HTTP header, disguised as a browserRes2 = Urllib2.urlopen (Request)#send request Get resultsPrintRes2.getcode ()PrintLen (Res2.read ())
Method 3:
#Add a processor for a special scenario. For example, you need to login cookies, proxy proxies, HTTPS, Web page auto jump or mutual ambition and other Web pages
#处理cookie实例
ImportUrllib2ImportCookieliburl='http://www.baidu.com'CJ= Cookielib. Cookiejar ()#Create a cookie containerOpener = Urllib2.build_opener (urllib2. Httpcookieprocessor (CJ))#Create a openerUrllib2.install_opener (opener)#install opener for URLLIB2Res3 = Urllib2.urlopen (URL)#send request Get resultsPrintRes3.getcode ()PrintLen (Res3.read ())PrintCj
Web Interpreter Type:
(1) Regular expressions
(2) Html.parser
(3) BeautifulSoup: Third Party Package
(4) lxml
BeautifulSoup Syntax:
(1) Creating a BS Object
(2) Search node Find_all,find (search first node)
Search content includes: node name, node attribute, node content
Example: <a href= ' 123.html ' class= ' Article_link ' >hello,python!</a>
Node Name: A
Node properties: href= ' 123.html ' or class= ' Article_link '
Node content: hello,python!
(3) Access node
fromBs4ImportBeautifulsouphtml_doc=" "<! DOCTYPE html>" "Soup= BeautifulSoup (Html_doc,'Html.parser', from_encoding='Utf-8')Print 'Get all Links:'links= Soup.find_all ('a') forLinkinchLinks:Printlink.name,link['href'],link.get_text ()Print ' onlyget links to Baidu:'Link_node= Soup.find ('a', href='http://www.baidu.com')Printlink_node.name,link_node['href'],link_node.get_text ()
Output Result:
Get all Links: a http://www.baidu.com Baidu a http://www.youku.com Youku a http://www.hao123.com Hao123 get Baidu Link: a http://www.baidu.com Baidu Regular expression match a http://www.baidu.com Baidu
Python: Crawler