Python Crawler Learning Chapter Sixth

Last Update:2018-01-13 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Hand-written Python crawler image crawler combat

Idea 1. Create a custom function that crawls a picture, which is responsible for crawling the images we want to crawl under a page, and the crawl process is: First through Urllib.request.urlopen (URL). read () reads all the source code of the corresponding Web page, And then according to the first regular expression above the first information filtering, after the filter is completed, based on the first filter results, based on the second regular expression above the second information filtering, extract all the target images on the page of the link, and a list of these links stored in the table, and then traverse the list, The corresponding link is stored locally via Urllib.request.urlretrieve (Imageurl,filename=imagename), in order to avoid abnormal crashes in the middle of the program, we can establish exception handling, if you cannot crawl a picture, it will be through the x+ =1 automatically jumps to the next picture.
2. Use the For loop to crawl all the pages under the category, and the link can be constructed as url= "http://list.jd.com/list.html?cat=23413143151&page=" +str (i), inside the For loop, Each time the loop, the corresponding I will automatically add 1, each time the loop is called 1) in the function of the image to achieve the crawl

  Import re 
 import urllib.request 
 def Craw (url,page): 
 Html1=urllib.request.urlopen (URL). Read () 
 Html1=str (HTML1) 
 pat1= ' <div id= ' plist '. +?   ' 
 Result1=re.compile (PAT1). FindAll (HTML1) 
 Result1=result1[0] 
 pat2= '  
 Imagelist=re.compile (PAT2). FindAll (RESULT1) 
 # imagelist=re.search (PAT2,RESULT1) 
 X=1 
 For ImageUrl in ImageList: 
 imagename= "d:/shoujitupian/img/" +str (page) +str (x) + ". jpg" 
 imageurl= "http: "+imageurl 
 Try: 
 Urllib.request.urlretrieve (imageurl,filename=imagename) 
 except Urll Ib.error.URLError as E: 
 If Hasattr (E, "code"): 
 X+=1 
 If Hasattr (E, "reason") : 
 x+=1 
 X+=1

 for i in range(1,6):
    url="https://list.jd.com/list.html?cat=9987,653,655&page="+str(i)
    craw(url,i)

```

Link crawler

Determine which portal links you want to crawl.
Build a regular expression of link extraction based on your needs.
Simulate a browser and crawl the corresponding Web page.
Extracts the links contained in the Web page according to the regular expression in 2.
Filter out duplicate links.
Subsequent operations. such as printing these links to the screen wait.
"'author = ' My '

import re#爬取所有页面链接
import urllib.request
def getlinks(url):
    headers=(‘User-Agent‘,‘Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/45.0.2454.101 Safari/537.36‘) #模拟成浏览器
    opener=urllib.request.buildopener() opener.addheaders=[headers]
 urllib.request.installopener(opener)#将opener安装为全局
    file=urllib.request.urlopen(url)
    data=str(file.read())
    pat=‘(https?://[^\s)";]+.(w|/)*)‘#根据需求构建好链接表达式
    link=re.compile(pat).findall(data)
    link=list(set(link))#去除重复元素
    return link

linklist=getlinks(url)#获取对应网页中包含的链接地址
for link in linklist:#通过for循环分别遍历输出获取到的链接地址到屏幕上
    print(link[0])

Embarrassing Things encyclopedia crawler combat

Analyze the URL rules between pages, construct URL variables, and enable crawling of multiple pages of content through a for loop
Build a custom function that is designed to crawl a piece of text on a Web page, including two parts, one for the user, and one for the user-published content. The function of the functions of the implementation of the process is: first, the simulation into a browser access, observing the content of the corresponding Web page source code, the User Information section and the format of the content part of the section is written as a regular expression. Then, according to the regular expression to extract all the user and all the contents of the alternative, and then through the for loop to traverse the content and assign the content to the corresponding variable, where the variable name is regular, in the form of "content+ sequence number", and then through the for loop to traverse the corresponding user, and output the user's corresponding content.

Multiple page URL links are obtained through the For loop, with each page calling the GetContent (url,page) function.

__author__ = ' My '
Import re
Import Urllib.request
def getcontent (url,page):
headers= (' user-agent ', ' mozilla/5.0 (Windows NT 6.1; WOW64) applewebkit/537.36 (khtml, like Gecko) chrome/45.0.2454.101 safari/537.36 ') #模拟成浏览器
Opener=urllib.request.build_opener ()
Opener.addheaders=[headers]
Urllib.request.install_opener (opener) #将opener安装为全局
File=urllib.request.urlopen (URL)
Data=str (File.read (). Decode ("Utf-8"))
userpat= ' target= "_blank" title= "(. *?)" > ' #提取用户的正则表达式
contentpat= ' <div class= "Content" > (. *?) </div> ' #提取内容的正则表达式
Userlist=re.compile (Userpat,re. S). FindAll (data)
Contentlist=re.compile (Contentpat,re. S). FindAll (data)
X=1
For content in ContentList:
Content=content.replace ("\ n", "")
Content=content.replace ("<span>", "" ")
Content=content.replace ("</span>", "" ")
Content=content.replace ("<br/>", "" ")
Name= "Content" +str (x)
EXEC (name+ ' =content ')
X+=1
Y=1
For user in UserList:
Name= "Content" +str (y)
Print (User +str (page) +str (y) + "Yes:" +user)
Print ("content is:")
EXEC ("Print (" +name+ ")")
Print ("\ n")
Y+=1
For I in Range (1,10):
Url= "https://www.qiushibaike.com/8hr/page/" +str (i)
GetContent (Url,i)

Crawler implementation

Create 3 Custom functions: A function implements the ability to crawl a specified URL using a proxy server and return the crawled data, a function that obtains the functionality of all article links for multiple pages, and a function that crawls the specified title and content and writes it to the file based on the link to the article.
The ability to use a proxy server to crawl the contents of a specified URL is mentioned in Chapter 4, in order to avoid a cause of program interruption, so the establishment of exception handling mechanism
To achieve a link to all articles that get multiple pages, we need to encode the keyword using urllib.request.quote (key), encode the corresponding article List page URL, and crawl through the page's article link in the For loop, Implemented by invoking the proxy server set in 2.
To implement a crawl of the specified title and content according to the article link and write to the corresponding file, you can use the For loop to crawl the URL provided in 3 (the real URL), extract the content of our attention based on regular expressions and write to the corresponding file after crawling.
If an exception occurs in the code, it needs to be deferred, that is, wait for a period of time before attempting the next operation. To implement delay processing, we can import the time module, using Time.sleep () implementation, such as Time.sleep (7) Delay of 7 seconds.
```
author = ‘My‘
import re
import urllib.request

import urllib.error  
```

模拟浏览器

headers=(‘User-Agent‘,‘Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/45.0.2454.101 Safari/537.36‘) opener=urllib.request.build_opener() opener.addheaders=[headers]

將opener安裝為全局

urllib.request.install_opener(opener)

設置一个列表listurl存储文章网页列表

listurl=[]

设置代理ip

def useproxy(proxyaddr,url): #建立异常机制 try: import urllib.request proxy=urllib.request.ProxyHandler ({‘http‘:proxy_addr}) opener=urllib.request.buildopener (proxy,urllib.request.HTTPHandler) urllib.request.installopener(opener) data=urllib.request.urlopen(url).read().decode(‘utf-8‘) return data except urllib.error.URLError as e: if hasattr(e,‘code‘): print(e.code) if hasattr(e,‘reason‘): print(e.reason) time.sleep(10) except Exception as e: print(‘exception:‘+str(e)) time.sleep(1)

设置搜索

def getlisturl(key,pagestart,pageend,proxy): try: page=pagestart #编码关键词key keycode=urllib.request.quote(key) #编码"&page" pagecode=urllib.request.quote(‘&page‘) #循环爬取各页文章的链接 for page in range(pagestart,pageend+1): #分别构建各页的url链接，每次循环构建一次 url="http://weixin.sogou.com/weixin?type=2&query="+keycode+pagecode+str(page) ‘‘‘ http://weixin.sogou.com/weixin?query=物联网 &type=2&page=2 ‘‘‘ #用代理服务器爬取，解决ip封杀问题 data1=use_proxy(proxy,url) print(data1) #获取链接正则表达式 listurlpat=‘

.? (http://. )?)"‘
#获取每页的所有文章链接并添加到列表listurl中
Print (Re.compile (listurlpat,re. S). FindAll (DATA1))
Listurl.append (Re.compile (listurlpat,re. S). FindAll (DATA1))
Print ("Get to", str (len (listurl)), ' page ')
Return listURL
Except Urllib.error.URLError as E:
If Hasattr (E, ' Code '):
Print (E.code)
If Hasattr (E, ' reason '):
Print (E.reason)
Time.sleep (10)
Except Exception as E:
Print (' Exception: ' +str (e))
Time.sleep (1)
Set up Save page
def getcontent (Listurl,proxy):
I=0
#html头
Html1= ""




‘‘‘
Fh=open ("d:/111.html", "WB")
Fh.write (Html1.encode (' Utf-8 '))
Fh.close ()
#再次追加写入的方式打开文件 to write the content of the corresponding article
Fh=open (' d:/111.html ', ' AB ')
#此时listurl
Print (listURL)
For I in Range (0,listurl):
For j in Range (0,listurl[i]):
Try
URL=LISTURL[I][J]
Url=url.replace (' amp; ', ' ")
Data=useProxy (Proxy,url)
Titlepat= ""
Contentpat= ' id= "JScontent" > (.?) id= "JSsgBar"
Title=re.compile (Titlepat). FindAll (data)
Content=re.compile (Contentpat). FindAll (data)
Thistitle= "No access to this time."
Thiscontent= "No access to this time."
if (title!=[]):
THISTITLE=TITLE[0]
if (content!=[]):
THISCONTENT=CONTENT[0]
Dataall= "
The title is: "+thistitle+"
The content is: "+thiscontent+"

"
Fh.write (Dataall.encode (' Utf-8 '))
Print ("+str" (i) + "pages" +str (j) + "minor processing")
Except Urllib.error.URLError as E:
If Hasattr (E, ' Code '):
Print (E.code)
If Hasattr (E, ' reason '):
Print (E.reason)
Time.sleep (10)
Except Exception as E:
Print (' Exception: ' +str (e))
Time.sleep (1)
Fh.close ()
Html2= ""
 
 
‘‘‘
Fh.open ("d:/111.html", ' AB ')
Fh.write (Html2.encode (' Utf-8 '))
Fh.close ()
key= ' Internet of Things '
Proxy= "125.115.183.26:808"
Proxy2= ""
Pagestart=1
pageend=2
Listurl=getlisturl (Key,pagestart,pageend,proxy)
GetContent (Listurl,proxy)

Multi-threaded Crawler

Multithreaded Small Program

Import threading
Class A (threading. Thread):
DefInit(self):
Threading. Thread.Init(self)
def run (self):
For I in range (10):
Print (' I am a thread a ')
Class B (threading. Thread):
DefInit(self):
Threading. Thread.Init(self)
def run (self):
For I in range (10):
Print (' I am thread B ')
T1=a ()
T2=b ()
T1.start ()
T2.start ()
队列的使用Import queue
A=queue. Queue ()
A.put (' Hello ')
A.put (' WJ ')
A.put (' like ')
A.put (' study ')
A.task_done ()
Print (A.qsize ())
Print (A.get ())
Print (A.get ())
Print (A.get ())
Print (A.get ())
Print (A.qsize ())
Print (A.get (), '----')
小程序改成多线程提升效率
--Coding:utf-8--
"""
Created on Sat Apr 22 10:25:08 2017
@author: My
"""
Import threading
Import queue
Import re
Import Urllib.request
Import time
Import Urllib.error
Urlqueue=queue. Queue ()
headers= (' user-agent ', ' mozilla/5.0 (Windows NT 6.1; WOW64) applewebkit/537.36 (khtml, like Gecko) chrome/45.0.2454.101 safari/537.36 ')
Opener=urllib.request.buildopener ()
Opener.addheaders=[headers]
Urllib.request.installOpener (opener)
Listurl=[]
def useProxy (proxyaddr,url):
Try
Import Urllib.request
Proxy=urllib.request.proxyhandler ({' HTTP//':p Roxyaddr})
Opener=urllib.request.buildOpener (Proxy,urllib.request.httphandler)
Urllib.request.install_opener (opener)
Print (URL)
Data=urllib.request.urlopen (URL). read (). Decode (' Utf-8 ')
Return data
Except Urllib.error.URLError as E:
If Hasattr (E, "code"):
Print (E.code)
If Hasattr (E, "Reason"):
Print (E.reason)
Time.sleep (10)
Except Exception as E:
Print ("Exception:" +str (e))
Time.sleep (1)
Class Geturl (threading. Thread):
DefInit(Self,key,pagestart,pageend,proxy,urlqueue):
Threading. Thread.Init(self)
Self.pagestart=pagestart
Self.pageend=pageend
Self.proxy=proxy
Self.urlqueue=urlqueue
Self.key=key
def run (self):
Page=self.pagestart
Keycode=urllib.request.quote (Key)
Pagecode=urllib.request.quote ("&page")
For page in range (self.pagestart,self.pageend+1):
Url= "Http://weixin.sogou.com/weixin? type=2&query= "+KEYCODE+PAGECODE+STR (page)
Print (URL)
Data1=useProxy (Self.proxy,url)
listurlpat= ' 
.? (http://.)?)"‘
Listurl.append (Re.compile (listurlpat,re. S). FindAll (DATA1))
Print ("Get To:" +str (Len (listurl)) + "page")
For I in range (0,len (listURL)):
Time.sleep (7)
For j in Range (0,len (Listurl[i])):
Try
URL=LISTURL[I][J]
Url=url.replace ("amp;", "")
Print ("+str" (i) + "I" +str (j) + "J-Time")
Self.urlqueue.put (URL) self.urlqueue.taskDone ()
Except Urllib.error.URLError as E:
If Hasattr (E, "code"):
Print (E.code)
If Hasattr (E, "Reason"):
Print (E.reason)
Time.sleep (10)
Except Exception as E:
Print ("Exception:" +str (e))
Time.sleep (1)
Class GetContent (threading. Thread):
DefInit(Self,urlqueue,proxy):
Threading. Thread.init(self)
Self.urlqueue=urlqueue
Self.proxy=proxy
def run (self):
Html1= ' <! DOCTYPE HTML PUBLIC "-//W3C//DTD XHTML 1.0 transitional//en" "Http://www.w3.org/TR, XHTML1/DTD/XHTML1-TRANSITIONAL.DTD ">
<meta http-equiv= "Content-type" content= "text/html; Charset=utf-8 "/>
 
 
 
 fh=open("D:/pythontest/7.html",‘wb‘)
 fh.write(html1.encode(‘utf-8‘))
 fh.close()
 fh=open("D:/pythontest/7.html",‘ab‘)
 i=1 
 while(True):
 try:
 url=self.urlqueue.get()
 data=useproxy(self.proxy,url)
 titlepat=""
 contentpat=‘id="jscontent">(.?) Id= "JSSGBar "'
Title=re.compile (Titlepat). FindAll (data)
Content=re.compile (Contentpat,re. S). FindAll (data)
Thistitle= "No access to this time."
Thiscontent= "No access to this time."
if (title!=[]):
THISTITLE=TITLE[0]
if (content!=[]):
THISCONTENT=CONTENT[0]
Dataall= "
The title is: "+thistitle+"
The content is: "+thiscontent+"

"Fh.write (Dataall.encode (" Utf-8 ")) Print (" +str (i) + "Web Processing")
I+=1
Except Urllib.error.URLError as E:
If Hasattr (E, ' Code '):
Print (E.code)
If Hasattr (E, "Reason"):
Print (E.reason)
Time.sleep (10)
Except Exception as E:
Print ("Exception:" +str (e))
Time.sleep (1)
Fh.close ()
Html2= ""
                
‘‘‘
Fh=open ("d:/pythontest/7.html", ' AB ')
Fh.write (Html2.encode (' Utf-8 '))
Fh.close ()
Class CONRL (threading. Thread):
Definit(self,urlqueue):
Threading. Thread. Init(self)
Self.urlqueue=urlqueue
def run (self):
while (True):
Print ("in Program Execution")
Time.sleep (60)
if (Self.urlqueue.empty ()):
Print ("The program is finished! ")
Exit ()
key= "Internet of Things"
Proxy= "59.61.92.205:8118"
Proxy2= ""
Pagestart=1
pageend=2
T1=geturl (Key,pagestart,pageend,proxy,urlqueue)
T1.start ()
T2=getcontent (Urlqueue,proxy)
T2.start ()
T3=CONRL (Urlqueue)
T3.start ()

Python Crawler Learning Chapter Sixth

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Python Crawler Learning Chapter Sixth

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

Python Crawler Learning Chapter Sixth

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support