Python Crawler Learning Chapter Sixth

Source: Internet
Author: User

Hand-written Python crawler image crawler combat

Idea 1. Create a custom function that crawls a picture, which is responsible for crawling the images we want to crawl under a page, and the crawl process is: First through Urllib.request.urlopen (URL). read () reads all the source code of the corresponding Web page, And then according to the first regular expression above the first information filtering, after the filter is completed, based on the first filter results, based on the second regular expression above the second information filtering, extract all the target images on the page of the link, and a list of these links stored in the table, and then traverse the list, The corresponding link is stored locally via Urllib.request.urlretrieve (Imageurl,filename=imagename), in order to avoid abnormal crashes in the middle of the program, we can establish exception handling, if you cannot crawl a picture, it will be through the x+ =1 automatically jumps to the next picture.
2. Use the For loop to crawl all the pages under the category, and the link can be constructed as url= "http://list.jd.com/list.html?cat=23413143151&page=" +str (i), inside the For loop, Each time the loop, the corresponding I will automatically add 1, each time the loop is called 1) in the function of the image to achieve the crawl

  Import re 
import urllib.request
def Craw (url,page):
Html1=urllib.request.urlopen (URL). Read ()
Html1=str (HTML1)
pat1= ' <div id= ' plist '. +?
'
Result1=re.compile (PAT1). FindAll (HTML1)
Result1=result1[0]
pat2= '
Imagelist=re.compile (PAT2). FindAll (RESULT1)
# imagelist=re.search (PAT2,RESULT1)
X=1
For ImageUrl in ImageList:
imagename= "d:/shoujitupian/img/" +str (page) +str (x) + ". jpg"
imageurl= "http: "+imageurl
Try:
Urllib.request.urlretrieve (imageurl,filename=imagename)
except Urll Ib.error.URLError as E:
If Hasattr (E, "code"):
X+=1
If Hasattr (E, "reason") :
x+=1
X+=1
 for i in range(1,6):
url="https://list.jd.com/list.html?cat=9987,653,655&page="+str(i)
craw(url,i)

```

Link crawler
    1. Determine which portal links you want to crawl.
    2. Build a regular expression of link extraction based on your needs.
    3. Simulate a browser and crawl the corresponding Web page.
    4. Extracts the links contained in the Web page according to the regular expression in 2.
    5. Filter out duplicate links.
    6. Subsequent operations. such as printing these links to the screen wait.
      "'author = ' My '
import re#爬取所有页面链接
import urllib.request
def getlinks(url):
headers=(‘User-Agent‘,‘Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/45.0.2454.101 Safari/537.36‘) #模拟成浏览器
opener=urllib.request.buildopener() opener.addheaders=[headers]
urllib.request.install
opener(opener)#将opener安装为全局
file=urllib.request.urlopen(url)
data=str(file.read())
pat=‘(https?://[^\s)";]+.(w|/)*)‘#根据需求构建好链接表达式
link=re.compile(pat).findall(data)
link=list(set(link))#去除重复元素
return link

linklist=getlinks(url)#获取对应网页中包含的链接地址
for link in linklist:#通过for循环分别遍历输出获取到的链接地址到屏幕上
print(link[0])
Embarrassing Things encyclopedia crawler combat
  1. Analyze the URL rules between pages, construct URL variables, and enable crawling of multiple pages of content through a for loop
  2. Build a custom function that is designed to crawl a piece of text on a Web page, including two parts, one for the user, and one for the user-published content. The function of the functions of the implementation of the process is: first, the simulation into a browser access, observing the content of the corresponding Web page source code, the User Information section and the format of the content part of the section is written as a regular expression. Then, according to the regular expression to extract all the user and all the contents of the alternative, and then through the for loop to traverse the content and assign the content to the corresponding variable, where the variable name is regular, in the form of "content+ sequence number", and then through the for loop to traverse the corresponding user, and output the user's corresponding content.
  3. Multiple page URL links are obtained through the For loop, with each page calling the GetContent (url,page) function.
    __author__ = ' My '
    Import re
    Import Urllib.request
    def getcontent (url,page):
    headers= (' user-agent ', ' mozilla/5.0 (Windows NT 6.1; WOW64) applewebkit/537.36 (khtml, like Gecko) chrome/45.0.2454.101 safari/537.36 ') #模拟成浏览器
    Opener=urllib.request.build_opener ()
    Opener.addheaders=[headers]
    Urllib.request.install_opener (opener) #将opener安装为全局
    File=urllib.request.urlopen (URL)
    Data=str (File.read (). Decode ("Utf-8"))
    userpat= ' target= "_blank" title= "(. *?)" > ' #提取用户的正则表达式
    contentpat= ' <div class= "Content" > (. *?) </div> ' #提取内容的正则表达式
    Userlist=re.compile (Userpat,re. S). FindAll (data)
    Contentlist=re.compile (Contentpat,re. S). FindAll (data)
    X=1
    For content in ContentList:
    Content=content.replace ("\ n", "")
    Content=content.replace ("<span>", "" ")
    Content=content.replace ("</span>", "" ")
    Content=content.replace ("<br/>", "" ")
    Name= "Content" +str (x)
    EXEC (name+ ' =content ')
    X+=1
    Y=1
    For user in UserList:
    Name= "Content" +str (y)
    Print (User +str (page) +str (y) + "Yes:" +user)
    Print ("content is:")
    EXEC ("Print (" +name+ ")")
    Print ("\ n")
    Y+=1
    For I in Range (1,10):
    Url= "https://www.qiushibaike.com/8hr/page/" +str (i)
    GetContent (Url,i)
Crawler implementation
    1. Create 3 Custom functions: A function implements the ability to crawl a specified URL using a proxy server and return the crawled data, a function that obtains the functionality of all article links for multiple pages, and a function that crawls the specified title and content and writes it to the file based on the link to the article.
    2. The ability to use a proxy server to crawl the contents of a specified URL is mentioned in Chapter 4, in order to avoid a cause of program interruption, so the establishment of exception handling mechanism
    3. To achieve a link to all articles that get multiple pages, we need to encode the keyword using urllib.request.quote (key), encode the corresponding article List page URL, and crawl through the page's article link in the For loop, Implemented by invoking the proxy server set in 2.
    4. To implement a crawl of the specified title and content according to the article link and write to the corresponding file, you can use the For loop to crawl the URL provided in 3 (the real URL), extract the content of our attention based on regular expressions and write to the corresponding file after crawling.
    5. If an exception occurs in the code, it needs to be deferred, that is, wait for a period of time before attempting the next operation. To implement delay processing, we can import the time module, using Time.sleep () implementation, such as Time.sleep (7) Delay of 7 seconds.
      author = ‘My‘
      import re
      import urllib.request

      import urllib.error
模拟浏览器

headers=(‘User-Agent‘,‘Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/45.0.2454.101 Safari/537.36‘)
opener=urllib.request.build_opener()
opener.addheaders=[headers]

將opener安裝為全局

urllib.request.install_opener(opener)

設置一个列表listurl存储文章网页列表

listurl=[]

设置代理ip

def useproxy(proxyaddr,url):
#建立异常机制
try:
import urllib.request
proxy=urllib.request.ProxyHandler ({‘http‘:proxy_addr})
opener=urllib.request.buildopener (proxy,urllib.request.HTTPHandler)
urllib.request.install
opener(opener)
data=urllib.request.urlopen(url).read().decode(‘utf-8‘)
return data
except urllib.error.URLError as e:
if hasattr(e,‘code‘):
print(e.code)
if hasattr(e,‘reason‘):
print(e.reason)
time.sleep(10)
except Exception as e:
print(‘exception:‘+str(e))
time.sleep(1)

设置搜索

def getlisturl(key,pagestart,pageend,proxy):
try:
page=pagestart
#编码关键词key
keycode=urllib.request.quote(key)
#编码"&page"
pagecode=urllib.request.quote(‘&page‘)
#循环爬取各页文章的链接
for page in range(pagestart,pageend+1):
#分别构建各页的url链接,每次循环构建一次
url="http://weixin.sogou.com/weixin?type=2&query="+keycode+pagecode+str(page)
‘‘‘
http://weixin.sogou.com/weixin?query=物联网 &type=2&page=2
‘‘‘
#用代理服务器爬取,解决ip封杀问题
data1=use_proxy(proxy,url)
print(data1)
#获取链接正则表达式
listurlpat=‘

.? (http://. )?)"‘
#获取每页的所有文章链接并添加到列表listurl中
Print (Re.compile (listurlpat,re. S). FindAll (DATA1))
Listurl.append (Re.compile (listurlpat,re. S). FindAll (DATA1))
Print ("Get to", str (len (listurl)), ' page ')
Return listURL
Except Urllib.error.URLError as E:
If Hasattr (E, ' Code '):
Print (E.code)
If Hasattr (E, ' reason '):
Print (E.reason)
Time.sleep (10)
Except Exception as E:
Print (' Exception: ' +str (e))
Time.sleep (1)

Set up Save page

def getcontent (Listurl,proxy):
I=0
#html头
Html1= ""




‘‘‘
Fh=open ("d:/111.html", "WB")
Fh.write (Html1.encode (' Utf-8 '))
Fh.close ()
#再次追加写入的方式打开文件 to write the content of the corresponding article
Fh=open (' d:/111.html ', ' AB ')
#此时listurl
Print (listURL)
For I in Range (0,listurl):
For j in Range (0,listurl[i]):
Try
URL=LISTURL[I][J]
Url=url.replace (' amp; ', ' ")
Data=useProxy (Proxy,url)
Titlepat= ""
Contentpat= ' id= "JS
content" > (.?) id= "JSsgBar"
Title=re.compile (Titlepat). FindAll (data)
Content=re.compile (Contentpat). FindAll (data)
Thistitle= "No access to this time."
Thiscontent= "No access to this time."
if (title!=[]):
THISTITLE=TITLE[0]
if (content!=[]):
THISCONTENT=CONTENT[0]
Dataall= "

The title is: "+thistitle+"

The content is: "+thiscontent+"


"
Fh.write (Dataall.encode (' Utf-8 '))
Print ("+str" (i) + "pages" +str (j) + "minor processing")
Except Urllib.error.URLError as E:
If Hasattr (E, ' Code '):
Print (E.code)
If Hasattr (E, ' reason '):
Print (E.reason)
Time.sleep (10)
Except Exception as E:
Print (' Exception: ' +str (e))
Time.sleep (1)
Fh.close ()
Html2= ""


‘‘‘
Fh.open ("d:/111.html", ' AB ')
Fh.write (Html2.encode (' Utf-8 '))
Fh.close ()
key= ' Internet of Things '
Proxy= "125.115.183.26:808"
Proxy2= ""
Pagestart=1
pageend=2
Listurl=getlisturl (Key,pagestart,pageend,proxy)
GetContent (Listurl,proxy)
Multi-threaded Crawler

Multithreaded Small Program

Import threading
Class A (threading. Thread):
DefInit(self):
Threading. Thread.Init(self)
def run (self):
For I in range (10):
Print (' I am a thread a ')
Class B (threading. Thread):
DefInit(self):
Threading. Thread.Init(self)
def run (self):
For I in range (10):
Print (' I am thread B ')

T1=a ()
T2=b ()
T1.start ()
T2.start ()
队列的使用Import queue
A=queue. Queue ()
A.put (' Hello ')
A.put (' WJ ')
A.put (' like ')
A.put (' study ')
A.task_done ()
Print (A.qsize ())
Print (A.get ())
Print (A.get ())
Print (A.get ())
Print (A.get ())
Print (A.qsize ())
Print (A.get (), '----')
小程序改成多线程提升效率

--Coding:utf-8--

"""
Created on Sat Apr 22 10:25:08 2017

@author: My
"""

Import threading
Import queue
Import re
Import Urllib.request
Import time
Import Urllib.error
Urlqueue=queue. Queue ()
headers= (' user-agent ', ' mozilla/5.0 (Windows NT 6.1; WOW64) applewebkit/537.36 (khtml, like Gecko) chrome/45.0.2454.101 safari/537.36 ')
Opener=urllib.request.buildopener ()
Opener.addheaders=[headers]
Urllib.request.install
Opener (opener)
Listurl=[]
def useProxy (proxyaddr,url):
Try
Import Urllib.request
Proxy=urllib.request.proxyhandler ({' HTTP//':p Roxy
addr})
Opener=urllib.request.build
Opener (Proxy,urllib.request.httphandler)
Urllib.request.install_opener (opener)
Print (URL)
Data=urllib.request.urlopen (URL). read (). Decode (' Utf-8 ')
Return data
Except Urllib.error.URLError as E:
If Hasattr (E, "code"):
Print (E.code)
If Hasattr (E, "Reason"):
Print (E.reason)
Time.sleep (10)
Except Exception as E:
Print ("Exception:" +str (e))
Time.sleep (1)
Class Geturl (threading. Thread):
DefInit(Self,key,pagestart,pageend,proxy,urlqueue):
Threading. Thread.Init(self)
Self.pagestart=pagestart
Self.pageend=pageend
Self.proxy=proxy
Self.urlqueue=urlqueue
Self.key=key
def run (self):
Page=self.pagestart
Keycode=urllib.request.quote (Key)
Pagecode=urllib.request.quote ("&page")
For page in range (self.pagestart,self.pageend+1):
Url= "Http://weixin.sogou.com/weixin? type=2&query= "+KEYCODE+PAGECODE+STR (page)
Print (URL)
Data1=use
Proxy (Self.proxy,url)
listurlpat= '

.? (http://.)?)"‘
Listurl.append (Re.compile (listurlpat,re. S). FindAll (DATA1))
Print ("Get To:" +str (Len (listurl)) + "page")
For I in range (0,len (listURL)):
Time.sleep (7)
For j in Range (0,len (Listurl[i])):
Try
URL=LISTURL[I][J]
Url=url.replace ("amp;", "")
Print ("+str" (i) + "I" +str (j) + "J-Time")
Self.urlqueue.put (URL) self.urlqueue.task
Done ()
Except Urllib.error.URLError as E:
If Hasattr (E, "code"):
Print (E.code)
If Hasattr (E, "Reason"):
Print (E.reason)
Time.sleep (10)
Except Exception as E:
Print ("Exception:" +str (e))
Time.sleep (1)
Class GetContent (threading. Thread):
DefInit(Self,urlqueue,proxy):
Threading. Thread.
init(self)
Self.urlqueue=urlqueue
Self.proxy=proxy
def run (self):
Html1= ' <! DOCTYPE HTML PUBLIC "-//W3C//DTD XHTML 1.0 transitional//en" "Http://www.w3.org/TR, XHTML1/DTD/XHTML1-TRANSITIONAL.DTD ">
<meta http-equiv= "Content-type" content= "text/html; Charset=utf-8 "/>



fh=open("D:/pythontest/7.html",‘wb‘)
fh.write(html1.encode(‘utf-8‘))
fh.close()
fh=open("D:/pythontest/7.html",‘ab‘)
i=1
while(True):
try:
url=self.urlqueue.get()
data=use
proxy(self.proxy,url)
titlepat=""
contentpat=‘id="js
content">(.
?) Id= "JSSGBar "'
Title=re.compile (Titlepat). FindAll (data)
Content=re.compile (Contentpat,re. S). FindAll (data)
Thistitle= "No access to this time."
Thiscontent= "No access to this time."
if (title!=[]):
THISTITLE=TITLE[0]
if (content!=[]):
THISCONTENT=CONTENT[0]
Dataall= "

The title is: "+thistitle+"

The content is: "+thiscontent+"


"Fh.write (Dataall.encode (" Utf-8 ")) Print (" +str (i) + "Web Processing")
I+=1
Except Urllib.error.URLError as E:
If Hasattr (E, ' Code '):
Print (E.code)
If Hasattr (E, "Reason"):
Print (E.reason)
Time.sleep (10)
Except Exception as E:
Print ("Exception:" +str (e))
Time.sleep (1)
Fh.close ()
Html2= ""

‘‘‘
Fh=open ("d:/pythontest/7.html", ' AB ')
Fh.write (Html2.encode (' Utf-8 '))
Fh.close ()
Class CONRL (threading. Thread):
Definit(self,urlqueue):
Threading. Thread.
Init
(self)
Self.urlqueue=urlqueue
def run (self):
while (True):
Print ("in Program Execution")
Time.sleep (60)
if (Self.urlqueue.empty ()):
Print ("The program is finished! ")
Exit ()
key= "Internet of Things"
Proxy= "59.61.92.205:8118"
Proxy2= ""
Pagestart=1
pageend=2
T1=geturl (Key,pagestart,pageend,proxy,urlqueue)
T1.start ()
T2=getcontent (Urlqueue,proxy)
T2.start ()
T3=CONRL (Urlqueue)
T3.start ()

Python Crawler Learning Chapter Sixth

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.