This paper illustrates the method of extracting Baidu search results by Python. Share to everyone for your reference. The implementation methods are as follows:
?
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26-27--28 29---30 31--32 33 34 35 36 37 38-39 40 41 42 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63-64 |
# Coding=utf8 Import urllib2 Import string import urllib import re import random #设置多个user_agents to prevent Baidu limit IP user_agents = [' mozilla/5.0 (Windows NT 6.1; WOW64; rv:23.0) gecko/20130406 firefox/23.0 ', ' mozilla/5.0 (Windows NT 6.1; WOW64; rv:18.0) gecko/20100101 firefox/18.0 ', ' mozilla/5.0 (Windows; U Windows NT 6.1; En-US) applewebkit/533+ (khtml, like Gecko) Element Browser 5.0 ', ' IBM webexplorer/v0.94 ', ' galaxy/1.0 [en] (Mac OS X 10. 5.6; U EN) ', ' mozilla/5.0 (compatible; MSIE 10.0; Windows NT 6.1; WOW64; trident/6.0) ', ' opera/9.80 (Windows NT 6.0) presto/2.12.388 version/12.14 ', ' mozilla/5.0 (IPAD; CPU os 6_0 like Mac os X) applewebkit/536.26 (khtml, like Gecko) version/6.0 mobile/10a5355d ' safari/8536.25 ', ' mozilla/5.0 (Windows NT 6.1) applewebkit/537.36 (khtml, like Gecko) chrome/28.0.1468.0 safari/537.36 ', ' mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.0; trident/5.0; TheWorld) ' Def baidu_search (KEYWORD,PN): p= {' wd ': keyword} res=urllib2.urlopen ("http://www.baidu.com/s?" +urllIb.urlencode (P) + "&pn={0}&cl=3&rn=100"). Format (PN) Html=res.read () return HTML def getlist (Regex,text): arr = [] res = Re.findall (regex, text) if res:for R in Res:arr.append (R) return arr def getmatch (regex,text): res = RE.F Indall (regex, text) if Res:return Res[0] return "def cleartag (text): p = re.compile (U ' <[^>]+> ') retval = P.sub (", text) return retval def geturl (keyword): for page in range: pn=page*100+1 html = baidu_search (KEYWORD,PN) content = Unicode (HTML, ' utf-8 ', ' ignore ') arrlist = getlist (u "<table.*?class=" Result ".*?>.*?</a>", content) for Item in Arrlist:regex = u " |
I hope this article will help you with your Python programming.