Python method to extract Baidu search results

Source: Internet
Author: User

This paper illustrates the method of extracting Baidu search results by Python. Share to everyone for your reference. The implementation methods are as follows:

?

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26-27--28 29---30 31--32 33 34 35 36 37 38-39 40 41 42 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63-64 # Coding=utf8 Import urllib2 Import string import urllib import re import random #设置多个user_agents to prevent Baidu limit IP user_agents = [' mozilla/5.0 (Windows NT 6.1; WOW64; rv:23.0) gecko/20130406 firefox/23.0 ', ' mozilla/5.0 (Windows NT 6.1; WOW64; rv:18.0) gecko/20100101 firefox/18.0 ', ' mozilla/5.0 (Windows; U Windows NT 6.1; En-US) applewebkit/533+ (khtml, like Gecko) Element Browser 5.0 ', ' IBM webexplorer/v0.94 ', ' galaxy/1.0 [en] (Mac OS X 10. 5.6; U EN) ', ' mozilla/5.0 (compatible; MSIE 10.0; Windows NT 6.1; WOW64; trident/6.0) ', ' opera/9.80 (Windows NT 6.0) presto/2.12.388 version/12.14 ', ' mozilla/5.0 (IPAD; CPU os 6_0 like Mac os X) applewebkit/536.26 (khtml, like Gecko) version/6.0 mobile/10a5355d ' safari/8536.25 ', ' mozilla/5.0 (Windows NT 6.1) applewebkit/537.36 (khtml, like Gecko) chrome/28.0.1468.0 safari/537.36 ', ' mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.0; trident/5.0; TheWorld) ' Def baidu_search (KEYWORD,PN): p= {' wd ': keyword} res=urllib2.urlopen ("http://www.baidu.com/s?" +urllIb.urlencode (P) + "&pn={0}&cl=3&rn=100"). Format (PN) Html=res.read () return HTML def getlist (Regex,text): arr = [] res = Re.findall (regex, text) if res:for R in Res:arr.append (R) return arr def getmatch (regex,text): res = RE.F Indall (regex, text) if Res:return Res[0] return "def cleartag (text): p = re.compile (U ' <[^>]+> ') retval = P.sub (", text) return retval def geturl (keyword): for page in range: pn=page*100+1 html = baidu_search (KEYWORD,PN) content = Unicode (HTML, ' utf-8 ', ' ignore ') arrlist = getlist (u "<table.*?class=" Result ".*?>.*?</a>", content) for Item in Arrlist:regex = u "

I hope this article will help you with your Python programming.

Related Article

E-Commerce Solutions

Leverage the same tools powering the Alibaba Ecosystem

Learn more >

Apsara Conference 2019

The Rise of Data Intelligence, September 25th - 27th, Hangzhou, China

Learn more >

Alibaba Cloud Free Trial

Learn and experience the power of Alibaba Cloud with a free trial worth $300-1200 USD

Learn more >

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.