【學習筆記】python爬取百度真實url

最後更新：2017-09-08 來源：互聯網

上載者：User

創建阿里雲帳戶，並獲得超過 40 款產品的免費試用版；而企業帳戶則可以享有總值 $1200 的免費試用版。立即註冊！

標籤：python

今天跑個指令碼需要一堆測試的url，，，挨個找複製粘貼肯定不是程式員的風格，so，還是寫個指令碼吧。

環境：python2.7

編輯器：sublime text 3

一、分析一下

首先非常感謝百度大佬的url分類非常整齊，都在一個類下

650) this.width=650;" src="https://s2.51cto.com/wyfs02/M00/A4/D4/wKioL1myOoKwHUt_AABV9rZbvps718.png-wh_500x0-wm_3-wmp_4-s_1471720461.png" title="QQ20170908143211.png" alt="wKioL1myOoKwHUt_AABV9rZbvps718.png-wh_50" />

即c-showurl,所以只要根據css爬取連結就可以，利用beautifulsoup即可實現，代碼如下：

        soup = BeautifulSoup(content,‘html.parser‘)        urls = soup.find_all("a",class_=‘c-showurl‘)

還有另外的一個問題是百度對url進行了加密，要想獲得真實的url，我的思路是訪問一遍加密的url，再獲得訪問介面的url，這時擷取到的url即為真實的url。

完整代碼如下：

#coding = utf-8import requestsfrom bs4 import BeautifulSoupimport timeheaders = {                    ‘Accept‘:‘text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8‘,                    ‘User-Agent‘:‘Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/31.0.1650.63 Safari/537.36 QIHU 360SE‘            }page_start = raw_input(‘please input stratpage\n‘)page_end = raw_input(‘please input endpage\n‘)word = raw_input(‘please input keyword\n‘)if page_start == 1:    page_start = 0else:    page_start = (int(page_start)-1)*10page_end = (int(page_end)-1)*10for i in range(page_start,page_end,10):    url = ‘http://www.baidu.com/s?wd=‘+word+‘&pn=‘+str(i)    try:        response = requests.get(url,headers=headers,timeout=10)        print ‘downloading...‘+url        content = response.content        soup = BeautifulSoup(content,‘html.parser‘)        urls = soup.find_all("a",class_=‘c-showurl‘)        for href in urls:            a = href[‘href‘]            try:                res = requests.get(a,headers=headers,timeout=10)                with open(‘urls.txt‘,‘a‘) as f:                    f.write(res.url)                    f.write(‘\n‘)                time.sleep(1)            except Exception,e:                print e                pass    except Exception,e:        print e        pass

當然，這隻是簡單的功能，如果爬取大量的url，建議利用線程進行處理，不然等到爬完也等到地老天荒了。。。。。我是爬取百十來個url，親測還可以。

本文出自 “踟躕” 部落格，請務必保留此出處http://chichu.blog.51cto.com/11287515/1963693

【學習筆記】python爬取百度真實url

本文章原先以中文撰寫並發佈於 aliyun.com，亦設英文版本，僅作資訊用途。本網站不對文章的準確性，完整性或可靠性或其任何翻譯作出任何明示或暗示的陳述或保證。如對該文章有任何疑慮或投訴，請傳送電郵至 info-contact@alibabacloud.com 並提供相關疑慮或投訴的詳細說明。職員會於 5 個工作天內與您聯絡，一經驗證之後，即會刪除該侵權內容。

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More