Simple Sina Weibo crawler-python version-(download part)---(ON)

Last Update:2016-05-30 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Write a blog for the first time, Mark.

I have not written technology to share the blog is also because there is no good to share, now feel that some experience and thinking process is worth documenting, a convenient for later review, two if can give others a little help, it is incomparably xingshen.

This is because the job needs to do some market sound analysis, need to crawl some forum content, micro-blog content to do data analysis, do not bother to find other profitable website to buy, I have studied the Python crawler, write a small crawler

Reptiles are divided into two main parts:

1, download the Web page function (that is, the Web content in HTML format in the TXT text)

2, from the saved text to parse out the desired content (mainly used to beautifulsoup, used to speak!) Novice Tip: Note case! ）

Other: In fact, the main function of the crawler is still very simple, because the main task is not to obtain data on the crawler learning is just a primer, here just want to share experience, code.

The main is easy to figure, code is ugly, do not joke

-------------------------------------------------------------------

Here is the code of the crawler's download page function:

Simply explain how it works: Get the URL you want to download (which I usually generate by rules)--Initiate a request (with cookies)--Save the returned data to the specified directory

#-*-coding;utf-8-*-import requests,re,time,datetimefrom BS4 Import beautifulsoupcook={"Cookie": "#your cookies#"}def Main_load (url,sleeptime): #主下载函数 time.sleep (sleeptime) try:html=requests.get (url,cookies=cook). Content #获得 Request result path= ' D:\\weibo\\movie\\cyx+dy ' #保存下载内容到你指定的文件目录 name=re.findall (R ' (? <=keyword=). * ', URL ) [0] #截取url的一部分作为文件名 filename=path+ ' \ \ ' +name+ '. txt ' r=open (filename, ' W ') r.write (HTML) r.clo SE () return True Except:return falsedef auto_get (url,main_load,shleeptime=3): #如果请求不成功就等待一段时间, and then Secondary initiation Request While not main_load (url,shleeptime): if shleeptime>20:break shleeptime + = 5 p Rint ' Enlarge sleep Time: ' +str (shleeptime) for I in Range (1,4): #在网页中搜索结果, get a link u RL, replace the following URL, let #url = ' ime=20150101&endtime=20160501&sort=time&page= ' +str (i) url1= ' http://weibo.cn/ Search/mblog?hidesearchframe=&keyword=%e9%aa%91%e8%a1%8c+%e7%88%b1%e6%83%85+%e5%85%ac%e8%b7%af&advancedfilter=1&starttime=20150101& Endtime=20160530&sort=time&page=1 ' Print auto_get (url,main_load,shleeptime=3)

A few notes:

1, I use the authentication method is a cookie, so to use this bot must obtain their own cookies, at first I wanted to simulate the landing, and later did not get up and give up, directly with the cookies convenient. Cookies can be found in the Temp folder of IE, a bit of a pit. I was directly from

2, Sina Weibo wap end of the search results page is a regular, basically is "fixed start +keyword+ start time + End time + page" format, originally this process can be passed through the parameters in the request to achieve, I am too lazy. Simply click on your own advanced search, choose the time keyword, search results Select the next page, in the browser address bar can see the general address format. In fact, as long as a change keyword on the line, the middle of the string "%e9%aa%91%e8" is actually search keywords.

This reptile is ugly, but it basically satisfies my own needs.

Follow-up: Sina Weibo wap only returns the first 100 pages of the search, probably only 1000 results, so if the search content too much behind the section can not be obtained. My solution is to shorten the search date, such as monthly search.

’

Simple Sina Weibo crawler-python version-(download part)---(ON)

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Simple Sina Weibo crawler-python version-(download part)---(ON)

Contact Us

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support