Write a blog for the first time, Mark.
I have not written technology to share the blog is also because there is no good to share, now feel that some experience and thinking process is worth documenting, a convenient for later review, two if can give others a little help, it is incomparably xingshen.
This is because the job needs to do some market sound analysis, need to crawl some forum content, micro-blog content to do data analysis, do not bother to find other profitable website to buy, I have studied the Python crawler, write a small crawler
Reptiles are divided into two main parts:
1, download the Web page function (that is, the Web content in HTML format in the TXT text)
2, from the saved text to parse out the desired content (mainly used to beautifulsoup, used to speak!) Novice Tip: Note case! )
Other: In fact, the main function of the crawler is still very simple, because the main task is not to obtain data on the crawler learning is just a primer, here just want to share experience, code.
The main is easy to figure, code is ugly, do not joke
-------------------------------------------------------------------
Here is the code of the crawler's download page function:
Simply explain how it works: Get the URL you want to download (which I usually generate by rules)--Initiate a request (with cookies)--Save the returned data to the specified directory
#-*-coding;utf-8-*-import requests,re,time,datetimefrom BS4 Import beautifulsoupcook={"Cookie": "#your cookies#"}def Main_load (url,sleeptime): #主下载函数 time.sleep (sleeptime) try:html=requests.get (url,cookies=cook). Content #获得 Request result path= ' D:\\weibo\\movie\\cyx+dy ' #保存下载内容到你指定的文件目录 name=re.findall (R ' (? <=keyword=). * ', URL ) [0] #截取url的一部分作为文件名 filename=path+ ' \ \ ' +name+ '. txt ' r=open (filename, ' W ') r.write (HTML) r.clo SE () return True Except:return falsedef auto_get (url,main_load,shleeptime=3): #如果请求不成功就等待一段时间, and then Secondary initiation Request While not main_load (url,shleeptime): if shleeptime>20:break shleeptime + = 5 p Rint ' Enlarge sleep Time: ' +str (shleeptime) for I in Range (1,4): #在网页中搜索结果, get a link u RL, replace the following URL, let #url = ' ime=20150101&endtime=20160501&sort=time&page= ' +str (i) url1= ' http://weibo.cn/ Search/mblog?hidesearchframe=&keyword=%e9%aa%91%e8%a1%8c+%e7%88%b1%e6%83%85+%e5%85%ac%e8%b7%af&advancedfilter=1&starttime=20150101& Endtime=20160530&sort=time&page=1 ' Print auto_get (url,main_load,shleeptime=3)
A few notes:
1, I use the authentication method is a cookie, so to use this bot must obtain their own cookies, at first I wanted to simulate the landing, and later did not get up and give up, directly with the cookies convenient. Cookies can be found in the Temp folder of IE, a bit of a pit. I was directly from
2, Sina Weibo wap end of the search results page is a regular, basically is "fixed start +keyword+ start time + End time + page" format, originally this process can be passed through the parameters in the request to achieve, I am too lazy. Simply click on your own advanced search, choose the time keyword, search results Select the next page, in the browser address bar can see the general address format. In fact, as long as a change keyword on the line, the middle of the string "%e9%aa%91%e8" is actually search keywords.
This reptile is ugly, but it basically satisfies my own needs.
Follow-up: Sina Weibo wap only returns the first 100 pages of the search, probably only 1000 results, so if the search content too much behind the section can not be obtained. My solution is to shorten the search date, such as monthly search.
’
Simple Sina Weibo crawler-python version-(download part)---(ON)