Sina Weibo crawler design (python version)

Source: Internet
Author: User

There is a project on hand recently on Sina Weibo, one of which is to be a crawler for Sina Weibo. While the Python learning manual and Python core programming have all been read through Landes, there is nothing really going on when it comes to the project. So I found a lot of information on the Internet. There are roughly two ways to get Sina Weibo, one for pure crawlers and one for Sina.

Using the API, you need to apply for a Sina Development Account, the process is a bit complex, the ultimate goal is to get Sina's App_key and App_secret. By booting the user authorization, get Access_token, and then call the API. Although the API provided by Sina is easy for developers to develop, the restrictions are large. The first is the user must authorize, since the Sina interface upgrade, many interfaces only for authorized users to be effective, like friends_timeline,user_timeline these to obtain the user micro-blog, there are many other interfaces have such restrictions. In addition, Access_token has the so-called expiration date, the developer for 5 years, while others, if the authorization level is the test only one day, the average is 7 days, the specific can crossing network introduction. In other words, more than the validity of the re-authorization, very troublesome. In addition to these, in fact, its frequency of interface access is also limited. As far as our project is concerned, it is not very convenient to use the API if the information to be obtained is related to the geographic location. Although I spent a lot of time researching the API, I finally gave up.

Finally I decided to use the crawler to do, thanks to the following article to provide me with ideas:

http://blog.csdn.net/codingmirai/article/details/17754645

The author of this article uses the Java language, but he offers me a good idea. He did not use the analog login, with the proxy IP, but in the latest article he also said that Sina because of the upgrade, so the proxy IP can not be used, can only be simulated login. So, I would also thank the following articles

Http://www.jb51.net/article/44779.htm

He helped me solve the problem of analog login, mainly by simulating login to save cookies, which I will introduce later.

Of course, in the actual process still have a lot of problems, such as to pass the regular expression to parse HTML, the site of the father's Sina Weibo Web source text is in the form of utf-8, that is, like ' \u4f60\u7684 ' form appears, look at the time of trouble, once let a person emotionally out of control. The concrete parsing method I will introduce in later article, at the same time gives the source code.


By the way, I compiled the environment: Linux,python 2.7 version. All of the source code is tested in this environment, if there are different systems, or python3.x version of the self-modification ...


This is the V1 version, which may be modified later, or it may do a graphical interface ...

The first production, inevitably have shortcomings, welcome to put forward comments.


Not many files, main.py: Master file; matcher.py: Parsing html;weibologin.py,weiboencode.py, weibosearch.py: Used for analog login, there is also a userlists file to hold the user name and password, this is to prevent Sina anti-crawler function, after I will introduce (but not perfect)


Main function (main.py):

#!/usr/bin/env python#-*-coding:utf-8-*-from weibologin import weibologinimport  Reimport urllib2import matcherdef main ():     urlheader= ' http://s.weibo.com/ weibo/'     para=raw_input (' Please enter search: \ n ')     page=1     userlists=open (' userlists '). ReadLines ()     reg1=re.compile (R ' \\u4f60\\u7684\\u884c\\ U4e3a\\u6709\\u4e9b\\u5f02\\u5e38\\uff0c\\u8bf7\\u8f93\\u5165\\u9a8c\\u8bc1\\u7801\\uff1a ')       #你的行为有些异常, please enter the Captcha     reg2=re.compile (R ' \\u62b1\\u6b49\\uff0c\\u672a\\u627e\\u5230 ') # Sorry, search results not found     for userlist in userlists:         username=userlist.split () [0]        password= Userlist.split () [1]        weibologin=weibologin (Username,password)          if weibologin. Login () ==true:            print  ' login successful '              user=True     #帐号可用          while page<=50 and user:             url=urlheader+para+ ' &page= ' +str (page)              print  ' Get page%d: '  % page            f=urllib2.urlopen ( URL)             ## #开始匹配网页内容 ###             for line in f:                 if re.search (R ' pid ":" Pl_weibo_ Direct "', line): &NBSP;&NBSP;&NBsp;  #匹配一定要准确!!                      if reg2.search (line):                         print  ' Sorry, no results found ... '                          return                     else:                              matcher.matcher (line)                          page+=1                         break                 if re.search (R ' pid ":" Pl_common_sassfilter ', line):                     if reg1.search (line):                         print   ' This account is locked, use the next account '                          user=False     #帐号不可用if  __ name__== ' __main__ ':     main ()


First, the crawl of the microblog is by entering the keyword search for the micro-blog with the specified keyword, that is: http://s.weibo.com on this site to search the microblogging. The above REG1 matches the "You are behaving a bit abnormally, please enter a verification code". Here I would like to explain, Sina Weibo anti-crawler function, when a one-time search too many pages will be out of this message, the limit is about 30 pages, and then it will be locked. I've also tried using proxy IPs, but I'm showing too many logins, which makes it impossible to log in. So my approach is to use multiple accounts, when an account is locked and then use the next, and then manually to unlock. This is not a good way, I am still trying other methods, if anyone has a good way to provide to me.

REG2 match is not found with the specified keyword of the microblogging information.

In the source code of the Web page, including "pid": "Pl_weibo_direct" line is the search results, all the search to the microblog are in this line, and then as long as the resolution of this line is OK. If a line contains "pid": "Pl_common_sassfilter" indicates that the account is locked.


Demo Login:

For this section you can view the above link, or my reprint blog:

http://liandesinian.blog.51cto.com/7737219/1549692 (corresponds to this project)


Parsing Web page content (matcher.py):

#!/usr/bin/env python#-*-coding:utf-8-*-import reimport codecsdef matcher (line):     reg=r ' <em> (. *?) <\\/em>.*?allowforward=1&url= (. *?) & ' #先将微博内容全部匹配下来, including Url    sub=r ' color:red ' #子串     reg=re.compile (REG)     reg2=re.compile (' <.*?> ') #去除其中的 <...>    mats= Reg.findall (line)     if mats!= ' [] ':         For mat in mats:            with  codecs.open (' Result.txt ', ' a ', encoding= ' Utf-8 ')  as f: #写入utf-8 file                  if mat[0].find (sub)!=-1: #含有子串                       t=reg2.sub (", mat[0]) #剔除其中的 &LT;...&GT;&NBSP;&NBSP;&NBSP;&NBsp;                f.write ( T.decode (' Unicode_escape '). replace (' \ \ ', ') + ' \ n ') #去除 "\"                      f.write (U ' micro-blog info: ')                      f.write (Mat[1] . replace (' \ \ ', ') + ' \ n ')

Weibo content is between <em> and <\/em>. First of all the microblog content to match down, where the surface may contain reproduced microblogging, and some reproduced Weibo are not included in the designated keywords, so need to be removed.


Basically now is this, of course, there are many shortcomings, especially how to deal with Sina's anti-crawler function, but also need to improve ...


This article from "Lotus's Thoughts" blog, please be sure to keep this source http://liandesinian.blog.51cto.com/7737219/1549701

Sina Weibo crawler design (python version)

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.