Solemn reminder: This blog is not allowed to reprint
I will first introduce the principle of Sina Weibo crawler design (including analog login + data parsing), if you do not want to see, you can move to the bottom of the Code section.
The basic steps are: Sina Weibo simulation login, crawl the page source code of the specified user page, the original page parsing and extract the text of the microblog. One of the Sina Weibo simulation login is the premise, parsing the page source code extraction body is the key 1. User name Encryption
The user name encryption of Sina Weibo currently uses BASE64 encryption algorithm. Base64 is a representation of binary data based on 64 printable characters. Since 2 of the 6 is equal to 64, every 6 bits is a unit, corresponding to a printable character. Three bytes have 24 bits, corresponding to 4 Base64 units, 3 bytes need to be represented by 4 printable characters. printable characters in Base64 include the letter A-Z, a-Z, number 0-9, a total of 62 characters, and two printable symbols that differ depending on the system. The encoded data is slightly longer than the original data for the original 4/3.
Python's Hashlib module contains several hashing algorithms, including a Md5,sha family digital Signature Algorithm for verifying file integrity, and a common Base64 encryption algorithm included in the Base64 module. 2. Password encryption
The encryption algorithm of Sina Weibo login password uses RSA2. Need to create an RSA public key, two parameters of the public key Sina Weibo gave a fixed value, the first parameter is the login in the first step of the PubKey, the second parameter is a JS encrypted file of ' 10001 '. These two values need to be converted from 16 to 10, and 10001 to decimal 65537. Then add Servertime and nonce again to encrypt. 3. Request a Sina pass login
Sina Weibo request login URL address is:
Http://login.sina.com.cn/sso/login.php?client=ssologin.js (v1.4.18)
Request the URL to send the header information to include:
{' entry ': ' Microblog ', ' Gateway ': ' 1 ', ' from ': ', ' savestate ': ' 7 ', ' Userticket ': ' 1 ', ' Ssosimpl Elogin ': ' 1 ', ' VSNF ': ' 1 ', ' vsnval ': ', ' su ': encodedusername, ' service ': ' Miniblog ', ' servertime ': servertime, ' nonce ': n Once, ' Pwencode ': ' rsa2 ', ' SP ': Encodedpassword, ' encoding ': ' UTF-8 ', ' prelt ': ' + ', ' rsakv ': rsakv, ' url ': ' Http://micro Blog.com/ajaxlogin.php?framelogin=1&callback=parent.sinassocontroller.feedbackurlcallback ', ' returntype ': ' The META '}
organizes the parameters, post requests, and then looks at what is obtained after viewing post. This content is a URL redirection address that verifies that the login was successful. If the end of the address is "retcode=101", the login fails, and "Retcode=0" indicates a successful login. After the
login succeeds, the URL in the replace message in body is the URL we want to use next. Then use the Get method to the above URL to send a request to the server, save the request cookie information, is the login cookie we need. 4. Get the Web page source code
You can start crawling Web pages after you get a cookie, this article uses Python's third-party package, Urllib2. The URLLIB2 package is a Python component that gets URLs. It provides a very simple interface in the form of a urlopen function, and also provides a more complex interface to handle general situations, such as: Basic authentication, cookies, proxies, and others. They are provided through handlers and openers objects.
HTTP is based on the request and response mechanism, and the client requests that the server provide the answer. URLLIB2 uses a Request object to map the HTTP request you made, by calling Urlopen and passing in the request object, returning the response object, and then calling read to get the Web page source code in the file. 5. Data parsing
From the microblogging platform to download the data to the local, the Web page source code is HTML code, JSON value, or HTML and JSON mix, etc., this article is to extract the micro-bobo master of the Chinese micro-blog.
First, the source code of the Web page is parsed with regular expressions. The steps are: Remove the Javascrip control statement, remove the CSS style sheet, and
Label to line break, convert successive lines into one, remove HTML comments, remove contiguous spaces, remove excess labels, and remove the characters after the @. After these preliminary processing, write the file by line.
The second step is to extract the micro-blog information. Through observation can be found that bloggers sent the text of the microblog in < Script>fm.view ({"ns": "Pl.content.homeFeed.index", "Domid": "Pl_official, We parse the text of the microblog based on the class and ID of the location where the text is located. We combine regular expressions with the BeautifulSoup class library. The BeautifulSoup library class is a high-quality HTML parser that imports HTML files and processes them once in a tree-like structure, allowing users to easily reach any tag or data within the tree structure, with the advantage of precise positioning and finding labels. The disadvantage of beautifulsoup is that processing speed is slower than regular expression. Therefore, the combination of regular expressions and BeautifulSoup class library can meet the needs of this article's Web page parsing.
The third step is to further remove the noise. Weibo text will embed some ads, such as "browser", "Baidu Music Prestige Edition", "Baidu Video", "Tencent Video", "Xinhua" and so on, if the microblog has been deleted will appear "Sorry, the microblog has been deleted", there are some Web page information such as "Loading please later", The above noise is to be removed. Here is the Code section, first the program structure diagram:
1. weibosearch.py, weibologin.py, weiboencode.py Three files are used to implement the impersonation login function, and the last function is called with login.py .
File One: weibosearch.py
#-*-Coding:utf-8-*-
import re
import JSON
# Find server time and Nonce # in Serverdata the
parsing process uses regular expressions and JSON primarily
def sserverdata (serverdata):
p = re.compile (' \ ((. *) \) ') # defines the regular expression
jsondata = P.search (serverdata). Group (1) # Find and extract packets by regular expression 1
data = json.loads (jsondata)
servertime = str (data[' servertime ']) # Get the corresponding field in data, JSON object is a dictionary
nonce = data[' nonce ']
PubKey = data[' PubKey ']
rsakv = data[' rsakv '] # Get field
return Servertime, Nonce, PubKey, rsakv
#Login中解析重定位结果部分函数
def sredirectdata (text):
p = Re.compile (' location\.replace\ ([\ ' "] (. *?) [\ ' "]\) ')
loginurl = P.search (text). Group (1)
print ' loginurl: ', loginurl # Output information, if the return value contains ' Retcode = 0 ' Indicates that the login was successful
return loginurl
File Two: weiboencode.py
#-*-coding:utf-8-*-import urllib import Base64 import RSA import BINASCII #获取用户名和密码加密后的形式, packets packaged into sent data return def Postenco De (UserName, PassWord, Servertime, Nonce, PubKey, rsakv): Encodedusername = GetUserName (userName) #用户名使用base64加密 en
Codedpassword = Get_pwd (PassWord, Servertime, Nonce, pubkey) #目前密码采用rsa加密 Postpara = {' entry ': ' Weibo ',
' Gateway ': ' 1 ', ' from ': ', ' savestate ': ' 7 ', ' Userticket ': ' 1 ',
' Ssosimplelogin ': ' 1 ', ' VSNF ': ' 1 ', ' vsnval ': ', ' su ': encodedusername,
' Service ': ' Miniblog ', ' servertime ': servertime, ' nonce ': nonce, ' pwencode ': ' RSA2 ',
' SP ': Encodedpassword, ' encoding ': ' UTF-8 ', ' prelt ': ' A ', ' rsakv ': rsakv, ' URL ': ' Http://weibo.com/ajaxlogin.php?framelogin=1&callback=parent.sinaSSOController.feedBackUrlCallBack ', ' ReturntyPE ': ' META '} postdata = Urllib.urlencode (Postpara) #封装postData信息并返回 return postdata #根据明文的用户名信息获取加 Username def getusername (userName): Usernametemp = Urllib.quote (userName) usernameencoded = base64.encodestring (user NAMETEMP) [: -1] return usernameencoded #根据明文的密码信息加入nonce和pubkey后根据rsa加密算法的规则生成密码的密文 def get_pwd (password, servertime, Nonce, PubKey): rsapublickey = Int (PubKey, +) key = RSA.
PublicKey (Rsapublickey, 65537) #创建公钥 message = str (servertime) + ' \ t ' + str (nonce) + ' \ n ' + str (password) #拼接明文加密文件中得到
passwd = rsa.encrypt (message, key) #加密 passwd = Binascii.b2a_hex (passwd) #将加密信息转换为16进制. return passwd
File three: weibologin.py
#-*-Coding:utf-8-*-# Demo login for Sina Weibo import urllib2 Import Cookielib # Important module for loading network programming import Weiboencode Import Weibosearc H class Weibologin: #python魔法方法, this function is called when initializing an object of this class Def __init__ (self, user, pwd, enableproxy = False): # Enablep Roxy indicates whether to use proxy server, default off print "Initialize Sina Weibo login ..." self.username = user Self.password = pwd self.en Ableproxy = enableproxy # Initializing class member # requires get to get two parameters before submitting a POST request, and the resulting data has a value of "Servertime" and "nonce", which is random self.se Rverurl = "Http://login.sina.com.cn/sso/prelogin.php?entry=weibo&callback=sinaSSOController.preloginCallBack
&su=&rsakt=mod&client=ssologin.js (v1.4.18) &_=1407721000736 "#loginUrl用于第二步, encrypted user name and password post to this URL Self.loginurl = "Http://login.sina.com.cn/sso/login.php?client=ssologin.js (v1.4.18)" #self. Postheader = {' U Ser-agent ': ' mozilla/5.0 (Windows NT 6.1; rv:24.0) gecko/20100101 firefox/24.0 '} self.postheader = {' User-agent ': ' mozilla/5.0 (Windows NT 6.3; WOW64) applewebkit/537.36 (khtml, like Gecko) chrome/37.0.2062.120 safari/537.36 '} #self. Postheader = {' user-agent ': ' M ozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; WOW64; trident/5.0) '} # Generate a cookie all the next get and post requests are taken with the cookie already obtained, because the login verification of the larger website relies on the cookie def enablecookie (self, enableproxy) : Cookiejar = cookielib. Lwpcookiejar () # establishes a cookie Cookie_support = Urllib2. Httpcookieprocessor (cookiejar) If Enableproxy:proxy_support = Urllib2. Proxyhandler ({' http ': ' http://122.96.59.107:843 '}) # using Proxy opener = Urllib2.build_opener (Proxy_support, Cookie_ Support, Urllib2. HttpHandler) print "Proxy enable" Else:opener = Urllib2.build_opener (Cookie_support, Urll Ib2.
HttpHandler) Urllib2.install_opener (opener) #获取 server time and Nonce parameters for encoding the password Def getservertime (self):
Print "Getting server time and nonce ..." serverdata = Urllib2.urlopen (Self.serverurl). Read () # get Web content print ' serverdata ', serverdata try: # extract Servertime, nonce, PubKey, rsakv field servertime in JSON, Nonce, PubKey, rsakv = Weibosearch.sserverdata (serverdata) print "GetServerTime success" return s Ervertime, Nonce, PubKey, rsakv except:print "Parse serverdata error. "Return None def getData (URL): request = Urllib2. Request (URL) response = Urllib2.urlopen (request) content = Response.read () return content de F Login (self): # login Program self. Enablecookie (self.enableproxy) # cookie or proxy server configuration, invoke custom function implementation Servertime, Nonce, pubkey, rsakv = self. GetServerTime () # Login The first step, call the function to get the above information # prepare all the post parameters returned postdata PostData = Weiboencode.postencode (self.us Ername, Self.password, Servertime, Nonce, PubKey, rsakv) print "Getting postdata Success" #封装request请求 to obtain a reference The text req = urllib2 for the fixed URL. Request (Self.loginurl, PostData, Self.postheader) # PackageRequest Information result = Urllib2.urlopen (req) # login Second step to self.loginurl send user and password text = Result.read () # Read content #p Rint text "" "After landing Sina returned a script defined in a further landing before the URL is also to get parameters and validation, and so on, this step is the real landing so you also need to get this URL to use
Get login can "" "" "" "Try:loginurl = Weibosearch.sredirectdata (text) # After the relocation information, parse to get the URL of the final jump to
Urllib2.urlopen (loginurl) # When the URL is opened, the server automatically writes the user login information to the cookie and logs on to the successful print loginurl except: Print "Login failed ..." return False print "login Success" return True
File Four: login.py (note here that you need to use the username and password of your registered Sina Weibo to replace the username and pwd in the program)
#-*-Coding:utf-8-*-
import weibologin
import urllib2
import re
import sys
reload (SYS)
Sys.setdefaultencoding ("Utf-8")
def login ():
username = ' username '
pwd = ' pwd ' #我的新浪微博的用户名和密码
Weibologin = Weibologin.weibologin (username, pwd) #调用模拟登录程序
if Weibologin. Login ():
print "Login successful ... " #此处没有出错则表示登录成功
Then, write yourself a test, run the file four function login (), the program will print a SINA server return address code, if the address code last retcode=0, then the login is successful, otherwise the login failed.
2. getraw_html.py Download the page source code for the specified page (note here that there is an absolute path, if you want to change it to your own path)
#-*-Coding:utf-8-*-
import weibologin
import Login
import urllib2
import re
import sys
Reload (SYS)
sys.setdefaultencoding ("Utf-8")
#调用模拟登录的程序, fetches the data from the Web page for the specified URL and obtains the original HTML information to be stored in
raw_html.txt def get_rawhtml (URL):
#Login. Login ()
content = Urllib2.urlopen (URL). Read ()
Fp_raw = open ("F://emotion /mysite/weibo_crawler/raw_html.txt "," w+ ")
fp_raw.write (content)
fp_raw.close () #获取原始的HTML写入文件
#print "Successfully crawled the specified Web source file and deposited raw_html.txt"
return content #返回值为原始的HTML文件内容
if __name__ = = ' __main__ ' :
login.login () #先调用登录函数
#url = ' http://weibo.com/yaochen?is_search=0&visible=0&is_tag=0 &profile_ftype=1&page=1#feedtop ' #姚晨
url = ' http://weibo.com/fbb0916?is_search=0&visible=0 &is_tag=0&profile_ftype=1&page=1#feedtop ' #自行设定要抓取的网页
get_rawhtml (URL)
3.gettext_ch.py resolves the Web page source code that was crawled to the previous section. Note that there is an absolute path and you need to change it when you use it
#-*-Coding:utf-8-*-import re from BS4 import BeautifulSoup from getraw_html import get_rawhtml import Login #Beautif Ulsooup is not suitable for parsing Chinese text, tried but not successfully import SYS reload (SYS) sys.setdefaultencoding ("Utf-8") #设置默认编码 to prevent encoding problems when writing to a file def filter_ tags (htmlstr): #先过滤CDATA re_cdata = re.compile ('//<!\[cdata\[[^>]*//\]\]> ', re. I) #匹配CDATA Re. I means ignoring case re_script = re.compile (' <\s*script[^>]*>[^<]*<\s*/\s*script\s*> ', re. I) #Script Re_style = Re.compile (' <\s*style[^>]*>[^<]*<\s*/\s*style\s*> ', re. I) #style re_br = Re.compile (' <br\s*?/?> ') #处理换行 re_h = Re.compile (' </?\w+[^>]*> ') #HTML标 Sign re_comment = Re.compile (' <!--[^>]*--> ') #HTML注释 s = re_cdata.sub (", htmlstr) #去掉CDATA s = re_sc Ript.sub (', s) #去掉SCRIPT s = re_style.sub (', s) #去掉style s = Re_br.sub (", s) #将br转换为换行 s = Re_h.sub (", s ) #去掉HTML Label s = Re_comment.sub (", s) #去掉HTML注释 s = re.sub (R ' \ \ \T ', ', s) s = re.sub (R ' \\n\\n ', ', s) s = re.sub (R ' \\r ', ', s) s = re.sub (R ' <\/?\w+[^>]*> ', ', s) #s = Re.sub (R ' <\\\/div\> ', ', s) #去除 <\/div> #s = re.sub (R ' <\\\/a\> ', ', s) #去除 <\/a> #s = Re. Sub (R ' <\\\/span\> ', ', s) #去除 <\/span> #s = re.sub (R ' <\\\/i\> ', ', s) #去除 <\/i> #s = re.sub ( R ' <\\\/li\> ', ', s) #去除 <\/li> s = re.sub (R ' <\\\/dd\> ', ', s) #去除 <\/dd> s = re.sub (R ' <\\ \/dl\> ', ', s) #去除 <\/dl> s = re.sub (R ' <\\\/dt\> ', ', s) #去除 <\/dt> #s = re.sub (R ' <\\\/UL\&G t; ', ', s) #去除 <\/ul> #s = re.sub (R ' <\\\/em\> ', ', s) #去除 <\/em> #s = re.sub (R ' <\\\/p\> ', ', s) #去除 <\/p> s = re.sub (R ' <\\\/label\> ', ', s) #去除 <\/label> s = re.sub (R ' <\\\/select\> ', ' " , s) #去除 <\/select> s = re.sub (R ' <\\\/option\> ', ', s) #去除 <\/option> s = re.sub (R ' <\\\/tr\> ', ', s) #去除 <\/tr>
s = re.sub (R ' <\\\/td\> ', ', s) #去除 <\/td> s = re.sub (R ' @[^<]* ', ', s) #去掉 @ after the character s = re.sub (R ' &L T;a[^>]*>[^<]* ', ', s) #去掉多余的空行 blank_line = Re.compile (R ' (\\n) + ') s = blank_line.sub (' \ n ', s) #将 Successive line breaks are converted to a newline s = s.replace (', ') return s def Handel (content, FP2): Handel_text = [] lines = content. Splitlines (True) #按行分割每行分别处理 for line in lines: #在用正则表达式处理之前首先根据开头的内容进行匹配 if Re.match (R ' (\<scrip t\>fm\.view\ (\{\ "ns\" \:\ "pl\.content\.homefeed\.index\" \,\ "domid\" \:\ "pl_official) (. *) ', line): #print" Lin
E ", line #调用正则表达式处理函数进行处理 temp_new = filter_tags (line) #print" Temp_new ", temp_new Handel_text.append (line) #调用正则处理函数 Content_chinese = "" #初始化, the last Chinese string for the text in Handel_ Text: #print "text", Text Cha_text = Unicode (text, ' Utf-8 ') #编码 #中文, spaces, punctuation, quotes, comma, exclamation point word = Re.findall (ur "[\u4e00-\u9fa5]+|,|. |:|\s|\u00a0|\u3000| ' # ' |\d|\u201c|\u201d|\u3001|\uff01|\uff1f|\u300a|\u300b| ff1b| ff08|
FF09 ", Cha_text) if Word: #如果该句子中含有上述字符 #print" word ", word for char in Word:
if char = = ': Content_chinese + = ' elif char = = ' \s ': #三种空格的形式 Content_chinese + = ' elif char = = ' \u00a0 ': #中文空格 Content_chinese + = ' ' elif char = = ' \u3000 ': #英文空格 content_chinese + = ' elif char = = '
# ': Content_chinese + = ' \ n ' else:content_chinese + = Char Content_chinese + = ' \ n ' #循环结束, Content_chinese generated #写入文件handel. txt FP3 = open ("F://emotion/mysite/weibo_crawler /handel.txt "," w+ ") Fp3.write (Content_chinese) fp3.close () #打开该文件 FP1 = open (" F://emotion/mysite/weibo_cra
Wler/handel.txt "," r ") Read = Fp1.readlines () pattern = re.compile (ur "[\u4e00-\u9fa5]+") #中文文本 #fp2 = open ("Chinese_weibo.txt", "a") #初始化写入文件 for ReadLine I
n read:utf8_readline = Unicode (ReadLine, ' utf-8 ') if Pattern.search (utf8_readline): #如果在该句话中能找到中文文本则进行处理
#print readline #测试 split_readline = Readline.split (") #由空格对文本进行分割, Split_readline is a list For C in split_readline:c = Re.sub (R ' publisher: [.] * ', ', c) #去掉 "publisher" C = re.sub (R ' Baidu [.] * ', ', ', c) #去掉 "Baidu" C = re.sub (R ' is loading in [.] * ', ', c) c = re.sub (R ' safe browsing [.] * ', ', c) c = re.sub (R ' Sorry [.] * ', ' ', c) c = re.sub (R '. *) high-speed browsing. * ', ' ', c) #print C,len (c) If Len (c) > 16: #提出过短的文本, a Chinese in utf-8 encoding is 3 bytes in length fp2.write (c) #print "C", C Fp1.close () #fp2. C Lose () #文件关闭 #print "Successfully parse Web page to extract Weibo and deposit Chinese_weibo.txt"
4. Call the above sections to provide an interface for user-friendly use
#-*-Coding:utf-8-*-
Import Threading
Import Login
import getraw_html
import gettext_ch
Import Time
def Crawler (number, Weibo_url):
login.login () #首先进行模拟登录
fp4 = open ("F://emotion/mysite/weibo_ Crawler/chinese_weibo.txt "," w+ ") #初始化写入文件 for
N in range (number):
n = n + 1
url = ' http://weibo.com /' + Weibo_url + '? is_search=0&visible=0&is_tag=0&profile_ftype=1&page= ' + str (n)
print "crawler url ", url
content = getraw_html.get_rawhtml (URL) # Call the function that gets the Web page source file to execute
print" page%d get success and write Into Raw_html.txt "%n
gettext_ch. Handel (content, FP4) # Call the function that parses the page
print "page%d Handel success and write into Chinese_weibo.txt"%n
# Time.sleep (1)
fp4.close ()
########## data crawl is complete.
# to go to the crawl of the microblog to re-processing
fp = open (' F://emotion/mysite/weibo_crawler/chinese_weibo.txt ', ' r ')
contents = []