Solemn reminder: This blog is not allowed to reprint
I will first chapter on Sina Weibo crawler design (including analog login + data parsing) principle, if you do not want to see, you can move to the bottom of the Code section.
The basic steps are: Sina Weibo's analog login, crawl the specified user page source code, the original page to parse and extract the micro-blog body. Sina Weibo's analog login is the premise, parsing the source code extraction text is the key 1. User name Encryption
Sina Weibo username encryption is currently using the BASE64 encryption algorithm. Base64 is a representation that represents binary data based on 64 printable characters. Because 2 of the 6 times equals 64, each 6 bit is a unit, corresponding to a printable character. Three bytes have 24 bits, corresponding to 4 Base64 units, that is, 3 bytes need to be represented by 4 printable characters. printable characters in Base64 include letters A-Z, a-Z, number 0-9, 62 characters in total, and two printable symbols that vary according to the system. The encoded data is slightly longer than the original data, which is 4/3 of the original.
Python's Hashlib module contains a variety of hashing algorithms, including a Md5,sha family digital Signature Algorithm for validating file integrity, and a common Base64 encryption algorithm included in the Base64 module. 2. Password encryption
Sina Weibo login password encryption algorithm using RSA2. Need to create an RSA public key first, two parameters of the public key Sina Weibo has given a fixed value, the first parameter is the login in the first step of the PubKey, the second parameter is JS encrypted file ' 10001 '. The two values need to be converted from 16 to 10, and 10001 to decimal to 65537. Then add Servertime and nonce to encrypt again. 3. Request Sina Pass sign in
Sina Weibo request login URL address is:
Http://login.sina.com.cn/sso/login.php?client=ssologin.js (v1.4.18)
Request the URL to send the header information to include:
{' entry ': ' Microblog ', ' Gateway ': ' 1 ', ' from ': ', ' savestate ': ' 7 ', ' Userticket ': ' 1 ', ' Ssosimpl Elogin ': ' 1 ', ' VSNF ': ' 1 ', ' vsnval ': ', ' su ': encodedusername, ' service ': ' Miniblog ', ' servertime ': servertime, ' nonce ': n Once, ' Pwencode ': ' rsa2 ', ' SP ': Encodedpassword, ' encoding ': ' UTF-8 ', ' prelt ': ' A ', ' rsakv ': rsakv, ' url ': ' Http://micro Blog.com/ajaxlogin.php?framelogin=1&callback=parent.sinassocontroller.feedbackurlcallback ', ' returntype ': ' META '}
organizes the parameters, posts the request, and then looks at what is available after the post. This content is a URL redirection address that verifies the success of the login. If the end of the address is "retcode=101", the login fails, and "retcode=0" indicates that the login was successful. When the
login succeeds, the URL in the replace message in the body is the URL we want to use next. Then use the Get method on the URL above to send a request to the server to save the cookie information for this request, which is the login cookie we need. 4. Get Web page source code
You can start crawling a Web page after you get a cookie, and this article uses the Python Third-party package urllib2. The URLLIB2 package is a Python component that gets the URLs. It provides a very simple interface in the form of a urlopen function, and also provides a more complex interface for dealing with general situations, such as: Basic authentication, cookies, proxies, and others. They are provided through handlers and openers objects.
HTTP is based on the request and answer mechanism, the client requests, and the server provides the answer. URLLIB2 uses a Request object to map your HTTP request, returns the response object by calling Urlopen and passing in the Request object, and then calls read to get the page source code into the file. 5. Data Resolution
From the microblogging platform to download the data to the local, the Web page source code is the HTML code, JSON value, or HTML and JSON mixed, and so on, this article is to extract the micro-bobo the main Chinese microblog.
First, the source code of the Web page is parsed with a regular expression. The steps are: Remove the Javascrip control statement, remove the CSS style sheet,
Labels into line breaks, converting consecutive lines into one, removing HTML annotations, removing contiguous spaces, removing unwanted labels, and taking out the characters after the @. Write the files in rows after these preliminary processing.
The second step is to extract micro-blog information. Through observation, we can find that the blogger's microblog text is in < Script>fm.view ({"ns": "Pl.content.homeFeed.index", "Domid": "Pl_official, We parse the micro-blog body based on the class and ID of the body where it is located. We used a combination of regular expressions and BeautifulSoup class libraries. The BeautifulSoup library class is a high-quality HTML parser that imports HTML files and processes them once in a tree structure, and then the user can easily reach any tag or data within the tree structure, with the advantage of locating and locating labels accurately. The disadvantage of beautifulsoup is that the processing speed is slower than regular expressions. Therefore, the combination of regular expressions and BeautifulSoup class libraries can meet the needs of this article's Web parsing.
The third step is to further remove the noise. The microblogging text will embed some ads, such as " browser", "Baidu Music privilege version," Baidu Video "," Tencent Video "," Xinhua "and so on, if the micro-blog has been deleted will appear" Sorry, the Weibo has been deleted ", in addition to some of the Web page prompts information such as" in the Loading please later ", All the above noise should be removed. Here is the Code section, first the schematic diagram of the program:
1. weibosearch.py, weibologin.py, weiboencode.py Three files are used to implement the analog login function, the last function uses login.py to invoke
Document I: weibosearch.py
#-*-Coding:utf-8-*-
import re
import JSON
# Find server time and nonce in Serverdata the
parsing process uses regular expressions and JSON primarily
def sserverdata (serverdata):
p = re.compile (' (. *) \) # defines regular expression
jsondata = P.search (serverdata). Group (1) # finds and extracts grouping 1
data = json.loads (jsondata)
servertime = str (data[' servertime ') through regular expressions Gets the corresponding field in data, the JSON object is a dictionary
nonce = data[' nonce ']
PubKey = data[' PubKey '] rsakv = data[' rsakv '
] # Get field return
Servertime, nonce, PubKey, rsakv
#Login中解析重定位结果部分函数
def sredirectdata (text):
p = Re.compile (' Location\.replace\ "] (. *?) [\ ']\ ')
loginurl = P.search (text). Group (1)
print ' loginurl: ', loginurl # Output information, if the return value contains ' Retcode = 0 ' Login is successful return
loginurl
Document II: weiboencode.py
#-*-coding:utf-8-*-import urllib import Base64 import RSA import Binascii #获取用户名和密码加密后的形式, encapsulated to send data packets back to Def Postenco De (UserName, PassWord, Servertime, Nonce, PubKey, rsakv): Encodedusername = GetUserName (userName) #用户名使用base64加密 en
Codedpassword = Get_pwd (PassWord, Servertime, Nonce, pubkey) #目前密码采用rsa加密 Postpara = {' entry ': ' Weibo ',
' Gateway ': ' 1 ', ' from ': ', ' savestate ': ' 7 ', ' Userticket ': ' 1 ',
' Ssosimplelogin ': ' 1 ', ' VSNF ': ' 1 ', ' vsnval ': ', ' su ': encodedusername,
' Service ': ' Miniblog ', ' servertime ': servertime, ' nonce ': nonce, ' pwencode ': ' RSA2 ',
' SP ': Encodedpassword, ' encoding ': ' UTF-8 ', ' prelt ': ' The ', ' rsakv ': rsakv, ' URL ': ' Http://weibo.com/ajaxlogin.php?framelogin=1&callback=parent.sinaSSOController.feedBackUrlCallBack ', ' ReturntyPE ': ' META '} postdata = Urllib.urlencode (Postpara) #封装postData信息并返回 return postdata #根据明文的用户名信息获取加 Secret username def getusername (userName): Usernametemp = Urllib.quote (userName) usernameencoded = base64.encodestring (user NAMETEMP) [: -1] return usernameencoded #根据明文的密码信息加入nonce和pubkey后根据rsa加密算法的规则生成密码的密文 def get_pwd (password, servertime, Nonce, PubKey): rsapublickey = Int (PubKey,) key = RSA.
PublicKey (Rsapublickey, 65537) #创建公钥 message = str (servertime) + ' \ t ' + str (nonce) + ' \ n ' + str (password) #拼接明文加密文件中得到
passwd = rsa.encrypt (message, key) #加密 passwd = Binascii.b2a_hex (passwd) #将加密信息转换为16进制. return passwd
File three: weibologin.py
#-*-Coding:utf-8-*-# Implementation of Sina Weibo analog login import urllib2 import cookielib # Load Network programming important modules Import Weiboencode import Weibosearc H class Weibologin: #python魔法方法, this function def __init__ (self, user, pwd, enableproxy = False) is called when the object of the class is initialized: # ENABLEP Roxy indicates whether to use a proxy server, the default shutdown print "Initialize Sina Weibo login ..." self.username = user Self.password = pwd self.en Ableproxy = enableproxy # initialization class member # You need get two parameters before submitting a POST request, and the resulting data has "servertime" and "nonce" values that are random self.se Rverurl = "Http://login.sina.com.cn/sso/prelogin.php?entry=weibo&callback=sinaSSOController.preloginCallBack
&su=&rsakt=mod&client=ssologin.js (v1.4.18) &_=1407721000736 "#loginUrl用于第二步, encrypted user name and password post to this URL Self.loginurl = "Http://login.sina.com.cn/sso/login.php?client=ssologin.js (v1.4.18)" #self. Postheader = {' U Ser-agent ': ' mozilla/5.0 (Windows NT 6.1; rv:24.0) gecko/20100101 firefox/24.0 '} self.postheader = {' User-agent ': ' mozilla/5.0 (Windows NT 6.3; WOW64) applewebkit/537.36 (khtml, like Gecko) chrome/37.0.2062.120 safari/537.36 '} #self. Postheader = {' user-agent ': ' M ozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; WOW64; trident/5.0) '} # Generate cookies all subsequent get and post requests are taken with the cookie that has been obtained, because the login verification of a slightly larger web site relies on the cookie def enablecookie (self, enableproxy) : Cookiejar = cookielib. Lwpcookiejar () # creates a cookie Cookie_support = Urllib2. Httpcookieprocessor (cookiejar) If Enableproxy:proxy_support = Urllib2. Proxyhandler ({' http ': ' http://122.96.59.107:843 '}) # using Proxy opener = Urllib2.build_opener (Proxy_support, Cookie_ Support, Urllib2. HttpHandler) print "Proxy enable" Else:opener = Urllib2.build_opener (Cookie_support, Urll Ib2.
HttpHandler) Urllib2.install_opener (opener) #获取 the server time and nonce parameters for encoding password def getservertime (self):
Print "Getting server time and nonce ..." serverdata = Urllib2.urlopen (Self.serverurl). Read () # get Web content print ' serverdata ', serverdata try: # extract Servertime in JSON, Nonce, PubKey, rsakv field Servertime, Nonce, PubKey, rsakv = Weibosearch.sserverdata (serverdata) print "GetServerTime success" return s Ervertime, Nonce, PubKey, rsakv except:print "Parse serverdata error. Return None def getData (URL): request = Urllib2. Request (URL) response = Urllib2.urlopen (request) content = Response.read () return content de F Login (self): # login Program self. Enablecookie (self.enableproxy) # cookie or proxy server configuration, call the custom function to implement Servertime, Nonce, pubkey, rsakv = self. GetServerTime () # Login The first step, call the function to get the above information # ready all the Post parameters to return postdata PostData = Weiboencode.postencode (self.us Ername, Self.password, Servertime, Nonce, PubKey, rsakv) print "Getting postdata success" #封装request请求, get the point The text req = urllib2 for the URL. Request (Self.loginurl, PostData, Self.postheader) # encapsulationRequest Information result = Urllib2.urlopen (req) # Log In second step to self.loginurl send user and password text = Result.read () # Read content #p Rint text "" "After the landing Sina returned a section of the script defined in a further landing URL is also to obtain parameters and validation, and so on, this step is the real landing so you need to get this URL again to use
Get login can "" "" "Try:loginurl = Weibosearch.sredirectdata (text) # After the relocation information, resolved to the final jump to the URL
Urllib2.urlopen (loginurl) # After the URL is opened, the server automatically writes the user login information to the cookie and logs on to the successful print loginurl except: print ' Login failed ... ' return False print ' login success ' return True
Document IV: login.py (note here that you will need to use your registered Sina Weibo username and password to replace the username and pwd in the program)
#-*-Coding:utf-8-*-
import weibologin
import urllib2
import re
import sys
reload (SYS)
Sys.setdefaultencoding ("Utf-8")
def login ():
username = ' username '
pwd = ' pwd ' #我的新浪微博的用户名和密码
Weibologin = Weibologin.weibologin (username, pwd) #调用模拟登录程序
if Weibologin. Login ():
print "Login successful ... " #此处没有出错则表示登录成功
Then, write a test yourself, run file four function login (), the program will print a SINA server return address code, if the address code the last retcode=0, the login is successful, otherwise login failed.
2. getraw_html.py Download the page source code of the specified page (note here, contains the absolute path, if you want to use to change your own path)
#-*-Coding:utf-8-*-
import weibologin
import Login
import urllib2
import re
import sys
Reload (SYS)
sys.setdefaultencoding ("Utf-8")
#调用模拟登录的程序, fetching the data from the specified URL from the Web page to get the original HTML information into
the Raw_html.txt def get_rawhtml (URL):
#Login. Login ()
content = Urllib2.urlopen (URL). Read ()
Fp_raw = open ("F://emotion /mysite/weibo_crawler/raw_html.txt "," w+ ")
fp_raw.write (content)
fp_raw.close () #获取原始的HTML写入文件
#print "Successfully crawled the specified page source file and deposited raw_html.txt" return
content #返回值为原始的HTML文件内容
if __name__ = = ' __main__ ' :
login.login () #先调用登录函数
#url = ' http://weibo.com/yaochen?is_search=0&visible=0&is_tag=0 &profile_ftype=1&page=1#feedtop ' #姚晨
url = ' http://weibo.com/fbb0916?is_search=0&visible=0 &is_tag=0&profile_ftype=1&page=1#feedtop ' #自行设定要抓取的网页
get_rawhtml (URL)
3.gettext_ch.py resolves the page source code crawled to the previous section. Note that there are absolute paths here and you need to change them when you use them
#-*-Coding:utf-8-*-import re from BS4 import BeautifulSoup from getraw_html import get_rawhtml import Login #Beautif Ulsooup is not suitable for parsing Chinese text, but has not succeeded in the import sys reload (SYS) sys.setdefaultencoding ("Utf-8") #设置默认编码 to prevent encoding problems when writing to a file def filter_ tags (htmlstr): #先过滤CDATA re_cdata = re.compile ('//<!\[cdata\[[^>]*//\]\]> ', re. I) #匹配CDATA Re. I means ignore case re_script = Re.compile (' <\s*script[^>]*>[^<]*<\s*/\s*script\s*> ', re. I) #Script Re_style = Re.compile (' <\s*style[^>]*>[^<]*<\s*/\s*style\s*> ', re. I) #style re_br = Re.compile (' <br\s*?/?> ') #处理换行 re_h = Re.compile (' </?\w+[^>]*> ') #HTML标 Sign re_comment = Re.compile (' <!--[^>]*--> ') #HTML注释 s = re_cdata.sub (', htmlstr) #去掉CDATA s = re_sc Ript.sub (', s) #去掉SCRIPT s = re_style.sub (', s) #去掉style s = re_br.sub (', s) #将br转换为换行 s = re_h.sub (', S #去掉HTML tag s = re_comment.sub (', s) #去掉HTML注释 s = re.sub (R ' \T ', ', s, s = re.sub (R ' \\n\\n ', ', s) s = re.sub (R ' \ \ \ s) s = re.sub (R ' <\/?\w+[^>]*> ', ', s) #s = Re.sub (R ' <\\\/div\> ', ', s) #去除 <\/div> #s = re.sub (R ' <\\\/a\> ', ', s) #去除 <\/a> #s = Re. Sub (R ' <\\\/span\> ', ', s) #去除 <\/span> #s = re.sub (R ' <\\\/i\> ', ', s) #去除 <\/i> #s = re.sub ( R ' <\\\/li\> ', ', S ' #去除 <\/li> s = re.sub (R ' <\\\/dd\> ', ', s) #去除 <\/dd> s = re.sub (R ' <\\ \/dl\> ', ', s) #去除 <\/dl> s = re.sub (R ' <\\\/dt\> ', ', s) #去除 <\/dt> #s = re.sub (R ' <\\\/UL\&G t; ', ', s) #去除 <\/ul> #s = re.sub (R ' <\\\/em\> ', ', s) #去除 <\/em> #s = re.sub (R ' <\\\/p\> ', ', ', s) #去除 <\/p> s = re.sub (R ' <\\\/label\> ', ', s) #去除 <\/label> s = re.sub (R ' <\\\/select\> ', ' , s) #去除 <\/select> s = re.sub (R ' <\\\/option\> ', ', s) #去除 <\/option> s = re.sub (R ' <\\\/tr\> ', ', s) #去除 <\/tr>
s = re.sub (R ' <\\\/td\> ', ', s) #去除 <\/td> s = re.sub (R ' @[^<]* ', ', s) #去掉 @ post character s = re.sub (R ' &L T;a[^>]*>[^<]* ', ', s) #去掉多余的空行 blank_line = Re.compile (R ' (\\n) + ') s = blank_line.sub (' \ n ', s) #将 Continuous line wrapping is converted into a newline s = s.replace (', ') return s def Handel (content, FP2): Handel_text = [] lines = content. Splitlines (True) #按行分割每行分别处理 for line in lines: #在用正则表达式处理之前首先根据开头的内容进行匹配 if Re.match (r) (\<scrip t\>fm\.view\ (\{\ "ns\" \:\ "pl\.content\.homefeed\.index\" \,\ "domid\" \:\ "Pl_official") (. *) ', line): #print Lin
E ", line #调用正则表达式处理函数进行处理 temp_new = filter_tags (line) #print" Temp_new ", temp_new Handel_text.append (filter_tags) #调用正则处理函数 Content_chinese = "" #初始化, last Chinese string for text in Handel_ Text: #print "text", Text Cha_text = Unicode (text, ' Utf-8 ') #编码 #中文, spaces, punctuation, quotes, comma, exclamation point word = Re.findall (UR) [\u4e00-\u9fa5]+|,|. |:|\s|\u00a0|\u3000| ' # ' |\d|\u201c|\u201d|\u3001|\uff01|\uff1f|\u300a|\u300b| ff1b| ff08|
FF09 ", Cha_text) if Word: #如果该句子中含有上述字符 #print word, word for char in Word:
if char = = ': Content_chinese = ' elif char = ' \s ': #三种空格的形式 Content_chinese + = ' elif char = ' \u00a0 ': #中文空格 Content_chinese + = ' ' elif char = = ' \u3000 ': #英文空格 content_chinese + = ' elif char = = '
# ': Content_chinese = ' \ n ' else:content_chinese + = Char Content_chinese + = ' \ n ' #循环结束, Content_chinese generated #写入文件handel. txt FP3 = open ("F://emotion/mysite/weibo_crawler /handel.txt "," w+ ") Fp3.write (Content_chinese) fp3.close () #打开该文件 FP1 = open (" F://emotion/mysite/weibo_cra
Wler/handel.txt "," r ") Read = Fp1.readlines () pattern = re.compile (ur "[\u4e00-\u9fa5]+") #中文文本 #fp2 = open ("Chinese_weibo.txt", "a") #初始化写入文件 for ReadLine I
n read:utf8_readline = Unicode (ReadLine, ' utf-8 ') if Pattern.search (utf8_readline): #如果在该句话中能找到中文文本则进行处理
#print readline #测试 split_readline = Readline.split (") #由空格对文本进行分割, Split_readline is a list For C in split_readline:c = Re.sub (R ' publisher: [.] * ', ', C ' #去掉 "publisher" C = re.sub (R ' Baidu [.] * ', ', C ' #去掉 "baidu" C = re.sub (R ' is loading [.] * ', ', c ' C = re.sub (R ' safe browsing [.] * ', ', c ' C = re.sub (R ' Sorry [.] * ', ', c ' C = re.sub (R '. *? high speed browsing. * ', ', c) #print C,len (c) If Len (c) > 16: #提出过短的文本, one Chinese in utf-8 encoding is 3 byte length Fp2.write (c) #print "C", C Fp1.close () #fp2. C Lose () #文件关闭 #print "Successfully parse Web page extract Weibo and deposit Chinese_weibo.txt"
4. Call the above parts, provide the interface, convenient for users to use
#-*-Coding:utf-8-*-
Import Threading
Import Login
import getraw_html import
gettext_ch
Import Time
def Crawler (number, Weibo_url):
login.login () #首先进行模拟登录
fp4 = open ("F://emotion/mysite/weibo_ Crawler/chinese_weibo.txt "," w+ ") #初始化写入文件 for
N in range (number):
n = n + 1
url = ' http://weibo.com /' + Weibo_url + '? is_search=0&visible=0&is_tag=0&profile_ftype=1&page= ' + str (n)
print ' crawler url ", url
content = getraw_html.get_rawhtml (URL) # Call the function that gets the Web page's source file
print ' page%d get success and write Into Raw_html.txt "%n
gettext_ch. Handel (content, FP4) # Calling functions that parse pages
print "page%d Handel success and write into Chinese_weibo.txt"%n
Time.sleep (1)
fp4.close ()
########## data crawl completed.
# Crawl to the microblogging to redo the
fp = open (' F://emotion/mysite/weibo_crawler/chinese_weibo.txt ', ' r ')
contents = []