2017.8.30 Update:
All engineering code upload Baidu disk. The script has now stopped developing.
Engineering Code:
Link: http://pan.baidu.com/s/1c1FWz76 Password: mu8k
————————————————————————————
Before I begin, I'll explain my choice of solution: Scrapy+beautifulsoup+re+pymysql, crawl Weibo mobile version (less crawl technology, easier)
Scrapy: Reptile frame, not much to say
BeautifulSoup: Excellent parsing library, I used to parse lxml
Re: Regular Expressions
Brief introduction of Pymysql:mysql thought
Skip the tedious scrapy of the various functional modules, I said my general idea: each user will have their own UID, through the analysis of the Web page, we can get to the user's UID, which determines the next step in the crawl direction. Weibo each user's homepage, fans and attention page address is very regular, learned that the UID can be very convenient to complete the fans and the list of concerns crawling. First through the normal landing, get cookies, let scrapy use the cookies to disguise the user.
For analysis of the Web page information, I use the beautifulsoup and re mixed way (no system to learn the Web page, using a single beautifulsoup a bit difficult). the acquisition and analysis of Cookies
First we log in normally to get cookies.
Mobile end of the landing I do not know why I can not click on the landing, so I choose to log on normally, and then jump to the mobile version.
The cookies obtained in the mobile version of the grab package are as follows:
By name, Sub is a key that represents identity. The results in the experiment also proved that the server is only sensitive to the three values of SUB,SUBP,SUHB, other value servers do not care, but for the insurance to stay is OK.
Then we need to replace the scrapy user-agent and prohibit them from complying with Robots.txt.
First, the header and the cookie are constructed (this can be obtained by grabbing the bag yourself):
Also, add the following user-agent to the settings.py:
User_agent = [
' mozilla/5.0 (Windows; U Windows NT 6.0; En-US) applewebkit/534.14 (khtml, like Gecko) chrome/9.0.601.0 safari/534.14 ',
' mozilla/5.0 (Windows NT 6.0) applewebkit/535.2 (khtml, like Gecko) chrome/15.0.874.120 safari/535.2 ',
' mozilla/5.0 (Windows; U Windows NT 6.0 x64; En-us; Rv:1.9pre) gecko/2008072421 minefield/3.0.2pre ',
' mozilla/5.0 (Windows; U Windows NT 6.0; EN-GB; rv:1.9.0.11) gecko/2009060215 firefox/3.0.11 (. NET CLR 3.5.30729) ',
' mozilla/5.0 (Windows; U Windows NT 6.0; En-us; rv:1.9.1.6) gecko/20091201 firefox/3.5.6 GTB5 ',
' mozilla/5.0 (Windows; U Windows NT 6.0; En-US) applewebkit/527 (khtml, like Gecko, safari/419.3) arora/0.6 (change:) ',
' mozilla/5.0 (Windows; U Windows NT 6.0; En-US) applewebkit/533.1 (khtml, like Gecko) maxthon/3.0.8.2 safari/533.1 '
]
Then we need to set the first page of its crawl:
def start_requests (self): return
[
Request (' https://weibo.cn/' + ' 1234567890 ' + '/fans ', Cookies=account_ Cookies[0],meta=header,callback=self. getfollowers)
#Request (' http://www.ipip.net/', callback=self. SHOW_IP)
]
When it crawls to a Web page, it calls getfollowers (self, response)
The functions are as follows:
def getfollowers (self, response): Pipe_item = self. Get_pipeitem (response,0) self. Task_in_queue = self. Task_in_queue-1 soup = bs4.
BeautifulSoup (Response.body.decode (response.encoding), ' lxml ') tag = soup.table while tag!= None:
Try:if (str (TAG.A)!= None): UID = Re.search ("/u/[0-9]{10}", str (TAG.A)) #Re获取UID
If UID!= none:pipe_item[' Datagram '].append (str (Uid.group ()) [3:]) Tag = tag.next_sibling except Exception as E: #Not a tag but a Navigate tag = Tag.next_sibl ing continue for ruid in pipe_item[' Datagram ': if ruid in self. Completed_uid:continue else:if Self. Task_in_queue > Max_waited_length:break else:self. Completed_uid.append (Ruid) self. Task_in_queUE = self. Task_in_queue + 1 yield Request (' https://weibo.cn/u/' + ruid, cookies=account_cookies[0],meta=header,c Allback=self. Getusrinfo) yield Request (' https://weibo.cn/' + ruid + '/fans ', Cookies=account_cookies[0],meta=header, Callback=self. Getfans) yield Request (' https://weibo.cn/' + ruid + '/follow ', cookies=account_cookies[0],meta=header,c Allback=self. getfollowers) Yield Pipe_item
This function only gets the first page of the follower, need to get all the can write some more.
3 yield in the final loop, submit all crawled pages at once, including fans, concerns, homepage.
The last yield, submit item to pipeline processing. Because asynchronous multi-line crawl, so each submitted to the pipeline item, each item must have a fragment ordinal number, then pipeline can be spliced.
The code itself is simple enough to say.
Here is the function to get the item:
def get_pipeitem (self,response,item_type):
Pipe_item = Pipeitem ()
pipe_item[' item_type '] = item_type # 1 is get Followers list
pipe_item[' Datagram '] = []
pipe_item[' usr_id '] = []
pipe_item[' usr_id '] = str (Re.search (' [ 0-9]{10} ', str (RESPONSE.URL)). Group ()) return
Pipe_item
The retrieved Web page is first passed in, gets the UID of the current crawled user, and initializes the item's other properties. It's also very simple.
Here's the function to get the fans:
def getfans (self, Response):
Pipe_item = self. Get_pipeitem (response,1)
self. Task_in_queue = self. task_in_queue-1
soup = bs4. BeautifulSoup (Response.body.decode (response.encoding), ' lxml ')
tag = soup.table while
tag!= None:
try:
if (str (TAG.A)!= None):
UID = Re.search ("/u/[0-9]{10}", str (TAG.A))
if UID!= None:
pipe_item[' Datagram "].append (str (Uid.group ()) [3: ]
tag = tag.next_sibling
except Exception as E: #Not a tag but a Navigate
tag =
tag.next_sibling cont Inue
yield Pipe_item
The principle is consistent with getting attention, except that the step of submitting a crawl request is less.
Here is the function to get the user information:
def getusrinfo (self, Response):
Pipe_item = self. Get_pipeitem (response,2)
self. Task_in_queue = self. task_in_queue-1
soup = bs4. BeautifulSoup (Response.body.decode (response.encoding), ' lxml ')
Info = Soup.find (' span ', class_= ' CTT ')
Pipe _item[' Datagram '].append (str (info.text). Replace (U ' \xa0 ', U ') [: -12])
pipe_item[' Datagram '].append (re.search (' Attention \[\d+\] ', str (soup.text)). Group () [3:-1])
pipe_item[' Datagram '].append (' fan re.search] ', str ( Soup.text)). Group () [3:-1])
pipe_item[' Datagram '].append (Re.search (' Weibo \[\d+\] ', str (soup.text)). Group () [3:-1 ])
yield Pipe_item
The place to mention is the STR (info.text). Replace (U ' \xa0 ', U '), remove the sky, or print test error.
There is still nothing to say.
It's very simple.