This article mainly introduces how to use Python to write a simple Weibo crawler. If you are interested, you can refer to the topic below. At first, I wanted to use Sina Weibo API to obtain Weibo content, however, I found that Sina Weibo has too many API restrictions:
Only the authorized user (or the user) can be obtained, and only the latest 5 records can be returned, WTF!
So I decided to give up this road and change it to "raw crawler", because the PC-side Weibo is a dynamic Ajax loading, and it is difficult to crawl, and I am sure it is difficult to retreat, instead, it crawls Weibo posts on the Mobile End, because Weibo posts on the mobile end can crawl all Weibo content at one time by page, which simplifies the work.
The final functions are as follows:
1. Enter the user_id of the Weibo user to be crawled to obtain all Weibo posts of the user.
2. The text content is saved to the text file named "% user_id", and all the HD images are saved in the weibo_image folder.
Specific operations:
First, we need to obtain our own cookies. Here we only talk about the chrome method.
1. Use chrome to open Sina Weibo Mobile Terminal
2. option + command + I call out Developer Tools
3. Open the Network and select the Preserve log option.
4. Enter the account password and log on to Sina Weibo.
5. Find m.weibo.cn-> Headers-> Cookie and copy the cookie to # your cookie in the code.
Then retrieve the user_id of the user you want to crawl. I don't need to talk about it. Open the user homepage and the number in the address bar is user_id.
Save the python code to the weibo_spider.py file.
Go to the current directory and run python weibo_spider.py user_id on the command line.
Of course, if you forget to add user_id to the end, the command line will prompt you to enter
End of execution
Minor issues:In my tests, sometimes the problem of image Downloading may fail. The specific reason is not clear yet. It may be due to the network speed problem, because the network speed in my dormitory is too unstable, of course, there may also be other problems, so under the root directory of the program, I also generated a text file userid_imageurls, which stores Download Links for all the crawled images, if a large image fails to be downloaded, You can import the linked group to thunder and other download tools for download.
In addition, my system is osx ei Capitan10.11.2, And the Python version is 2.7. The dependent library can be installed using sudo pip install XXXX. For specific configuration problems, you can use stackoverflow.
The following is the implementation code.
#-*-Coding: utf8-*-import reimport stringimport sysimport osimport urllibimport urllib2from bs4 import BeautifulSoupimport requestsfrom lxml import etreereload (sys) sys. setdefaultencoding ('utf-8') if (len (sys. argv)> = 2): user_id = (int) (sys. argv [1]) else: user_id = (int) (raw_input (u "Enter user_id:") cookie = {"Cookie": "# your cookie"} url =' http://weibo.cn/u/%d?filter=1&page=1 '% User_idhtml = requests. get (url, cookies = cookie ). contentselector = etree. HTML (html) pageNum = (int) (selector. xpath ('// input [@ name = "mp"]') [0]. attrib ['value']) result = "" urllist_set = set () word_count = 1image_count = 1 print U' crawler ready... 'for page in range (1, pageNum + 1): # obtain lxml page url =' http://weibo.cn/u/%d?filter=1&page=%d '% (User_id, page) lxml = requests. get (url, cookies = cookie ). content # selector = etree for text crawling. HTML (lxml) content = selector. xpath ('// span [@ class = "ctt"]') for each in content: text = each. xpath ('string (.) ') if word_count> = 4: text = "% d:" % (word_count-3) + text + "\ n" else: text = text + "\ n" result = result + text word_count + = 1 # image crawling soup = BeautifulSoup (lxml, "lxml") urllist = soup. find_all ('A', href = re. compile (R' ^ http://weibo.cn/mblog/oripic ', Re. i) first = 0 for imgurl in urllist: urllist_set.add (requests. get (imgurl ['href '], cookies = cookie ). url) image_count + = 1fo = open ("/Users/Personals/% s" % user_id, "wb") fo. write (result) word_path = OS. getcwd () + '/% d' % user_idprint U' text microblog crawling completed' link = "" fo2 = open ("/Users/Personals/% s_imageurls" % user_id, "wb") for eachlink in urllist_set: link = link + eachlink + "\ n" fo2.write (link) print U' image link crawling completed. 'If not urllist_set: print U' the page does not contain the image 'else: # download the image and save it in the current directory's pythonimg folder image_path = OS. getcwd () + '/weibo_image' if OS. path. exists (image_path) is False: OS. mkdir (image_path) x = 1 for imgurl in urllist_set: temp = image_path + '/%s.jpg' % x print U' is downloading the % s image '% x try: urllib. urlretrieve (urllib2.urlopen (imgurl ). geturl (), temp) failed T: print u "failed to download the image: % s" % imgurl x + = 1 print U' the original Weibo has been crawled. % d entries in total, save PATH % s '% (word_count-4, word_path) print u' Weibo image crawled, % d Total, save PATH % s '% (image_count-1, image_path)
A simple Weibo crawler is completed, hoping to help you learn it.