1 This program is marked
The daily crawl is the basic information of Sina Weibo users, such as user nickname, Avatar, User's attention, fan list to
and posted tweets, etc., which are captured and saved to MongoDB.
2. How to Achieve:
Take a few big V of Weibo as a starting point, crawl through their fans and watchlist, then get fans and watchlist followers and watchlist, and so on, so that recursive crawls can be achieved. If a user is associated with another user on a social network, their information is crawled by the crawler so that we can crawl all users. In this way, we can get the user's unique ID, and then according to the ID for each user to publish the Weibo.
3. Analysis
The crawl site is: https://m.weibo.cn, which is the site of the Weibo mobile terminal. Opening the site jumps to the sign-in page because the home page has a login limit. However, we can open a User details page directly
Sina Weibo's anti-crawling ability is very strong, if you do not log in and directly request the Micro-blog API interface, which is very easy to lead to 403 status code. So here we implement a middleware that adds random Cookies to each request.
Microblogging also has an anti-crawl measure is the detection of the same IP request is too large when the 414 status code will appear. You can switch agents if you encounter such a situation.
Scrapy crawling Sina Weibo