Using the Python scrapy crawl the content of the microblogging "one" __python

Source: Internet
Author: User
Tags python scrapy

2017.8.30 Update:
All engineering code upload Baidu disk. The script has now stopped developing.

Engineering Code:

Link: http://pan.baidu.com/s/1c1FWz76 Password: mu8k

————————————————————————————
Before I begin, I'll explain my choice of solution: Scrapy+beautifulsoup+re+pymysql, crawl Weibo mobile version (less crawl technology, easier)
Scrapy: Reptile frame, not much to say
BeautifulSoup: Excellent parsing library, I used to parse lxml
Re: Regular Expressions
Brief introduction of Pymysql:mysql thought

Skip the tedious scrapy of the various functional modules, I said my general idea: each user will have their own UID, through the analysis of the Web page, we can get to the user's UID, which determines the next step in the crawl direction. Weibo each user's homepage, fans and attention page address is very regular, learned that the UID can be very convenient to complete the fans and the list of concerns crawling. First through the normal landing, get cookies, let scrapy use the cookies to disguise the user.

For analysis of the Web page information, I use the beautifulsoup and re mixed way (no system to learn the Web page, using a single beautifulsoup a bit difficult). the acquisition and analysis of Cookies

First we log in normally to get cookies.
Mobile end of the landing I do not know why I can not click on the landing, so I choose to log on normally, and then jump to the mobile version.
The cookies obtained in the mobile version of the grab package are as follows:

By name, Sub is a key that represents identity. The results in the experiment also proved that the server is only sensitive to the three values of SUB,SUBP,SUHB, other value servers do not care, but for the insurance to stay is OK.

Then we need to replace the scrapy user-agent and prohibit them from complying with Robots.txt.

First, the header and the cookie are constructed (this can be obtained by grabbing the bag yourself):

Also, add the following user-agent to the settings.py:

User_agent = [
    ' mozilla/5.0 (Windows; U Windows NT 6.0; En-US) applewebkit/534.14 (khtml, like Gecko) chrome/9.0.601.0 safari/534.14 ',
    ' mozilla/5.0 (Windows NT 6.0) applewebkit/535.2 (khtml, like Gecko) chrome/15.0.874.120 safari/535.2 ',
    ' mozilla/5.0 (Windows; U Windows NT 6.0 x64; En-us; Rv:1.9pre) gecko/2008072421 minefield/3.0.2pre ',
    ' mozilla/5.0 (Windows; U Windows NT 6.0; EN-GB; rv:1.9.0.11) gecko/2009060215 firefox/3.0.11 (. NET CLR 3.5.30729) ',
    ' mozilla/5.0 (Windows; U Windows NT 6.0; En-us; rv:1.9.1.6) gecko/20091201 firefox/3.5.6 GTB5 ',
    ' mozilla/5.0 (Windows; U Windows NT 6.0; En-US) applewebkit/527 (khtml, like Gecko, safari/419.3) arora/0.6 (change:) ',
    ' mozilla/5.0 (Windows; U Windows NT 6.0; En-US) applewebkit/533.1 (khtml, like Gecko) maxthon/3.0.8.2 safari/533.1 '
]

Then we need to set the first page of its crawl:

def start_requests (self): return
        [
            Request (' https://weibo.cn/' + ' 1234567890 ' + '/fans ', Cookies=account_ Cookies[0],meta=header,callback=self. getfollowers)
            #Request (' http://www.ipip.net/', callback=self. SHOW_IP)
        ]

When it crawls to a Web page, it calls getfollowers (self, response)
The functions are as follows:

def getfollowers (self, response): Pipe_item = self. Get_pipeitem (response,0) self. Task_in_queue = self. Task_in_queue-1 soup = bs4.
            BeautifulSoup (Response.body.decode (response.encoding), ' lxml ') tag = soup.table while tag!= None:
                    Try:if (str (TAG.A)!= None): UID = Re.search ("/u/[0-9]{10}", str (TAG.A)) #Re获取UID
                If UID!= none:pipe_item[' Datagram '].append (str (Uid.group ()) [3:]) Tag = tag.next_sibling except Exception as E: #Not a tag but a Navigate tag = Tag.next_sibl ing continue for ruid in pipe_item[' Datagram ': if ruid in self. Completed_uid:continue else:if Self. Task_in_queue > Max_waited_length:break else:self. Completed_uid.append (Ruid) self. Task_in_queUE = self. Task_in_queue + 1 yield Request (' https://weibo.cn/u/' + ruid, cookies=account_cookies[0],meta=header,c Allback=self. Getusrinfo) yield Request (' https://weibo.cn/' + ruid + '/fans ', Cookies=account_cookies[0],meta=header, Callback=self. Getfans) yield Request (' https://weibo.cn/' + ruid + '/follow ', cookies=account_cookies[0],meta=header,c Allback=self. getfollowers) Yield Pipe_item

This function only gets the first page of the follower, need to get all the can write some more.
3 yield in the final loop, submit all crawled pages at once, including fans, concerns, homepage.
The last yield, submit item to pipeline processing. Because asynchronous multi-line crawl, so each submitted to the pipeline item, each item must have a fragment ordinal number, then pipeline can be spliced.
The code itself is simple enough to say.
Here is the function to get the item:

def get_pipeitem (self,response,item_type):
        Pipe_item = Pipeitem ()
        pipe_item[' item_type '] = item_type # 1 is get Followers list
        pipe_item[' Datagram '] = []
        pipe_item[' usr_id '] = []
        pipe_item[' usr_id '] = str (Re.search (' [ 0-9]{10} ', str (RESPONSE.URL)). Group ()) return
        Pipe_item

The retrieved Web page is first passed in, gets the UID of the current crawled user, and initializes the item's other properties. It's also very simple.

Here's the function to get the fans:

def getfans (self, Response):
        Pipe_item = self. Get_pipeitem (response,1)
        self. Task_in_queue = self. task_in_queue-1
        soup = bs4. BeautifulSoup (Response.body.decode (response.encoding), ' lxml ')
        tag = soup.table while
        tag!= None:
            try:
                if (str (TAG.A)!= None):
                    UID = Re.search ("/u/[0-9]{10}", str (TAG.A))
                    if UID!= None:
                        pipe_item[' Datagram "].append (str (Uid.group ()) [3: ]
                tag = tag.next_sibling
            except Exception as E: #Not a tag but a Navigate
                tag =
                tag.next_sibling cont Inue
        yield Pipe_item

The principle is consistent with getting attention, except that the step of submitting a crawl request is less.

Here is the function to get the user information:

def getusrinfo (self, Response):
        Pipe_item = self. Get_pipeitem (response,2)
        self. Task_in_queue = self. task_in_queue-1
        soup = bs4. BeautifulSoup (Response.body.decode (response.encoding), ' lxml ')
        Info = Soup.find (' span ', class_= ' CTT ')
        Pipe _item[' Datagram '].append (str (info.text). Replace (U ' \xa0 ', U ') [: -12])
        pipe_item[' Datagram '].append (re.search (' Attention \[\d+\] ', str (soup.text)). Group () [3:-1])
        pipe_item[' Datagram '].append (' fan re.search] ', str ( Soup.text)). Group () [3:-1])
        pipe_item[' Datagram '].append (Re.search (' Weibo \[\d+\] ', str (soup.text)). Group () [3:-1 ])
        yield Pipe_item

The place to mention is the STR (info.text). Replace (U ' \xa0 ', U '), remove the sky, or print test error.
There is still nothing to say.

It's very simple.

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.