Python crawler: Dou Fish Tv__python

Source: Internet
Author: User
Tags pack
Fight fish bomb Curtain assistant 0. Foreword

A few days ago (before the winter break) idle bored, see roommates are watching Dou fish TV, although I am not very interested in those online games, but I suddenly thought, if I can get the top of the bullet screen content, not a little meaning. 1. Analysis Phase

If I want to crawl the top of the page, it's just two ways to use the browser, hand-click, or not manually (using the JS script) to access the things I want. Write HTTP client (dou fish without HTTPS communication)

The first method is omnipotent, but obviously not, for the following reasons: manual save is not possible, programmers do not. Browsers are limited to local interaction, in other words, even if I crawl the corresponding barrage, I have no way to solve the problem of persistence. Let's say you're choosing a chrome or Firefox browser and you're not going to be able to persist, but that requires a write extension, and the chrome extensions aren't written or very interesting.

The second approach is clearly a normal programmer's approach.

Language selection Ruby

Write a client, that is, write a small crawler, the use of the scene:

User executes command at terminal

Gem install Danmu
danmu Douyu [Room_id/url]
#比如
danmu douyu qiuri danmu Douyu

Then you can enjoy the curtain in the terminal.
Screen Shot 2016-02-09 at 12.23.15 pm.png

Think about how to crawl a Web site

Four step: Request Web page (raw data)-Extract data (purify data)-Save data-analyze data

Obviously, as long as the request page is resolved, the rest is nothing more than parsing and SQL statements. 1.1. Dou Fish TV bomb screen grasping the idea to determine

If it is as simple as what I said above, there is no need to write another article. After all, the web crawler has little technical content. Distributed crawler only.

Usually the web crawler is nothing more than to solve the following problems:

Request, if the other side has a certain strategy of the anti-reptile, that need to reverse the crawler. For example, header with host, with refer, with other need to verify, then apply for user name and password, and then landing if there is an anti-stop mechanism in the login period, then first get a login page, and then parse out the token, with the corresponding token and then landed. Add a log to the program and save it locally. Prevent the emergence of a variety of anti-reptile mechanism ban off the program, so as to facilitate the next step against the crawler countermeasures.

And, due to the existence of the request response mechanism, usually, each request corresponds to a response, if the error, or timeout, or a status code, so the ordinary web crawler is relatively easy.

So, Dou Fish TV's site is so easy to crawl it.

You guessed it, the answer is "no".

Because of the real-time nature of the barrage, it is decided that the barrage of fish TV will not be able to display the barrage by holding the XML for the full specified time-end screen (for example, a video barrage in Bilibili exists in a section of XML) or JSON data. Otherwise, when the host operation is very good, the audience's barrage will not be able to display in real time.

Well, it must be websocket, so, as always, I open the F12, view the network traffic.

As you would expect, there is no bounce-screen flow. There is no news of a websocket.

Well, the message must be there, but the message is not transmitted through the HTTP protocol or the WebSocket protocol, so what's the problem?

Analysis of the front end of the code to find the screen to get the JS code, suffering from too much code, looking for a long time did not find. That is, the execution logic may be inside flash.

So sacrifice a big kill device Wireshark, grasp the flow. Finally saw the appearance of the curtain.

That is true.
Douyutveachmsg.png

Originally used is the flash socket function.

So, all we need to do is simulate every single message in the socket.

More analysis of several sets of data, but still to send message content lack of assurance, especially in user authentication, users receive the bomb screen this piece. Search engine for a while, found that there is a post, read the end of my puzzle.

The address is: https://www.zhihu.com/question/29027665

On this basis, several message analysis processes are omitted.

Summarize the server distribution of Dou Fish TV website.
Douyutvinfo.jpg 1.2. Room Information and bomb screen certification server access

First, let's take a room for an anchor, for example, Qiuri.

TA's room link is divided into two types of Http://www.douyutv.com/qiuri http://www.douyutv.com/[room ID]

For this host room page request, Normal, all the useful information is not put in HTML rendering out, but there is a built-in HTML in the JS script, this is to reduce the server rendering HTML pressure. But the rendering in JS inside does not need to render. (Do not understand) in short, is the program first loaded without concrete data fill the page, and then JS Update data.

Built-in two JS script, JS script has two variables, the variable is easy to convert to JSON data, that is, two JSON data, one is about the host's personal information, the other is about the screen certification server list (any one of the servers in this list can be authenticated, But every time you request an anchor page, you get a different list of authentication servers.
Sc
Screen Shot 2016-02-09 at 1.01.51 pm.png
Screen Shot 2016-02-09 at 12.44.01 pm.png

Through this step, we get the anchor information and the Bomb screen server authentication address, port. 1.3. Introduction to the process of sending a socket message

By grasping the bag, we analyze that large packet of data, we can determine the following process can be used to obtain the bomb screen message. (The analysis process is more cumbersome)

First, create two sockets. One is for authentication (@danmu_auth_socket) and another user gets the barrage (@danmu_client). Step 1: @danmu_auth_socket send a message to log in, get message 1 parse out the username of the anonymous user, and then get the message 2 parse out the GID Step 2: @danmu_auth_socket send QRL message, get two messages with no use Step 3: @danmu _auth_socket Send keeplive message step 4: @danmu_socket send a pseudo login message (all anonymous users just need to enter the user name in step one, because the authentication has been done above) Step 5: @danmu_socket send Join _group message requires steps one China's GID step 6: @danmu_socket constant RECV message can get the barrage news.

The following will explain in detail 2.1. Message socket message format and send a message

Since you are sending a message, each message is always formatted.

The message format of the fish is roughly as follows:
Douyutveachmsg.png

Each message and follows the following format:

1. Communication protocol length, the length of the latter four parts, four bytes
2. The second part is the same as the first part
3. Request code, send to Dou fish words, content for 0xb1,0x02, Dou Fish return code for 0XB2,0X02
4. Send Content
5. End Byte

#-*-Encoding:utf-8-*-
class
  message # Messages sent to the fish
  # 1. Communication protocol length, the length of the latter four sections, four bytes
  # 2. The second part is the same as the first part
  # 3. Request code, send to Dou fish words, content for 0xb1,0x02, Dou Fish return code for 0XB2,0X02
  # 4. Send Content
  # 5. End byte
  #pack (' c* ') is a bizarre way to convert a byte array
  to a string. def Initialize (content)
    @length = [content.size + 9,0x00,0x00,0x00].pack (' c* ')
    @code = @length. DUP
    @magic = [0xb1,0x02,0x00,0x00].pack (' c* ')
    @content  = content
    @end = [0x00].pack (' c* ')
  end

  def to_s
    @length + @code + @magic + @content + @end
  End

After encapsulation, we just focus on the visible string, which is the Content section.
The content section, that is, the contents of the message sent, will be detailed in the following article.

Open two sockets, one user authentication, and the other for the capture of the barrage.

For user's bomb screen authentication, it is one of the list of authentication servers mentioned in 2.1. Pick out a set of IP and port

@danmu_auth_socket = Tcpsocket.new @auth_dst_ip, @auth_dst_port

The user gets the bomb screen as long as

danmu.douyutv.com:8601
danmu.douyutv.com:8602
danmu.douyutv.com:12601
danmu.douyutv.com:12602

Four groups of domain names: ports can be used as follows Danmu_server and port

@danmu_socket = Tcpsocket.new Danmu_server,danmu_port

Send a message just so

    data = "Type@=loginreq/username@=" + @username + "/password@=1234567890123456/roomid@=" + @room_id. to_s + "/"
    All_ data = message (data)
    @danmu_socket. Write All_data

It would be nice to put the strings that need to be transferred in.

Next, we need to deal with the six steps mentioned above 2.2. Steps to send a message detail process one

Send Message content as:

Type@=loginreq/username@=/ct@=0/password@=/roomid@=156277/devid@=df9e4515e0ee766b39f8d8a2e928bb7c/rt@= 1453795822/vk@=4fc6e613fc650a058757331ed6c8a619/ver@=20150929/

What we need to be aware of IS as follows:

Type indicates the types of messages the login message for Loginreq
username is not required, the system will automatically return the corresponding visitor account after the request is logged in.
CT is not clear what meaning, the default of 0 does not affect
password do not need
Roomid room ID
devid for device identification, so we use random UUID generation
RT should be runtime bar, Timestamp
VK The MD5 value of the result of the string concatenation of the timestamp + "7OE9NPEG9XXV69PHU31FYCLUAGKEYTSF" +devid (this is a reference to an article about which I don't quite understand how to explore)
ver default

In this step, we can get two messages and use regular expressions from the message to get the corresponding username and the GID

    str = @danmu_auth_socket. recv (4000)
    @username = str[/\/username@= (. +) \/nickname/,1]
    str = @danmu_auth_ SOCKET.RECV (4000)
    @gid = str[/\/gid@= (\d+) \//,1]
2.3. Send message detailed process steps two

The message that is sent is:

"Type@=qrl/rid@=" + @room_id. to_s + "/"

Needless to say, the type is Qrl,rid for Roomid, send this message directly. The two messages returned are of little value.

    Data  = "type@=qrl/rid@=" + @room_id. to_s + "/"
    msg = message (data)
    @danmu_auth_socket. Write msg
    str = @ DANMU_AUTH_SOCKET.RECV (4000)
    str = @danmu_auth_socket. RECV (4000)
2.4. Send the message detailed process step three

The message that is sent is:

"Type@=keeplive/tick@=" + timestamp + "/vbw@=0/k@=19beba41da8ac2b4c7895a66cab81e23/"

Send directly. Not much meaning.

    data = "Type@=keeplive/tick@=" + timestamp + "/vbw@=0/k@=19beba41da8ac2b4c7895a66cab81e23/"
    msg = message (data)
    @danmu_auth_socket. write msg
    str = @danmu_auth_socket. RECV (4000)

The first three steps, that is, the 2.2-2.3-2.4 three step, that is, using @danmu_auth_socket to complete the acquisition of username and GID important steps. After you get these two fields, you're done with the mission.

The next step is to @danmu_socket to get the barrage. 2.5. Send message detailed process steps four

Message content is: "type@=loginreq/username@=" + @username + "/password@=1234567890123456/roomid@=" + @room_id. to_s + "/"

is slightly different from the above 2.2. However, it should be noted that

  The username is worth the change of the username password obtained in the 2.2
  
    data = "Type@=loginreq/username@=" + @username + "/password@=1234567890123456/roomid@=" + @room_id. to_s + "/"
    All_ data = message (data)
    @danmu_socket. Write all_data
    str = @danmu_socket. RECV (4000)
2.6. Send message detailed process steps five

The next step is to complete the final certification, Join_group message content is

"Type@=joingroup/rid@=" + @room_id. to_s + "/gid@=" + @gid + "/"

The GID is obtained in the GID 2.2.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.