BT website-Osho Magnetic-python development Crawler instead of. NET

Source: Internet
Author: User
Tags compact sha1 unique id

BT website-Osho Magnetic-python development Crawler instead of. NET write crawler, mainly demonstrates the access speed and index efficiency in about 10 million of the hash record.

Osho Magnetic Download-http://www.oshoh.com is now using the Python +centos 7 system

Osho Magnetic Download (www.oshoh.com) has undergone multiple point technical changes. The open source version uses the Django site framework rewrite, previously flask, and early tornado. Movie FM also uses tornado, and later found tornado not suitable for any scenario. Content-based websites or Django are good at it, but the entry time is longer than other frameworks. Early database using MongoDB, because it is convenient to read and write data with Python, do not have to focus on data structure, search function using the keyword search, but later with the increase in the number of resources, performance is obviously not keeping up. This year changed the Wiredtiger engine, bring the fulltext search is still not to force. In addition Amazon Cloudsearch is a pit, local tyrants can consider, performance really is very good, is more expensive. Finally build a sphinxsearch bar, the database is also replaced by MySQL (MyISAM engine), together is also very convenient. Sphinx Create full-text indexing speed is very strong, the official self-assessment is also very high, I myself Test 1000w resources (about 3GB), 1 minutes or so index completed. Do not believe, you can self-test a bit.

The principle is as follows: BitTorrent DHT ProtocolBitTorrent uses a distributed, loose hash table (DHT) to store peer contact information for torrent that cannot be reached. In this way, each peer becomes a tracker. This protocol is implemented on UDP based on Kademlia.

Please note the terminology used in this article to avoid confusion. Peer is a client/server that listens on the TCP port to implement the BitTorrent protocol. node is a client/server that listens on UDP ports to implement the DHT protocol. The DHT network consists of nodes that store the location information of the peer. The BitTorrent client contains a DHT node and, through this node, contacts the other nodes in DHT to obtain the location of the other peers so that they can be downloaded from them through the BitTorrent protocol.

Review

Each node has a globally unique identifier, called the node ID. The node ID is randomly selected from 160bit space and is the same space as the Infohash value of the BitTorrent. The distance metric is used to compare the proximity of two nodes or nodes to Infohash. The node must maintain a routing table that contains a small number of other node contact information. The closer the ID is to the ID, the more detailed the routing table. The node knows a lot of nodes that are very close to it, and only knows a small number of nodes far away from it.

In Kademlia, the distance measure takes an XOR or calculation, and the result is interpreted as an unsigned integer.
Distance (A, b) =| A? b|
The smaller the value, the closer the distance.

When a node needs to find a torrent peer, it calculates the distance between the torrent Infohash and the node ID in the local routing table. The peer information that is currently downloading the torrent is then requested from some of the closest nodes to the torrent. If a node has this information, it responds directly. Otherwise, it must reply to a node that is closer to the torrent in its routing table. So repeatedly the search is closer to the node until it is not found. When the search is complete, peer registers its contact information with the closest node to torrent.

Query the return value of a peer request (which contains an opaque value), called a token. When a node notifies other nodes that its peer is downloading a torrent, it must use the token that was acquired when the peer request was recently queried to that node. When a node attempts to advertise a torrent, the node it has requested checks whether its token is valid based on the IP address of the node. This mechanism can prevent malicious hosts from registering torrent with other hosts. Because the token is returned only by the requesting node to the node to which it receives the token, no specific implementation is specified. Tokens should be accepted within a reasonable period of time after distribution. The implementation of BitTorrent is, with the other's IP address and a password (this password is changed every five minutes), calculate SHA1 as a token, the effective time of this token is 10 minutes.

routing table

Each node maintains a routing table that consists of good nodes that it knows. The nodes in the routing table are used as the starting point for sending requests in DHT. When other nodes are queried, the nodes in the routing table are returned.

Not every node we know is the same. Some are good, while others are not. Many nodes that use DHT can send requests and receive replies, but cannot answer requests from other nodes. It is important to note that nodes in each node's routing table should be good nodes. A node that has answered a request from this node in the last 15 minutes is a good node, and if it has ever answered a request from this node and has sent a request to this node within the last 15 minutes, it is also a good node. If a node is inactive for 15 minutes, it becomes a suspect node. If it fails to answer the request several times in succession, it becomes a bad node. The good nodes we know are given higher priority for nodes in the unknown state.

The routing table covers the entire node ID space (from 0 to 2160). The routing table is subdivided into buckets, with each bucket covering a portion of the space. An empty table has only one bucket, and the space it covers is min=0,max=2160. When a node with an ID of N is inserted into the routing table, it is placed in the Min<=n<max bucket. An empty table has only one bucket, so any node can be put in. Each bucket can hold up to a maximum of K nodes before it fills up, currently 8. When a bucket is filled with good nodes, it is no longer possible to add another node to it unless the ID of the current node falls within the bucket's reach. In this case, the bucket will be replaced by two new barrels, two new barrels covering the original half of the space, and the original bucket inside the node redistribution into the new bucket. For a new table with only one bucket, the full bucket is always divided into two buckets covering 0-2159 and 2159-2160 respectively.

When a bucket is filled with good nodes, the new node is discarded. If a node in the bucket becomes bad, it will be replaced by a new node. If some suspicious nodes in the bucket are inactive for up to 15 minutes, the longest inactive node will be ping. If the ping node responds, it will ping the next most inactive suspicious node in turn until one fails to respond, or all the nodes in the bucket are good. If a node in the bucket fails to respond to a ping, it is recommended to try again before discarding it and replacing it with a new good node. In this way, the routing table will fill up the stable long-term active node.

Each bucket maintains a final change attribute to mark its new and old extent. When a node in a bucket is ping and replies, or a node is added to a bucket, or a node is replaced by another node, the last change property of the bucket is updated. If the bucket does not change for 15 minutes, it should be refreshed--randomly selecting an ID from the ID space it covers and performing a find_nodes search on it. Nodes that can receive requests from other nodes typically do not need to flush buckets frequently. Nodes that cannot receive requests from other nodes need to refresh all buckets periodically, ensuring that the routing table is a good node when DHT needs it.
At startup, the node inserts the first node in its routing table, and then it should try to find other nodes of the nearest neighbor in DHT--Send the Find_node command to the neighboring node, and then send the command to the nearer node until the node of the nearest neighbor cannot be found. The routing table should be saved each time the client software calls the routing table.

BitTorrent Protocol Extensions

The BitTorrent protocol is extended so that the peer that tracker informs can exchange UDP port numbers with each other. This allows the client to obtain a routing table that is automatically generated when the regular torrent is downloaded. When the newly installed client first attempts to download a torrent that is unable to track, there are no nodes in the routing table and the contact information in the torrent is required.

The peer that supports DHT sets the last digit of the reserved flag bit 8 bytes in the BitTorrent protocol handshake Exchange. Peer receives a handshake message from the remote node and should reply to a port message if the flag supports DHT. It starts with 0x09, then a two-byte UDP port, in network byte order. Peer receives this message should attempt to ping the node that corresponds to the IP and port on the remote peer. If you receive a ping response, the node should attempt to insert the new contact information into the routing table as usual.

Torrent file extension

A torrent dictionary does not include the "announce" key, but a "nodes" key. This key should be set to the nearest K node in the node routing table that generated the torrent. Alternatively, this key is set to a known good node by the person who generated the torrent. Please do not add "router.bittorrent.com" to the torrent file automatically, and do not add it to the client's routing table.

De<nodes = [["

KRPC protocol

The KRPC protocol is a simple RPC mechanism that consists of a Bencode dictionary that is sent on UDP. Sends a request packet, replies to a response packet, and does not retry. There are three types of messages: query, response, error. For the DHT protocol, there are four kinds of query:ping, Find_node, Get_peers, Announce_peer.

A KRPC message is a dictionary that includes two universal keys and several other keys depending on the message type. Each message has a "T" key and a string value that represents the transaction ID. This transaction ID is generated by the request node and echoes back at the time of the reply. Such a reply can correlate multiple requests for the same node. Another common key in Krpc is "Y", which is also a string as a value that represents the type of message. The values for "Y" are: "Q" (query, request), "R" (response, reply, response), "E" (Error, errors).

Contact Code

Peer's contact information is encoded as a 6-byte string, also known as " tight IP address/port information ," and a 4-byte IP address is immediately followed by a 2-byte port number, all in network byte order.

The contact information of the node is encoded into a 26-byte string, also known as " tight node information ", and the 20-byte node ID is followed by a tight IP address/port information , which also takes the network byte order.

Request

The request, that is, the value of "Y" is the KRPC message of "Q", including two additional keys "Q" and "a". The value of "Q" is the requested method name, and the value of "a" is the requested parameter.

Response

The response, the value of "Y" is the KRPC message for "R", including an additional key "R". The value of "R" is the return value of the request. When a request is successfully completed, a response message is sent.

Error

Error, the value of "Y" is the KRPC message for "E", including an additional key "E". The value of "E" is a list, where the first element is an integer that represents the error code. The second element is a string of error messages. Send an error when the request cannot be completed. The following table is an error that may occur:

201 General error
202 Server error
203 Protocol errors, such as exception message packets, invalid parameters, invalid tokens, and so on
204 Method unknown

Error Package Example

De<generic error = {' t ': 0, ' y ': ' E ', ' e ': [201, ' A generic error ocurred ']} bencoded = d1:eli201e23:a generic Error Ocurr ede1:ti0e1:y1:eede<

DHT Request

All requests have an ID key, and the value is the ID of the requesting node. All responses have an ID key, and the value is the ID of the response node.

Ping

The most basic request is ping, "q" = "ping". The ping request has only one parameter "id", the value is the sender's node id,20 bytes, and the network byte order. The corresponding response also has only one ID key, and the value is the ID of the response node.

De<arguments: {"id": "<querying Nodes id>"} response: {"id": "<queried nodes id>"}de<

Sample Package

De<ping Query = {"T": "0", "y": "Q", "Q": "Ping", "a": {"id": "abcdefghij0123456789"}} bencoded = D1:ad2:id20: abcdefghij0123456789e1:q4:ping1:t1:01:y1:qede<

De<response = {"T": "0", "y": "R", "R": {"id": "mnopqrstuvwxyz123456"}} bencoded = D1:rd2:id20:mnopqrstuvwxyz123456e1: t1:01:y1:rede<

Find_node

Find_node is used to find the node contact information for a given ID, "q" = = "Find_node". The Find_node request has two parameters, "ID" contains the ID of the request node; "Target" contains the target node ID to be looked up by the request nodes. When a node receives a FIND_NODE request, its response should contain a "node" key, which is the target node, or the close node information of K (8) In its routing table that is closest to the target node.

De<arguments: {"id": "<querying nodes id>", "Target": "<id of Target Node>"} response: {"id": "<queri Ed nodes id> "," Nodes ":" <compact node info> "}de<

Sample Package

De<find_node Query = {' t ': 0, ' y ': ' Q ', ' Q ': ' Find_node ', ' a ': {' id ': ' abcdefghij0123456789 ', ' target ': ' mnopqrstuvwxyz123456 '}} bencoded = D1:ad2:id20:abcdefghij01234567896:target20:mnopqrstuvwxyz123456e1:q9:find_ node1:ti0e1:y1:qede<

De<response = {' t ': 0, ' y ': ' R ', ' R ': {' id ': ' 0123456789abcdefghij ', ' nodes ': ' Def456 ... '}} bencoded = d1:rd2:id20:0 123456789abcdefghij5:nodes9:def456...e1:ti0e1:y1:rede<

Get_peers

Get_peers is associated with a torrent infohash, "q" = "get_peers". Get_peers request has two parameters: "ID" contains the ID of the request node; "Info_hash" contains the infohash of torrent. If the node receiving the request knows the peer of Infohash, it connects the close IP address/port information of these peers into a string list with "value" as the key and replies to the request node. If the node receiving the request does not have a peer of infohash, it replies to the nearest K node in the routing table from Infohash, with "nodes" as the key. In either case, the "token" key is included in the return value. The token value is also required to send announce_peer requests in the future.

De<arguments: {"id": "<querying nodes id>", "Info_hash": "<20-byte Infohash of Target torrent>"} response : {"id": "<queried nodes id>", "Values": ["<compact peer Info string>"]} or: {"id": "<queried nodes ID&G t; "," Nodes ":" <compact node info> "}de<

Sample Package

De<get_peers Query = {' t ': 0, ' y ': ' Q ', ' Q ': ' Get_peers ', ' a ': {' id ': ' abcdefghij0123456789 ', ' info_hash ': ' mnopqrstuvwxyz123456 '}} bencoded = D1:ad2:id20:abcdefghij01234567899:info_hash20:mnopqrstuvwxyz123456e1:q9:get_ peers1:ti0e1:y1:qede<

De<response with peers = {' t ': 0, ' y ': ' R ', ' R ': {' id ': ' abcdefghij0123456789 ', ' token ': ' aoeusnth ', ' Values ': [' AXJE.UIDHTNMBRL ']}} bencoded = D1:rd2:id20:abcdefghij01234567895:token8:aoeusnth6:valuesl15:axje.uidhtnmbrlee1: ti0e1:y1:rede<

De<response with closest nodes = {' t ': 0, ' y ': ' R ', ' R ': {' id ': ' abcdefghij0123456789 ', ' token ': ' aoeusnth ', ' nodes ': ' Def 456 ... '} bencoded = d1:rd2:id20:abcdefghij01234567895:nodes9:def456 ... 5:token8:aoeusnthe1:ti0e1:y1:rede<

Announce_peer

The peer claiming the request node is downloading a torrent on a port. Announce_peer has four parameters: "id" is the request Node ID, "Info_hash" is torrent infohash; "Port" is the port number that is being downloaded, an integer; "token" is the last time the Get_peers request response was received. The node that receives the announce request must check the token (token) based on the IP address, which is the same token that was sent to it the last time it was the request node, and the token that is now provided. The node that receives the request should then store the IP address of the requesting node and provide the port associated with the Infohash to the local peer contact information storage pool.

De<arguments: {"id": "<querying nodes id>", "Info_hash": "<20-byte Infohash of Target torrent>", "Port": <port number>, "token": "<opaque token>"} response: {"id": "<queried nodes id>"}de<

Sample Package

De<announce_peers Query = {' t ': 0, ' y ': ' Q ', ' Q ': ' Announce_peers ', ' a ': {' id ': ' abcdefghij0123456789 ', ' info_hash ': ' mnopqrstuvwxyz123456 ', ' Port ': 6881, ' token ': ' Aoeusnth '}} bencoded = D1:ad2:id20:abcdefghij01234567899:info_hash20:
mnopqrstuvwxyz1234564:porti6881e5:token8:aoeusnthe1:q14:announce_peers1:ti0e1:y1:qede<

De<response = {"T": "0", "y": "R", "R": {"id": "mnopqrstuvwxyz123456"}} bencoded = D1:rd2:id20:mnopqrstuvwxyz123456e1: t1:01:y1:rede<

Footnote

    1. "Kademlia:a peer-to-peer information System Based on the XOR Metric",
      Petar Maymounkov and David mazieres,
    2. Use SHA1 and plenty of entropy to ensure a unique ID

Original address: http://www.bittorrent.org/Draft_DHT_protocol.html

Http://www.protocol.com.cn/archiver/tid-7852.html Study of the group of DHT: 375737269

BT website-Osho Magnetic-python development Crawler instead of. NET

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.