Use Libtorrent's Python binding library to implement a DHT crawler that captures magnetic links in a DHT network.
Introduction to the DHT networkPeer Network
When you download resources from a torrent file, you know which computers in the peer-to-peer network The resources are called peer. In a traditional peer-to network, a peer that uses the tracker server to track resources. To download resources, you first need to get these peers.
DHT Network
Tracker servers face some copyright and legal issues. DHT appears, which disperses the resource peer information on the tracker across the network. The DHT network is composed of distributed nodes, and node is the peer-client that implements the DHT protocol. Peer client programs are both peer and node. The DHT network has a variety of algorithms, often with Kademlia.
DHT Network Download
When a peer client uses a torrent file to download a resource, if it does not have a tracker server, it queries the DHT network for a resource's peer list and then downloads the resource from peer.
Magnet is a magnetic link
The identification of a resource is called Infohash in a DHT network and is a 20-byte long string obtained through the SHA1 algorithm. Infohash is calculated using the file description information for the seed file. Magnetic links are obtained by encoding infohash into a 16 binary string. The peer client uses the torrent link to download the resource's seed file and then downloads the resource based on the seed file.
Kademlia algorithm
Kademlia is an implementation of the DHT network, the specific algorithm see: DHT protocol
KRPC protocol
KRPC is the interaction protocol between nodes, which is transmitted using UDP.
Includes 4 kinds of requests: Ping,find_node,get_peer,announce_peer. Where Get_peer and Announce_peer are the main messages for querying resources between nodes.
DHT crawler principle
The main idea is to pretend to be a peer-to client, join the DHT network, collect get_peer and announce_peer messages from the DHT network, which are the UDP messages that other node sends to the spoofed peer client.
Implementation of DHT crawler in this paperCrawler operating Environment
Linux Systems
Python 2.7
Python bindings for the Libtorrent library
Twisted Network Library
Firewalls open fixed UDP and TCP ports
Introduction to the Libtorrent Library
The Libtorrent library is a client-side library with a rich interface that can be used to develop and download resources on the network. It has a Python-bound library, and this crawler is developed using its Python library.
There are several concepts that need to be explained in Libtorrent. Session is equivalent to a peer-to client, the session opens a TCP and a UDP port, used to exchange data with other peer clients. You can define multiple sessions within a process, which is a multi-peer client, to speed up collection.
Alert is the queue used to collect various messages in Libtorrent, each session has its own alert message queue. The Get_peer and announce_peer messages of the KRPC protocol are also obtained from this queue, which is the collection of magnetic links using these two messages.
Main implementation Code
Crawler implementation of the main code is relatively simple
# Event Notification handler function def _handle_alerts (self, session, alerts): while len (Alerts): alert = alerts.pop () # get Dht_announce_alert and Dht_get_peer_alert messages # Collect magnetic links from these two messages if isinstance (Alert, lt.add_torrent_alert): alert.handle.set_upload_limit (Self._torrent_upload_limit ) Alert.handle.set_download_limit (Self._torrent_download_limit) elif isinStance (Alert, lt.dht_announce_alert): info_hash = alert.info_hash.to_string (). Encode (' hex ') if info_hash in self._meta_list: self._meta_list[info_hash] += 1 else: self._meta_list[info_hash] = 1 self._current_meta_count += 1 eliF isinstance (Alert, lt.dht_get_peers_alert): info_hash = alert.info_hash.to_string (). Encode (' hex ') if info_ hash in self._meta_list: self._meta_list[info_hash] += 1 else: self._infohash _queue_from_getpeers.append (Info_hash) self._meta_list[info_hash] = 1 &nBsp; self._current_meta_count += 1 def start_work (self): ' main work cycle, Check messages, display status ' ' # clean screen begin_time = time.time () show_interval = self._delay_interval while True: for session in self._sessions: session.post_ Torrent_updates () # getting information from the queue self._handle_alerts (sesSion, session.pop_alerts ()) Time.sleep (self._sleep_time) if show_interval > 0: show_interval -= 1 continue show_interval = self._delay_interval # Statistical information Display show_ content = [' torrents: '] Interval = time.time () - begin_time show_content.append (' pid: %s ' % os.getpid ()) show_content.append (' time: %s ' % time.strftime ('%y-%m-%d %h:%m:%s ')) show_content.append (' run time: %s ' % self._get_runtime (interval)) show_content.append (' start port: %d ' % self._start_port) show_content.append (' collect session num: %d ' % len (self._sessions)) show_content.append (' info hash nums from get peers: %d ' % len (self._infohash_queue_from_getpeers)) show_content.append (' torrent collection rate: %f /minute ' % (Self._current_meta_count * 60 / interval) show_content.append (' current torrent count: %d ' % self._current_meta_count) show_content.append (' total torrent count: %d ' % len (self._meta_list)) show_ Content.append (' \ n ') # store run state to file try: &nbsP; with open (self._stat_file, ' WB ') as f: f.write (' \ n '). Join (show_content)) with open (self._result_file, ' WB ') as f: json.dump (self._meta_list, f) except Exception as err: pass # Test If the exit time is reached if interval >= self._exit_time: # stop break # end of day backup results file self._backup_result () # destruction of peer client for session in self._sessions: torrents = session.get_torrents () for torrent in torrents: session.remove_torrent ( Torrent
Operational efficiency
On one of my 512M memory, single CPU machines. The crawler has just started to run slightly slower, running a few minutes after the collection speed stabilized at 180 per minute, 1 hours to collect about 10000.
Running state
Run times:12torrents:pid:11480 time:2014-08-18 22:45:01 run time:day:0, hour:0, Minute:12, second:25 start PO RT:32900 Collect session num:20 info hash nums from get peers:2222 torrent collection Rate:179.098480/minute Curr ent torrent count:2224 total torrent count:58037
Crawler complete code
Complete code See: Https://github.com/blueskyz/DHTCrawler
A twisted-based monitoring process is also included to view the crawler status and reboot after the crawler process exits.
SOURCE Link: Python-developed DHT crawler
DHT web crawler developed by Python