DHT web crawler developed by Python

Last Update:2014-08-22 Source: Internet

Author: User

Tags message queue

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Use Libtorrent's Python binding library to implement a DHT crawler that captures magnetic links in a DHT network.

Introduction to the DHT networkPeer Network

When you download resources from a torrent file, you know which computers in the peer-to-peer network The resources are called peer. In a traditional peer-to network, a peer that uses the tracker server to track resources. To download resources, you first need to get these peers.

DHT Network

Tracker servers face some copyright and legal issues. DHT appears, which disperses the resource peer information on the tracker across the network. The DHT network is composed of distributed nodes, and node is the peer-client that implements the DHT protocol. Peer client programs are both peer and node. The DHT network has a variety of algorithms, often with Kademlia.

DHT Network Download

When a peer client uses a torrent file to download a resource, if it does not have a tracker server, it queries the DHT network for a resource's peer list and then downloads the resource from peer.

Magnet is a magnetic link

The identification of a resource is called Infohash in a DHT network and is a 20-byte long string obtained through the SHA1 algorithm. Infohash is calculated using the file description information for the seed file. Magnetic links are obtained by encoding infohash into a 16 binary string. The peer client uses the torrent link to download the resource's seed file and then downloads the resource based on the seed file.

Kademlia algorithm

Kademlia is an implementation of the DHT network, the specific algorithm see: DHT protocol

KRPC protocol

KRPC is the interaction protocol between nodes, which is transmitted using UDP.

Includes 4 kinds of requests: Ping,find_node,get_peer,announce_peer. Where Get_peer and Announce_peer are the main messages for querying resources between nodes.

DHT crawler principle

The main idea is to pretend to be a peer-to client, join the DHT network, collect get_peer and announce_peer messages from the DHT network, which are the UDP messages that other node sends to the spoofed peer client.

Implementation of DHT crawler in this paperCrawler operating Environment

Linux Systems
Python 2.7
Python bindings for the Libtorrent library
Twisted Network Library
Firewalls open fixed UDP and TCP ports

Introduction to the Libtorrent Library

The Libtorrent library is a client-side library with a rich interface that can be used to develop and download resources on the network. It has a Python-bound library, and this crawler is developed using its Python library.

There are several concepts that need to be explained in Libtorrent. Session is equivalent to a peer-to client, the session opens a TCP and a UDP port, used to exchange data with other peer clients. You can define multiple sessions within a process, which is a multi-peer client, to speed up collection.

Alert is the queue used to collect various messages in Libtorrent, each session has its own alert message queue. The Get_peer and announce_peer messages of the KRPC protocol are also obtained from this queue, which is the collection of magnetic links using these two messages.

Main implementation Code

Crawler implementation of the main code is relatively simple

#  Event Notification handler function     def _handle_alerts (self, session, alerts):         while len (Alerts):             alert = alerts.pop ()              #  get Dht_announce_alert and Dht_get_peer_alert messages              #  Collect magnetic links from these two messages              if isinstance (Alert, lt.add_torrent_alert):                 alert.handle.set_upload_limit (Self._torrent_upload_limit )                  Alert.handle.set_download_limit (Self._torrent_download_limit)              elif isinStance (Alert, lt.dht_announce_alert):                 info_hash = alert.info_hash.to_string (). Encode (' hex ')                  if info_hash in  self._meta_list:                     self._meta_list[info_hash] += 1                 else:                     self._meta_list[info_hash]  = 1                     self._current_meta_count += 1             eliF isinstance (Alert, lt.dht_get_peers_alert):                 info_hash = alert.info_hash.to_string (). Encode (' hex ')                 if info_ hash in self._meta_list:                     self._meta_list[info_hash] += 1                 else:                     self._infohash _queue_from_getpeers.append (Info_hash)                      self._meta_list[info_hash] = 1         &nBsp;           self._current_meta_count += 1     def start_work (self):         ' main work cycle, Check messages, display status ' '         #  clean screen          begin_time = time.time ()         show_interval  = self._delay_interval        while True:             for session in self._sessions:                 session.post_ Torrent_updates ()                  #  getting information from the queue                  self._handle_alerts (sesSion, session.pop_alerts ())              Time.sleep (self._sleep_time)             if  show_interval > 0:                 show_interval -= 1                 continue             show_interval = self._delay_interval             #  Statistical information Display             show_ content = [' torrents: ']             Interval = time.time ()  - begin_time             show_content.append ('   pid: %s '  % os.getpid ())              show_content.append ('   time: %s '  %                                  time.strftime ('%y-%m-%d %h:%m:%s '))              show_content.append ('   run time: %s '  % self._get_runtime (interval))              show_content.append ('   start port: %d '  % self._start_port)              show_content.append ('   collect  session num: %d '  %                                 len (self._sessions))              show_content.append ('   info  hash nums from get peers: %d '  %                                  len (self._infohash_queue_from_getpeers))              show_content.append ('   torrent collection rate:  %f /minute '  %                                   (Self._current_meta_count * 60 / interval)               show_content.append ('   current torrent count: %d '  %                                  self._current_meta_count)              show_content.append ('   total torrent  count: %d '  %                                  len (self._meta_list))             show_ Content.append (' \ n ')             #  store run state to file             try:            &nbsP;    with open (self._stat_file,  ' WB ')  as f:                     f.write (' \ n '). Join (show_content))                  with open (self._result_file,  ' WB ')  as f:                     json.dump (self._meta_list,  f)             except Exception as  err:                pass             #  Test If the exit time is reached              if interval >= self._exit_time:                 # stop                 break             #  end of day backup results file              self._backup_result ()         #  destruction of peer client          for session in self._sessions:             torrents = session.get_torrents ()              for torrent in torrents:                 session.remove_torrent ( Torrent

Operational efficiency

On one of my 512M memory, single CPU machines. The crawler has just started to run slightly slower, running a few minutes after the collection speed stabilized at 180 per minute, 1 hours to collect about 10000.

Running state

Run times:12torrents:pid:11480 time:2014-08-18 22:45:01 run time:day:0, hour:0, Minute:12, second:25 start PO RT:32900 Collect session num:20 info hash nums from get peers:2222 torrent collection Rate:179.098480/minute Curr ent torrent count:2224 total torrent count:58037

Crawler complete code

Complete code See: Https://github.com/blueskyz/DHTCrawler

A twisted-based monitoring process is also included to view the crawler status and reboot after the crawler process exits.

SOURCE Link: Python-developed DHT crawler

DHT web crawler developed by Python

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More