DHT web crawler developed by Python

Source: Internet
Author: User
Tags message queue

Use Libtorrent's Python binding library to implement a DHT crawler that captures magnetic links in a DHT network.


Introduction to the DHT networkPeer Network

When you download resources from a torrent file, you know which computers in the peer-to-peer network The resources are called peer. In a traditional peer-to network, a peer that uses the tracker server to track resources. To download resources, you first need to get these peers.


DHT Network

Tracker servers face some copyright and legal issues. DHT appears, which disperses the resource peer information on the tracker across the network. The DHT network is composed of distributed nodes, and node is the peer-client that implements the DHT protocol. Peer client programs are both peer and node. The DHT network has a variety of algorithms, often with Kademlia.


DHT Network Download

When a peer client uses a torrent file to download a resource, if it does not have a tracker server, it queries the DHT network for a resource's peer list and then downloads the resource from peer.


Magnet is a magnetic link

The identification of a resource is called Infohash in a DHT network and is a 20-byte long string obtained through the SHA1 algorithm. Infohash is calculated using the file description information for the seed file. Magnetic links are obtained by encoding infohash into a 16 binary string. The peer client uses the torrent link to download the resource's seed file and then downloads the resource based on the seed file.


Kademlia algorithm

Kademlia is an implementation of the DHT network, the specific algorithm see: DHT protocol


KRPC protocol

KRPC is the interaction protocol between nodes, which is transmitted using UDP.

Includes 4 kinds of requests: Ping,find_node,get_peer,announce_peer. Where Get_peer and Announce_peer are the main messages for querying resources between nodes.


DHT crawler principle

The main idea is to pretend to be a peer-to client, join the DHT network, collect get_peer and announce_peer messages from the DHT network, which are the UDP messages that other node sends to the spoofed peer client.


Implementation of DHT crawler in this paperCrawler operating Environment
    1. Linux Systems

    2. Python 2.7

    3. Python bindings for the Libtorrent library

    4. Twisted Network Library

    5. Firewalls open fixed UDP and TCP ports


Introduction to the Libtorrent Library

The Libtorrent library is a client-side library with a rich interface that can be used to develop and download resources on the network. It has a Python-bound library, and this crawler is developed using its Python library.

There are several concepts that need to be explained in Libtorrent. Session is equivalent to a peer-to client, the session opens a TCP and a UDP port, used to exchange data with other peer clients. You can define multiple sessions within a process, which is a multi-peer client, to speed up collection.

Alert is the queue used to collect various messages in Libtorrent, each session has its own alert message queue. The Get_peer and announce_peer messages of the KRPC protocol are also obtained from this queue, which is the collection of magnetic links using these two messages.


Main implementation Code

Crawler implementation of the main code is relatively simple

#  Event Notification handler function     def _handle_alerts (self, session, alerts):         while len (Alerts):             alert = alerts.pop ()              #  get Dht_announce_alert and Dht_get_peer_alert messages              #  Collect magnetic links from these two messages              if isinstance (Alert, lt.add_torrent_alert):                 alert.handle.set_upload_limit (Self._torrent_upload_limit )                  Alert.handle.set_download_limit (Self._torrent_download_limit)              elif isinStance (Alert, lt.dht_announce_alert):                 info_hash = alert.info_hash.to_string (). Encode (' hex ')                  if info_hash in  self._meta_list:                     self._meta_list[info_hash] += 1                 else:                     self._meta_list[info_hash]  = 1                     self._current_meta_count += 1             eliF isinstance (Alert, lt.dht_get_peers_alert):                 info_hash = alert.info_hash.to_string (). Encode (' hex ')                 if info_ hash in self._meta_list:                     self._meta_list[info_hash] += 1                 else:                     self._infohash _queue_from_getpeers.append (Info_hash)                      self._meta_list[info_hash] = 1         &nBsp;           self._current_meta_count += 1     def start_work (self):         ' main work cycle, Check messages, display status ' '         #  clean screen          begin_time = time.time ()         show_interval  = self._delay_interval        while True:             for session in self._sessions:                 session.post_ Torrent_updates ()                  #  getting information from the queue                  self._handle_alerts (sesSion, session.pop_alerts ())              Time.sleep (self._sleep_time)             if  show_interval > 0:                 show_interval -= 1                 continue             show_interval = self._delay_interval             #  Statistical information Display             show_ content = [' torrents: ']             Interval = time.time ()  - begin_time             show_content.append ('   pid: %s '  % os.getpid ())              show_content.append ('   time: %s '  %                                  time.strftime ('%y-%m-%d %h:%m:%s '))              show_content.append ('   run time: %s '  % self._get_runtime (interval))              show_content.append ('   start port: %d '  % self._start_port)              show_content.append ('   collect  session num: %d '  %                                 len (self._sessions))              show_content.append ('   info  hash nums from get peers: %d '  %                                  len (self._infohash_queue_from_getpeers))              show_content.append ('   torrent collection rate:  %f /minute '  %                                   (Self._current_meta_count * 60 / interval)               show_content.append ('   current torrent count: %d '  %                                  self._current_meta_count)              show_content.append ('   total torrent  count: %d '  %                                  len (self._meta_list))             show_ Content.append (' \ n ')             #  store run state to file             try:            &nbsP;    with open (self._stat_file,  ' WB ')  as f:                     f.write (' \ n '). Join (show_content))                  with open (self._result_file,  ' WB ')  as f:                     json.dump (self._meta_list,  f)             except Exception as  err:                pass             #  Test If the exit time is reached              if interval >= self._exit_time:                 # stop                 break             #  end of day backup results file              self._backup_result ()         #  destruction of peer client          for session in self._sessions:             torrents = session.get_torrents ()              for torrent in torrents:                 session.remove_torrent ( Torrent


Operational efficiency

On one of my 512M memory, single CPU machines. The crawler has just started to run slightly slower, running a few minutes after the collection speed stabilized at 180 per minute, 1 hours to collect about 10000.

Running state

Run times:12torrents:pid:11480 time:2014-08-18 22:45:01 run time:day:0, hour:0, Minute:12, second:25 start PO RT:32900 Collect session num:20 info hash nums from get peers:2222 torrent collection Rate:179.098480/minute Curr ent torrent count:2224 total torrent count:58037


Crawler complete code

Complete code See: Https://github.com/blueskyz/DHTCrawler

A twisted-based monitoring process is also included to view the crawler status and reboot after the crawler process exits.


SOURCE Link: Python-developed DHT crawler


DHT web crawler developed by Python

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.