How to write a magnetic search for thieves

Last Update:2015-01-01 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

I am a sophomore student, in my spare time to develop this magnetic search--btgoogle.com. There have been a lot of people studying the DHT crawler group inside, that is, to study the so-called magnetic search, the use of spare time and a person in the group to develop the DHT protocol-type magnetic search, but I am only responsible for the front-end. But this cooperation, let me have more understanding of the Internet technology, such as the non-acquisition of magnetic search, using Sphinx as a search, Redis as a database. Redis is a key-value database, and Redis is more suitable for storage of hash type data, and DHT, of course, borrows its distributed functionality. And then I took some time to get to know Sphinx, and I wanted to get a deeper look at the Linux system in the new Year, as well as some of the Web protocols. It's a good deal to start a Kindle because of the reason for reading. For example, recently prepared to look at the HTTP authoritative guide Kindle version. Study or rely on the interest is better, there is the need for someone to guide, so efficiency on the leverage, I am the lack of guidance of people ╮(╯▽╰)╭. Okay, let's just go into the subject.

The reptile part is indeed more complicated, the details can be viewed GITHUB:HTTPS://GITHUB.COM/LAOMAYI/SIMDHT

Because of their own interests, I personally did a thief-type magnetic search engine uses 4 different sites of data, that is, the site of 4 nodes, you can find different resources according to needs. Do a good job later, and then do a static, node, resources are more, and then found that someone needs this thing, but also someone to buy me, but this money is not enough wages, so still ready to open source.

Below are the open source DHT crawlers that I worked with with the technician. This is achieved based on a third-party seed bank acquisition of seed

Github:https://github.com/laomayi/simdht

The subsequent development was obtained under the DHT protocol.

So what is magnetic search? I don't have a detailed description here. Introduce a few links to explain.

*********************************************************

Introduction to Magnetic Search: http://xiaoxia.org/2013/05/11/magnet-search-engine/

Plain English explains the principle and history of DHT: Http://www.tuicool.com/articles/jmQfyyN

How to make a search artifact: Http://www.tuicool.com/articles/Fj2mai

BT seeds obtained from the magnetic chain: http://www.cnblogs.com/cpper-kaixuan/p/3532694.html

Simple-html-dom API Reference: http://btgoogle.com/apis/

********************************************************

The source code of the Btgoogle website is the thief type. That's the code I'm going to open.

Principle:

Based on the magnetic search that has already been made, and then based on the user's search, the server then parses the HTML from other magnetic search sites and then collects the main part of the data and then inserts it into its own template.

The principle is simple. But a lot of people also realize this way, some people use Python, but the interface is ugly, and can not be changed pages, and can not support Chinese and so on. I'm using PHP here.

Simple-html-dom parsing HTML is combined with curl to assist, and then crawl, so that if the other site is found to be able to change the IP at any time to modify. Very convenient.

Of course, some people want to use the regular way

First get the HTML with file_get_content () and then the regular match gets. The result is very inefficient and the website is very slow.

For example, the following is a real-time acquisition of demo.btspider.net code.

Demo:http://btgoogle.com/search4/%e5%bf%83%e8%8a%b1%e8%b7%af%e6%94%be

The role that curl plays inside my program is to get the HTML proxy function. Simple-html-dom is used to parse HTML.

Specific reference within the program. Because this site and my front-end interface is similar. Basically directly get a div block can be used directly. There is no need for more complex operations. In the program need to pay attention to the collection of paging problems.

<?php require (' header.php ');? ><style type= "Text/css" > #wall {max-width:100%;} </style> <!--Search Content Resource Description Start-up-<?php $ip =array ("183.140.164.204 : 80 "," 58.68.129.68:8888 "," 111.13.109.26:82 "," 115.29.247.115:8888 "," 111.13.87.173:8081 "," 112.124.101.115:8080 ", "122.96.59.104:80", "122.227.8.190:80", "122.96.59.106:80", "121.12.120.246:2008", "111.11.228.9:81", " 120.203.214.144:80 "," 122.96.59.107:843 "," 115.236.59.194:3128 "," 113.140.25.4:81 "," 183.221.47.19:8123 "," 111.13.109.52:80 "," 119.147.115.30:8088 "," 115.227.194.241:80 ");
This array is stored in the IP is from the proxy site, and then can be randomly selected array to obtain a random proxy IP proxy acquisition.
                  $ch = Curl_init ();                  $timeout = 5;                  $title =$_get[' s ']; if (!empty ($title)) {if ($_get["page"]== "") {$url = "http://demo.btspider.net/ ". $title."                          -first-asc-1 ";                      $page = 1;                          } else{$page =$_get["page"]; $url = "http://demo.btspider.net/". $title. "                      -first-asc-". $page;                  } $next = $page +1;  $url = Preg_replace ("/(search\/) (. *)/E", "' search/'. UrlEncode (' $ $ ')", $url);                  Some URLs are decode so you need to encode curl_setopt ($ch, Curlopt_url, $url);                  curl_setopt ($ch, Curlopt_header, false);                  curl_setopt ($ch, Curlopt_returntransfer, 1);                  curl_setopt ($ch, Curlopt_connecttimeout, $timeout); curl_setopt ($ch, Curlopt_proxyauth, Curlauth_basic);           Proxy authentication Mode       curl_setopt ($ch, Curlopt_proxy, ' 111.13.109.26:80 '); curl_setopt ($ch, Curlopt_proxyport, 80); Proxy Server Port curl_setopt ($ch, Curlopt_proxytype, curlproxy_http); Use the HTTP proxy mode curl_setopt ($ch, Curlopt_useragent, ' mozilla/5.0 (Windows; U Windows NT 6.1; En-us;                  rv:1.9.1.2) gecko/20090729 firefox/3.5.2 GTB5 ');                  ' mozilla/5.0 (Windows NT 5.1; rv:9.0.1) gecko/20100101 firefox/9.0.1 ' $htmls = curl_exec ($ch);                  $html = str_get_html ($HTMLS);                         foreach ($html->find (' #wall ') as $element) {$fata = $element->find ('. Search-item ');                          foreach ($fata as $link) {$hrefs = $link->find ("a", 0);                          $hrefin = $hrefs->href;                     $hrefs->href= '/details '. $hrefin; Echo ' <div class= ' Search-item > '. $link->innertext. '          </div> ';               }} if (! $link) {//Returns an error when no data is found on the current page. echo "Sorry, the keyword is very sensitive or no resource information, please refresh or change a node search!"                    ";                  $j = 0;                   } else $j = 5;                  Curl_close ($ch); ?> <!--Search Content Resource description end--><?php echo ' <div class= ' bottom-pager ' > '; for ($i = $page-5; $i <= $page + $j; $i + +) {//For paging if the current page does not have data, you do not need to return extra pages.    If any, then the current page is added back.      # code: if ($i >0) {echo ' <a href= '/search4/'. $title. '      /'. $i. ' > ' $i. ' </a> '; }} Echo ' <a href= '/search4/'. $title. ' /'. $next. ' " > Next </a> '; Echo ' </div> ';} Require (' footer.php ');?>

The following will slowly take the entire site of the code to share the analysis. The idea is clear is the details more points.

Demo:www.btgoogle.com Open Source Agreement: Www.btgoogle.com/about

How to write a magnetic search for thieves

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

How to write a magnetic search for thieves

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

How to write a magnetic search for thieves

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support