Using Elasticsearch to build a reptile system

Source: Internet
Author: User
Tags composer install

(a) Why use the search.

The crawler system is generally divided into multi-threaded download, link pool, data storage, retrieval system and so on. This retrieval system consolidates the information we crawl and speeds up our search. In addition, not only the crawler system use, I feel in all want to make the results index to provide query needs can use a retrieval system, such as personal Social library, large-scale vulnerability scanning system (can be used as a zombie network) and so on. There are a lot of retrieval systems, but I think Elasticsearch is more convenient and provides APIs for many languages, such as Java,php,perl,python. Today I'm going to record the process of tossing the system.

(b) How to use it.

Download from the Internet client and server each one, the server say, download down directly can be used:


Run the bat file into the bin directory and the default port is 9200.

The client (I'm using the PHP API) is a bit more cumbersome to use elasticsearch-php with the following three requirements:

The 1.PHP version is above 5.3.9, I'm using PHP5.3.23.

2. Use Composor to manage packages in the project, download the following address: https://getcomposer.org/

3. Opening of Curl and OpenSSL in php.ini

After installing the composer, the following procedures are used:

(1) Create a new directory, I here the project name is: Phpcrawler.

(2) Add the JSON file to the inside:

	{  
	    "require": {"Elasticsearch/elasticsearch":  
	        "~1.2"  
	    }  
	} 

Then CD to this folder, perform composer install--no-dev

Wait a moment to see the folder appears under the Vendor folder, the use of this file, where you need to paste where you can.

This allows us to include/vendor/autoload.php directly in PHP.

Maybe my explanation is not very detailed, you can also refer to the following blog:

http://blog.csdn.net/rongyongfeikai2/article/details/37911871

(iii) Building the reptile system

With phpcrawler,php to build a reptile system has become much easier, only need to give a link to the portal, Phpcrawler can help you crawl, of course, the necessary is still to write their own, but omitted the page, link pool management, link scheduling and other trouble. For how to use Phpcrawler, this is not the focus of this article, you can refer to:

Http://www.cuab.de/classreferences/PHPCrawler/overview.html

The data I'm grabbing is the song data, a song single corresponding to multiple songs, song information including the creation of time, create a person, song name, etc., and stored on MySQL, there are two tables, a song, a songlist, through the songlist_id as a foreign key association.

But the data is messy, easy to find in SQL, and slow to find, and using Elasticsearch to build indexes can change that.

(iv) Establishment of indexing system

Using the Elasticsearch class in PHP, you return a client that can pass an array directly to the constructor to indicate one or more server IPs and ports.

The establishment process is as follows:

(1) Create a client

(2) Set the name of the index

(3) Set the index of the mapping

Sets the mapping for the index, named: Songlist_type.


(4) Set index entry

To add an index entry with the $client object's index, to set a $params parameter, the body is the place where the index entries are set, where the title of the song sheet and the songlist_id of the song are indexed. Submitting a keyword title to the server will give you a corresponding ID, and you can use this ID to get the song from the list.

After adding all the index entries, the index is done, remember to execute, submit the index to the server

(v) using index query data

When the index is finished, it is natural to take advantage of our needs. You also need to set up a client in the query, use the search method in the client, and set the parameters to get the result set. The key value of the provided index here is title (the index entry of songlist_id and title). The value of title is set to $query (get).

The specific process is as follows:

When you get the result set (Array), you can get the results of the query from the hits item.

Submit $query= ' annoyance ', print the result array returned by search:

Array ( 
[took] => 2 
[timed_out] =>
[_shards] => Array ( 
[Total] => 5
[successful] => 5 
[failed] => 0) 
[hits] => Array (   ///The specific information of the indexed entry to be queried, if there is no matching entry hits is empty
[total] => 1   ///Match index entry number
[Max_score] => 2.2478988 
[hits] => array ([    
0] => Array ( 
[_index] => songlist_index  //Index name
[_type ] => songlist_type   //indexed mapping
[_id] => 209  //corresponding ID number
[_score] => 2.2478988 
[_ SOURCE] => Array ( 
[title] => what's bothering you.  //Results of title
[songlist_id] => 209  /Index Entry))) 


After you get it, you can print it to the page.

(vi) Effect of use

Stolen Baidu logo, but the focus is the search results are as follows:

(vii) Summary

Simply use Elasticsearch to build the index, and if you need it later, remember to use the retrieval system.


Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.