ES-based Full-text text Search

Source: Internet
Author: User
Tags relational database table
background:

A given sequence of text files needs to be searched based on the given keyword. project design:

Es+python
The use of ES to establish full-text text search, according to the given search keyword directly to find ES services to build

Download Elasticsearch
Extract directly, run the./bin/elasticsearch-d in the bin directory to start the service in the backend.
If the hint is that the Java version is not up to date, you need to update.
Elasticsearch requires at least Java 8 but your Java version From/usr/bin/java does don't meet this requirement
To update the Java version:

sudo add-apt-repository ppa:webupd8team/java
sudo apt-get update
sudo apt-get install Oracle-java8-installer

After installation Java version:

Directly in the bin directory with the./elasticsearch Start

9200 Port Access Results:

Basic Concepts:

Index:
The index in ES is the equivalent of a database.
As a verb, it refers to the process of "saving" a document into ES, and after indexing a document, we can use ES to search for the document.
As a noun, it refers to the place where the document is saved, which is equivalent to a "library" in the concept of a database.

Type:
Equivalent to a table in the database.
ID: Unique, equivalent to primary key, is used to identify documents, the index/type/id of a document must be unique. The document ID is automatically generated (if not specified)

Document Documentation:
A document is a JSON text stored in ES that can be interpreted as a row in a relational database table. Each document is stored in the index and has a type and ID. A document is a JSON object (hash/hashmap/associative array in some languages) that contains 0 or more fields (key-value pairs). The original JSON text is saved in the _source field after the index, and is included by default in the return value after the search completes.

node:
A node is an ES instance, a machine can run multiple instances, but an instance on the same machine ensures that the HTTP and TCP ports are different in the configuration file.
cluster:
Represents a cluster, there are multiple nodes in the cluster, one of which will be selected as the primary node, the main node can be elected, the master-slave node is for the internal cluster.
Shards:
Represents an index fragment, ES can divide a complete index into multiple slices, the advantage of which is that a large index can be split into multiple, distributed to different nodes, to form a distributed search. When it saves the index, it selects the appropriate "primary fragment" (Primary Shard), where the index is saved (we can interpret the fragment as a physical storage area). The partitioning method is fixed and must be determined at installation time (default is 5) and cannot be changed after the index is created.
Since there is a main fragment, it must be "from" fragmented, in the ES called "Replica Fragmentation" (Replica Shard), ES can set multiple copies of the index. There are two main functions of replica fragmentation:
1) High Availability: If a fragment node is hung, it can go to other replica fragment nodes, after node recovery, the fragment data can be recovered through other nodes to improve the fault tolerance of the system.
2 Load Balancing: ES will automatically control the search route according to the load situation, and the replica fragment can divide the load equally . Based on Python&es full-text index establishment

For just-Started ES services, you can first query the configuration information for all indexes, because there is no index information at this time, so it is empty:

To create an index by invoking the ES interface:

#-*-Coding:utf-8-*-
__author__ = ' Jasonliu '

import datetime
import urllib2
import JSON
Import Time
import httplib               
import socket

id = "7"
url = "http://IP or domain name to obtain data: 8071/getext?nohead=1&id=" + STR (ID)
try:
    s = urllib2.urlopen (URL, timeout = ten). Read (). replace ("\ r \ n", "")
    If Len (s) > 0:
        d = Json.dumps ({"krcid": ID, "content": s})
        url = "http://127.0.0.1:9200/lyrics/fulltext/" + str (ID) + "/_create" #创建一个 Document, if the file already exists, returns the failure
        Urllib2.urlopen (URL, d, timeout = ten)
        print ("Add success:" + str (ID))
except Socket.timeout, E:
    print ("timeout:" + str (ID))
    raise e
except:
    print ("Add error:" + str (ID))

The above code implements a JSON text to index=lyrics,type=fulltext,id=krcid,document that contains lyric information. The Full-text text cannot be added to a loop on the outer layer.
You can find this ID simply by following these ways:
Http://127.0.0.1:9200/lyrics/fulltext/7
Query results:
keyword Query

Full-text lookup for the specified keyword (full-text lookup):

#-*-coding:utf-8-*-__author__ = ' jasonliu ' import datetime import URLLIB2 Import JSON import time import Httplib Import Socket id = "7" Try:url = "Http://127.0.0.1:9200/lyrics/fulltext/_search?" #创建一个文档, if the file already exists, it returns a failure queryparams = "pretty&size=1000" url = url + queryparams query_template = "{\" _sou  Rce\ ": [\ krcid\"],\ "query\": {\ "match\": {\ "content\": \ {\ "query\": \ "I love es\", \ "type\": \ "phrase\", "slop\": 20}}},\ "highlight\": {\ "pre_tags\": [\ "<kw>\"],\ "post_tags\": [\ </kw>\]],\ "fields\": {\ "content\": {}} } "#修改其中的keyword Tempjason = json.loads (query_template) tempjason[" Query "[" Match "] [" content "] [" QUERY "] =" sky The fog came carelessly "data = Json.dumps (tempjason) print ' data= ', data CNX = Urllib2.urlopen (URL, data, timeout=60) t
EMPCNX = Cnx.read () print TEMPCNX # kg_song_info = Json.loads (TEMPCNX) # print ("Add success:" + str (ID)) Except Socket.timeout, E:print ("Timeout:" + str (ID)) Raise e except:print ("Add Error:" + str (ID)) 


To this point, you can simply implement the ability to index and use text.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.