background:
A given sequence of text files needs to be searched based on the given keyword. project design:
Es+python
The use of ES to establish full-text text search, according to the given search keyword directly to find ES services to build
Download Elasticsearch
Extract directly, run the./bin/elasticsearch-d in the bin directory to start the service in the backend.
If the hint is that the Java version is not up to date, you need to update.
Elasticsearch requires at least Java 8 but your Java version From/usr/bin/java does don't meet this requirement
To update the Java version:
sudo add-apt-repository ppa:webupd8team/java
sudo apt-get update
sudo apt-get install Oracle-java8-installer
After installation Java version:
Directly in the bin directory with the./elasticsearch Start
9200 Port Access Results:
Basic Concepts:
Index:
The index in ES is the equivalent of a database.
As a verb, it refers to the process of "saving" a document into ES, and after indexing a document, we can use ES to search for the document.
As a noun, it refers to the place where the document is saved, which is equivalent to a "library" in the concept of a database.
Type:
Equivalent to a table in the database.
ID: Unique, equivalent to primary key, is used to identify documents, the index/type/id of a document must be unique. The document ID is automatically generated (if not specified)
Document Documentation:
A document is a JSON text stored in ES that can be interpreted as a row in a relational database table. Each document is stored in the index and has a type and ID. A document is a JSON object (hash/hashmap/associative array in some languages) that contains 0 or more fields (key-value pairs). The original JSON text is saved in the _source field after the index, and is included by default in the return value after the search completes.
node:
A node is an ES instance, a machine can run multiple instances, but an instance on the same machine ensures that the HTTP and TCP ports are different in the configuration file.
cluster:
Represents a cluster, there are multiple nodes in the cluster, one of which will be selected as the primary node, the main node can be elected, the master-slave node is for the internal cluster.
Shards:
Represents an index fragment, ES can divide a complete index into multiple slices, the advantage of which is that a large index can be split into multiple, distributed to different nodes, to form a distributed search. When it saves the index, it selects the appropriate "primary fragment" (Primary Shard), where the index is saved (we can interpret the fragment as a physical storage area). The partitioning method is fixed and must be determined at installation time (default is 5) and cannot be changed after the index is created.
Since there is a main fragment, it must be "from" fragmented, in the ES called "Replica Fragmentation" (Replica Shard), ES can set multiple copies of the index. There are two main functions of replica fragmentation:
1) High Availability: If a fragment node is hung, it can go to other replica fragment nodes, after node recovery, the fragment data can be recovered through other nodes to improve the fault tolerance of the system.
2 Load Balancing: ES will automatically control the search route according to the load situation, and the replica fragment can divide the load equally . Based on Python&es full-text index establishment
For just-Started ES services, you can first query the configuration information for all indexes, because there is no index information at this time, so it is empty:
To create an index by invoking the ES interface:
#-*-Coding:utf-8-*-
__author__ = ' Jasonliu '
import datetime
import urllib2
import JSON
Import Time
import httplib
import socket
id = "7"
url = "http://IP or domain name to obtain data: 8071/getext?nohead=1&id=" + STR (ID)
try:
s = urllib2.urlopen (URL, timeout = ten). Read (). replace ("\ r \ n", "")
If Len (s) > 0:
d = Json.dumps ({"krcid": ID, "content": s})
url = "http://127.0.0.1:9200/lyrics/fulltext/" + str (ID) + "/_create" #创建一个 Document, if the file already exists, returns the failure
Urllib2.urlopen (URL, d, timeout = ten)
print ("Add success:" + str (ID))
except Socket.timeout, E:
print ("timeout:" + str (ID))
raise e
except:
print ("Add error:" + str (ID))
The above code implements a JSON text to index=lyrics,type=fulltext,id=krcid,document that contains lyric information. The Full-text text cannot be added to a loop on the outer layer.
You can find this ID simply by following these ways:
Http://127.0.0.1:9200/lyrics/fulltext/7
Query results:
keyword Query
Full-text lookup for the specified keyword (full-text lookup):
#-*-coding:utf-8-*-__author__ = ' jasonliu ' import datetime import URLLIB2 Import JSON import time import Httplib Import Socket id = "7" Try:url = "Http://127.0.0.1:9200/lyrics/fulltext/_search?" #创建一个文档, if the file already exists, it returns a failure queryparams = "pretty&size=1000" url = url + queryparams query_template = "{\" _sou Rce\ ": [\ krcid\"],\ "query\": {\ "match\": {\ "content\": \ {\ "query\": \ "I love es\", \ "type\": \ "phrase\", "slop\": 20}}},\ "highlight\": {\ "pre_tags\": [\ "<kw>\"],\ "post_tags\": [\ </kw>\]],\ "fields\": {\ "content\": {}} } "#修改其中的keyword Tempjason = json.loads (query_template) tempjason[" Query "[" Match "] [" content "] [" QUERY "] =" sky The fog came carelessly "data = Json.dumps (tempjason) print ' data= ', data CNX = Urllib2.urlopen (URL, data, timeout=60) t
EMPCNX = Cnx.read () print TEMPCNX # kg_song_info = Json.loads (TEMPCNX) # print ("Add success:" + str (ID)) Except Socket.timeout, E:print ("Timeout:" + str (ID)) Raise e except:print ("Add Error:" + str (ID))
To this point, you can simply implement the ability to index and use text.