44 Python distributed crawler build search engine Scrapy explaining-elasticsearch (search engine) basic query

Source: Internet
Author: User
Tags create index python scrapy
1. Query by elasticsearch (search engine)

Elasticsearch is a very powerful search engine, the purpose of using it is to quickly query the required data

Inquiry classification:
Basic query: query using the built-in query conditions of elasticsearch
Combined query: combine multiple query conditions to perform compound query
Filter: At the same time of query, filter data by filter condition without affecting the scoring

 

 

2. Elasticsearch (search engine) creates data

First of all, we first create indexes, tables, and field attributes, field types, and add good data

Note: In general, we use ik_max_word Chinese word segmentation parser in Chinese, all fields that need to establish word reversal index for word segmentation must be specified, ik_max_word Chinese word segmentation parser
The system is not ik_max_word Chinese word parser by default

ik_max_word Chinese word segmentation parser is a plugin of elasticsearch (search engine), in the plugins / analysis-ik folder of the elasticsearch installation directory, the version is 5.1.1

More instructions: https://github.com/medcl/elasticsearch-analysis-ik

Description:

#Create index (set field type)
#Note: In general, Chinese uses the ik_max_word Chinese word segmentation parser, all fields that require word segmentation to establish a reversal index must be specified, ik_max_word Chinese word segmentation parser
#System default is not ik_max_word Chinese word segmentation parser
PUT jobbole #Create index and set index name
{
  "mappings": {#Set the mappings mapping field type
    "job": {#table name
      "properties": {#Set field type
        "title": {#table name
          "store": true, #Field attribute true means save data
          "type": "text", #text type, text type can be segmented, create an inverted index
          "analyzer": "ik_max_word" #Set the word segmentation parser, ik_max_word is a Chinese word segmentation parser plugin
        },
        "company_name": {#field name
          "store": true, #Field attribute true means save data
          "type": "keyword" #keyword ordinary string type, without word separation
        },
        "desc": {#field name
          "type": "text" #text type, text type can be segmented, but the word segmentation parser is not set, use the system default
        },
        "comments": {#field name
          "type": "integer" #integer number type
        },
        "add_time": {#field name
          "type": "date", #dateTime type
          "format": "yyyy-MM-dd" # yyyy-MM-dd time format
        }
      }
    }
  }
}



#Save document (equivalent to the written data of the database)
POST jobbole / job
{
  "title": "python django development engineer", #field name: value
  "company_name": "Meituan Technology Co., Ltd.", #Field name: value
  "desc": "Familiar with the concept of django, familiar with python basics", #field name: value
  "comments": 20, #Field name: value
  "add_time": "2017-4-1" #Field name: value
}

POST jobbole / job
{
  "title": "Python scrapy redis distributed crawler foundation",
  "company_name": "Yuxiu Technology Co., Ltd.",
  "desc": "Familiar with the concept of scrapy, familiar with basic knowledge of redis"
  "comments": 5,
  "add_time": "2017-4-2"
}

POST jobbole / job
{
  "title": "elasticsearch builds search engine",
  "company_name": "Communication Technology Co., Ltd.",
  "desc": "Familiar with the concept of elasticsearch",
  "comments": 10,
  "add_time": "2017-4-3"
}

POST jobbole / job
{
  "title": "pyhhon builds a recommendation engine system",
  "company_name": "Intelligent Technology Co., Ltd.",
  "desc": "Familiar with the recommendation engine system algorithm",
  "comments": 60,
  "add_time": "2017-4-4"
}
 From the above, we can see that we have created the index and set the field attributes, types, and word segmentation parser, and created 4 data.


 

3. Elasticsearch (search engine) basic query

 

 

match query [most used]
We will segment our search term in the word breaker set in the current field, and find it in the current field. The higher the matching degree, the higher the ranking. If the search term is capital letters, it will be automatically converted to lower case

#matchquery
#We will segment our search terms into the specified field, the higher the degree of matching, the higher the ranking
GET jobbole / job / _search
{
  "query": {
    "match": {
      "title": "Search Engine"
    }
  }
}
 

 

term query
Queries that do not segment our search terms and match the search terms exactly

#termquery
#The query that does not segment our search term, matches the search term exactly
GET jobbole / job / _search
{
  "query": {
    "term": {
      "title": "Search Engine"
    }
  }
}
 

 

terms query
Pass an array to match the words in the array

#termsQuery
#Pass an array to match the words in the array
GET jobbole / job / _search
{
  "query": {
    "terms": {
      "title": ["Engineer", "django", "System"]
    }
  }
}
 

 

Control the number of queries returned
From from the first few data
Size to get a few pieces of data

#Control the number of queries returned
#from from the first few data
#sizeGet a few pieces of data
GET jobbole / job / _search
{
  "query": {
    "match": {
      "title": "Search Engine"
    }
  },
  "from": 0,
  "size": 3
}
 

 

match_all query, query all data

#match_allQuery, query all data
GET jobbole / job / _search
{
  "query": {
    "match_all": {}
  }
}
 

 

match_phrase query
Phrase lookup
Phrase query, will split the search words into a list such as [python, development]
Then the search field must satisfy all the elements in the list to match
slop is after setting the word segmentation [python, development] python and development, how many characters are separated by a match
If the number of space characters is less than the slop setting, it matches, and the number of space characters is greater than the slop setting, it does not match

#match_phrase query
#Phrase query
#Phrase query, will split the search term into a list such as [python, development]
#The searched field must meet all the elements in the list to match
#slop is after setting the word segmentation [python, development] python and development, how many characters between them count as a match
#The number of space characters is less than the slop setting, and the number of space characters is greater than the slop setting
GET jobbole / job / _search
{
  "query": {
    "match_phrase": {
      "title": {
        "query": "elasticsearch engine",
        "slop": 3
      }
    }
  }
}
 

 

multi_match query
For example, you can specify multiple fields
For example, query the title field and desc field contains python keyword data
query set the search term
fields to search
title ^ 3 indicates the weight, which indicates that the keyword weights matched in the title are 3 times the keyword weights matched in other fields

#multi_matchquery
#For example, you can specify multiple fields
#For example, query the title field and desc field contains python keyword data
#querySet search terms
#fieldsFields to search
# title ^ 3 indicates the weight, which indicates that the keyword weight in the title is 3 times the keyword weight in other fields
GET jobbole / job / _search
{
  "query": {
    "multi_match": {
      "query": "Search Engine",
      "fields": ["title ^ 3", "desc"]
    }
  }
}
 

 

stored_fields sets which fields are displayed in search results

Note: The stored property of the field to be displayed using stored_fields must be true. If the field to be displayed does not set the store property, the default is false. If it is false, the field will not be displayed

#stored_fieldsSet which fields are displayed in search results
GET jobbole / job / _search
{
  "stored_fields": ["title", "company_name"],
  "query": {
    "multi_match": {
      "query": "Search Engine",
      "fields": ["title ^ 3", "desc"]
    }
  }
}
 

 

Sort by sort search results
Note: The sorted fields must be numbers or dates
desc ascending
asc descending

#Sort by sort search results
#Note: The sorted field must be a number or date
#descAscending
#ascdescending
GET jobbole / job / _search
{
  "query": {
    "match_all": {}
  },
  "sort": [{
      "comments": {
        "order": "asc"
      }
    }]
}
 

 

Range field value range query
Query the value range of a field
Note: The field value must be a number or time
gte is greater than or equal to
ge is greater than
lte is less than or equal to
lt is less than
boost is the weight, you can set a weight for the specified field

#rangeField value range query
#Query the value range of a field
#Note: The field value must be a number or time
#gte is greater than or equal to
#ge is greater than
#lte is less than or equal to
#lt is less than
#boost is the weight, you can set a weight for the specified field
GET jobbole / job / _search
{
  "query": {
    "range": {
      "comments": {
        "gte": 10,
        "lte": 20,
        "boost": 2.0
      }
    }
  }
}
 

 

The range field value is the time range query

#range field value is time range query
#Query the time value range of a field
#Note: Field value must be time
#gte is greater than or equal to
#ge is greater than
#lte is less than or equal to
#lt is less than
#now is the current time
GET jobbole / job / _search
{
  "query": {
    "range": {
      "add_time": {
        "gte": "2017-4-1",
        "lte": "now"
      }
    }
  }
}
 

 

wildcard query, wildcard query
* Represents one or more arbitrary characters

#wildcard query, wildcard query
# * Represents one or more arbitrary characters
GET jobbole / job / _search
{
  "query": {
    "wildcard": {
      "title": {
        "value": "py * n",
        "boost": 2
      }
    }
  }
}
 

 

fuzzy fuzzy query

#fuzzyFuzzy search
#Search content containing words
GET lagou / biao / _search
{
  "query": {
    "fuzzy": {"title": "Advertising"}
  },
  "_source": ["title"]
}


#fuzzinessSet edit distance, edit distance is how many steps (insert, delete, replace) you need to edit to edit the field value to be searched
#prefix_length is the length of the keyword that does not participate in the transformation
GET lagou / biao / _search
{
  "query": {
    "fuzzy": {
      "title": {
        "value": "Advertising Recording",
        "fuzziness": 2,
        "prefix_length": 2
      }
    }
  },
  "_source": ["title"]
}
 

Forty-four Python distributed crawlers create a search engine


Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.