Tech Prophet Issue 5: ElasticSearch Full Text Search Technology Sharing


Technology needs to be communicated to accelerate progress, and knowledge needs to be shared to spread further.

Whether you ask others for advice or find knowledge online, the growth of your skills comes from the selfless sharing that others do for you.

-- Progress begins with communication and gains come from sharing!

After the throwback, the heavy hitters come in.

On November 30, 2018, the [Carlson College] Tech Prophet's 5th Sharing Session is back! Mr. Ah Tu, brought a great technical sharing about ElasticSearch (abbreviated as "ES") to the R&D partners. With many aspects of understanding, Mr. Atu combines his work experience and related cases to explain step by step: ES introduction, ES data storage, ES data analysis and how to achieve accurate query.

Introduction to ElasticSearch

Part 1

1

What is ES

ElasticSearch(ES)It's a system based onLucene Open source built、 distributed、RESTful The full text of the search index raise (hand)。

ES is capable of scaling horizontally to hundreds of servers for storage and processing petabytes of data. Large amounts of data can be stored, searched and analyzed in a very short period of time. Even for data volumes of billions or more, because the underlying layer uses invert index machine processed, As long as your server resources are good enough, Theoretically as the amount of data increases、 index incremental, Real-time query efficiency is linear。

Lucene department index raise (hand)

LIUS、Solr、Elasticsearch、Katta、Compass etc. are based on Lucene packages.

Lucene is currently the most popular open source full-text search index DynaKit, It is not a search application with full features, Rather, it is a text-focused index harmony Toolkit for searching, Ability to add index with search capabilities。

Lucene's indexing data structure has become a de facto standard, adopted by many search engines.

2

A few core concepts of ES

cluster

cluster

A collection of one or more nodes (servers). A cluster is identified by a unique cluster ID and assigned a cluster name, which is important because nodes can join the cluster by this cluster name.

node

knots

An es cluster consists of one or more nodes and is involved in indexing data and providing query capabilities. The name of the node is configured in node.name in eaalsticsearch.yml, or if not configured a random UUID will be used to identify it.

index

index

index is a group of similar characteristics with document muster, index The name of the requirement must be lowercase。

type

types

(located) at index in, It is possible to define one or more types。 types be index logical category of the/ partition。

document

document

document It is possible to be index The basic information unit of the, Analogous to every piece of data in the database。

shard

segmentation

The smallest unit of ES processing is the slice, which is the container for the data, and the documents are kept within the slice, which in turn is distributed to the various nodes within the cluster. When your cluster scales up or down, ES automatically migrates the shards across the nodes so that the data remains evenly distributed across the cluster.

replicas

Slice copy

even segmentation replica of, it adds to knots Failures provide high availability, It also extends the search volume harmony throughput, Because the search can be executed in parallel on all copies。

ES Data Storage

Part 2

1

ES storage logic harmony Physical design

Logical design.

index harmony The basic unit of search is document, It can be thought of as a row in a relational database。 document with types to group, types It contains a number of document, Similar tables contain several rows。 eventual, an types exist in the same index in, index It's a bigger container., resembleSQL Databases in the world。

Physical design.

In the backend, ES divides each index into slices, and each slice can be migrated between different nodes (servers) in the cluster. The way the physical design is configured determines the performance, scalability, and availability of the cluster.

2

invert index

The ES engine writes document data into the data structure of the Inverted Index, which establishes a mapping relationship between terms and documents, where the data is term-oriented rather than document-oriented.

( Simplified version of the reverse row index as shown)

After the field values have been parsed, Store in reverse row index in, invert index Stored is the participle(Term) harmony document(Doc) Relationship between, As you can see from the graph, invert index There is a list of words, Each participle is unique in the list, Records the number of times the word appears, and those containing the word document。 in reality,ElasticSearch Engine created backwards index It's much more complicated than that.。

3

columnar storage

As you can see from the memory structure of the diagram, with row storage, the data of a table is placed together, but columnar storage is kept separate; Only the columns involved will be read during the query; Perfect for performing sorting harmony Aggregation operations。

Disadvantages. When selection is complete, the selected column is reassembled; insert/update operations are more cumbersome.

4

Data Security Policy

ES Data Analysis

Part 3

analysis(analysis) It's in the document was sent and added to the reverse row index prior to,ES A series of processes will be performed on each field being analyzed。

(1) Character filtering.

Transform characters using character filters.

(2) The text is cut into subdivisions.

Slice and dice text into single or multiple clauses.

(3) Subword filtering.

use Participle filter Transforming each participle。

(4) participle index:

After the above series of processing of the participle, Then store these subscripts in the index in。

The final composition of the previously described inverted row index。

1

analyzer

2

participle

ChineseIK participle

in useIK participle time, To make the search results more accurate, We can customize the participle, For example, specialized vocabulary in laws and regulations, Terminology in the case process, etc.。

IK participle The two strategies in the:ik_max_word harmony ik_smart

ik_max_word: Will split the text at the finest granularity, For example, it would place“ the Chinese People's Republic (PRC) harmony national anthem” split“ the Chinese People's Republic (PRC) harmony national, the Chinese people, China (alternate formal name), ethnic Chinese person or people, popular commons harmony national, people, human, the people, total harmony national, total harmony, harmony, national, national anthem”, Will exhaust every possible combination;

ik_smart: Will do the most coarse-grained splitting, For example, it would place“ the Chinese People's Republic (PRC) harmony national anthem” split“ the Chinese People's Republic (PRC) harmony national, national anthem”。

3

Participle filter

We can customize it to our needs during use analyzer, Custom word breaker, Custom Participle filter Wait a minute, But Points to note are.

(1) Whatever you customize, you must test and refine it over and over again, because the parser is the basis for storage and querying, and if there is a problem with this, then the query results can be lost by a thousand miles.

(2) analyzer, participle If there is a change,ES The data must be regenerated, Otherwise, the storage method harmony There is an error in the search。

How to search precisely

Part 4

1

Inquire harmony filter

The big difference between a filter and a query is that a filter doesn't calculate query relevance, it doesn't calculate the score of matching documents, so a filter is faster than a query in terms of performance, the reason being that it does less work. So in practice queries used with filters can improve search performance.

List of search methods:

Match_all

Match

Query_string

Simple_query_string

Term

Terms

Phrase

Prefix

Phrase_prefix

Muti_match

Bool

Range

Wildcard

nested

2

Minimum Match Query

QueryBuilders.multiMatchQuery(Object text, String... fieldNames).minimumShouldMatch(“90%”);

Text is the query text, fieldNames is an array of fields, meaning multiple fields in the document query, the percentage is the minimum match.

For example, if there are three subscripts after a text query, this means that the recalled doc has at least 3*0.9=2.7 rounded down by 2 entries to work.

3

Phrase Search

QueryBuilders.matchPhraseQuery(String field, Object value).slop(2) ;

denotes a lexical entry that allows two split results to exist in the middle after splitting the value.

Several methods of controlling precision for ES queries.

(1) Use the query method provided by the ES open source code, and the setting of some parameters. The open source code can be refactored to meet the actual requirements if it has to be.

(2) In existing Chinese participleIK custom word breakers。 Add some technical terms harmony Common vocabulary。

(3) Customize the character filter based on the actual application, Participle filter, Exclude some meaningless participles。

Wait ......

I don't know, this sharing session has come to an end, thank you very much Mr. Ah Tu for bringing professional ElasticSearch technical sharing to the partners!

There is always some truth in the words that have been passed down for thousands of years, such as "a stone from another mountain can be used to make jade" and "if three people walk, there is a teacher in me".

Have time to occasionally think about where what you have comes from, and then pass on that idea when it's within your power and appropriate ......

--this, both sharing and gratitude!

Carlson College looks forward to hearing from you!

END

Editor radical in Chinese characters (Kangxi radical 2)Li Yue

Reviewed by Xu Bingzi

Copyright information: Carlson College.

Find out from here


Recommended>>
1、Laravel Database Operations
2、AJAX User Manual
3、Shared Memory LockFree Queue Implementation
4、What is JavaMarkerInterface
5、Android Advanced Animation 4 Finale Table of Contents Review Wrapper Library Summary

    已推荐到看一看 和朋友分享想法
    最多200字,当前共 发送

    已发送

    朋友将在看一看看到

    确定
    分享你的想法...
    取消

    分享想法到看一看

    确定
    最多200字,当前共

    发送中

    网络异常,请稍后重试

    微信扫一扫
    关注该公众号