Tech Prophet Issue 5: ElasticSearch Full Text Search Technology Sharing
Technology needs to be communicated to accelerate progress, and knowledge needs to be shared to spread further.
Whether you ask others for advice or find knowledge online, the growth of your skills comes from the selfless sharing that others do for you.
-- Progress begins with communication and gains come from sharing!
After the throwback, the heavy hitters come in.
On November 30, 2018, the [Carlson College] Tech Prophet's 5th Sharing Session is back! Mr. Ah Tu, brought a great technical sharing about ElasticSearch (abbreviated as "ES") to the R&D partners. With many aspects of understanding, Mr. Atu combines his work experience and related cases to explain step by step: ES introduction, ES data storage, ES data analysis and how to achieve accurate query.
Introduction to ElasticSearch
—
“
Part 1
1
What is ES
ElasticSearch(ES)It's a system based onLucene Open source built、 distributed、RESTful The full text of the search index raise (hand)。
ES is capable of scaling horizontally to hundreds of servers for storage and processing petabytes of data. Large amounts of data can be stored, searched and analyzed in a very short period of time. Even for data volumes of billions or more, because the underlying layer uses invert index machine processed, As long as your server resources are good enough, Theoretically as the amount of data increases、 index incremental, Real-time query efficiency is linear。
Lucene department index raise (hand)
LIUS、Solr、Elasticsearch、Katta、Compass etc. are based on Lucene packages.
Lucene is currently the most popular open source full-text search index DynaKit, It is not a search application with full features, Rather, it is a text-focused index harmony Toolkit for searching, Ability to add index with search capabilities。
Lucene's indexing data structure has become a de facto standard, adopted by many search engines.
2
A few core concepts of ES
cluster
cluster
A collection of one or more nodes (servers). A cluster is identified by a unique cluster ID and assigned a cluster name, which is important because nodes can join the cluster by this cluster name.
node
knots
An es cluster consists of one or more nodes and is involved in indexing data and providing query capabilities. The name of the node is configured in node.name in eaalsticsearch.yml, or if not configured a random UUID will be used to identify it.
index
index
index is a group of similar characteristics with document muster, index The name of the requirement must be lowercase。
type
types
(located) at index in, It is possible to define one or more types。 types be index logical category of the/ partition。
document
document
document It is possible to be index The basic information unit of the, Analogous to every piece of data in the database。
shard
segmentation
The smallest unit of ES processing is the slice, which is the container for the data, and the documents are kept within the slice, which in turn is distributed to the various nodes within the cluster. When your cluster scales up or down, ES automatically migrates the shards across the nodes so that the data remains evenly distributed across the cluster.
replicas
Slice copy
even segmentation replica of, it adds to knots Failures provide high availability, It also extends the search volume harmony throughput, Because the search can be executed in parallel on all copies。
ES Data Storage
—
“
Part 2
1
ES storage logic harmony Physical design
Logical design.
index harmony The basic unit of search is document, It can be thought of as a row in a relational database。 document with types to group, types It contains a number of document, Similar tables contain several rows。 eventual, an types exist in the same index in, index It's a bigger container., resembleSQL Databases in the world。
Physical design.
In the backend, ES divides each index into slices, and each slice can be migrated between different nodes (servers) in the cluster. The way the physical design is configured determines the performance, scalability, and availability of the cluster.
2
invert index
The ES engine writes document data into the data structure of the Inverted Index, which establishes a mapping relationship between terms and documents, where the data is term-oriented rather than document-oriented.
( Simplified version of the reverse row index as shown)
After the field values have been parsed, Store in reverse row index in, invert index Stored is the participle(Term) harmony document(Doc) Relationship between, As you can see from the graph, invert index There is a list of words, Each participle is unique in the list, Records the number of times the word appears, and those containing the word document。 in reality,ElasticSearch Engine created backwards index It's much more complicated than that.。
3
columnar storage
As you can see from the memory structure of the diagram, with row storage, the data of a table is placed together, but columnar storage is kept separate; Only the columns involved will be read during the query; Perfect for performing sorting harmony Aggregation operations。
Disadvantages. When selection is complete, the selected column is reassembled; insert/update operations are more cumbersome.
4
Data Security Policy
ES Data Analysis
—
“
Part 3
analysis(analysis) It's in the document was sent and added to the reverse row index prior to,ES A series of processes will be performed on each field being analyzed。
(1) Character filtering.
Transform characters using character filters.
(2) The text is cut into subdivisions.
Slice and dice text into single or multiple clauses.
(3) Subword filtering.
use Participle filter Transforming each participle。
(4) participle index:
After the above series of processing of the participle, Then store these subscripts in the index in。
The final composition of the previously described inverted row index。
1
analyzer
2
participle
ChineseIK participle
in useIK participle time, To make the search results more accurate, We can customize the participle, For example, specialized vocabulary in laws and regulations, Terminology in the case process, etc.。
IK participle The two strategies in the:ik_max_word harmony ik_smart
ik_max_word: Will split the text at the finest granularity, For example, it would place“ the Chinese People's Republic (PRC) harmony national anthem” split“ the Chinese People's Republic (PRC) harmony national, the Chinese people, China (alternate formal name), ethnic Chinese person or people, popular commons harmony national, people, human, the people, total harmony national, total harmony, harmony, national, national anthem”, Will exhaust every possible combination;
ik_smart: Will do the most coarse-grained splitting, For example, it would place“ the Chinese People's Republic (PRC) harmony national anthem” split“ the Chinese People's Republic (PRC) harmony national, national anthem”。
3
Participle filter
We can customize it to our needs during use analyzer, Custom word breaker, Custom Participle filter Wait a minute, But Points to note are.
(1) Whatever you customize, you must test and refine it over and over again, because the parser is the basis for storage and querying, and if there is a problem with this, then the query results can be lost by a thousand miles.
(2) analyzer, participle If there is a change,ES The data must be regenerated, Otherwise, the storage method harmony There is an error in the search。
How to search precisely
—
“
Part 4
1
Inquire harmony filter
The big difference between a filter and a query is that a filter doesn't calculate query relevance, it doesn't calculate the score of matching documents, so a filter is faster than a query in terms of performance, the reason being that it does less work. So in practice queries used with filters can improve search performance.
List of search methods:
Match_all
Match
Query_string
Simple_query_string
Term
Terms
Phrase
Prefix
Phrase_prefix
Muti_match
Bool
Range
Wildcard
nested
2
Minimum Match Query
QueryBuilders.multiMatchQuery(Object text, String... fieldNames).minimumShouldMatch(“90%”);
Text is the query text, fieldNames is an array of fields, meaning multiple fields in the document query, the percentage is the minimum match.
For example, if there are three subscripts after a text query, this means that the recalled doc has at least 3*0.9=2.7 rounded down by 2 entries to work.
3
Phrase Search
QueryBuilders.matchPhraseQuery(String field, Object value).slop(2) ;
denotes a lexical entry that allows two split results to exist in the middle after splitting the value.
Several methods of controlling precision for ES queries.
(1) Use the query method provided by the ES open source code, and the setting of some parameters. The open source code can be refactored to meet the actual requirements if it has to be.
(2) In existing Chinese participleIK custom word breakers。 Add some technical terms harmony Common vocabulary。
(3) Customize the character filter based on the actual application, Participle filter, Exclude some meaningless participles。
Wait ......
I don't know, this sharing session has come to an end, thank you very much Mr. Ah Tu for bringing professional ElasticSearch technical sharing to the partners!
There is always some truth in the words that have been passed down for thousands of years, such as "a stone from another mountain can be used to make jade" and "if three people walk, there is a teacher in me".
Have time to occasionally think about where what you have comes from, and then pass on that idea when it's within your power and appropriate ......
--this, both sharing and gratitude!
Carlson College looks forward to hearing from you!
END
Editor radical in Chinese characters (Kangxi radical 2)Li Yue
Reviewed by Xu Bingzi
Copyright information: Carlson College.
▼
Find out from here
▼