The Index spoken of above is a logical concept that classifies the data stored in ES in terms of the namespace dimension (aka sub-table), with data of the same class placed in the same Index.
And when actually stored, an Index's data is distributed across multiple shards (called shards), with each shard hosting a portion of an Index's data. In turn, each slice can have multiple replicas, one of which is the primary copy, called primary, and the others are from the replica, called replica.
There are two Indices in the figure, Index1 and Index2. where Index1 has two Primary Shards, P1 and P2. Each Primary Shard has two more copies, for example, two R1 copies of P1.
At this point you understand the relationship between Index and Shard, and the concepts of Primary, Replica.
ES itself is again a multi-node distributed cluster, with shards being broken up and distributed across nodes. Different nodes can be distributed on different machines.
And by default ES does not have both primary and replica of the same shard on the same Node, so that even if one Node fails out of the cluster, it is guaranteed not to lose data because there is another copy in the cluster.
The figure shows P1 and two R1s, distributed over three Nodes (Node1, Node2, Node3), which carry exactly the same data. When Node1 and Node2 are down at the same time, there is still an R1 in the ES cluster that will not affect the data read and write services, much less lose data.
By ES's distributed architecture, you can summarize some of the benefits of ES:
o distributed search engine， Linearly scalable to enhance system performance
o severalshard Share of stress， Highly Concurrent Writes、 inquiry
o Multi-Copy Storage， No data loss on node failure
o base frame、 Server room awareness， It is possible to set the sameshard The copies are assigned to different server rooms， Avoid data loss due to server room failure， Achieve master-slave disaster tolerance
o Comes with powerful cluster management features， Flexible and stretchable - shard Automatic equalization， Number of nodes can be increased or decreased at any time based on business changes
oRESTful interface， Easy to develop and debug
o Powerful aggregation and analysis capabilities， Can do complex statistical analysis（ separate buckets、geo hash、 multilayer aggregation etc.）
By default, after downloading the ES installer, go to the ES directory and execute the following command directly and ES will start in the background and listen to port 9200 of the machine. We'll be able to read and write data through this port, no additional configuration of the ES is required.
The second part of this section of ES is devoted to a special module of ES, Ingest Pipeline, which is used to do pre-processing of write data.
Before the data is actually written to Index of ES, some modifications can be made to the data through Incest Pipeline. It is possible to define a number of processors inside it, and different processors can do different things to write data. For example, instead of Logstash, you can do the formatting of the logs through Grok Processor.
The diagram at the bottom shows how the Grok Processor works. For text written to ES (Simple Data), an expression (Grok Pattern) can be defined to specify how to do parsing of the input text. That is, the input text is split into parts, each of which is a separate field, according to certain rules. After parsing, the final structured data at the bottom is formed.
The so-called Grok Pattern is actually a regular expression, except that ES makes aliases for common regular expressions to make it easier for us to use them. So-called text parsing is also known as regular matching, and then the various parts that are matched are put into separate fields to form structured data.
The data in the figure is a sample log, divided into three parts, each parsed into three fields by regular matching. - The red part is parsed as the time field, representing the time when this log was generated. - The yellow part is resolved to the client field, representing the IP address of the client accessing the service. - The blue part is parsed as the duration field, representing the elapsed time of this service.
So how should a Pipeline be defined?
The figure shows the pipeline used in this course, and according to the ESPI, we can define a Pipeline with the REST PUT command.
The inest/pipeline in the URL is a fixed API for the ES definition pipeline, and the last apachelog is the name of the pipeline (pipeline_name) that we define, which is used when writing data later. A section of the apache log will be captured later using Filebeat, so the pipelien here takes the name apachelog.
Http's Bodyjson format definesprocessors, which contains three processor: grok, date, dateindexname. - Grok processor We've covered that the pattern used here is a predefined expression for parsing apache logs within ES, and we can use it directly. - Date processor is to change the timestamp field generated by grok to the Date type. Because the fields generated by grok by default are string type, you need to convert them here. - dateindexname processor: Usually our logs are time-dependent, in order to facilitate management, delete expired logs, etc., usually index is divided by time range, such as one index generated in a day, the data of the day are written to the same index, and then the next day the log generated will be written to a new index. Dateindexname processor automatically generates index name based on the time field (that is, the timestamp field generated above), so that you do not have to specify index name when writing data. The index name defined in the figure is a prefix apache_log@ plus the date of the day, and the actual resulting index name is shown in the bottommost part of the figure.
This concludes the ES portion of the presentation.
The second part introduces Filebeat, which is at the most upstream of the entire Elastic Stack and is responsible for log collection. The main focus is on the basics of Filebeat and how to configure it to start.
The basics of filebeat Filebeat is a member of the Beats family, Beats includes many tools, interested students can go to the official website to learn more.
As shown in the first image, Filebeat is a text collector that listens to text files, similar to the linux tailf command, and constantly collects new content in the files, then sends the collected data to Logstash or Elasticsearch.
It needs to be installed on a production environment, the machine that generates the log files, but it is written in the Golang language, which is more efficient; and it is simple, relatively lightweight, and takes up fewer machine resources.
It has the ability to restart the renewal pass. Filebeat saves the meta-information currently in the listening file in a registry file. As shown in the second figure, the path to the file, the inode information, and where the collector has read it are recorded. When the Filebeat process exits, the regstry file is read the next time it restarts, and for the files that are listened to, the contents of the file continue to be read from the location of the offset.
Filebeat is also stress-aware, and it dynamically adjusts the speed at which it delivers data based on the current load conditions of the ES cluster to prevent the ES cluster from being overstressed.
Basic configuration of filebeat
The configuration file for filebeat is in the filebeat root directory, in yaml format. By default, we only need to modify input (which files are collected) and output (where the data is output to).
oenabled： Default isfalse， change intotrue back，Filebeat Only then will the contents of the file be captured。
opaths： Path of the file to be captured， It's a sequence.， Multiple paths can be assigned，filebeat will listen to the files in these paths at the same time。paths Wildcards are also supported， Convenient configuration of multiple files under one path。
output: The logs need to be exported to ES, so output.elasticsearch needs to be modified to.
ohosts: ES of each node of the clusterip harmonyport， It's a sequence.， Addresses of multiple nodes in a cluster can be configured
opipeline: used when writing data to thepipeline， Here, fill in the fields that were previously inES Created inapache_log
This way filebeat is configured, and if you start it, you can just execute the filebeat binary. If you want to execute in the background, you can start it with the nohup command If you are worried about filebeat taking up too many machine resources, you can limit filebeat's resource usage by starting it with taskset tied core.
When the amount of logs collected is relatively large, the default configuration may not be able to meet the demand, so some commonly used tuning parameters are given here for your reference. Of course the optimal configuration of parameters varies from scenario to scenario, and you need to make the appropriate trade-offs and adjustments based on your actual usage scenario. For these parameters, without going into too much detail here, you can refer to the official documentation for filebeat. You can share your questions in the comments, and I will also sort out the reasons and effects of adjusting these parameters later when I have time.
Introduction to Kibana
The last section introduces Kibana, which is at the very downstream of Elastic Stack and provides mainly data analysis and visualization capabilities. The main focus is on the basic features of Kibana and how to query data and generate visual charts through it.
Kibana provides the ability to query, visualize, and statistically analyze data.
It is a separate process that needs to be downloaded and deployed independently, it can only bind to one ES cluster and cannot query data from multiple ES clusters at the same time.
Search the Discovery query interface to obtain target data with simple query criteria, improving the efficiency of problem location.
Visualize The Visualize feature, which is its biggest highlight, allows you to generate various charts as shown in the figure with the help of ES's aggregated query interface.
Data Management Console provides a command line interface that allows you to add, delete, and check data, and manage ES clusters through commands.
X-Pack Monitor When used in conjunction with X-Pack, a value-added service provided by ES, Kibana can also provide monitoring capabilities, which are not part of the scope of this course.
Starting kibana is also very simple, by default just execute the kibana binary directly from the kibana/bin directory, it will connect to port 9200 on the local machine by default, which is the default listening port for ES.
To query data via Kibana.
Step 1: First tell Kibana what Index we expect to query. As shown in the first image, via the Management page, configure an Index pattern to match the Index we are looking for. Filebeat uploads data through a pieline called apachelog and generates an Index at the beginning of the apachelog@, so you need to specify index pattern as apache_log@, so that it can match the index created by piepelien, as shown in the blue section of the figure.
Step 2: You need to select the time field for the data, here it is specified as the timestamp field defined in the pipeline.
The third step allows you to search for data in the Discovery interface. You can enter the query criteria in the text box marked in red in the figure. For example, to query apache logs with a response of 403, just type response:403. What is shown below is the corresponding apache log that was queried and is structured data, you can see that the client ip/verb/request etc is parsed out.
The specific syntax of the query can be found in the official Kibana documentation.
The main interfaces related to kibana data visualization are Visualize, Timelion, and Dashboard.
Because data visualization charts are done based on ES's aggregation queries, a certain understanding of ES's aggregation queries is required. So the main focus here is to introduce the functionality, and the specific creation method needs to be configured by referring to Kibana's official documentation after understanding ES's aggregation queries.
The Visualize interface is mainly used to create individual charts, such as the first pie chart, which shows that requests with a response code of 200 account for 82.06% of the total number of requests for apache. There is also a Chapter 2 area chart that shows the data throughput of apache at each moment.
The Timelion interface is mainly used to see the relationship between different indicators at the same point in time. The two charts in the figure where the red vertical lines are located are at the same moment, and by taking the values of each chart, you can see the correlation between different indicators at the same moment, which is easy to locate and analyze.
Dashboards are dashboards that allow you to place the various charts created above on the same Dashboard page and then name and save them. This way, every time you log in to Kibana you can see the charts and metrics at a glance directly from the Dashboard. And Dashboard can generate url links for easy sharing to others; it also provides links to iframes, which can be easily embedded in other system's front page.
This concludes the course and shares some of the work done by our ES team in the Infrastructure Department. We have developed two products based primarily on Elasticsearch, a source-generated Elasticsearch service and a temporal database, CTSDB.
This course is short, the content consists of more, and the use of ES is taught in a simpler way. For more information on the use and tuning of ES, you can directly access the article "Elasticsearch Tuning in Practice", or you can scan the QR code above to check it out and welcome your criticism.
For the introduction of timing data and Tencent Cloud CTSDB, you can directly access the article "Tencent's Only Timing Database: CTSDB Demystified", or you can scan the QR code above to check it, and we welcome your criticism.
For those who have a need for the above two products, you can search for CES and CTSDB on Tencent Cloud's official website, and there is a more detailed introduction in the column.