实时分析系统(Hive/Hbase/Impala)浅析


1. What is real-time analysis( Online Enquiry) system?

Inside the field of big data, real-time analysis( Online Enquiry) The system is one of the most common scenarios, Usually used for customer complaint handling, Real-time data analysis, Online inquiries and more。 Because it's a query application, It usually has the following characteristics:

a. Low time delay( second class)。

b. Complex query conditions( multiple dimensions, Dimensionality is not fixed), simple with( involveID)。

c. Large range of queries( Typically query table records in the billions)。

d. The number of returned results is small( Dozens, if not thousands.)。

e. High concurrent count requirement( Hundreds or thousands of simultaneous concurrent)。

f. backingSQL( There's basically an industry consensus on this, The reason is that it's hard to find someone who can also data analyze, You can still write.JAVA Analysis engineer for code)。

traditionally, A data warehouse is often used to undertake this task, Data warehouses cope with multi-dimensional complex queries by creating indexes。 Traditional data warehouses also have obvious drawbacks, Not very scalable, High cost of index creation, Index prone to failure, etc.。 When the query is complex, Traditional areas andhadoop None of them have a particularly good solution at the moment。 Dimensionality if not fixed, Indexes cannot be created or are too costly, Usually only through wholesale violenceSCAN The way to solve the。

Systems to perfectly address real-time analytics are still being explored, Here's what's going on.hadoop Several common solutions in the field

2. Hive

One sentence descriptionHive: hive is based onHadoop A data warehouse tool for the, Structured data files can be mapped to a database table, and provide a completesql Enquiry function, It is possible to placesql statement is converted toMapReduce Mission to run。Hive backingHSQL, It is a type ofSQL。

It is also true that this mechanism leads toHive The biggest drawback is that it's slow。Map/reduce Scheduling itself is only suitable for batch, Long-cycle missions, For a short and fast business like a query, too high a price。

Map/reduce Why only for batch tasks, No explanation here., We recommend that you look at the principles involved, There is more industry analysis of this fast, This also gave birth tospark A range of solutions such as。

3. Hbase

HBase It is a distributed、 Column-oriented open source database, The technology is derived fromChang et al authored byGoogle essays“Bigtable: A distributed storage system for structured data”。 just asBigtable Took advantage of theGoogle file system(File System) The same distributed data storage provided by,HBase (located) atHadoop which provides a similar level of access to theBigtable competencies。HBase beApache ofHadoop Sub-projects of the project。HBase Unlike a typical relational database, It is a database suitable for unstructured data storage。 Another difference is thatHBase Column-based rather than row-based model。

Hbase At its core, the data is abstracted into tables, Only the tablerowkey harmonycolumn family。Rowkey is the primary key of the record, pass (a bill or inspection)key /value It's easy to find.。Colum family Storing the actual data in the。 Only the primary key can be passed(row key) and the primary key of therange to retrieve data, Only single line transactions are supported( Available throughhive Support to implement multiple tablesjoin complex operation such as)。 Primarily used to store unstructured and semi-structured loose data。

precisely becauseHbase this structure, Responding to a query with a primary key(use id) The application is very effective, Query results are returned very quickly。 For those without a primary key, When querying through multiple dimensions, It would be very difficult.。 The industry is trying to solve this problem, Some technical solutions are implemented above, The results are also largely unsatisfactory:

a. Huawei's secondary index, The core idea is to imitate the way the database builds indexes on the columns that need to be queried to build indexes, Problems when it comes to loading speed, High rate of data inflation, You can't build too many secondary indexes, maximum1~2 size。

b. Hbase Its own coprocessor, Don't bringrowkey of the query, by co-processor, Parallel scanning via threads。

c. Hbase onPhoniex,Phoniex It is possible to allow developers to create a new version of theHBase The use of theSQL inquiry。Phoenix The query engine will take theSQL The query is converted to one or moreHBase scan, and choreographing the implementation to generate a standardJDBC result set, For simple queries, Performance even outperformsHive。

4. Impala

Impala beCloudera before being subjected toGoogle ofDremel Real-time interactions developed under inspiredSQL Big Data Query Tool,Impala There is no more use of the slowHive+MapReduce batch file, Instead, by using a distributed query engine similar to that found in commercial parallel relational databases( due toQuery Planner、Query Coordinator harmonyQuery Exec Engine Three components), It can be accessed directly from theHDFS perhapsHBase use in the middleSELECT、JOIN and statistical functions to query data, thereby significantly reducing latency。 The architecture is shown in the figure 1 as shown,Impala mainly composed ofImpalad, State Store harmonyCLI compose。

Impalad: andDataNode Run on the same node, due toImpalad The process is represented, It receives query requests from clients( The one that received the query requestImpalad forCoordinator,Coordinator pass (a bill or inspection)JNI invokejava Front-end explanationSQL The query statement, Build the query plan tree, The execution plan is then distributed through the scheduler to others with the appropriate dataImpalad execute), Read and write data, The query is executed in parallel, And the results are streamed back to the networkCoordinator, due toCoordinator Return to the client。 at the same timeImpalad also withState Store Stay connected, For determining whichImpalad Is healthy and ready for a new job。 (located) atImpalad Start three inThriftServer: beeswax_server( Connecting clients),hs2_server( borrowHive metadata), be_server(Impalad Internal use) and oneImpalaServer serve。

Impala State Store: Tracks in the clusterImpalad health status and location information, due tostatestored The process is represented, It is handled by creating multiple threadsImpalad Registered subscriptions and with eachImpalad Keep your heart beating connected, everyImpalad will cache one copyState Store information in, whileState Store Offline(Impalad foundState Store While offline, will enterrecovery mode, Register repeatedly, whileState Store After rejoining the cluster, Automatically return to normal, Update cached data) BecauseImpalad YesState Store The cache of can still work, But it will be because someImpalad It's not working., And cached data cannot be updated, resulting in the assignment of the execution plan to the failedImpalad, Causing the query to fail。

CLI: Command line tools provided for user queries(Impala Shell usepython realize), at the same timeImpala It is also availableHue,JDBC, ODBC Use the interface。

Impala The architecture is similar to a distributed databaseGreenplum database, A large query is analyzed as a subquery, Distributed to the underlying execution, Finally, the results are merged, To be clear, violence comes through multithreaded concurrenceSCAN to achieve high speed。

The architecture is perfect, Reality is bone-chilling, During actual use,Impala Performance and stability are far from good。 especiallyImpala Although claimed to supportHDFS harmonyHBASE, But actual use is found, Run inHDFS above, Performance is not good enough, Run inHBASE performance is poor, There are also frequent problems such as memory overflows that remain to be resolved。

5. epilogue

as it stands, The industry doesn't have a perfect solution yet, The usual ideas are:

a. Organize data in advance based on query results。 Every business is different, To query fast, Just analyze the scene in advance, At the time of data entry, Organize your data in advance based on the results of your query。 This is also the practice of applications such as Weibo, Store data in advance based on the results displayed。

b. For dimensions that are not fixed, Multi-dimensional queries, as it standshadoop And the traditional parallel database architecture there is a process of convergence, I believe that in the end it will be the same way,Impala There is still a future。

c. Convergence of multiple query engines, Usually we want a piece of data, Can handle a variety of applications, Can assume direct take-userid quick query, It is also possible to handle complex analysis of multi-dimensionality, So support multiple applications, The convergence of features of the multi-query engine is not avoidable。 Hope laterimpala can be resolved inhabase performance problem。

d. Speed up with high-speed hardware,flash Cards are getting cheaper and cheaper, Replace the data that will require a high-speed queryflash and so on on high-speed hardware。


Recommended>>
1、Baker Chain becomes the first batch of companies supported by Shanghai Yangpu District policy landing
2、python guys homemade template home query and download file script
3、Steel market situation analysis will add vitality to the citys economic development
4、Guangzhou Development Zone Blockchain Mining System
5、Another year of WWDC

    已推荐到看一看 和朋友分享想法
    最多200字,当前共 发送

    已发送

    朋友将在看一看看到

    确定
    分享你的想法...
    取消

    分享想法到看一看

    确定
    最多200字,当前共

    发送中

    网络异常,请稍后重试

    微信扫一扫
    关注该公众号