cool hit counter Big Data from the Ground Up Series (II) - Data Benchmarking Episode of TPC-DS_Intefrankly

Big Data from the Ground Up Series (II) - Data Benchmarking Episode of TPC-DS


1. Preparation of tools

The TPC-DS is not much more than an introduction, as we can see from the official documentation.

You can do queries and so on, so in the big data area, we can use TPC-DS to generate massive amounts of data and use the test statements it provides to complete performance benchmarking. Download the relevant packages from the official website at

http://www.tpc.org/tpc_documents_current_versions/current_specifications.asp

All right, we've got the package down.

Yeah, that's right. That's the one.

2. TPC-DS table design

As we continue to look at the official documentation provided, we are able to know that the benchmark test will generate 24 tables, of which there will be 7 factual tables and 17 dimensional tables.

Roughly on page 18 there is the following diagram.

We can see that the following diagram shows the design of the fact sheet table.

The following figure shows the amount of data per table for different data sizes.

OK, the above, mainly illustrates some table design of TPC-DS and the number of entries generated for the data volume. Here's how to start actually using the tool to generate test data, generating a 10G data volume for this test.

Once compiled, we will see in the tools directory.

Represents 10G of data

Represents exporting data to /home/tmp/data

Using 24 threads

After a long wait 。。。。。。。 Found that the generated data is only 2.1G 。。。。 It does seem a bit small, all over again I

Generate 100G of data to see how big the total data is~~

Hey, that's it. Tired heart.

Use to generate data and import it into a hive table

Well, after introducing the table structure of TPC-DS and such, instead of applying this tool to play with, we use.

https://github.com/krutivan/hive-benchmark

After downloading the file, transfer it to the server.

Once compiled, we can start building data. Note that the tool uses mapreduce for hadoop, and we're using the HDP distribution, so we need to :

It will automatically finish generating 100G of data through the MR program, well, this will take about 1 1/2 hours, take your time and wait 。。。。

OK, a long wait later, we finished building the data. Then we found that there is an error in the above figure, through debug, we found that it is because when I installed HDP, the maximum memory of the container is set to 4G, while the container size of Tez is 11G, after changing the container size of Tez to 4G, we continue.

Wait a long moment for the import to complete.

It is needed that some queries of this tool are not supported by hive because it writes sql standard and not hive sql standard !!! There are probably 50+ statements that are available.


Recommended>>
1、Documentary The Story of the Information Machine Blockchain screened in the US
2、Only 14 of healthy people eat and sleep on time and exercise regularly 44 million insurance user profiles are out
3、First in China Tencent Ruiji Helps Add Another New Member to Mimu Bears Smart Family Making Medical Services More Efficient
4、Public Announcement2018 Innovation Projects for Deep Integration of Artificial Intelligence and Real Economy
5、Reconstructing homes with the power of technologynbsp Peaceful technology offers new ideas for green cities

    已推荐到看一看 和朋友分享想法
    最多200字,当前共 发送

    已发送

    朋友将在看一看看到

    确定
    分享你的想法...
    取消

    分享想法到看一看

    确定
    最多200字,当前共

    发送中

    网络异常,请稍后重试

    微信扫一扫
    关注该公众号