How to conduct Tps-ds test

Source: Internet
Author: User
Tags benchmark
1 Introduction

With the commercialization of open source Hapdoop, Map/reduce, Spark, HDFS, HBase and other technologies, large data management technology has been developed by leaps and bounds. How to objectively compare different data management systems, that is, the choice of large data test benchmark, has become an important research topic.

The Transactional Energy Management Committee (transactionprocessing performance COUNCIL,TPC) is currently the most well-known standard organization for evaluating benchmarks for non-profit data management systems. It defines multiple sets of standard test sets for objectively and reproducible evaluation of database performance. Over the past more than 20 years, the agency has published a variety of database benchmarking benchmarks, such as Tpc-a, tpc-d, Tpc-h and Tpc-ds. The Decisionsupport (DS) subset, Tpc-ds, is a standard set of SQL tests for evaluating decision support systems (or data warehouses). This test set contains a large data set of statistics, report generation, online query, data mining and other complex applications, testing data and values are skewed, consistent with the real data. It can be said that Tpc-ds is very close to the real scene of a test set, but also a more difficult test set.

TPC is comprised of more than 10 server vendors (Huawei is the only Chinese company in the organization) involved in the development of standard specification, performance and price metrics for business application benchmarks (Benchmark), and manages the release of test results. TPC benchmark test results are the server class equipment performance of the core technical indicators.


2 tpc-ds profile

The TPC-DS test benchmark is a test benchmark for the next generation decision support system that TPC has introduced to replace Tpc-h. So before we discuss Tpc-ds, let's introduce tpc-h. 2.1 tpc-h

Tpc-h is a consumer-oriented decision support system test benchmark that defines 8 tables, 22 queries, and follows SQL92. The database schema of the Tpc-h benchmark follows the third paradigm, scholars such as Professor Ye Xiaojun that "its data table data features single (such as data does not tilt), its data maintenance function only limits the potential for the use of the index, and not test the DBMS to perform real data maintenance operations-data extraction, transformation and loading (ETL ) capability of the function ". At the same time, the emerging data warehouse began to adopt new models, such as star model, snowflake model. Tpc-h has not been able to accurately reflect the real performance of today's database systems. To this end, TPC has launched a new generation of tpc-ds benchmarks for decision-making applications. 2.2 tpc-ds

Tpc-ds uses the star type, snowflake type and so on multidimensional data mode. It contains 7 fact tables, and 17 latitude tables have an average of 18 columns per table. Its workload includes 99 SQL queries, covering the core of SQL99 and 2003, as well as OLAP. This test set contains a large data set of statistics, report generation, online query, data mining and other complex applications, testing data and values are skewed, consistent with the real data. It can be said that Tpc-ds is very close to the real scene of a test set, but also a more difficult test set.

This feature of Tpc-ds is very similar to the analysis of large data mining applications. The large data analysis technology such as Hadoop is also a large-scale data analysis and deep mining of massive data, also includes interactive online query and statistical reporting class application, while the data quality of large data is also low, the data distribution is real and uneven. So Tpc-ds becomes the best test set for objectively measuring multiple versions of Hadoop and SQL on Hadoop technology. This benchmark test has several key features:

L A total of 99 test cases, followed SQL ' 99 and SQL 2003 syntax standards, SQL case is more complex;

L The amount of data analyzed is large, and the test case is to answer real business questions;

L test cases include a variety of business models (such as analysis report type, iterative online analysis type, data mining type, etc.);

L Almost all test cases have high IO load and CPU computing requirements.

Ye Xiaojun and other scholars have summarized the partial summaries of these queries as shown in table 1.

for complete information on this benchmark test, refer to http://www.tpc.org/tpcds/. 3 tpc-ds test

The main steps of the Tpc-ds test are four parts for environment preparation, SQL statement compatibility testing, and statement modification, TPC-DS testing, and test results, where the SQL statement compatibility test will be performed under the condition that the 1GB data volume is set up with a virtual machine cluster. The Tpc-ds test is performed under 500GB data volume. 3.1 Environment Preparation 3.1.1 Local configuration

Because Tpc-ds performs performance testing, the test environment should be a clean environment.

to ensure that the operating system is clean, you can redo the raid, and then install the operating system through the use of the ILO port to install the operating system, and finally, based on the required environment, through the Ambari and HDP cluster installation manual on the physical machine or the Cloudera on the physical machine. Installation manuals such as manager and CDH cluster installation (pending completion) complete the installation of the test cluster, and finally, change the cluster configuration to a test configuration and record the test configuration. 3.1.2 TPC Resources

This section mainly describes how to obtain the test resources from TPC, into 1GB data, 500GB data generation, table statements and SQL statement generation, and so on.

1, official website Download TPC-DS Compression package

URL: http://www.tpc.org/information/current_specifications.asp

2, under the Linux decompression (Windows temporarily do not introduce):

[root@node1tpc]# Unzipdstools.zip

3. Execute make in Tools file

[root@node1tpc]# Cd/tpcdsversion1.3.1/tools

[Root@node1 tools]# make

Note: If you do not install GCC, you will get an error.

4, in the Tools directory to use./dsdgen to generate data.

--help View Help

Common:

-dir Build Directory

-scale data Size (GB)

-DELIMITER field separator, default |

-Is there a delimiter at the end of the TERMINATE, parameter Y or n

Example: In the/usr/datas directory, generate 1GB data that has a delimiter of | And no delimiters at the end:

./dsdgen-dir/usr/datas-scale 1-terminaten

This allows you to generate 1GB of data and 500GB for each test, with 25 tables of data generated as follows, which is the 7 fact table, 17 latitude table, and 1 version table Dbgen_version.dat described above.

5./dsqgen generation of SQL test Cases

--help View Help

Common:

-input input, read the template that the test case contains, generally use/QUERY_TEMPLATES/TEMPLATES.LST.

-directory template directory, general use of-directory. /query_templates can be.

-dialect generates the language of a database, you can view the/query_templates directory with Oracle, DB2, SQL Server, and so on.

-scale Data size (unit g)

Special note: At the end of each query in the/query_templates directory, add:

Define _end = "";

FTP has a fix, you can take to cover: \\100.0.1.1\admin\soft\tpc\query_templates

Example:./dsqgen-input. /query_templates/templates.lst-directory.. /query_templates-dialect Oracle-scale 1

Thus, a query_0.sql can be generated in the tools directory, which is the generated SQL test case.

6, the construction of the table statement

/tpcdsversion1.3.1/tools directory, tpcds.sql file. Many data platforms may not be used directly and need to be modified. The modification of the table statement is mainly based on the data type modification supported by different environments and some basic syntax corrections, and the separator character is specified in the table when the data delimiter is generated.

Take Hive as an example: integer changed to Int,time to timestamp, remove not NULL, delete primary key, specify separator row format delimited fieldsterminated by ' | '; Wait a minute.

7. Test Case Results

under the directory/tpcdsversion1.3.1/answer_sets. 3.2 SQL statement Compatibility Test and statement modification 3.2.1 SQL statement compatibility Test

It is necessary to have a 99-statement format before testing, since the resulting script may contain tabs, which can cause unexpected errors when the statements are copied or executed directly.

To complete the establishment of the database, enter this database to build the table operation. Note that if you do not create a new database, the operation will be done in the default database.

The support test for SQL statements can then be traversed under a small amount of data (1GB) (this test can be done in a virtual machine because this test only tests compatibility and is less demanding for test performance).

First, load the local data into the table that you just created, such as:

After the load completes, check that the table data is complete, and then start the SQL Compatibility test. The main test method is to traverse the execution of each test case, if the SQL statement can be completed, to carry out the completion of the record, if not completed, the system will be the error record, until all the statements test completed. 3.2.2 Statement modification

After the previous phase of the compatibility test, you can not complete the SQL statement of the error summary, the problem is categorized separately, the same problem modified together.

Modification must comply with the principle that does not change the original statement query content, the original statement use of the basic logic, in accordance with this principle under the conditions of small-scale adjustment, such as: Do not change the value of the data type adjustment, add alias, in the comparison of the query after adding limit 1. It is better to keep a rough record of how the method is modified, so that it will not only help in the case of other statements when the same problem is changed, but also to facilitate the subsequent statistical test results.

At the same time, records can not be supported statements, such as: Rollup, intersect and so on. When you encounter similar problems, you can quickly refer to the previous processing methods to modify or skip directly.

Every time you modify a statement, you need to test the revised statement, which is tested in a small amount of data (1GB) in order to ensure efficiency. There may be other errors in the test, and the problem needs to be tackled immediately. 3.3 tpc-ds test

After you complete the SQL statement compatibility Test, you will compile the statements that can be run into a script. Repeat the build table and load process in the test environment, after the 500GB data import completes, executes this script, realizes the Tpc-ds test. The main reason for this is that these SQL cases are tested over a longer period of time and may need to be tested by using nighttime time, otherwise, as in the same way as compatibility testing.

When executing load and SQL statements, you need to keep log, take the SQL statement in hive as an example, first switch to root, create a new log document and change its read and write permissions, and then execute the script:

You can also change the permissions of the Hdfs-site in the configuration, convert dfs.permissions.enabled to False, and then execute the script with the root user.

If you want to know the execution of the statement, you can query log through the cat statement, the script should be stopped in time for error, export log, analyze the cause of the error, rerun or execute the next statement.

after all the statements have been executed, the log is exported and the next process is done. 3.4 test results finishing

The main factors of interest in this test are:

1. Support for SQL statements;

2. Modification of the SQL statements supported after modification;

3. The time usage of the statement that can complete the query (this represents the performance of the pros and cons);

4. Accuracy of query results;

According to the above factors, according to the test specific needs, analysis and processing:

1. For SQL statement support situation, mainly through the SQL compatibility test of the number of statements to distinguish;

2. The content can be modified through the text comparison software, such as beyond compare to compare the modified SQL with the generated query_0.sql, revise the content and the unsupported statement to collate the statistics;

3. The SQL execution time data of each platform in log is made into a table, which can be used to compare the performance of each platform intuitively;

4. To understand the specific level of support, you can further the test results are correct or wrong reasons for the comparison. 4 reference documents

[1]www.tpc.org [2] Chen Dan, Ye Xiaojun, Schilling, Tpc-ds Performance testing tool implementation, computer application, Volume 31st, 9th issue,,, and.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.