As an off-line data analysis system, TDW has a good performance advantage in dealing with massive data by parallel computing. But Tencent knows, want to use a all-inclusive system to solve all the problems are generally unrealistic, the same, TDW also has its disadvantages, such as small data processing performance, Update/delete performance, interface is not rich. So Tencent introduced a powerful open source database PostgreSQL, and did some function expansion to make it have the ability to access TDW data, while Tencent developed a new storage engine in TDW, which Tencent called Pgdata storage engine, Enables TDW to read and write data in PostgreSQL.
Application Overview in Tencent
- The main business scenario is OLAP data analysis
- Mostly internal systems, with a small amount applied to external services
- exists as a supplement to the TDW system
- The main application forms for the business use of TDW provided by TPG Services
- TDW team is responsible for machine, operation and technical support
- Business submission application can be used
- PG and MySQL
- MySQL can support, then use MySQL first
- MySQL is not satisfied, and then consider using PG
The realization of TDW and PostgreSQL access functions is a powerful complement to TDW, which is embodied in the following 3 points:
- Make up the short board with not rich TDW interface. TDW lacks standardized jdbc/odbc, programming interfaces are not rich, and PostgreSQL has a powerful community that provides interfaces to various languages, such as JDBC/ODBC, Shell, C + +, C #, Python, and Perl, The user accesses the data in TDW through these rich interfaces and the PostgreSQL TDW bridging tool Tdwlink developed by Tencent.
- Make up TDW small data analysis efficiency at the end of the short plate. TDW in mass data processing, it can play the advantages of parallel execution. But for small data analysis, its performance is inferior to the traditional db. With PostgreSQL, for data analysis within 10GB, better performance and time response can be obtained, typically returning results at the second level, and TPG has an advantage in this scenario compared to the TDW-minute response.
- As a TDW pgdata storage engine, it compensates for the short board under TDW update/delete efficiency. TDW as a data warehousing system, for record-level update/delete support is not very good. Record-level update/delete in TDW can result in overrides of the entire table, which means that even a delete of a piece of data can cause the entire table to be rewritten, consuming a lot of system resources. TPG as a traditional database, record-level update and delete efficiency is very high.
- TDW data mart, results show
- External provision of JDBC/ODBC and other standardized interface
- Improving the function of small data analysis in colleges and universities
- Improving the Update/delete function of colleges and universities
Why should TPG be introduced?
Challenges of TDW Application promotion
- TDW off-line analysis, can not meet the needs of business results Library
- TDW has no standard JDBC, ODBC interface and is difficult to dock with commercial tools
- TDW processing small data, doing update and delete inefficient
- TDW is not omnipotent, it is impossible to meet all the application scenarios
- Need to build a set of RDBMS as a complement to TDW
- To facilitate the user to migrate existing business, do have tools to do migration
- To have a standard JDBC, ODBC interface
- Better performance
- Features easy to expand
- The perfect DB feature
- The SQL standard supports better
- Support Pl/pgsql and many other process languages
- Supports advanced features such as views, analytic functions, CTE, etc.
- OLAP performance over MySQL
- Complex SQL 10 times times Higher performance +
- Cost-based SQL optimization, tuning means more
- Partial index, function index, cluster index
- Plug-in extension of functionality
- Have access to MySQL, Redis, text, and other external data source Plug-ins
- It's easy to develop plug-ins to access TDW
- TB database backup and recovery (based on ZFS snapshot technology)
- Faster, snapshots of terabytes of data take less than 1 seconds
- Small footprint, newly generated snapshots hardly occupy space
- Support for snapshot incremental backups, support for fast rollback
The position of the PostgreSQL system in the TDW biosphere is shown in the following illustration, TPG is a name for Tencent's PostgreSQL after the expansion:
The following two parts of the TDW and PostgreSQL of the exchange of visits to ask the function to do an introduction, that is, Pgdata storage engine and Tdwlink functions.
The TDW itself supports a variety of storage formats, including Textfile,rcfile,pbfile, on which Tencent develops a new storage engine, the Pgdata storage engine, that transparently stores and accesses the data in PostgreSQL, as shown below:
On use, you only need to specify the Pgdata storage engine when creating a table, such as using the following statement to create this type of appearance
Create external Table foo (idx bigint, str string) stored as Pgdata
In the subsequent access process and the use of other TDW tables generally, here is a brief description of the implementation of the access, the data flow is as follows
- When parsing SQL query statements, first decompose each subquery, and then for each subquery will always first judge the table involved in the current query is Pgdata appearance, if there is no DB appearance is the normal query statement execution process;
- If there is a pgdata appearance, then check whether the appearance data involved in this query has been imported, if the import directly reuse the imported data, if not the corresponding subquery processing, convert it to the corresponding standard relational database SQL statements, and then get the connection information from the table, Use JDBC to execute query statements;
- Import the data from the query's results into a random HDFs directory in the TMP directory, which is generated using UUID, so you can guarantee uniqueness without conflict with other query statements, and then set it to the HDFs file path where the data for the appearance is located. The MapReduce task automatically reads the path as the input path to the task;
- Clears the temporary file after the query has finished executing and the query results have been successfully obtained. If an exception is found in the query, it automatically cleans up the junk data that has been imported.
The SQL standard contains a feature named Sql/med, "SQL Management of External Data", PostgreSQL in the 9.1 version of 2011, using a type called foreign data The wrappers mechanism has read-only support for this standard, and developers only need to develop appropriate plug-ins for the corresponding data source to access the remote data source via PostgreSQL. At present, the community has a lot of plug-ins based on this function, such as ORACLE_FDW,MYSQL_FDW,REDIS_FDW, Tencent team based on the characteristics of PostgreSQL, the development of the ability to access TDW data plug-ins, called TDW_FDW, Refer to the following figure:
To facilitate users to use this feature, Tencent developed a stored procedure Tdwlink, users only need to configure the corresponding foreign server and authentication information, can be very convenient to use, such as Tencent has defined the TDW_SVR this server, To access the test table of the test library in TDW, you need only select Tdw_meta.tdwlink (' tdw_svr ', ' Test ', ' select col1, col2 from Test '). Since this function is provided through PostgreSQL, it can be used with any interface provided by PostgreSQL, thus indirectly extending the access interface that TDW has. In addition, Tencent's TDW_FDW plug-in is based on the PostgreSQL 9.1 version of the development, this version of the foreign data wrappers can only provide read-only access to external data sources, so Tencent's TDW_FDW only to provide read-only access to TDW, And the latest release of the PostgreSQL9.3 version of the foreign Data wrappers has provided written support, this is Tencent's TDW_FDW Plug-ins can be further functional expansion of the place.
Open Source plugin for use
- Zoning Management: Transformation based on Pg_partman
- Monitoring: Check_postgres and Pgbadger
- Cache warming: Pgfincore
- Mutual access between instances: Dblink
- Read MySQL data: MYSQL_FDW
- Chinese Full-text Search: Zhparser
Current operating conditions
- 11 sets of TPG instances, totalling about 40 machines
- Used to store about 30TB, maximum instance storage up to 5TB
- TPG's total number of households in the company is 100 +
- No incident due to PG itself, user evaluation positive
- External: User Report
- Internal: TDW system, more than 30 business reporting system, data extraction system, BI system, marketing system, etc.
Application of future planning
- Provide cluster PG-XC service for business
- Richer application scenarios
- Geographic information (POSTGIS)
- Machine learning and data Mining (Madlib)
- R statistical analysis (PL/R)
- Self-service of access flow
- Cloud the services of TPG and reduce the amount of manpower involved in operation and maintenance