In 2017, the double eleven refreshed the record again. The transaction created a peak of 325,000 pens/second and a peak payment of 256,000 pens/second. Such transactions and payment records will form a real-time order feed data stream, which will be imported into the active service system of the data operation platform.
The active service of the data operation platform, based on these combined data, real-time analysis, real-time public opinion display, real-time identification of objects that need active services, etc., to achieve an intelligent service operation platform.
Real-time analysis solution through Alibaba Cloud RDS PostgreSQL and HybridDB for PGSQL:
- Withholding hundreds of thousands of pens/s of write throughput and doing data cleaning, it is several times the transaction
- Real-time analysis of minute-level delays, 5 billion-level tables associated with second-order responses
- Real-time discovery of transaction anomalies to enhance Taobao's user experience.
Business background
An e-commerce business usually involves merchants, stores, logistics, users, payment channels, loan channels, goods, platforms, small two, advertisers, manufacturers, distributors, shopkeepers, store clerk, supervisors, tax, quality inspection, etc. The activities of these objects generate a large amount of business data such as browsing, orders, complaints, refunds, disputes, etc.
And any business involves many different business systems. In these business systems, in order to locate problems, operational needs, analysis needs, or other requirements, a buried point is set in the system to record the logs generated by the user's behavior in the business system, also called FEED logs. For example, the order system, interlocking in the business system, from the shopping cart, placing orders, payment, delivery, receipt (and disputes, refunds, etc.), an order usually produces a number of associated records.
The attributes generated by each link may be different. It is possible that new attributes are generated, and it is possible to change existing attribute values. For ease of analysis, it is often necessary to combine several records (several attributes) generated by the order throughout the process into one record (order size table).
The active service of the data operation platform, based on these combined data, real-time analysis, real-time public opinion display, real-time identification of objects that need active services, etc., to achieve an intelligent service operation platform.
Difficulty
The project not only has complex business logic, but also has high real-time and performance requirements. The specific ones are:
Extreme performance of complex queries, such as 5 billions of table associations
Real-time, requiring a delay within minutes
High concurrent write
The throughput pressure is up to hundreds of thousands of pens per second
The data for each SQL analysis is always at terabytes
Response time requires seconds, milliseconds
In addition to the requirements of real-time, in the process of writing, there are actions such as switching, merging, and cleaning of data. If you have done database or data analysis, you will know that writing, switching, merging, and cleaning up to 800,000/s throughput per second is not particularly difficult; the millisecond analysis of terabytes of data alone is not counted. difficult.
However, it is not so easy to provide real-time writes while providing a second-order real-time analysis of minute-level delays and doing reasonable scheduling.
Program
To support such business needs, the schemes shown are as follows:
Among them:
RDS PostgreSQL is a cloud version developed by Alibaba Cloud based on the open source relational database PostgreSQL.
HybridDB for PostgreSQL is a distributed analysis database of MPP architecture. It has excellent performance in multi-table correlation, complex query, real-time statistics, circle and many other aspects, and supports various unique features such as JSON, GIS, HLL valuation and so on.
OSS is a massive, secure, low-cost, and highly reliable cloud storage service launched by Alibaba Cloud. It is used as offline storage for data.
The most critical is to achieve transparent access to offline storage OSS by RDS PostgreSQL and HybridDB for PostgreSQL.
In this scheme, multiple PostgreSQL accepts the writing of the service, completes the cleaning of the data in each RDS PostgreSQL, and then writes the cleaned data into the elastic storage OSS by operating an external table (like a heap table); After the write is complete, HybridDB for PostgreSQL also loads the data from the OSS into the HybridDB in parallel by operating the external table (similar to the heap table). In HybridDB, realize millisecond-level queries of tens and hundreds of terabytes of data.
In PostgreSQL, create an external table:
#Create plugins, execute once per library
Create extension oss_fdw;
# Create server, create one for each OSS bucket
CREATE SERVER ossserver FOREIGN DATA WRAPPER oss_fdw OPTIONS
(host 'oss-cn-hangzhou-zmf.aliyuncs.com' , id 'xxx', key 'xxx', bucket 'mybucket');
# Create oss external table, each OSS object that needs to operate corresponds to a table
CREATE FOREIGN TABLE ossexample
(date text, time text, volume int)
SERVER ossserver
OPTIONS ( filepath 'osstest/example.csv', delimiter ',' ,
Format 'csv', encoding 'utf8', PARSE_ERRORS '100');
This creates a table that maps to the OSS object. Reading and writing to ossexample is the reading and writing of OSS. After the data is written to "local_tbl", execute the following SQL:
#Write to OSS
Insert into ossexample
Select date, time, volume) from local_tbl where date > '2017-09-20';
The data that satisfies the filter condition in the table "local_tbl" is written in the object "osstest/example.csv" corresponding to the OSS.
In HybridDB for PostgreSQL, OSS is also read and written in a similar manner. Throughout the process, the user sees just a piece of SQL. as follows:
# Create an external table for exporting data to OSS
Create WRITABLE external table ossexample_exp
(date text, time text, volume int)
Location('oss://oss-cn-hangzhou.aliyuncs.com
Prefix=osstest/exp/outfromhdb id=XXX
Key=XXX bucket=testbucket') FORMAT 'csv'
DISTRIBUTED BY (date);
# Create a heap table, the data is loaded into this table
Create table example
(date text, time text, volume int)
DISTRIBUTED BY (date);
#Data parallel loading from ossexample into example
Insert into example select * from ossexample;
The execution of the INSERT statement will write the data in the "osstest/exp/outfromhdb" file in parallel to the table "example". The principle is as follows:
HybridDB reads OSS.jpeg
HybridDB is a distributed database. In a HybridDB for PostgreSQL cluster, there is one Master and multiple Segments. The number of Segments can be expanded horizontally. Segment is responsible for storing and analyzing data, and Master is the main portal to accept query requests and distribute.
By reading data from the OSS in parallel through each segment, the entire cluster can achieve a fairly high throughput, and this capability increases linearly with the number of segments.
Program advantage
The above solution does not seem complicated at first, but solves the following problems:
Performance
Combines the superior concurrent write performance of PostgreSQL with the superior analysis performance of HybridDB.
A single RDS PostgreSQL can even support millions of writes; after writing PostgreSQL, it is loaded into HybridDB in batches, making PostgreSQL and HybridDB seamlessly connected, and real-time millisecond-level queries can be achieved by using MPP's superior analysis performance.
2. Data handling and cleaning
In the traditional field of analysis, data handling is often a relatively heavy and poor performance, resulting in a long distance between TP and AP, only in a different way and rhythm. And if it is the handling of heterogeneous databases, the pain index will go up the ladder.
If these can be operated through SQL, the cleaning and handling of data is ultimately just the definition and execution of SQL.
In the above figure, both RDS PostgreSQL and HybridDB for PostgreSQL have the ability to read and write directly to OSS, which can be easily concatenated. With reasonable scheduling and encapsulation, functions that would otherwise require a lot of work can be realized at a lower cost.
3. Unification of hot and cold data
By operating offline storage, cold data can be placed in OSS, and hot data can be placed in PostgreSQL or HybridDB for PostgreSQL. The same processing can be used to achieve unified processing of hot and cold data.
4. Dynamically adjust resources
One of the benefits of cloud ecology is dynamics and flexibility. RDS PostgreSQL resources can be dynamically adjusted at any time without affecting any availability, which is equivalent to fueling the aircraft in the air; while the expansion and shrinkage of HybridDB is a second-level switch. The flexibility of OSS itself also allows the customer to put as much data as possible.
Therefore, it brings the following advantages:
Compared with the traditional data analysis scheme, SQL is used as a unified way to manage data and reduce heterogeneity;
Dynamic resource scheduling, reducing costs
The boundaries between hot and cold data are blurred and directly access each other
TP, AP integration
There is no limit to the number of RDS PostgreSQL; there is no limit to the number of HybridDB clusters.
Alibaba Cloud Database PostgreSQL
Alibaba Cloud Database PostgreSQL, based on an open source relational database known as "Most Advanced." In the StackOverflow 2017 developer survey, PostgreSQL can be said to be "the developer's favorite and most desirable relational database in the annual statistics."
The advantages of PostgreSQL are as follows:
Stable
PostgreSQL's code quality is recognized by many people, and there are often people who laugh that the developers of PG are Virgo. Basically, a large version of PG is released, and it can be produced after three or two small versions. This is a place worthy of praise. This is evident from the beautiful commit log of PostgreSQL.
Thanks to PostgreSQL's multi-process architecture, a connection exception does not affect the main process and other connections, resulting in good stability.
Performance
We have some internal performance data. The performance test of TPCC shows that the performance of PostgreSQL is basically on the same level as the commercial database.
Rich
The richness of PostgreSQL is the most worthwhile place to talk about. Because it is so rich that I don't know how to focus. Here are just a few of the points that are considered interesting (query, type, function):
Rich in inquiry
Not to mention HASH\Merge\NestLoop JOIN, as well as recursion, connect by, window, rollup\cube\grouping sets, materialized view, SQL standard, etc., as well as various full-text searches, regular expressions, fuzzy queries. , similarity, etc. Beyond these, the most important is PostgreSQL's powerful cost-based optimizer, combined with parallel execution (parallel scanning, parallel JOIN, etc.) and various cost factors, bringing a wide variety of flexible and efficient query support.
Rich type
Such as high precision numeric, floating point, auto increment sequence, currency, byte stream, time, date, time stamp, boolean, enumeration, plane geometry, solid geometry, multidimensional geometry, earth, PostGIS, network, bitstream, full text search , UUID, XML, JSON, array, composite type, domain type, range, tree type, chemical type, gene sequence, FDW, large object, image, etc.
[PS: The array here allows the user to manipulate the data in the database as if it were an array in JAVA. For example, item[0][1] represents an element in a two-dimensional array, and item can be used as a field in a table. ]
Or, if the above is not enough, you can customize your own type (create type), and you can perform operator overloading for these types, such as implementing IP type addition, subtraction, multiplication and division (the operation definition depends on the concrete implementation, meaning: you want Let what the addition of IP looks like).
There are also various types of indexes, such as btree, hash, gist, sp-gist, gin, brin, bloom, rum index, and so on. You can even customize specific index and index scans for the types you define.
Rich in function
PostgreSQL has an unparalleled feature - plugins. It uses the Hook in the kernel code, allowing you to add any functions, such as PostGIS, JSON, genes, etc., without modifying the database kernel code. It does a lot of customization in the plugin without affecting any of them. The kernel code thus meets a wide variety of needs. And PostgreSQL plugins, countless.
The FDW mechanism allows you to access other data sources, such as Hadoop, MySQL, Oracle, Mongo, etc., in the same PostgreSQL as you would a local table, without taking up too much resources from the PG. For example, the OSS_FDW developed by our team is used to read and write OSS.
As for the other, for a simple example, PostgreSQL's DDL (such as addition and subtraction fields) can be done in a transaction [PS: PostgreSQL is Catalog-Driven, DDL modification can be understood as a record modification]. This point, I believe that students who do business will have experience.
On the basis of the open source version, Alibaba Cloud Database PostgreSQL adds HA, seamless expansion, automatic backup, recovery and non-aware switching, offline storage transparent access, diagnosis and optimization and many other functions to relieve the worries of use. More suggestions to visit the product page of Alibaba Cloud official website (see below for reference).
Alibaba Cloud HybridDB for PostgreSQL
HybridDB for PostgreSQL is a distributed analytical database of MPP architecture. Based on open source Greenplum, it has excellent performance in many aspects such as multi-table association, complex query, real-time statistics and circle. On this basis, Alibaba Cloud HybridDB for PostgreSQL provides a variety of unique features such as JSON, GIS, HLL valuation, backup recovery, abnormal auto-repair, etc.; and has many performance optimizations in METASCAN and other aspects, compared to the open source version. Improvement.
Alibaba Cloud HybridDB for PostgreSQL has the following features:
Real-time analysis
Support SQL syntax for real-time analysis of distributed GIS geographic information data types, assist with Internet of Things, Internet to achieve LBS location service statistics; support SQL syntax for distributed JSON, XML, fuzzy string and other data real-time analysis, help financial, government and enterprise industries to achieve Message data processing and fuzzy text statistics.
Stable and reliable
Support distributed ACID data consistency, achieve cross-node transaction consistency, all data two-node synchronous redundancy, SLA guarantee 99.9% availability; distributed deployment, computing unit, server, cabinet triple protection, improve important data infrastructure security.
Easy to use
Rich OLAP SQL syntax and function support, many Oracle function support, the industry's popular BI software can be used directly online; can communicate with cloud database RDS (PostgreSQL/PPAS), realize OLTP+OLAP (HTAP) mixed transaction analysis solution .
Support distributed SQL OLAP statistics and window functions, support distributed PL/pgSQL stored procedures, triggers, and realize distributed computing process development on the database side.
Hybrid analysis of geographic data in line with international OpenGIS standards, through a single SQL can be used to analyze geographic information from massive data, such as: human flow, area statistics, whereabouts and so on.
Excellent performance
Supports row and column hybrid storage, performance of inventory in OLAP analysis is up to 100 times performance improvement compared to row storage; supports high-performance OSS parallel data import, avoiding performance bottleneck of single-channel import.
Based on distributed massively parallel processing, with the addition of linear expansion storage and computing capabilities of computing units, the OLAP computing performance of each computing unit is fully utilized.
Flexible expansion
On-demand computing unit, CPU, memory, storage space expansion, OLAP performance smoothly rises to hundreds of terabytes; support transparent OSS data operations, non-online analysis of cold data can be flexibly transferred to OSS object storage, data storage Capacity expands indefinitely.
Through MySQL database, high-performance data can be imported through mysql2pgsql, and the industry's popular ETL tools can support ETL data import with HybridDB as the target.
The formatted file stored in the OSS can be used as a data source for real-time operation through the external table mode, and data query is implemented using standard SQL syntax.
Support data transparent flow from PostgreSQL/PPAS, continuous increments without programming, simplify maintenance, high-performance internal data modeling and data cleaning after data storage.
Safety
The IP whitelist configuration supports up to 1000 server IP addresses that allow RDS instances to be connected, and direct risk control from the access source.
DDOS protection, real-time monitoring at the network entrance, when the large traffic attack is found, the source IP is cleaned, and the black hole can be directly pulled when the cleaning is invalid.
Conclusion
Using Alibaba Cloud's cloud ecosystem, RDS PostgreSQL, HybridDB for PostgreSQL and other cloud services to help enterprises build intelligent enterprise data BI platform, HybridDB for PostgreSQL is also the core engine of enterprise big data real-time analysis operation and storage. A closed loop of business data that enables enterprises to move from online services to real-time data analysis in the cloud.