Database schema (RPM)

Source: Internet
Author: User
Tags db2 rollback sorts unique id

1. Industry challenges-four scenarios for "cross-Library paging"

1). Method One: Global View method

A. Change the order by time offset X limit Y to the order by time offset 0 limit x+y

B. The service layer sorts the resulting n (x+y) data in memory, sorts the memory and then takes the Y record after the offset X

This approach with the page turn, the performance is getting lower.

2). Method Two: Business tradeoff Method-no page skipping query

A. Get the first page of data in a normal way and get the Time_max of the first page record

B. On each page, order by time offset X limit y is changed to the order by time where time> $time _max limit y

To ensure that only one page of data is returned at a time, performance is constant.

3). Method Three: Business tradeoff Method-allows fuzzy data (multiple databases to detect data on average, loss of certain accuracy)

A. Change the order by time offset X limit Y to the order by time offset x/n limit y/n

4). Method Four: two times Query method

A. Change the order by time offset X limit y to the order by time offset x/n limit y

B. Find the minimum value time_min

C. Between two queries, order by time between $time _min and $time _i_max

D. Set up the virtual Time_min, find the Time_min in the various sub-library offset, so as to get time_min in the global offset

E. Get the time_min in global offset, naturally get the global offset X limit Y

2. Single key business, database horizontal Segmentation architecture Practice

1). Horizontal Segmentation method

A. The scope method, based on the user Center's business primary key UID, divides the data horizontally into two DB instances:

USER-DB1: Store UID data from 0 to 10 million

USER-DB2: Store UID data from 1 to 20 million

A) The advantages of the scope method are: The segmentation strategy is simple, according to the UID, according to the scope, User-center can quickly locate the data in which library; Simple expansion, if the capacity is not enough, just add user-db3 can

b). The deficiencies of the scope law are:

The UID must meet the incremental characteristics

Data volume is uneven, the new USER-DB3, in the initial data will be relatively small

Request volume uneven, in general, the new registered user activity will be relatively high, so user-db2 tend to be higher than the USER-DB1 load, resulting in unbalanced server utilization

B. Hashing is also based on the User Center's business primary key UID, dividing the data horizontally into two DB instances:

USER-DB1: The UID data that stores the UID to modulo 1

USER-DB2: The UID data that stores the UID to modulo 0

A) The advantage of hashing is that the segmentation strategy is simple, according to UID, according to the Hash,user-center can quickly locate the data on which library

Data balance, as long as the UID is uniform, the distribution of data in each library must be balanced

Demand balance, as long as the UID is uniform, the distribution of the load on each library must be balanced

b). The lack of hashing is: the problem of expansion, if the capacity is not enough, to add a library, re-hash may lead to data migration, how to smooth the data migration, is a need to resolve the issue

2). Problems with User Center segmentation:

For a query on the UID attribute can be routed directly to the library, assuming access to uid=124 data, can be directly positioned db-user1: for queries on non-UID attributes, such as queries on login_name properties, it is tragic:

Assuming that you access Login_name=shenjian data, you often need to traverse all the libraries because you do not know which library the data falls on, and the performance can be significantly reduced when the number of libraries is multiple.

3). According to the architectural experience of the landlord over the years, there are often two types of business requirements in the User Center non-UID attribute:

A. User-side, front-desk access, most typically with two types of requirements

User login: Query the user's entity through Login_name/phone/email, 1% requests belong to this type

User information query: After login, through the UID to query the user's instance, 99% requests belong to this type

User-side query is basically a single record of the query, a large number of visits, services need to be highly available, and the requirements for consistency are high.

B. Operation side, background access, according to product, operational needs, access patterns vary according to age, gender, avatar, login time, registration time to query.

Operation side of the query is basically a batch paging query, because it is an internal system, access is very low, the requirements for availability is not high, the requirements for consistency is not so strict.

What kinds of architectural solutions should be used to solve these two different business requirements?

4). User Center Horizontal Segmentation Architecture Ideas

User Center in the case of large data volume, using UID for horizontal segmentation, for non-UID attributes on the query requirements, the core idea of architecture design is:

For the user side, the "establish a non-UID attribute to the UID mapping relationship" schema scheme

For the operational side, the "foreground and background separation" architecture should be used

5). User Center-user side best Practices

A. Index Table method

Ideas: UID can be directly located to the library, login_name can not directly locate the library, if through login_name can query to the UID, problem solving

Solution:

Establish an index table to record the mapping relationship of Login_name->uid

When accessed using login_name, first query the UID through the Index table, then locate the corresponding library

The index table has fewer properties, can hold very much data, and generally does not require a library

If the amount of data is too large, you can divide the library by login_name

Potential shortfall: One-time database query, performance degradation

B. Cache mapping Method:

Idea: Access Index table performance is low, the mapping relationship in the cache performance better

Solution:

Login_name query first to the cache to query the UID, and then locate the database according to the UID

Assuming the cache miss, the login_name corresponding UID is obtained using the Sweep library method and placed in the cache

Login_name to UID mapping relationship does not change, once the mapping relationship is put into the cache, will not be changed, no elimination, cache hit rate is super high

Cache level Segmentation via login_name if the amount of data is too large

Potential shortfalls: Cache queries more than once

C.login_name Generate UID:

Idea: Do not make remote query, by login_name directly get UID

Solution:

When the user registers, the design function login_name generates UID,UID=F (login_name) and inserts the data by the UID Sub-Library

When accessed using login_name, the UID is computed first through the function, i.e. uid=f (login_name) again, routed from the UID to the corresponding library

Potential deficiency: This function design requires a very technical skill, with UID generation conflict risk

D.login_name gene into UID

Idea: Can not use login_name to generate UID, from login_name extract "gene", into the UID. Assuming 8 libraries, using UID%8 routing, the subtext is that the last 3 bits of the UID determine which library the data falls on, and the 3 bits are called "genes".

Solution: When the user registers, the design function login_name generates 3bit genes, login_name_gene=f (login_name), such as the pink part

At the same time, generate a globally unique ID of 61bit, as the user's identity, such as the green section

And then the 3bit Login_name_gene as part of the UID, like the poo-yellow part.

Generates 64bit UID, assembled from ID and login_name_gene, and inserts data according to the UID Sub-Library

When accessed using login_name, the 3bit gene is re-restored by the function by Login_name, login_name_gene=f (login_name), directly to the library via login_name_gene%8

E. User center-operational side Best practices

Front user side, business requirements are basically a single-line record access, as long as the establishment of non-UID attribute login_name/phone/email to the UID mapping relationship, can solve the problem.

Background operation side, the business needs are different, basically is the bulk paging access, such access calculation is large, the amount of return data is large, compared to consume database performance.

If at this time the foreground business and the background Business public batch service and a database, may cause, because the background "few requests" the "batch query" the "inefficient" access, causes the database CPU occasional instantaneous 100%, affects the foreground normal user's access (for example, the login time-out).

Moreover, in order to meet the background business of various "grotesque" needs, often in the database to establish a variety of indexes, these indexes occupy a large amount of memory, will make the user side of the front desk service Uid/login_name on the query performance and write performance significantly reduced, processing time growth.

For this type of business, the "foreground and background separation" architecture scenario should be used:

The customer side front desk business demand structure is still unchanged, the product operation side back-end business demand extracts the independent web/service/db to support, relieves the system coupling, for "the Business complex" "The Concurrency quantity is low" "does not have the high availability" "can accept certain delay" the backstage service:

You can remove the service layer and access the DB directly through DAO in the operating background web layer

No reverse proxy required, no cluster redundancy required

No need to access real-time libraries, you can synchronize data asynchronously via MQ or offline

In a very large database, you can use an "index external" or "HIVE" design that fits a lot of data to allow for higher latency

F. Summary

This article makes some introductions to the "User Center" as a typical "single key" class of business, horizontally segmented architecture points.

Horizontal Segmentation Method: Range method; Hash method

Horizontal segmentation After the problem: The UID attribute query can be directly located to the library, through the non-UID attribute query cannot be located to the library

Non-UID attribute query typical business: User side, front desk access, single record query, access volume, service needs high availability, and the requirements for consistency is high; operating side, back-office access, according to product, operational requirements, access patterns vary, basically is the bulk paging query, because it is the internal system, access is very low, The requirements for availability are not high and the requirements for consistency are not so stringent

These two types of business architecture design ideas:

For the user side, the "establish a non-UID attribute to the UID mapping relationship" schema scheme

For the operational side, the "foreground and background separation" architecture should be used

User foreground side, "establishing non-UID attribute mapping relationship to UID" best practice:

Index Table method: The mapping relation of record Login_name->uid in database

Cache mapping Method: the mapping relationship of record login_name->uid in cache

Login_name Generating UID

Login_name gene into UID

Operating background side, "foreground and background separation" best practices:

Front desk, back-end system web/service/db decoupling, avoid background low-efficiency query caused foreground query jitter

The design of data redundancy can be adopted

can use "external index" (such as ES search system) or "Big Data processing" (such as Hive) to meet the background of the abnormal query requirements

3.10 billion data 10,000 attribute data schema design

1). What is the version + EXT scenario for database extensions?

Use ext to host personalization attributes for different business needs, using version to identify the meanings of the various fields in ext.

Advantages: A. Can dynamically expand the properties at any time, good extensibility; b. Both old and new data can exist simultaneously, good compatibility

Insufficient: The fields in the A.ext cannot be indexed; B.ext key value has a lot of redundancy, it is recommended that the key is shorter

2). How to store different categories and heterogeneous data in a unified manner, using a similar version+ext approach:

Tiezi (Tid,uid, time, title, Cate, Subcate, XXID, ext)

A. Some common fields are extracted separately for storage

B. Define what ext means by Cate, subcate, XXID, etc. (and version a bit like?). )

C. Use ext to store individual requirements for different lines of business

3). Solve the problem of storage of massive heterogeneous data, the new problems encountered are:

A. Each record ext key needs to be repeated storage, occupy a lot of space, can compress the storage

B.cateid is not enough to describe the contents of Ext, category has hierarchy, depth is not sure, ext can have self-descriptive

C. Add attributes at any time to ensure extensibility

4). Unified Class Properties Service

Abstract a unified class, attribute services, to manage this information alone, and the Post Library ext field JSON key, unified by the number to represent, reduce storage space.

What is the meaning of the number, which sub-classification, the value of the check constraints, unity is stored in the class, attribute services.

In addition, if the value of a key in Ext is not a regular checksum, but an enumeration value, a value-qualified enumeration table is required to verify

5). Unified Search Service

When the amount of data is large, the query requirements on different attributes cannot be met by combining indexes to satisfy all the query requirements.

58 The sages of the same city, from the early morning, identified the "external index, unified Search Service" technical route:

A. The database provides the "Post ID" of the positive row query requirements

B. All non-"Post ID" personalized search requirements, unified go External index

6). The operation of metadata and index data follows:

A. To the Post Tid row query, direct access to the post service

B. Modify the post, post service notifies the retrieval service, and the index is modified

C. Complex queries on posts to meet demand through retrieval services

4. Database second-level smooth expansion architecture scheme

1). Deployment scenario:

A. Large concurrency, large traffic in the Internet architecture, in general, the database has a service layer, the service layer records the "Business Library name" and "database instance" mapping, through the database connection pool to the database routing SQL statements to execute

B. As the data volume increases, the data should be split horizontally, and the data will be distributed to different database instances (even physical machines), so as to reduce the data volume and enhance the performance of the expansion purposes.

C. The Internet architecture needs to ensure that the database is highly available, a common way to ensure the availability of the database using dual-master synchronous +keepalived+ Virtual IP

D. Integration of the above (2) and (3), the actual online architecture, both horizontal and high-availability assurance

Question: If the amount of data continues to increase, 2 library performance can not carry, how to do?

Answer: Continue to split horizontally, split into more libraries, reduce the amount of data in a single library, increase the number of library master instances (machines), and improve performance.

2). Stop service Scenario: Pause all services and migrate data.

Rollback scenario: If the data migration fails or the test fails after the migration, change the configuration back to the X library, restore the service, and hang up the announcement another day.

Solution Advantages: Simple

Scenario disadvantage: A. Stop service, not high availability;

B. Technical students under pressure, all work to be done within the specified time, according to experience, the more prone to stress error (this is fatal)

C. If there is a problem the first time did not check out, started the service, after a period of time to find a problem, it is difficult to rollback, need to go back, may lose a part of the data

3). Second level, smooth, handsome plan

A. Modifying the configuration

Two main changes:

a). The machine where the DB instance resides is a dual virtual IP, the original%2=0 library is virtual ip0, and now adds a virtual ip00,%2=1 to the other library similarly

b). Modify the configuration of the service (whether in the configuration file, or in the configuration Center), the database configuration of 2 libraries, the database configuration of 4 libraries, the modification should pay attention to the old library and hard mapping relationship:

%2=0 of the library, will become%4=0 and%4=2;

The part of%2=1 will become%4=1 and%4=3;

This modification is to ensure that the correct data is still routed after the split.

B.reload configuration, instance expansion

Service layer Reload configuration, reload may be in several ways:

a). Compare the original, restart the service, read the new configuration file

b). Advanced, configure the center to signal the service, reread the configuration file, re-initialize the database connection pool

Either way, after reload, the database instance expansion is complete, the original is 2 DB instance provides the service, now becomes 4 DB instance provides the service, this process can generally complete in the second level.

The process can be progressively restarted, with no impact on the correctness and availability of the service:

a). Even if the%2 search and%4 are present at the same time, it does not affect the correctness of the data, as this is still a dual master data synchronization

b). Service reload prior to service delivery, redundant services ensure high availability

Once the instance has been extended, you will find that the amount of data in each database is still not declining, so the third step is to do some finishing work.

C. Closure work, data contraction:

There are some finishing touches to the work:

a). Change the dual virtual IP to the false IP

b). Remove the old two-master synchronization, so that the data of the paired libraries is no longer synchronized.

c). Add new dual-master synchronization to ensure high availability

d). Delete redundant data, for example: Ip0%4=2 data are all killed, only for%4=0 data service

This way down, the amount of data per library is reduced to the original half, the data contraction is completed.

5.10 billion data smooth data migration, no impact on services

For the internet a lot of "large amount of data, concurrency, high business complexity" business scenarios, in

A. Underlying table structure changes

B. Transformation of the number of libraries

C. Underlying storage media transformation

, there are two common solutions to the need for data migration, "Smooth migration of data, no downtime of the migration process, and guaranteed system continuity."

1). Chase Log method, five steps:

A. Service upgrade, log "data modification on Old library"

B. Develop a data migration gadget for data migration

C. Develop a read log gadget to flatten data differences

D. Develop a data-to-small tool to verify data consistency

E. Traffic cut to new library, complete smooth migration

2). Double notation, four steps:

A. Service is upgraded to record "data modification on Old library" for double write of new library

B. Develop a data migration gadget for data migration

C. Develop a data-to-small tool to verify data consistency

D. Traffic cut to new library, complete smooth migration

6. Three scenarios for MySQL redundant data

1). Why redundant data

For example: Order business, to users and businesses have order inquiry requirements:

Order (OID, info_detail);

T (buyer_id, seller_id, OID);

If you use buyer_id to divide the library, seller_id queries need to scan multiple libraries.

If you use seller_id to divide the library, buyer_id queries need to scan multiple libraries.

You can now use data redundancy to meet the query requirements on buyer_id and seller_id, respectively:

T1 (buyer_id, seller_id, OID)

T2 (seller_id, buyer_id, OID)

The same data, redundant two copies, a copy of buyer_id to meet the needs of buyers, a copy of the seller_id to meet the seller's query requirements.

2). Service Synchronous Double Write

As the name implies, redundant data is written synchronously by the service layer, such as 1-4 processes:

Business party invoke service, new data

Service first inserts T1 data

Service re-inserting T2 data

Service returns business party new data success

Advantages:

Not complex, the service layer by a single write, changed two times write

Data consistency is relatively high (because double write success is not returned)

Disadvantages:

Request processing time increased (to insert two times, time doubled)

Data may still be inconsistent, such as after the second-step write T1 completes the service restart, the data is not written to T2

3). Service Asynchronous Double Write

The double write of the data is no longer done by the service, and the service layer asynchronously sends a message to a dedicated data replication service through the message bus to write redundant data, such as the 1-6 process:

Business party invoke service, new data

Service first inserts T1 data

The service sends an asynchronous message to the message bus (it can be sent, not returned, and is usually done soon)

Service returns business party new data success

Message bus delivers messages to data Sync Center

Data Sync Center Inserts T2 data

Pros: Short Request processing time (1 inserts only)

Cons: The complexity of the system has increased, introducing a component (message bus) and a service (dedicated data replication Service)

Since the return line of business data is inserted successfully, the data is not necessarily inserted into the T2, so the data has an inconsistent time window (the window is very short and ultimately consistent)

Redundant table data is inconsistent when message bus loses message

Whether the service synchronizes double-write or service-asynchronous-write, services need to focus on the complexities of "redundant data." If you want to remove the "data redundancy" of the system coupling, lead to a common third scenario.

If the system is sensitive to processing time, it leads to the second common scenario.

4). Offline Asynchronous Double write:

In order to shield the complexity of the "redundant data" to the service, the double write of the data is no longer done by the service layer, but by a service or task under the line, such as the 1-6 process:

Business party invoke service, new data

Service first inserts T1 data

Service returns business party new data success

Data is written to the log in the database

Offline service or task read log of database

Offline service or task insert T2 data

Advantages: Data bi-write and business fully decoupled; Request processing time is short (insert only 1 times)

Disadvantage: Return line of business when data is inserted successfully, the data is not necessarily inserted into T2, so the data has an inconsistent time window (the window is very short and ultimately consistent)

Consistency of data depends on the reliability of offline services or Tasks

5). Summary:

Business scenarios with large Internet data volumes are often:

Use horizontal slicing to reduce the amount of data in a single library

Use inverse paradigm design of data redundancy to meet query requirements in different dimensions

Data redundancy can be easily achieved by using service synchronization two-notation

To reduce the delay, you can optimize the asynchronous double-notation for services

To shield the complexity of the service from redundant data, it can be optimized for offline asynchronous double-writing

Content transferred from the public number: Architect's path

Database schema (RPM)

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.