Original: http://www.p2pquan.com/article-740-1.html
First, Introduction
As Internet finance continues to heat up, more and more banks are releasing their own Internet financial products. But the characteristics of "high concurrency and Big data" of Internet products have brought new challenges to the traditional core system architecture of banks.
1, the core technical characteristics of the Internet
The core technical features of the current Internet can be summarized as: distributed, easy to expand, a large number of low-end devices, the underlying open source software. Distributed architectures can be scaled to support the influx of clients on the Internet. At the same time, the big data platform based on customer behavior analysis needs to be distributed system, the most typical of which is the Hadoop cluster.
2. Technical characteristics of traditional banking system
The technical characteristics of traditional banking system can be summarized as: Special equipment, low-level closed source software, expensive, stable system. As the financial industry has a high demand for the stability, reliability and security of the core system, most of the banks are now using IBM's overall solution, and there is little alternative, and IBM remains a monopoly of the market.
3, the challenge is mainly from the Customer experience upgrade
As the size of the Internet customer grows, the number of customer behavior and system response also shows an explosive growth.
Traditional electronic banking provides passive services, products only provide the corresponding functions, the need for customers to inquire and operate, and the Internet products more is to provide active services, the image of an intuitive display of customer-related data information. This requires that a single customer-initiated or product-initiated request response needs to interact with multiple functional interfaces of the core system at the same time, and then consider the scale effect of the Internet customer, the pressure on the system is very large, which also brings great risks.
4, product design should have rationality
In most cases, the architecture of the core system cannot be changed. This requires that we try to avoid high-concurrency interaction responses from product design and application systems, but this is not done at the expense of the customer experience.
If the core system bottleneck due to high concurrency, the first should ensure that the application system to support the product, at least the product can not crash and give customers more than the expected sense of loss; second, the product's exception prompt content is also friendly enough to reduce the customer's negative emotions. Finally, the amount of the originating request is controlled, and the frequency of customer requests can be reduced by means of temporary input verification code.
The following is a "pleasant loan system Architecture" as an example, the pleasant loan architect June Sun to share a brief introduction of the pleasant loan system in the development process encountered in the practical problems and solutions, and focus on the pleasant loan system and the loan system of high-concurrency solutions.
Second, the pleasant loan system version of the iteration
1.0 version --Simple annoyance
Before the iteration of a pleasant loan system, in fact, is a front desk, a background, a db, the foreground is a multi-level deployment of the way.
Software is also the most traditional software as divided into three layers, the first layer is the controller, the second is the service, the third tier is db. Obviously this system is not suitable for the internet, there are some unavoidable problems. First, when the user is over, online users on-line, this way of deployment will create a number of bottlenecks, including the server and data. The second is that the team gets bigger and all the developers focus on the same system, and the conflict is severe.
1.5 version --"Eat the tonic" try!
For the above problems they made some changes, June sun it as "eat the tonic." There is usually a clear characteristic of eating and tonic, that is, immediate effect, but the side effects are also very large.
first , they pay more attention to performance on the page layer of pleasant loan, such as browser, compressed transmission, the page has been optimized by YSlow, the browsing layer has added CDN, made static or even reverse proxy, so that can withstand 80% of traffic.
In the middle of the page layer added a cluster, the cluster can basically block out 80% of traffic, and finally the system to split its business vertically into multiple systems, such as the background of the app, the Web, letter review, crawl, activities, reports and so on. Database also some changes, started just a host, a database, now become the master and slave, a master more from.
Users can support the transfer million, in addition to their constraints in the database, the two sets of systems to block 4/5 of the traffic, the overall block after its flow into the previous 1/25, in fact, the database concurrency capacity to expand 25 times times. But their business has grown much more than 25 times times, so the database is still a big bottleneck.
Secondly , the second problem is the team division, in fact, each team to do their own system, but the pleasant loan still use the same database, this time, such as the design and modification of the database, is very troublesome. Every time to ask the other team, I do not change, what impact on you and so on.
In addition, the third problem is also very tricky, a lot of use of cache, data timeliness and consistency of the problem more and more serious. I remember the April when a financial products, its inventory more than a fold, 10 o'clock on-line very punctual, but found the problem after it offline, under a half-day after the next, the reason is that there is a cache, very difficult.
2.0 version --"special mess Hall" refinement
In order to solve 1.5T, June sun they need to do refinement optimization, he defines it as special mess hall.
first of all , the rational division of Data Attribution, optimize query efficiency, shorten the time of database things, second, sub-system, each system with a fixed table. From what you do every day, let ops find out which ones are the slowest on the line and optimize them. Third, do things, or as much as possible to improve the time to shorten things.
then, begin to focus on code quality, improve execution efficiency, and begin to focus on concurrency issues; When users reach this amount, there will be users to help them test. For example, the same user login with the same account two clients, he also points in, this time if the program is not good, it is likely to let him mention two times. Finally, to distinguish between strong consistency and eventual consistency, reasonable use of caching and read-write separation to solve these problems.
2.0 Performance problems solved a lot, but also bring new problems-the system is more and more, the dependencies between systems become complex; At this time, it is very easy to have a b,b C loop call. The second is the increase of inter-system calls, the upstream system crushing downstream system; the third, but also a very headache problem, a lot of system, find the problem on the line becomes more and more difficult; imagine that when the system is deployed to many machines, looking for a problem on the line, it is very difficult to check the form of a log. So on this basis they do a few things, one is about current limit, limited circulation is often based on two points: the maximum number of active threads (high consumption tasks), the number of runs per second (low consumption task);
The maximum number of active threads is appropriate for high-consumption tasks, and then the number of runs per second is appropriate for low-consumption tasks (this should be a limit on the number of interface requests). The second is a suggestion, he suggested that as far as possible to unify the return value between the internal systems, the return value must record the return status (business Normal, business exception, program exception) and error description, third, with the RPC framework to complete the current limit work.
Say again about the lookup problem, the pleasant loan Log System deployment framework, the leftmost is their business system, on the business system to collect logs into the Kafka queue, eventually using Kibana and their own developed system to view the log, log to a centralized point is not easy to find, and this is better to find.
Regarding the software aspect, the pleasant loan unifies uses the Slf4j+logback output log, then the date system to do the log concatenation, all the service side and the client is hidden between some parameters, these parameters will follow the call chain step by step, through the AOP to implement, the log concatenation needs to pass which some parameters, Or what parameters do you want to play in the log? The first is the time, this time should be to the millisecond level, the second is the serial number, the serial number is each request sublimation to the only one P value. Then the device number: time (to milliseconds), serial number (UUID, each request generated unique value), user session, device number, caller time (app use phone local time), native IP, client IP, user real IP, across the system number of times-with these to find the problem is very easy.
After 2.0, the pleasant loan of the site basically to the medium and large sites, a short period of time there will not be too many performance problems, but they must continue to go down.
3.0 version --Split and do service
3.0 Summary Down is to do service, popular point is split, including vertical splitting, top-up split on the basis of the level of the system, then service to how to do it.
First of all, when doing business splitting, you can do a large service split in accordance with basic services and business services, and then the basic services include non-business basic services and business-oriented basic services, these systems are obviously not much related to other systems. Business basic services are characterized by a small relationship with the business, that is, the relationship between these systems and the business system is only the correlation between the primary key and foreign key.
Pleasant loans can be naturally disassembled into two systems, one is the loan business, one is the financial business, the loan business can be disassembled into the background, the WEB, cooperation channels, the system will have a basic service, is to provide some basic services and interface of a system.
The basic service is split into two parts, one for the basic service, and one for the service after sale. Then the split process they found a problem, financial and loan two business how to dismantle, is to match the business and bond relationship, this can not be opened up a function to upgrade a service to provide services.
The split system looks like it's easy, and the split approach June Sun summarizes the following:
First, the appropriate redundancy, redundancy ensures that the database can still be associated with the query; Most of the time is not a new system, but on the original system to make changes, this time can do some redundancy, to ensure that they can not modify.
Second, data replication, but must ensure that the data attribution system has the right to modify and initiate replication; This is more appropriate for the global configuration just said, for example, all companies with pleasant loans will have such an integration table, recording the country's regional lines, which will be used in each system, not necessarily every system in the form of an interface call him. Can be redundant within each system.
Third, is how to verify the database-and not necessarily split it into two to verify it, you can build a database of two accounts, the two accounts of the respective permissions point to the split table, you can directly verify the split effect through the account.
IV, planning services ahead of time to determine the separation of read-more, write multi-service, distinguish between fast request, slow request service, different services need to be deployed separately.
Finally, the same data cannot be controlled by more than one system, and the same system cannot be responsible for more than one person.
4.0 version --Cloud outlook
To do the above, 3.0 version has been done almost, but there is still a lot to do after the pleasant loan, 4.0 is not to do cloud platform, remote deployment of the plan, the table is very large when it is not to do vertical splitting, to the IOE or use Docker rapid deployment and so on, These are actually the things we do 4.0 or 5.0 of the future to consider.
Third, the optimization of the pleasant credit management system
Reasonable estimation of traffic--strong consistency and eventual consistency
These three interfaces are the first page, the column page, the detail page respectively.
First of all, to reasonably estimate the flow, distinguish what is strong consistency of the flow, what is non-mandatory drainage.
Evaluation method One: Weekday pv* (24/Heat time);
Evaluation method Two: number of online users during heat time * average per person operation/heat time. Take a pleasant loan to the bank for example, they have 20,000 people in the high period, and then averaged 20 operations, in about 2 minutes or so basically all the bonds robbed, doubling out about 3,000 times per second.
Then distinguish between what is strong and what is ultimately consistent, and how much of the two flows are respectively. Strong consistency This data must be the most accurate data, this number can not be retained in the form of read-write shunt, must be the correct data. The final data is that timeliness is not so high, as long as the final result is consistent.
20 operations include: Registration, registration verification code, login, unlock gesture password, home page, browse product list and so on these operations, which some such as product balance, generate orders, pay SMS, payment, these are very strong consistent requirements, this account for each operand accounted for about 1/7, The turnover is about 500 times/S.
For the final consistent solution is very simple, increase the machine can be solved, more time-sensitive can directly use the database read and write separation, increase the application server, processing more concurrency, real-time high-speed use of read-write separation scheme, shorten catched time, low-real-time can be used for a longer period of catched.
Strong consistency of the traffic processing scheme, in general, is accelerated, can use the database lock, can also use ZK, or directly use the queuing mechanism, the database of the lock, basically change in 2000 times per second below. In this case, the lock of the database can be completely prevented from concurrency, the first method is to have a transaction to prevent concurrency.
Then update the shared resource, then query the shared resource again, and then determine the result. If this result is set up, it will run directly, if this result is not established, then do. The second is the method of nothing, add a condition to judge that their resources are not established, when no update is successful, it returns.
What if the traffic is still unable to withstand?
To do these can already withstand very large traffic, but the business may continue to develop, can not withstand how to do?
One of the first principles is that there is no distributed operation, the best way is single point, queued processing.
second , the single point concurrency is too large, using the appropriate way to split the granularity of the lock; for example, product December, add a September period, June period and so on. For example, to increase this is not enough, can sell their products by province, each province has a own inventory, grain size will be many.
Third , increase the demand for demotion, without affecting the normal use of users can properly reduce the quality of service. Appropriate modification of the requirements, the appropriate increase in the user waiting for the result time, if you let the user wait 2 seconds, is not able to support 4400 every two seconds, the answer is yes, can let users wait a while, this interaction let users have a better experience. Finally, the appropriate adjustment of operational strategies, decentralized user concentration active time.
The challenge of Internet high concurrency to financial system architecture from the view of pleasant loan system architecture