The change of wireless architecture of TripAdvisor website

Source: Internet
Author: User

From the start of the stand-alone system, it has grown to now have hundreds of distributed deployment systems. This article mainly will be the mobile website wireless system in the process of small to large, encountered problems and solutions to share with you, hope for everyone to bring some reference. The article will be introduced from the service promotion, the pain of the South Beijing engine room, the performance enhancement practice, the app client technology evolution four aspects.

Service Promotion

The service of the cattle started in 2011, at that time we mainly carried out the service of the member, in 2012, a search 2.0 service, 2013 is the time for a significant advance in service, mainly carried out the search 3.0, Price Center, order center, product base data systems such as service-oriented, In 2014, the TSP (Service management platform), business public system, resource search system and other services, 2015 to the production category, open API service.

As can be seen from the above process, our service is not overnight, but has undergone a lengthy process, each split is equivalent to a high-speed car replacement tires process. It can be noted that in 2012 we split a search of 2.0, and soon after that we launched the search 3.0 in 2013.

The difference between the two versions is: do a search 2.0 at the beginning of no experience, although the use of solr such a very mature open source search engine to build a search platform, but does not clearly define the relationship between the search platform and business systems, resulting in a very heavy search platform logic, as a data aggregation platform to use, Site List page data and detail page data Are out of the search, resulting in a search to get the data source part of the logic is very complex, the search developers will be 70% of the time and business system docking logic processing, indexing efficiency is relatively low, resulting in performance instability, gradually retired. After learning the lesson, we set up a search 3.0 platform, just provide list search, unified list field, the data push logic to the outside of the search, by the various product systems to push the data, the search itself focused on performance improvement and stability, and gradually add intelligent sorting, manual intervention search results function. So far, search 3.0 is the most stable system in our company.

Next is the service of the process, the technical aspects of the two services are better: Price calculation services and service governance platform.

Price Calculation Services

Technically, the price calculation service has two difficulties: one is the price dependence of the group more factors, and the path is deeper, and the other is that these factors are more frequent price changes, especially in the high season. Therefore, from the design, the price calculation service must have the large capacity requirement, simultaneously has the real-time nature.

The price calculation service was built from 13 and has undergone four phases: Synchronization architecture, asynchronous architecture, concurrency architecture, and distributed architecture, as shown in 1.


Figure 1 Service-based recommendation-price calculation service

Synchronization Architecture: Inter-system interaction mainly through the interface, the other system through the call interface to notify the price center to initiate operations, the price center through the interface to obtain other system price dependence of all resources. The whole calculation process adopts serial model line, and the efficiency is low enough to meet the needs of small-scale computing.

Asynchronous architecture: Inter-system interaction through MQ, Price center to rely on the database to obtain data from other systems, speed up the efficiency of data reading, and the calculation of the price into two segments: first for a resource multiple vendors, the lowest cost price of the resource is calculated, and then calculate the product lowest price. This architecture is more efficient than synchronous schema data reads, and can accelerate computation by pre-generating data, improving overall performance by up to 3 times times.

Concurrency architecture: First of all, the price library of its own data (cost price of resources, product group start) of the sub-database sub-table, improve the system's data capacity, and then according to the frequency of product access to distinguish between hot and cold data calculation frequency, cold data reduce the calculation frequency, thermal data increase the calculation frequency-and through in-memory to establish The data structure of the three dimensions of the resource improves the reading and writing efficiency of the data in the computation process. The overall performance is 3.5 times times higher than the asynchronous architecture, and the price calculation time per group period is controlled below 200ms.

Distributed architecture: By parsing the binlog of dependent databases, converting dependent database data into a suitable memory database structure, and further improving the data read efficiency, so as to solve the problem of computing over-reliance on the database, using sharding MQ for local access, local computing , through the use of UNIX domain communication mechanism, to achieve local communication, each compute instance depends on the resources and traffic to limit to the local server, maximize the increase I/O capability, reduce I/O loss. Overall performance is twice times higher than concurrent architectures, and the price calculation time for each group period is controlled below 100ms.

Through the optimization of the above phases, the price calculation service is shown in the overall architecture 2.


Figure 2 recommendation of service-price calculation service Whole Body architecture

The compute cost node in the distribution node is a preprocessing node, which mainly calculates the resource cost price, and the compute node in the physical machine is the unit node of the actual execution price calculation. Scheduling node through a certain routing rules, the price of the calculation of fragmented to different machines, Binlog synchronization will follow similar rules, the data synchronization to different storage node physical machine, so that the overall local storage, local computing.

As of May 2015, the price Calculation service is calculated on a daily basis of about 900 million times, and each group period is calculated on average 2 times per day. The price calculation service is always iterative in the I/O capability and computational efficiency, looking forward to a better architecture in the future.

Service Governance Platform

With the deepening of service advancement, each system provides more and more interfaces, the whole system has gradually produced some problems: mesh interface Call; There is a cyclic dependency in the interface, which may cause avalanche effect; service invocation lacks monitoring; using hardware to achieve load balancing and maintainability is poor. To address these issues, we urgently need a service governance platform to manage all the services.

Based on the open source Service governance platform, we have made some customizations that will soon be built for the TripAdvisor service governance platform, as shown in Architecture 3.


Figure 3 The service management platform of the TripAdvisor cattle

The registration center uses master-slave mode for cluster deployment, "Master" for service address changes and heartbeat retention, "from" to provide query services. Establish a long connection between the master and slave to keep the heartbeat. "Master" after the outage, "from" replacement, change their identity. An instance of each deployment of a registry can accept a client long connection request only if it obtains the "primary" identity. After each service provider or service consumer perceives "master" downtime, it attempts to connect to and establish a long connection, use the Sqllite database to persist the list of services, use a highly available memory cache to save a list of available service addresses, and establish a long connection between the service provider and the service consumer to maintain the heartbeat.

After the service provider is started, the registry updates the available service address list by using the service provided by the generic component to register the center. If the service does not have an audit record, it is pending approval as a new service. After the new service is submitted to the registry, the registry will not be updated to the list of available services and will need to be manually audited on the admin page before it can be accessed and perceived by the service consumer.

In the event of a service provider outage, a heartbeat outage, the registry will update the list of available service addresses, remove all services from the provider, and issue a change notification. The heartbeat has a heavy-connection retention mechanism. There is no heartbeat in a certain time to disconnect. The service provider uses the connection pool, which controls the number of long connections and sets the maximum number of connections. If the maximum number of connections is limited, new connection access is denied, guaranteeing the availability of the current system.

Management page can query services, view service details and available service address list, view the list of service consumers, the review of new online services, to the ban on offline services, real-time adjustment of a service load balancing policy, to a service provider to reduce rights, rights, disable, allow operation.

the pain of the computer room in south Beijing

This section mainly introduces the strategy of the engine room deployment of the TripAdvisor cattle. Before 2014, we basically maintain the structure of the South Beijing engine room, in the circumstances, the strategy is still relatively reasonable, but with the increasing use of the volume, gradually the problem, we in 2015 became the strategy of Nanjing Single Room, in the future we will be the three centers of the two places more stable, High-availability architecture evolution.

South Beijing single Room strategy, at the beginning of the design, very good to meet the business needs. Before 2010, more than 70% of the orders are telephone orders, plus the booking process of travel orders is more complex, the need for customer service manual participation of more, we need to deploy the order system in Nanjing Room, in order to provide our customer service with a good user experience. At the same time in order to provide Internet users with better room conditions, we need to deploy the site in Beijing. In this machine room architecture, we have carried out a lot of system optimization work, mainly in order to solve the problem of data synchronization between the remote computer room.

First of all, for the website data "read more write less" characteristics, we have each subsystem, the following typical system design, 4 shows.


Figure 4 The typical design of each subsystem under the feature of the website data read and write less

South Beijing through the database of the master-slave synchronization mechanism for data synchronization, Beijing computer room application read Beijing's database, through a dedicated line to the Nanjing database, so as to ensure consistency of data on both sides.

The design of the system when the capacity is small, can be very good to run, but in the case of unstable line, there will be more problems, the most common is the data synchronization delay, such as the user after the site registration, can not immediately login. In response to this problem, we used a fuse design, using a specific process to monitor database synchronization delay, if the delay reached the upper limit, will try to use a public VPN synchronization, when the line is better, then switch back.

In addition, in order to control the amount of data synchronization, all data synchronization adopts the compression mechanism to minimize the amount of synchronized data. At the same time we also continuously expand the capacity of the line.

As the business continues to grow and the volume of data being synchronized grows, the challenges of this deployment architecture are growing, and eventually, in early 2015, we merge the two labs. The biggest challenge is the network conditions of Nanjing Computer room, at that time there is no better access to the multi-line BPG room, in order to provide users with better network services, we finally adopted a dynamic CDN program, Nanjing Computer Room exports only provide telecommunications export IP. For Unicom, mobile users through the dynamic domain name resolution, to the local closer to the relay server, and then by the transit server optimized routing access to Nanjing Telecom line. The program can provide a good network service for the national users.

In the overall server deployment cost, we reduced at least 30%, one is to avoid the same set of systems in South Beijing to deploy two, the second is to save a large number of private line costs.

The current single-room strategy is a transition program, in order to ensure the further high availability of the system and data security, we will later to the standard two-place three central computer room deployment strategy forward.

Performance optimization

Performance optimization mainly introduces us in the optimization process summed up a few tools, our thinking is: first, continue to promote the evolution of the architecture, System division, advance the expansion of resources to ensure the overall load-carrying capacity. Then, the continuous improvement of monitoring, performance indicators specific, identify problems, solve problems, to ensure the overall stability of the ability. Mainly by such three tools to achieve: CODIS, BWT, OSS.

Codis is pea pod a redis distributed cluster solution that is developed in the form of Go and C languages, and is fully compatible with Twemproxy. The CODIS layer handles requests for forwarding, non-stop data migrations, and so on. All the underlying processing is transparent to the client. In summary, you can simply assume that the background connection is a Redis service with an infinite memory size. From no caches, to file caches, to memcache caches, to today's CODIS caches, caching is a necessity for large architectures.

After using Codis, the application side does not need to care about where the cache is stored, does not need to care about the cache expansion and data migration work, does not need to care about the cache data consistency problem, greatly improved the efficiency of application development and maintenance.

BWT is a self-developed active Cache update Service, in order to further improve the efficiency of page generation, when the application system changes data, the data request to be updated to BWT, BWT based on the updated policy set up to update the cache. Application system pushes over the data, generally will delay 3 minutes, to update. At the same time, BWT will also be analyzed by the log hot data, will be automatically updated according to the set times. During the update process, updates are automatically stopped if the target machine is under high load.

OSS is also our own research and development of a website operation monitoring system, the initial goal of the system is to monitor and manage the performance, availability and security of the website. The latter will be separated into a separate operational monitoring system to provide monitoring services for all systems. Figure 5 is the system structure of the OSS.

Figure 5 System architecture of OSS for website Operation Monitoring System

The main feature is to use the UPD method to send logs from the application system, as far as possible to reduce the performance of the application system to send log consumption. By receiving logs through the NSQ queue, the consumption process written in the go language will be processed into a DB and the various statistical reports will eventually be rendered through the page.

The various faults of the website can be found by error and performance chart, quickly find the problem. Mainly rely on interface monitoring, slow-scan SQL monitoring, memcache monitoring, redis monitoring and single-page performance monitoring.

App Client Technology evolution

This article mainly introduces the experience of the development of the TripAdvisor app, focusing on two aspects of thread hot patch and front-end resource static.

Online Hot Patches

Because the app uses the client-published scheme, once released package has a bug, repair is a very headache problem, the traditional repair methods mainly include: server-side shielding technology, that is, the problem of the function temporarily blocked off; jump to H5 page, jump to the page that produces the problem directly to the corresponding H5 page Emergency release of new version. These methods have certain limitations, for the service-side shielding technology, will increase the service-side code complexity and hide the local function; for jumping to H5, the user experience is reduced, and for an emergency release, it increases operational costs and reduces the user experience.

To this end we introduced Ali's online hot patch technology so that when the problem occurs, the ability to quickly release a patch package will solve the problem.

Static front-end resources

Because the H5 development cycle is short, easy to deploy features, there are a large number of H5 pages in the in-transit cattle app, but for H5 pages, the loss of user experience is also obvious. In order to make the elements of the page render more quickly, presented to the user, we adopted the front-end resource static solution, the main idea is to H5 the static resources in the page to load early, the implementation of the following points:

    1. Static resources are loaded asynchronously, and when the user opens the app, it asynchronously downloads or updates the static file.

    2. Optimize rendering to reduce unnecessary overhead. By optimizing the DOM layout, the loaded static resources can be packaged, prioritized, taken from the server, and then rendered, thus speeding up the first entry speed, reducing the number of first-screen DOM renders, using lazy loading, stepping-loading, and optimizing the rendering structure, as the WebView performance in the app is lower than the mobile browser , reducing unnecessary rendering overhead, such as reducing the number of scrolling graphs that consume very high, optimizing interactions, interacting, causing Dom Reflow to redraw, using minimal dom rearrangement, separating some of the newly added layers from the original DOM structure, and using some 3D CSS to help the page redraw using the GPU.

These are some of the practical points that we have in the process of architecture change, although it seems to be scattered, but it is mainly introduced from the following three aspects of architecture.

Logical architecture: Service-based, how to abstract the common functions of the business, to serve the way to provide to other systems.
Physical architecture: The design of the South Beijing engine room, problems encountered, solutions and so on.
System Architecture: Non-functional architecture, such as performance optimization, app client performance improvement practices.

The change of wireless architecture of TripAdvisor website

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.