Ctrip practice case sharing: using HTML5 to monitor site performance

Source: Internet
Author: User
Keywords Ctrip HTML5 practice case

Intermediary transaction SEO diagnosis Taobao guest Cloud host technology Hall

Wen/Shingdong, Liu, Xie School

HTML5 technology brings a lot of new elements to the web, not only makes the website become more and more beautiful, the interactive experience is closer and perfect, but also makes many functions that once could not be completed can be realized. This article aims at the new characteristic which the HTML5 brings in the website performance monitoring, shares with everybody Ctrip traveling network in this direction the practical experience.

The status quo of website performance monitoring

The performance of the website is becoming more and more popular, because it directly affects the user experience. Most Internet companies in the Web site performance monitoring only to achieve server performance monitoring and network I/O monitoring, this background has spawned a number of Third-party Web site performance monitoring companies, such as the keynote, monitoring treasure, GA and so on. They all have one thing in common--basic monitoring points in major cities across the country, regularly active access to pages to get performance data. The data is then aggregated and made available to the end user after the report is generated.

Advantages and disadvantages of third party monitoring

Third-party monitoring has the following advantages and disadvantages.

Advantage

There is no need to change existing program code. Third-party monitoring because of the use of active access and acquisition mechanism, only need to configure the URL of the relevant page in the user management interface, you can simulate the process of user access, so there is no need for developers to intervene.

can collect rich data. Because the browser used by the analog access is deployed by the vendor, you can add custom Plug-ins or integrate other performance tools on the client, and programmatically implement data collection for all types of performance.

Small amount of data, easy to assemble. This depends on the supplier's monitoring point deployment, but limited to cost considerations, the general supplier will only be deployed in the one or two-line city, which is relatively easy to summarize and less complex to analyze.

Can be reproduced and validated if a problem occurs. Because of the rich data, and the monitoring point of the problem can be controlled, we can easily reproduce, easy to arrange the wrong.

Disadvantage

lump-sum investment. The deployment of monitoring points requires a large number of equipment support, if only for a company to provide services, cost-effective, the need for a large amount of money.

It is difficult to complete the tests under different browsers. Monitoring points can not take into account the use of all users of the browser, for different business, customer groups inconsistent, the weight of the browser is also different, so the monitoring point generally to IE and WebKit Core browser.

The response has a time interval. In general, there is no single monitor page, there will be a lot of tests to ensure that no interference, especially performance testing, will be queued in turn, this will make a cycle of time is very long, and need to wait until all the monitoring points are completed before the final report, can not be timely reflect the current state.

It is difficult to monitor the pages that the strong dependencies process enters. For example, the booking process, the need to post a large amount of information, and timeliness, for the monitoring point has a certain challenge. Some operators can now provide some simple scripting capabilities, but they cannot be satisfied with increasingly complex business requirements.

Limited monitoring points, can not cover the overall user base. Monitoring points can be increased, but always unable to cover all network environment, so data can only be used for reference, and does not represent the real user experience.

What did HTML5 bring us?

The new configured standard in HTML5 has been implemented in IE9, the latest Firefox and Chrome, and the accuracy has reached the millisecond level, and we can get a lot of key metrics through detailed time points.

Here, let's take a quick look at some of the indicators available (Figure 1). Many of them can help us understand client performance and customer experience, such as server-side processing time + network transmission time (shorter) =responsestart–requeststart, client white time =dominteractive– Navigationstart or Responsestart.

  

Figure 1 HTML5 performance metrics (photo quotes from the official website of the Consortium)

For Ctrip, our main monitoring indicators are the following several.

1 Total length: From page Jump start to the page onload;

2. DNS Domain name Resolution time: From the launch of the page domain name resolution to the completion of resolution;

3. Connect to establish TCP connection with the server length: from the initiating TCP connection to three times handshake completed;

4. Request length: From the originating page request to the server side to return the first byte;

5. Response response time: From the receiving server sent back to the first byte to the home page to complete the download;

6. Domready Page Dom tree parsing: Jump from page to page DOM element stable.

Next we look at the advantages and disadvantages of client data acquisition.

Advantage

Real customer access performance data. Customers may be doing a lot of other things while visiting the site, and there may be a lot of other network applications consuming bandwidth, and real user data is representative of the customer's experience.

can differentiate browser, operating system platform. This is especially valuable when users are using a variety of shell browsers and custom browsers, and ordinary tests cannot cover such a complex web browsing environment.

Covering a wide range, and a more balanced geographical distribution. Compared to the third party, we can rely on JavaScript to collect data from various regions, even overseas, the larger the scale of the site, the more meaningful, can reflect the user's local network situation, learned the CDN acceleration effect.

Browser native support, high precision. Millisecond-level precision is significant for network DNS, connect time, and browser initialization event execution time.

Disadvantage

There is nothing for older browsers to do. Performance data acquisition needs HTML5 support, for IE6, IE7, IE8, does not support this standard is its biggest mishap, but thanks to the HTML5 speed, with the release of the high version of the browser, the problem will fade, do not need us to worry about.

A small amount of JavaScript code needs to be deployed. Similar to Google analyze code loading mechanism, you need to embed the code at the bottom of each page, the workload depends on the site architecture, if there is a unified footer, the workload is actually very small.

Cannot reproduce. Because the data from the customer, then the state can not be retained, it is difficult to simulate the customer's environment, will have a certain impact on the error.

Best Practices for Ctrip

Ctrip has accumulated some experience in data collection, mainly realizes the idea and the environment constructs as shown in Figure 2.

  

Fig. 2 The main ideas of realization and environment construction

JavaScript Capture/Data postback

When the page is loaded, the JavaScript code deployed on the page gets the performance information from the Performance.timing object, and then the data is assembled into URL parameters to simulate sending a picture request to the collector server. The way to simulate a picture request is similar to Google analyze, which is new image () src=. This method is widely used and has the advantages of cross domain and good compatibility.

This way of return is also inadequate. As we all know, the length of the parameters of a GET request is limited, which means we have to be careful with the length of the postback data and truncate the extra long information. If not, too long information may be directly discarded, not conducive to subsequent processing and analysis.

Nginx Receive/log log

The Collector service is a nginx cluster of performance excellence. In order to minimize the resource occupancy of the client when returning data, Nginx takes a log only and does not do any processing. This enables client-side postback data to be quickly completed and closed, minimizing the impact on the user experience. The common access log format for Nginx (including Apache) contains the full URL of the GET request, and the performance data we return is recorded in the URL's parameters.

In order to optimize the load capacity of the collector cluster, we need to adjust the Linux, Nginx and so on.

In Linux, the maximum number of open files is one of the most critical parameters. Since regular Web servers often run dynamic scripts such as PHP and JavaScript, each request involves database operations, and their concurrency capability is not bad until 1000. The Linux server default configuration is usually sufficient to meet this level of concurrency. But our scene is very special: we almost do not need to do processing, only note the access log. The Nginx server is known for its concurrency performance, and official data says it can support 100,000 concurrent. In a Linux system, each connection corresponds to a socket file, so the maximum concurrency limit is limited by the system's maximum number of open files. In addition, some network-related kernel parameters are also optimized according to the application scenario.

In nginx aspect, removed the unwanted function, retained the httpemptygifmodule. This module returns only one 1x1 pixel gif picture for the incoming request. Because picture data is only a few bytes in memory, it can respond to client requests at a very fast rate.

Location =/_.gif {empty_gif;}

As the effect of the above configuration is, access to Http://yourdomain/_.gif will get a pixel-only GIF picture, which responds very quickly.

Read Log/Send to queue

A dedicated log shipping agent tracks log content through a similar tail mechanism, sending new log entries to message queues in real time for subsequent processing. This part of the meaning is: first, it will be distributed in the cluster of scattered logs to one place, is a log aggregation process, second, the analysis program and the Nginx server decoupling to maximize the protection of Nginx cluster high availability, but also maximize the availability of raw data.

Remove from queue/Storm cluster real-time analysis

The Backend Data analysis program uses a distributed real-time streaming data processing framework storm. Based on this framework for processing, one face to collect the massive data, you can extend the processing capacity, and the real-time flow of the operation of the delay is very small, real-time access to the page performance information, so that timely warning become possible.

Storm abstracts data processing into a topological structure composed of logical units (Figure 3). Each logical unit consists of an operation and input and output, in terms of storm, these logical units have two main classes: spout and Bolt, where the spout is the source of the data.

  

Fig. 3 Schematic diagram of the Storm operational topology (quoted from Storm official website)

These topologies, which will be distributed to each physical node in the cluster, can be used to efficiently compute the data quickly.

What we do on the storm cluster, including the browser, operating system, geographical location analysis, analysis of data, directly support by region, operator, system platform, browser type, and specify specific page conditions arbitrary query and report.

Generate real-time reports/alerts/alert messages

With the help of the powerful real-time processing ability of storm framework, the analysis of log can produce real-time report quickly. In addition, by comparing with the historical data, we can give early warning to the abnormal information in the performance data, including sending early warning mails and warning messages.

Real-time reports are processed directly in memory, with the help of Storm's DRPC (distributed Remote Procedure Call) (Figure 4), or distributed far process calls, which can aggregate the most recent data cached in each operation node directly to generate a report.

  

Figure 4 Storm distributed RPC schematic (referenced from Storm official website)

By some rules, we do some early-warning operation on real-time data stream. After the warning event is triggered, the relevant information is sent to the alarm system as an event. Alarm system According to the configuration, to the relevant personnel to send early warning mail or SMS.

Day, week, quarter, year summary

On the basis of storm output data, the timing is summarized by day, week, quarter and year. The summary data can be conveniently used for historical data query, which provides a reference for real-time early warning and long-term performance evaluation. At the same time, with different granularity of the old data, can gradually discard outdated large detail data, reduce the pressure of the database.

Is there anything else we can do?

The whole environment needs a lot of manpower and resources, many people may doubt its value. Here, I would like to tell you that the client data collection is very worthwhile to invest. In addition to the above mentioned page access point of Time data collection, in fact, we still have a lot of places to reuse.

For example, the client's JavaScript error collection, using a combination of Try catch and OnError, collects client error messages. In Ctrip, we also classify this kind of data as the performance data of the website, the JavaScript error will directly affect the user's impression on the website, and will affect the user's consumption in the website, which is directly related to the profit and cannot be neglected.

For example, through user behavior data acquisition, the basic access information of the page can be obtained. User access flow, all user actions on the page, can help improve existing products, if the conditions permit, with A/b test, for new product development can also provide a lot of valuable references.

Therefore, we can rely on imagination, expand ideas, get more meaningful information, complete more meaningful research.

(all three authors come from Ctrip Travel Network)

This article is selected from the "Programmer" magazine 2013 1, not allowed to reprint without permission. If you want to reprint please contact market@csdn.net

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.