Large Website agency Technical organization

Last Update:2016-01-16 Source: Internet

Author: User

Tags failover website performance

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Large site Core architecture elements

1. Performance

2. Availability

3. Flexibility

4. Extensibility

5. Security

Instantaneous response: The high-performance architecture of the website

1. Website Performance test:

1). website performance at different viewing angles

A. User perspective of the site performance: User computers, Web server communication time, Web server processing time, user browser resolution time and so on.

B. Site performance from developer perspective:

C. Web site performance from the OPS staff perspective: Optimize backbone, optimize resource utilization with virtualization technology

2). Performance Test Indicators

A. Response time: A single request time is not calculated, can be repeated 10,000 times, test the total response time required for 10,000 executions, and then divided by 10,000, the response time of a single request.

B. Concurrency: Refers to the number of simultaneous requests processed by the system, and this number also reflects the load characteristics of the system. The test program tests the concurrency of the system by multithreaded simulation of concurrent users, and in order to simulate real user behavior, the test program does not start multithreading and then stop

To send a request, instead of adding a random wait time between two times, this time is called think time.

C. Throughput: The number of requests processed per unit of time, reflecting the overall processing capacity of the system.

D. Performance technology: Some data metrics that describe the performance of a server or operating system. This includes metrics such as system Load, number of objects and threads, memory usage, CPU usage, disk and network IO. These indicators are also important parameters of system monitoring, set alarm thresholds for these indicators,

When the monitoring system discovers that the performance counter exceeds the threshold, it alerts the operational and development personnel and discovers the handling system anomalies in time.

3). Performance test Method:

A. Performance testing: The system design of the initial planning performance indicators as the expected goal, the system constantly exert pressure to verify that the system within the acceptable range of resources, can achieve the department can be overdue.

B. Load test: The system constantly increase the concurrent requests to increase the system pressure, until one or more of the system's performance indicators reached a security threshold, such as a certain resource has been saturated, then continue to exert pressure on the system, the system's processing capacity not only can not improve, but decline.

C. Stress testing: In case of exceeding the safety load, continue to exert pressure on the system until the system crashes or can no longer process any requests to obtain maximum system pressure withstand capacity.

E. Stability Test: Tested system under certain hardware, software, network environment conditions, to the system load a certain business pressure, is the system to run for a long time, in order to detect the stability of the system. The stability test does not uniformly exert pressure on the system.

F. Test results: The daily operating range of the site, the maximum load point of the system, the system's crash point

4). Performance test report: should include: concurrency number, response time, TPS (number of transactions per second), error rate (%), Load, Memory (GB), remarks

5). Performance optimization Strategy

A. Performance analysis: Check the log of each link in the request processing, analyze which link response time is unreasonable, exceed the expectation, then examine the monitoring data, the main factors that analyze the impact ability are memory, disk, network, CPU, code problem or unreasonable architecture design, or

Insufficient system resources.

B. Performance optimization

2. Web Front End performance optimization

1). Browser Access Optimization:

A. Reducing HTTP requests

B. Using browser caching: caching static resources. At some point, static resource transformations need to be applied to the client browser in a timely manner and can be implemented by modifying the file name. When there are many static resources, consider updating them on a per-volume basis.

C. Enable compression

D. CSS placed on the top of the page, JavaScript placed at the bottom of the page: if the page parsing needs to use JS, then placed on the base is not appropriate.

E. Reducing cookie transmission

2). CDN Acceleration: That is, the content distribution network, deployed in the closest to the end users of the network service provider, the user's network request is always first to reach his network server, where the site cache some static resources (less variable data). Can be returned to the user at the fastest speed,

Video sites and portals, for example, cache hot-spot content with the greatest amount of user access in a CDN.

3). Reverse Proxy: The reverse is a part of the site's front-end architecture, deployed in the front of the site, when the user requests to reach the site's data center, the first access to the reverse proxy server, where the static resources of the site cache. Without having to forward the request to the application server, you can

Returned to the user. Requests from the Internet must go through a proxy server, with a single barrier between the Web server and possible cyber attacks to protect it.

3. Application Server Performance Optimization:

1). Distributed cache: The first law of Web site performance optimization: Prioritize Caching for performance optimization.

A. Fundamentals of Caching: Caching refers to storing data in relatively high-speed storage media for system processing. On the one hand cache access is fast, can reduce the data access time, on the other hand if the cached data is computed processing, then the back cached data

It can be used directly without repeating calculations, so caching also plays a role in reducing computational time.

The nature of the cache is a memory hash table, where site data access usually follows the 28 law.

B. Fair use of the cache: frequently modified data (unsuitable), no access to hotspots (not appropriate), inconsistent data and dirty reads (note), Cache availability (after a cache crash, database facing access can withstand), cache warming (cache data is loaded in advance),

Cache Penetration (A simple correspondence strategy is that the data that you do not exist is also cached (its value is null).)

In general, the data read and write more than 2:1, that is, write a cache, before the data update at least two times, slow existence meaningful

C. Distributed Cache Architecture: One is the distributed cache that needs to update synchronization, which is represented by JBoss cache, which is a distributed cache which is represented by memcached and not communicating with each other.

D. memcached: Simple communication protocol, rich client program, high-performance network communication, efficient memory management, non-communication server cluster architecture

2). Asynchronous operations: Using Message Queuing to make calls asynchronous can improve the extensibility of the Web site. In fact, using Message Queuing can also improve the performance of your Web site system. Message Queuing has a good peak shaving effect.

Note that because data is returned to the user immediately after it is written to the message queue, the data may fail in subsequent business checks, write databases, and so on, so after the use of Message Queuing for business asynchronous processing, the business process needs to be modified appropriately to mate. If the order has been submitted, the order

The data is written to the message queue and cannot be returned immediately when the user order is submitted successfully. It is necessary for the order consumer process in Message Queuing to actually process the order, even after the product has been out of the library, and then notify the user of the order success by e-mail or SMS message to avoid trading disputes.

Anything that can be done later should be done later.

3). Using the cluster

4). Code optimization:

A. Multithreading: From the perspective of resource utilization, there are two main reasons for using Multithreading: IO Blocking and multi-CPU. Number of threads enabled = [Task Execution time/(Task execution Time-io wait time)]*cpu number of cores

The main ways to address thread safety are to design objects as stateless objects, use local variables, and use locks when accessing resources concurrently.

B. Resource reuse: There are two main modes: Singleton (Singleton) and Object pool

C. Data structure: hash, etc.

D. Garbage collection

4. Storage Performance Optimization:

1). Mechanical HDD VS solid-State Drive

2). B + Tree VS LSM tree

3). RAID VS HDFS

A. RAID (redundant array of Inexpensive disks) technology is primarily designed to improve disk access latency and enhance disk availability and fault tolerance. At present, server-level computers support the insertion of multiple disks (8 blocks or more), using RAID technology to implement data on multiple disks

Concurrent read-write and data backup.

RAID0: The data is divided into N points according to the number of disks, while concurrently writing n blocks of disk, so that the overall data write speed is a disk n times. Read the same, but as long as there is a piece of corruption, data integrity is destroyed.

RAID1: When writing to disk, data is written to two disks at the same time, so that any piece of disk damage will not result in data loss, inserting a new disk can be automatically repaired by copying, with high reliability.

RAID10: Combined with RAID0 and RAID1 two scenarios, the average of all disks divided into two parts, the data at the same time in two copies of disk write, equivalent to RAID1, but on each disk on the N/2 block disk, using RAID0 technology concurrent access. Improve reliability and improve performance,

However, the RAID10 disk reuse rate is low.

RAID3: Divide n blocks, data written to N-1 block, checksum data written to Nth block.

RAID5: Similar to RAID3 but the checksum data is written to all disks in a spiral. Avoid frequent disk changes being written bad.

RAID6: Data is written to the N-2 block, and the checksum is written to the two disks in a spiral.

B. In HDFs (Hadoop Distributed File System), the system carries out data concurrently read and write and backup on multiple servers across the storage cluster, and can be seen as a raid-like capability on a server cluster scale, so disk RAID is not required.

HDFs manages the contents of the file in blocks (block), a file is divided into several block, and when the application writes a file, it is copied to the other two machines with each BLOCK,HDFS, ensuring that each block has three replicas, two servers are down in time, the data

can still be accessed, equivalent to the RAID1 data replication function.

When the file is processed and calculated, through the MapReduce Concurrency Computing task Framework, can start multiple compute subtasks, while reading multiple blocks of the file, concurrent processing, equivalent to implement RAID0 concurrent access.

5. Website performance is a subjective feeling for users, the ultimate goal of performance optimization is to improve the user experience, is that they feel the site quickly. To leave this purpose, the pursuit of technical so-called high performance, is trifles, not much significance. And the user experience is fast or slow, can be improved by technical means,

You can also improve by optimizing the interaction experience.

Foolproof: Web site Availability Architecture

1. Measurement and assessment of website usability:

1). Website Usability metrics: Web site unavailability is also known as a site failure, and how many 9来 the industry typically uses to measure the usability of the site.

2). Website Usability Assessment:

2. Highly Available Web site architecture: The main purpose of the site's high availability architecture design is to ensure that the service is still available when the server hardware fails, that the data is still saved and can be accessed. The primary means of achieving high availability are redundant backups of data and services and failover. Disk is corrupted, the data is read from the backed-up disk.

3. High-availability applications:

1). Failover of stateless services through load balancing: For application server clusters, the mechanism of real-time monitoring and automatic transfer of the service availability status is load balancing. The load Balancer server determines whether the server is available through a heartbeat monitoring mechanism.

2). Application Server cluster Session management: Under the cluster environment, the session management mainly has the following methods:

A. Session copy: Large Web site not suitable for

B. Session binding: You can use load balancing in the source address hash algorithm implementation, the Load Balancer server always distributes requests originating from the same IP to the same server. However, this approach does not meet the requirements for high availability of the system, because once the server goes down, the session does not exist. are rarely adopted.

C. Using cookies to record session: The session is recorded on the client, each time the server request, the session is placed in the request sent to the server, the server after processing the request and then modify the session response client. Disadvantage: limited by cookie size, affect performance, if the browser turns off cookies,

cannot be used and so on.

D. Session server: Single session server (cluster) to manage session Application server each time the session is read and written, the session server is accessed.

This solution actually separates the state of the application server into a stateless application server and stateful session server, and then designs its architecture for the different characteristics of the two servers.

For stateful session servers, a relatively simple method is to use distributed cache, database, etc., on the basis of these products to packaging, so that they meet the session storage and access requirements. If the business scenario has higher requirements for session management, such as using the session service integration

Single Sign-On (SSO), User services and other functions, you need to develop a dedicated session service management platform.

4. Highly Available SERVICES: Reusable service modules provide basic public services for business products, and these services are usually distributed in a large web site, and are called remotely by specific applications. Reusable services, like applications, are stateless services, so you can use a load-balanced, fail-escaped

Policy to implement highly available services. In addition to this, there are several highly available service strategies in the specific practice:

1). Tiered management: Operational management of the server, the core applications and services are preferred to use better hardware, operational response speed is also extremely rapid. At the same time in the service deployment also carry out the necessary isolation, to avoid the chain reaction of failure. Low-priority services are initiated by different threads or deployed in different

On virtual machines, core services and data are even deployed in different address data centers.

2). Timeout settings: Due to service outage, thread deadlock, and other reasons, the application may be unresponsive to the service side of the call, resulting in a long time for the user request not responding, while also consuming application resources, is not conducive to timely transfer of requests to the normal server.

Set the service invocation time-out in the application and, once timed out, the communication framework throws an exception, and the application, depending on the service scheduling policy, can choose to continue retrying or transfer the request to another server that provides the same service.

3). Asynchronous invocation: App calls to the service are done asynchronously, such as Message Queuing. Of course, not all service calls can be called asynchronously, and for the invocation of user information, the asynchronous approach prolongs the response time. For those who must confirm that the server call succeeds to proceed to the next step

Operation is also not suitable for asynchronous invocation.

4). Service downgrade: During peak site visit, the service needs to be degraded in order to ensure the proper operation of core applications and functions. There are two ways to downgrade: Denial of service and service shutdown.

Denial of service: Deny calls to low-priority applications, reduce the number of concurrent service calls, make sure that the core application is working properly, or randomly deny partial request calls, save resources, make another part of the request successful, and avoid the tragic death of everyone.

Shutdown feature: Turn off some unimportant services, or shut down some unimportant features within the service to conserve system overhead and make resources available for important services and functions.

5). idempotent design: To ensure that the service repeats the call and the invocation produces the same result, that is, the service is idempotent.

5. Highly available data: the means of ensuring high availability of data storage are mainly data backup and failure transfer mechanisms. Data backup is to ensure that the data have multiple copies, the failure of any copy will not result in permanent loss of data, so as to achieve full data persistence. The fail-over mechanism guarantees that a copy of the current data is not

Access, you can quickly switch access to other copies of data to ensure that the system is available.

High availability of cache services: a simple way to scale up a cache server cluster is to share a distributed cache cluster across the site, and separate applications and products from deploying their own cache servers, simply requesting resources from the shared cache cluster.

1). Cap principle: The CAP principle argues that a storage system that provides data services cannot simultaneously meet data consistency (consistency), data availability (availibility), partition tolerance (Partition tolerance, The system has scalability across network partitions) these three conditions.

The high-availability data has the following layers of meaning: data persistence, data accessibility, data consistency

In large Web sites, it is common to choose to harden the availability (A) and Scalability (P) of distributed storage systems, and to some extent, abandon consistency. Specifically, data consistency is divided into the following points:

Strong data consistency: data from each replica is always consistent in the physical storage

Data user consistent: Each copy may be inconsistent, but returned to the user through the error correction mechanism, etc., returned to the user a correct data.

The final data is consistent: After a period of time, the data will eventually reach a consistent

2). Data backup:

Cold backup: Advantages are simple and inexpensive, cost and technical difficulty are relatively low. The disadvantage is that data eventual consistency is not guaranteed. At the same time, data availability is not guaranteed because of data lag.

Hot backup: Asynchronous hot backup and synchronous hot backup

A. Asynchronous hot backup: Refers to a multi-data copy of the write operation is completed asynchronously, that is, the application received a data service system write operation successful response, write only a successful copy, the storage system will asynchronously write to the other replicas.

B. Synchronous hot Backup: The write operation of multiple copies of the data is completed synchronously, that is, when the application receives a successful response from the data service system, multiple copies of the data have been successfully written.

3). Fail-Over transfer:

A. Fail-over confirmation: Two ways, heartbeat detection and application access failure report. For an application's access failure report, the control center also needs to send a heartbeat check at a time to confirm.

B. Access transfer: After confirming that a storage server is down, data read-write access needs to be rerouted to other servers.

C. Data recovery: The number of copies must be restored to the system setting, with the server down. Copy data from a healthy server and restore the number of copies of the data to a set value.

6. Software Quality assurance for highly available websites:

1). Website Publishing: Publishing is usually done by publishing a script. During the publishing process, each server that is shut down is a small part of the cluster and can be accessed immediately after the publication is completed, so the entire publishing process does not affect the user's use.

2). Automated testing: Automated testing tools to complete the system deployment, test data generation, test execution, test report generation and other test procedures in one click

3). Pre-release validation: Sandbox testing to be as consistent as possible with live online environments.

4). Code Control:

A. Trunk development, branch Publishing: Code changes are carried out on the trunk, need to publish, pull a branch from the backbone to publish, change the branch to become a release version, if the version has a bug, continue to modify the release on the branch, will modify the merge (merge) back to the trunk, know the next trunk release.

B. Branch development, Trunk release: Any changes can not be run directly on the trunk, need to develop a new feature or fix a bug, from the backbone pull a branch for development, Development completed and pass the test, merge back to the trunk, and then release from the backbone, the backbone of the code is always up to date

The published version.

Each of these two approaches has advantages and disadvantages. Backbone development, Branch publishing, backbone code response to the current state of the entire application, at a glance, easy to manage and control, but also conducive to continuous integration. Branch development, trunk release method, each branch independent, non-interference, can make different release cycle of development

In the same application.

5). Automated release: The less people intervene, the higher the degree of automation, the less likely it is to introduce a fault, and the more likely the train will arrive on time.

6). Grayscale Publishing: After a successful application release, it is still possible to find a failure due to a software issue that will require a release rollback, which is to uninstall the software that was just released and restore the previous version. To cope with this situation, a large web site uses the grayscale publishing model to divide the cluster server into several parts,

Only part of the day is released.

Grayscale publishing, also commonly used for user testing, is to publish a new version on a subset of servers, then monitor user action behavior and collect user experience reports. Compare users ' satisfaction with both versions to determine the final release. This method is called AB test.

7. Website Operation monitoring: No monitoring system is allowed to go online.

1). Monitoring Data acquisition

A. User behavior Log collection: User behavior log refers to all the actions that the user does on the browser and the operating system environment, including: User operating system and browser version, IP address, page access path, page dwell time, etc. These data are PV/UV indicators for statistical sites, analysis of user behavior,

Optimization of website design, personalized marketing and recommendation is very important.

There are two ways to collect user behavior logs:

A1. Server-Side Log collection

A2. client browser Log collection: Use the page embedding JS script to collect the user's real operation behavior, so it is more accurate than the server log collection. The disadvantage is more troublesome.

In addition, the number of user logs for large Web sites is staggering, data storage and computational pressure is high, and many networks are now developing log statistics and analysis tools based on the real-time computing framework storm.

B. Server performance monitoring: The collection of server performance metrics, such as system load, memory footprint, disk IO, network IO, such as early to make fault early warning, timely judge the application status, in case of prevention, the failure to kill in the bud is very important.

The current Web site uses a wide range of open source performance monitoring tools, ganglia, which supports large-scale server clusters and supports graphically displaying real-time performance curves in the browser.

C. Running a data report: The site also needs to monitor technical and business metrics related to specific business scenarios such as buffer hit ratio, average response latency, number of messages sent per minute, total number of tasks to be processed, and so on.

2). Monitoring and management: After monitoring data acquisition, in addition to the system performance evaluation, cluster scale scalability prediction, can also be based on real-time monitoring data for risk warning, and server failover, automatic load adjustment, maximize the utilization of all the resources of the cluster machine.

A. System alarm: Monitoring management system can configure alarm thresholds and selling personnel contact, alarm way In addition to mail, instant communication tools, but also can configure mobile phone text messages, voice alarm, system alarm, engineers even in thousands of miles away, sleep at night can be timely notification, rapid response.

B. Fail-over: In addition to fail-over when the application fails to access, the monitoring system can proactively notify the application in the event of a failure and fail over

C. Automatic graceful downgrade: Graceful demotion refers to the website in order to cope with the sudden outbreak of peak access, actively shut down some of the functions, release some of the system resources, to ensure that the core functions of the site normal access to a means.

Large Website agency Technical organization

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More