Architect-Shanda Xu Shiwei vs Kingsoft Zhang Yan

Source: Internet
Author: User

Xu Shiwei: As a System Architect, what aspects do you usually use to ensure high website availability (reduce fault time )?

Zhang banquet:Many factors may cause website faults, thus affecting the high availability of the website, for example, server hardware faults, software system faults, IDC room faults, bugs not detected before the program went online, distributed attacks, and sudden increases in access.

A good set of website system architecture should be avoided as much as possible, such as the existence of single point of failure (spof) such as only one server, one database, and one set of software nodes. Once a single point of failure (spof) occurs, the website service becomes unavailable. It may take a long time to restore the normal service, or even cannot be recovered. Server Load balancer clusters, dual-node Hot Standby, and distributed processing can all be used to solve single point of failure. For example, Server Load balancer clusters can be built by providing web servers for the same business and MySQL database slave databases. Once a server or service in the cluster fails, it is automatically removed in real time, which is imperceptible to users and does not affect the access to the whole website, it can leave enough time for O & M engineers to troubleshoot and solve faults.

For important MySQL database master databases, we are used to implementing hot backup at the hardware and software layers to avoid spof. The more complex the device is, the more likely it will be to fail. When the disk is not damaged, the probability of server downtime caused by applications is much higher than that of a simple disk array. Therefore, to solve the problem from the hardware layer, you can install the same database version on the two servers, perform the same configuration, and connect a disk array with SAS or SCSI lines, store database data files on a disk array. Normally, use server a to mount the disk array partition, start MySQL, and bind a virtual IP address. If server a is down, use server B to mount the disk array partition, start MySQL, and take over the virtual IP address. If you solve the problem from the software layer, you can use drbd and other software for mirroring.

The probability of IDC data center faults is small, but if yes, the impact is also the biggest. If all servers are hosted in an IDC, once the IDC suffers a long traffic attack, power failure, network disconnection, and local policy network blockout, it can only be handled by the IDC, the solution takes a long time. If the cost permits, the website servers are distributed in more than two IDCs. When an IDC fails, you can temporarily switch to DNS to restore services first.

Although the program code was strictly tested by testers before it went online, the test environment and production environment are different after all, therefore, bugs that will sharply affect performance and normal services are often discovered only after the program is launched. This requires us to quickly roll back to the previous normal version after discovering the bug. On the basis of SVN, we developed a web code publishing system that records the changes in files between different versions, one-click release and rollback of program code on multiple web servers.

In the face of DDoS distributed denial-of-service attacks, it is relatively easy to use firewalls to deal with semi-connection and fake IP addresses. The distributed CC attack that specifically targets complex and dynamic application URLs comes from real IP addresses and real HTTP requests, it has the characteristics of simulating the User-Agent of a regular browser, a low number of requests per second for a single IP address, and thousands of attack sources. It is difficult to distinguish from normal access and is more difficult to deal. However, normally accessing a URL through a browser will load the JavaScript scripts, CSS styles, images, and other files introduced in the URL. In case of CC attacks, you need to analyze the logs in time to find out the URL with abnormal traffic increase, and then use the prepared shell script to find out which IP requests only access the URL, instead of loading the files introduced by the URL, the IP addresses are automatically blocked.

When designing the system architecture, you must first consider the sudden access times higher than the current traffic. For online gaming sites, the traffic volume is greatly affected by the advertising in a concentrated period of time and online activities, and the peak bandwidth is not fixed. For static content, you can use commercial CDN for billing based on actual usage. For dynamic content, if the sudden increase in access volume exceeds the existing server processing capacity, the simplest temporary processing method is to increase the number of servers. It takes time to mount New servers. However, you can use servers of other services in the same IDC to open a group of new processes on different ports and add them to the original Server Load balancer pool. In addition, you can temporarily disable some secondary functions in the Web to reduce server consumption.

Xu Shiwei: Do you have any experience in task splitting? What measures do you use to ensure task independence?

Zhang banquet:I believe many people have encountered this situation: it takes less time to modify and add new functions on an old project than to re-create a new project that contains all functions. A project that requires long-term maintenance will inevitably face the departure of old employees and the claim of new employees. In many cases, the maintainability of the project code determines the life cycle of a project. Let a new employee face a huge project with incomplete documents, unfamiliar and complex functions under the pressure of a specified development time, it is not easy to understand the logic of all functions in a short time. Therefore, the task needs to be split. After dividing a large task into small modules, the code can be independent between modules without affecting each other, And the maintainability is also greatly enhanced.

For task splitting, I will introduce the architecture design of two important projects that I am responsible for this year. In the first project: user behavior analysis system on the Kingsoft game website, because data mining and computing requires high memory and CPU resources, one server does not have sufficient processing capabilities, the commercial Distributed Data Warehouse is too expensive. Therefore, tasks can only be split from applications. We first split the entire data mining task into multiple data mining plug-ins based on the data indicators to be mined. Each plug-in can run on different servers, multiple plug-ins can be deployed on multiple servers at the same time. If the same data item is used among multiple data mining plug-ins, copy the data in redundancy mode to provide the required plug-ins, thus, there is no interaction or association between the plug-ins, ensuring the computation speed of the plug-ins under a large amount of data.

In the second project: Jinshan game's new version of operation management system, the entire task is divided into three parts: php web management interface, php web API function interface, and C/C ++ middleware engine. This is a layered split. The "php web management interface" on the top calls the "php web API function interface ", "php web API function interface" calls the "C/C ++ middleware engine" running on the game server ", the "C/C ++ middleware engine" communicates with the "game server process" through TCP, UDP binary protocol, mail number, command line, and other methods. The four are relatively independent and the code is not associated, and interaction is achieved through layer-by-layer API interfaces. "Php web management interface" is responsible for implementing General interfaces. The "php web API function interface" is further segmented based on the connected game module and sub-function module. Each function module interacts with each other through the internal API. The "C/C ++ middleware engine" is large and comprehensive, and does not process specific commands. It is compatible with most communication methods such as TCP, UDP, HTTP, https/SSL, signal, and command line, interacts with various types of game servers. This is a system architecture fully driven by API interfaces. When a new game is connected to the operation management system, you only need to add a module to the "php web API function interface; to add a new management function for a game, you only need to add a sub-module in "php web API function interface. By splitting tasks, you can simplify complex functions and shorten the time required to access a new game to 1 ~ 2 weeks.

Xu Shiwei: What measures do you use to ensure product quality? How often do you prefer to update your website?

Zhang banquet:The quality of Web products is mainly reflected in architecture, function, performance, Security, code uniqueness, and compatibility.

In terms of architecture, I will first design an architecture scheme, and then involve project-related personnel and Expert Group members to discuss and demonstrate the advantages and disadvantages of the architecture, propose suggestions for improvement, and ensure the feasibility of the architecture. The technical solutions for all important projects need to be evaluated by the Expert Group.

In terms of functions and performance, special testers will perform function testing, stress testing, and security scanning. The testing environment is divided into offline testing environment and online quasi-testing environment.

In terms of code uniqueness, we developed a web configuration information management platform and related PHP extensions, which are provided to System Engineers for unified management of configuration information. In the new project, IP address and port information such as MySQL and memcached will no longer appear in the PHP program configuration file, and are replaced by the variables provided by the web configuration information management platform. From "Development Environment> offline test environment> online formal environment", different databases are connected, which leads to frequent obfuscation or forgetting to modify php development engineers, through the Web configuration information management platform, the configuration file in PHP code does not need to be modified in the four environments. This ensures code consistency, reduces error rates, and ensures product quality.

In terms of compatibility, we have maintained a unified development environment, test environment, and online environment from the operating system to PHP and MySQL versions. All Web services run on centos Linux systems. Because most PHP programmers are used to writing code on Windows, some interfaces and PHP extensions called in our programs can only be run in Linux. For this reason, we have developed a small tool that maps nginx virtual hosts and program files created by multiple programmers on their own Windows systems to a Linux server, use PHP-CGI on Linux to execute PHP code on Windows. In this way, PHP programmers can modify the local code, save it, and debug it without affecting each other. After debugging, you can right-click windows and submit the modified Code to the svn repository.

In the Web 2.0 era, website updates must be real-time. Not to mention dynamic websites, static website content must be released in real time. We developed an open-source software named sersync (http://code.google.com/p/sersync/) that uses inotify of Linux 2.6 kernel to monitor Linux File System events, if any file under the monitored directory is modified, sersync automatically captures events through the kernel and synchronizes the files to the CDN Origin Site Server using rsync. Sersync only synchronizes the addition, deletion, and modification of a single file or directory. Unlike rsync image synchronization, sersync only needs to compare tens of millions of files in the entire directory of both servers and supports multi-thread synchronization, therefore, the efficiency is very high. The CMS content publishing system on the Kingsoft game official website, whether the website editing uploads images, videos, attachments through the Web or FTP, or the System Engineers directly go to the CMS Publishing Server to add, modify, and delete files, after completing these tasks, sersync automatically synchronizes the files that have been added, deleted, and modified to the CDN Origin Site Server. After the files are synchronized, the CDN cache refresh API is automatically called to actively refresh the access URL of the modified or deleted files.

Xu Shiwei: What aspects do you usually pay attention to during the interview? Which questions are frequently asked?

Zhang banquet:First, I will not elaborate on the basic skills required by my position.

Second, focus on project experience and accumulation, rather than academic qualifications and working years. Creating a project is like a battle. After a hundred battles, the accumulated successful experience can make your work more comfortable, and the failed experience can avoid many detours.

Third, you can ~ More than two technical fields are proficient. People with specialized skills can be proficient in a certain number of technical fields. I believe that new projects with new technical fields or with no prior experience can be easily competent and perfect.

Fourth, focus on the knowledge breadth of candidates. Today's projects have left us off the age of personal heroes, focusing on teamwork. The wider the knowledge, although the depth in non-specialized fields may not be enough, you can view problems from a higher perspective by knowing yourself and yourself, this has obvious benefits for collaborative development and project integration.

Fifth, good comprehension, thinking, design, and innovation. Basic skills are not enough to learn, experience is insufficient to accumulate, technology is not proficient to study, the knowledge is not wide enough to expand, but to cultivate these four capabilities, is a very difficult thing. To build an excellent team, these four capabilities are indispensable. Their importance even exceeds the above four requirements.

I will not always ask fixed questions, but the questions I have asked are almost all related to these aspects.

Xu Shiwei: Have you ever tried to open your own program code? What do you think about the current situation of open-source communities in China?

Zhang banquet:Whether to open source your own program code is closely related to the nature of your company or department. If it is in a R & D-driven enterprise or department, code generation is the lifeblood of the company's survival. It is difficult to support open-source businesses because it is necessary to compete with competitors for technical advantages and maintain technological advantages. If it is an operation-driven enterprise or department, technology is one of the tools used to improve the quality and level of operation. Extract pure technical code or products from the company's business products to develop
The source can improve the standardization and standardization of the company's internal technical products according to the requirements of open-source products. It can also reference the use, feedback and opinions of more users to solve undiscovered potential bugs, improve code quality and technical level. There are many benefits for improving the operation quality and operation level. I also tried to open source some of my own code, such as simple Message Queue Service httpsqs (http://code.google.com/p/httpsqs/), MySQL HTTP/rest client mysql-UDF-HTTP (http://code.google.com/p/mysql-udf-http/), while also encouraging team members to try open source, for example, the automatic synchronization software sersync just mentioned.

The open-source community in China is growing. many well-known Internet companies open-source some of their own products, but most of them only stay in the use of open-source products, technical exchanges, and localization, few people actually participate in the coding of open-source products. Many open-source products are maintained by the original author or Team of the original company. The road to open-source communities in China is still long.

Architect-Shanda Xu Shiwei vs Kingsoft Zhang Yan

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.