Preferred Mysql,wix use 4 Web servers to support 54 million user sites

Source: Internet
Author: User
Keywords Cloud computing MySQL large data Distributed File system
Tags .mall apply backup based beginning browser business cache

"Editor's note" WiX has been operating the site for a long time, and after the launch of the WYSIWYG web platform based on HTML5, users have established more than 54 million sites in the company, and most of these sites have less than 100 solar PV. Since the PV of each page is low, the traditional caching strategy does not apply. Even so, however, the company has done so with only 4 Web servers. Recently, Wix's chief backend engineer, Aviran Mordo, shared their strategy in a speech at "Wix Architecture at Scale," and we look at the summary of Todd Scalability, the founder of High Hoff:

The following translation

The WiX effort around extensibility can be summed up with a "custom" three--a system that has been improved with high availability and high performance after a careful review of the system.

WiX uses multiple data centers and cloud services, which are often rare and replicate data to Google Compute engine and AWS. For failover, they have a specific coping strategy.

From the beginning to the end, WiX did not use the transaction. Instead, all data is immutable, and they use a very simple final consistency strategy for use cases. WiX is not a caching strategy enthusiast, in short they don't create a very high-end cache layer. Instead, they devote most of their energy to path rendering optimization, allowing each page to display no more than 100 milliseconds.

WiX began with a very small system, using a monolithic architecture, and in the business development process, they naturally transition to a service-oriented architecture. Throughout the architecture, they have used a very sophisticated service identification strategy that can easily focus all of their energies on one event.

System Statistics

5,400 Web sites, each month will add 1 million 800+TB static data, 1.5TB of new files per day 3 data centers + two cloud services (Google and Amazon) 300 servers a day 700 million of HTTP requests 600 employees, 200 people's research and development team The number of services within the system reaches 50 4 public Web servers to support 45 million web sites

System Components

MySQL Google and Amazon Cloud services CDN (content distribution network) Chef

System Evolution

1. The system begins with a simple monolithic architecture with only one application server at the beginning, which is the simplest initial strategy for anyone, very flexible and easy to update.

Tomcat, Hibernate, custom network framework. Use a stateful login. Regardless of any performance and extensibility associated.

2. Two years later.

still uses a monolithic server to support everything. Have a certain size of the development team, and need to support a certain size of users. Dependency problems. Changes to a certain point usually result in changes to the entire system, and failures in unrelated areas usually cause widespread crashes throughout the system.

3. The time has come to split the system.

to service-oriented architecture transformations, but this is not an easy job. For example, how do you separate a function into two services? Focus on the behavior of the user in the system, and the main down to the 3 categories: Modify the site, view the WiX established Web site and media services. Web site updates include data validation, security and validation of server data, data consistency, and a large number of data modification operations. Once a website is established, the user will view it. Therefore, for the system as a whole, the number of visitors is 10 times times that of the modifier. So the focus is converted to:

high Availability. Ha becomes the biggest feature of the system because of the user's business behavior. Performance。 High flow value. Long tail problem. There are a lot of websites already on the platform, but they are usually very small. Looking at a website alone may only be 10 or 100 PV per day. Given this feature, caching does not have much effect on system extensions. As a result, caching becomes very inefficient.

Media support is the second largest service, including HTML, JavaScript, CSS, and images. They need a way to support a large number of requests on 800TB data, in which the caching of static content is the key to success. The new system looks like a network layer, the site is cut into 3 parts of the service: Modify the part (any changes to the data to modify the operation), the media section (supporting static content, read-only), the public part (a file is accessed in the first part, read-only).

Service Building Guidelines

each service has its own independent database, and each database can only be written by one service. The database can be accessed only by the API of the service, which separates the concerns and makes the data model transparent to other services. For performance reasons, other services are only given read-only access to the database, and a database can be written to only one service. Services are stateless, which makes it easy to scale horizontally, and the growth of the business only needs to be supported by adding more servers. Do not use transactions. In addition to the billing/financial transaction, all other services do not use transactions, the idea here is to avoid the cost of database transactions, thereby improving performance. Given that no transactions are used, the developer must consider designing the appropriate data model to complete the transaction logic characteristics to avoid inconsistencies. Caching is not a factor to consider when designing a new service. First, consider service performance as much as possible, and then quickly deploy to a production environment to see how the service is running. Use caching to troubleshoot performance problems only if your code is not optimized.

Update Service


update Service must handle a large number of files. Data is used immutable JSON pages are stored in MySQL, about 2.5 million a day. MySQL is a great key-value store. The key is set based on the hash function of the file, so the key is immutable and access to MySQL through the primary key can achieve very good performance. Acceptable extensibility. What kind of trade-off does WiX have in terms of extensibility? The reason why WiX does not use NoSQL is that nosql tends to sacrifice consistency, and that developers don't have the ability to deal with it, so insisting on MySQL is not necessary. Dynamic database. In order to make way for websites that are often visited, cold data for all sites (usually data that is more than 3 months old) will be transferred to other databases, which are often very low in performance but have high capacity. To the user's growth left room for capacity. Large archive databases are very slow, but given the frequency with which data is used, there is no problem. But once the data is accessed, the data is transferred to the active database before the next visit.

to build the high availability of the update service

When the volume of large data reaches a certain level, the high availability of any thing is hard to guarantee. Therefore, focus on the critical path, the site is undoubtedly the content of the site. If the site is a decorative part of the problem, it does not cause any fatal impact on the usability of the site. So for a website, the critical path is the only concern. Prevent database crashes. If you want to complete the failover as soon as possible, be sure to make a backup of the database and quickly switch to the database when the recovery occurs. Data integrity protection. This is not necessarily a malicious breach, and a bug may have an impact on the data store. All data is immutable and a revised version is saved for any data. In the worst case scenario, we can revert to the revision even if the data is compromised and cannot be repaired. Prevents an unavailable condition from occurring. Unlike desktop applications, Web sites must be accessible anytime, anywhere. Therefore, it is important to backup data in different cloud environments in different geographic data centers, which gives the system sufficient flexibility.

Click the "Save" button on a Web site and the modify session sends a JSON file to the modify server. The server sends pages to the active MySQL server, and it backs up in another datacenter. When the data is modified locally, an asynchronous process uploads the modifications to a static grid, the so-called media section. When the data is transferred to the static grid, a notification is sent to the Archiving service saved on the Google Compute engine. The Archiving service connects to this static grid, downloads the modification page, and stores it in Google Cloud services. A notification is then sent to the modifier informing the page that it has been stored in the GCE. Also, the system saves another copy in Amazon based on GCE data. When the last notification is received, this means that the data has been saved in 3 copies: A database, a static grid, and a GCE. For the new version, there are 3 replicas, and for older versions there will be two. This process has the characteristics of self-healing. If there is an error, the next time the user updates the content of their web site, all the unfinished changes will be uploaded again. Deactivating a file can be done with garbage collection. Modeling data with no database transactions

For the service owners, they never expect this to happen: The user modifies two pages at the same time, and the result is that only one page is stored in the database, which creates an inconsistent state. Get all the JSON files, and then save them to the database in order. When all data is saved, a command is released that contains the ID list of all saved pages uploaded to this static server (the hash value of the file name in the static server).

Media Section

stores a large number of files. 800TB of user media files, an average of 3 million files per day, 500 million records. Modify the image. They modify the image for different devices and screens. Here, you can insert watermarks as needed, and you can convert audio formats. Establish a consistent distributed file system, use multiple data center backup mode, and achieve failover across the data center. Run the pain. 32 servers, twice every 9 months. Plan to migrate to the cloud for better scalability. Lock the vendor to hell. Because the APIs are used, you can migrate across cloud service providers in a matter of weeks only by changing the implementation. Failed in Google Compute engine. When they migrated from the data center to GCE, they were quickly constrained by Google Cloud services. And after Google made some changes, the system is functioning normally. The data is immutable and therefore very beneficial to caching. The image request is sent first to the CDN. If the requested image does not exist in the CDN, the request is sent directly to their Austin main data center. If the image is not found in the main data center, then the location of the search is Google Cloud services. If the requested image is still not found in Google Cloud services, the next location is the data center in Tampa.

Common Parts

Resolves the url (in 45 million Web sites) and assigns it to the specified renderer, which is then converted to HTML, sitemap XML, or a robots txt. A public SLA with a peak response time of less than 100 milliseconds. Web sites must be highly available, and require very high performance, but caching does not work. When a user modifies a page and publishes it, the list that includes the page element is pushed to the public environment, along with the routing table. Minimize downtime. Parsing a single route requires a database call to be made. Assigning a request to a renderer requires 1 RPC calls. Getting a Web site list also requires a database call. The query table is cached in memory and is modified every 5 minutes. Data cannot be saved in the same format because it needs to be routed to the editor. Data is stored in a non-standard format, optimized by a primary key, and all requirements are returned in a single request. Minimize business logic. The data is nonstandard and is calculated in advance. In a large scenario, each operation that occurs within a second is multiplied by 45 million times, so each operation that occurs on the public server needs to be adjusted. Page Rendering

the HTML returned by a public server is a bootstrap HTML type that uses a JavaScript Shell and contains JSON data related to all site listings and dynamic Data. The render will be placed on the client. Today, laptops and mobile devices already have very powerful capabilities that can be fully engaged. JSON is selected because it is very convenient to parse and compress. Bugs on the client are easy to patch. Patching a client bug requires only a redeployment of one client code, and if rendered on the server side, HTML is cached, so patching a bug requires a new rendering of thousands of sites.

common parts of high availability

Although the goal is always available, there will always be unexpected situations: The request is sent by the browser, which is then transmitted to a data center, which, through a load balancer, is sent to a public server, resolves the route, passes it to the renderer, and then returns to the browser, and use the browser to run JavaScript. The browser then sends a request to the file service, which does the same thing as the browser, and then stores the data in the cache. What happens to data center loss: all ups will be hung up and data center will be lost. All DNS will be changed, and requests will be sent to the secondary data center. Common part loss: All public servers are lost when the load balancer configuration occurs only halfway through. Or, when you deploy the wrong version, the server throws a failure. WiX solves this problem by customizing the load Balancer code, and when the public server is lost, they route the file server to the cache, even if the system has failed to recover after the alert. In the case of poor network connectivity: The request is sent by the browser, which is then transmitted to a data center, through a load balancer, and the corresponding HTML is returned. Now the JavaScript code must retrieve all the JSON data and pages. Then go to the content distribution network, send it to the static grid, and get all the files for the site to render. When the network is very card-able, file return may not be possible. JavaScript makes a choice: if the primary location does not get the file, the code is retrieved from the file service.

learned Knowledge

Identify the key road points and concerns of the business, understand how the product works, develop usage scenarios, and try to make your work worthwhile. Use cloudy and multiple data centers. For better usability, create redundancy on critical paths. Convert data to minimize process jumps, all for performance only. Anticipate and do everything you can to reduce network jitter. Leverage good client CPU to establish redundancy on critical paths for availability. Start small, run first, and then look for the next decision. From the beginning to the end, WiX first of all is how to make the service can run well, and then methodically transferred to the service-oriented architecture. The long tail needs different ways to solve it. Instead of caching everything, WiX improves services by optimizing rendering paths and backs up data in both active and archival databases. Use immutable methods. Immutable will have a far-reaching impact on the architecture of the service, overwriting all the processing of the backend to the client, which is an elegant solution for many problems. Vendor lock-in does not exist at all. All features are implemented through the API, and only need to modify the implementation to complete the migration of different cloud vendors within a few weeks. The biggest bottleneck is data. It is extremely difficult to transfer large amounts of data in different cloud environments.

Original link: Nifty architecture Tricks from wix-building A Publishing Platform at Scale (translation/Dongyang Zebian/Zhonghao)

CSDN invites you to participate in China's large data award-winning survey activities, just answer 23 questions will have the opportunity to obtain the highest value of 2700 Yuan Award (a total of 10), speed to participate in it!

National Large data Innovation project selection activities are also in full swing, details click here.

The 2014 China Large Data Technology Conference (Marvell conference 2014,BDTC 2014) will be held at Crowne Plaza Hotel, New Yunnan, December 12, 2014 14th. Heritage since 2008, after seven precipitation, "China's large Data technology conference" is currently the most influential, the largest large-scale data field technology event. At this session, you will not only be able to learn about Apache Hadoop submitter uma maheswara Rao G (a member of the project Management Committee), Yi Liu, and members of the Apache Hadoop and Tez Project Management Committee Bikas Saha and other shares of the general large data open source project of the latest achievements and development trends, but also from Tencent, Ali, Cloudera, LinkedIn, NetEase and other institutions of the dozens of dry goods to share. There are a few discount tickets for the current ticket purchase.

Free Subscribe to the "CSDN large data" micro-letter public number, real-time understanding of the latest big data progress!

CSDN large data, focus on large data information, technology and experience sharing and discussion, to provide Hadoop, Spark, Impala, Storm, HBase, MongoDB, SOLR, machine learning, intelligent algorithms and other related large data views, large data technology, large data platform, large data practice , large data industry information and other services.

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.