Nifty has been operating the site for a long time, and after the launch of the WYSIWYG web platform based on HTML5, users have built more than 54 million sites in the company, and most of them have less than 100 solar PV. Since the PV of each page is low, the traditional caching strategy does not apply. Even so, however, the company has done so with only 4 Web servers. Recently, Wix's chief backend engineer, Aviran Mordo, shared their strategy in a speech at "Wix Architecture at Scale," and we look at the summary of Todd Scalability, the founder of High Hoff:
The following translation
The WiX effort around extensibility can be summed up with a "custom" three--a system that has been improved with high availability and high performance after a careful review of the system.
WiX uses multiple data centers and cloud services, which are often rare and replicate data to Google Compute engine and AWS. For failover, they have a specific coping strategy.
From the beginning to the end, WiX did not use the transaction. Instead, all data is immutable, and they use a very simple final consistency strategy for use cases. WiX is not a caching strategy enthusiast, in short they don't create a very high-end cache layer. Instead, they devote most of their energy to path rendering optimization, allowing each page to display no more than 100 milliseconds.
WiX began with a very small system, using a monolithic architecture, and in the business development process, they naturally transition to a service-oriented architecture. Throughout the architecture, they have used a very sophisticated service identification strategy that can easily focus all of their energies on one event.
System statistics
5,400 sites, 1 million new per month
800+TB static data, 1.5TB of new files per day
3 data centers + two cloud services (Google and Amazon)
300 servers
700 million HTTP requests per day
A total of 600 employees, 200 people research and development team
The number of services in the system amounted to 50
4 public Web servers to support 45 million web sites
System components
MySQL
Google and Amazon Cloud services
CDN (content distribution network)
Chef
System evolution
1. The system begins with a simple monolithic architecture with only one application server at the beginning, which is the simplest initial strategy for anyone, very flexible and easy to update.
Tomcat, Hibernate, custom network framework.
Use a stateful login.
Regardless of any performance and extensibility associated.
2. Two years later.
Still use a monolithic server to support everything.
Have a certain size of the development team, and need to support a certain size of users.
Dependency problems. Changes to a certain point usually result in changes to the entire system, and failures in unrelated areas usually cause widespread crashes throughout the system.
3. The time has come to split the system.
To a service-oriented architecture transformation, but that's not an easy thing to do. For example, how do you separate a function into two services?
Focus on the behavior of the user in the system, and the main down to the 3 categories: Modify the site, view the WiX established Web site and media services.
Web site updates include data validation, security and validation of server data, data consistency, and a large number of data modification operations.
Once a website is established, the user will view it. Therefore, for the system as a whole, the number of visitors is 10 times times that of the modifier. So the focus is converted to:
High availability. Ha becomes the biggest feature of the system because of the user's business behavior.
Performance。
High flow value.
Long tail problem. There are a lot of websites already on the platform, but they are usually very small. Looking at a website alone may only be 10 or 100 PV per day. Given this feature, caching does not have much effect on system extensions. As a result, caching becomes very inefficient.
Media support is the second largest service, including HTML, JavaScript, CSS, and images. They need a way to support a large number of requests on 800TB data, in which the caching of static content is the key to success.
The new system looks like a network layer, the site is cut into 3 parts of the service: Modify the part (any changes to the data to modify the operation), the media section (supporting static content, read-only), the public part (a file is accessed in the first part, read-only).
Service Building Guidelines
Each service has its own independent database, and each database can only be written to one service.
The database can be accessed only by the API of the service, which separates the concerns and makes the data model transparent to other services.
For performance reasons, other services are only given read-only access to the database, and a database can be written to only one service.
Services are stateless, which makes it easy to scale horizontally, and the growth of the business only needs to be supported by adding more servers.
Do not use transactions. In addition to the billing/financial transaction, all other services do not use transactions, the idea here is to avoid the cost of database transactions, thereby improving performance. Given that no transactions are used, the developer must consider designing the appropriate data model to complete the transaction logic characteristics to avoid inconsistencies.
Caching is not a factor to consider when designing a new service. First, consider service performance as much as possible, and then quickly deploy to a production environment to see how the service is running. Use caching to troubleshoot performance problems only if your code is not optimized.
Update Service
The update service must handle a large number of files.
Data is used immutable JSON pages are stored in MySQL, about 2.5 million a day.
MySQL is a great key-value store. The key is set based on the hash function of the file, so the key is immutable and access to MySQL through the primary key can achieve very good performance.
Acceptable extensibility. What kind of trade-off does WiX have in terms of extensibility? The reason why WiX does not use NoSQL is that nosql tends to sacrifice consistency, and that developers don't have the ability to deal with it, so insisting on MySQL is not necessary.
Dynamic database. In order to make way for websites that are often visited, cold data for all sites (usually data that is more than 3 months old) will be transferred to other databases, which are often very low in performance but have high capacity.
To the user's growth left room for capacity. Large archive databases are very slow, but given the frequency with which data is used, there is no problem. But once the data is accessed, the data is transferred to the active database before the next visit.
Create high availability for Update Services
When the volume of large data reaches a certain level, the high availability of any thing is hard to guarantee. Therefore, focus on the critical path, the site is undoubtedly the content of the site. If the site is a decorative part of the problem, it does not cause any fatal impact on the usability of the site. So for a website, the critical path is the only concern.
Prevent database crashes. If you want to complete the failover as soon as possible, be sure to make a backup of the database and quickly switch to the database when the recovery occurs.
Data integrity protection. This is not necessarily a malicious breach, and a bug may have an impact on the data store. All data is immutable and a revised version is saved for any data. In the worst case scenario, we can revert to the revision even if the data is compromised and cannot be repaired.
Prevents an unavailable condition from occurring. Unlike desktop applications, Web sites must be accessible anytime, anywhere. Therefore, it is important to backup data in different cloud environments in different geographic data centers, which gives the system sufficient flexibility.
Clicking the "Save" button on a Web site will send a JSON file to the modify server.
The server sends pages to the active MySQL server, and it backs up in another datacenter.
When the data is modified locally, an asynchronous process uploads the modifications to a static grid, the so-called media section.
When the data is transferred to the static grid, a notification is sent to the Archiving service saved on the Google Compute engine. The Archiving service connects to this static grid, downloads the modification page, and stores it in Google Cloud services.
A notification is then sent to the modifier informing the page that it has been stored in the GCE.
Also, the system saves another copy in Amazon based on GCE data.
When the last notification is received, this means that the data has been saved in 3 copies: A database, a static grid, and a GCE.
For the new version, there are 3 replicas, and for older versions there will be two.
This process has the characteristics of self-healing. If there is an error, the next time the user updates the content of their web site, all the unfinished changes will be uploaded again.
Deactivating a file can be done with garbage collection.
Modeling data with no database transactions
For the service owners, they never expect this to happen: The user modifies two pages at the same time, and the result is that only one page is stored in the database, which creates an inconsistent state.
Get all the JSON files, and then save them to the database in order. When all data is saved, a command is released that contains the ID list of all saved pages uploaded to this static server (the hash value of the file name in the static server).
Media section
A large number of files are stored. 800TB of user media files, an average of 3 million files per day, 500 million records.
Modify the image. They modify the image for different devices and screens. Here, you can insert watermarks as needed, and you can convert audio formats.
Establish a consistent distributed file system, use multiple data center backup mode, and achieve failover across the data center.
Run the pain. 32 servers, twice every 9 months.
Plan to migrate to the cloud for better scalability.
Lock the vendor to hell. Because the APIs are used, you can migrate across cloud service providers in a matter of weeks only by changing the implementation.
Failed in Google Compute engine. When they migrated from the data center to GCE, they were quickly constrained by Google Cloud services. And after Google made some changes, the system is functioning normally.
The data is immutable and therefore very beneficial to caching.
The image request is sent first to the CDN. If the requested image does not exist in the CDN, the request is sent directly to their Austin main data center. If the image is not found in the main data center, then the location of the search is Google Cloud services. If the requested image is still not found in Google Cloud services, the next location is the data center in Tampa.
Common Parts
Resolves the url (in 45 million Web sites) and assigns it to the specified renderer, which is then converted to HTML, sitemap XML, or a robots txt.
A public SLA with a peak response time of less than 100 milliseconds. Web sites must be highly available, and require very high performance, but caching does not work.
When a user modifies a page and publishes it, the list that includes the page element is pushed to the public environment, along with the routing table.
Minimize downtime. Parsing a single route requires a database call to be made. Assigning a request to a renderer requires 1 RPC calls. Getting a Web site list also requires a database call.
The query table is cached in memory and is modified every 5 minutes.
Data cannot be saved in the same format because it needs to be routed to the editor. Data is stored in a non-standard format, optimized by a primary key, and all requirements are returned in a single request.
Minimize business logic. The data is nonstandard and is calculated in advance. In a large scenario, each operation that occurs within a second is multiplied by 45 million times, so each operation that occurs on the public server needs to be adjusted.
Page rendering
The HTML returned by a public server is a bootstrap HTML type that uses a JavaScript Shell and contains JSON data related to all site listings and dynamic Data.
The render will be placed on the client. Today, laptops and mobile devices already have very powerful capabilities that can be fully engaged.
JSON is selected because it is very convenient to parse and compress.
Bugs on the client are easy to patch. Patching a client bug requires only a redeployment of one client code, and if rendered on the server side, HTML is cached, so patching a bug requires a new rendering of thousands of sites.
High availability of common parts
Although the goal is always available, there will always be some surprises.
Typically: A request is sent by the browser, which is then transmitted to a data center, which, through a load balancer, will be sent to a public server, parse the route, pass it to the renderer, then return to the browser and run JavaScript using the browser. The browser then sends a request to the file service, which does the same thing as the browser, and then stores the data in the cache.
What happens to data center loss: all ups will be hung up and data center will be lost. All DNS will be changed, and requests will be sent to the secondary data center.
Common part loss: All public servers are lost when the load balancer configuration occurs only halfway through. Or, when you deploy the wrong version, the server throws a failure. WiX solves this problem by customizing the load Balancer code, and when the public server is lost, they route the file server to the cache, even if the system has failed to recover after the alert.
In the case of poor network connectivity: The request is sent by the browser, which is then transmitted to a data center, through a load balancer, and the corresponding HTML is returned. Now the JavaScript code must retrieve all the JSON data and pages. Then go to the content distribution network, send it to the static grid, and get all the files for the site to render. When the network is very card-able, file return may not be possible. JavaScript makes a choice: if the primary location does not get the file, the code is retrieved from the file service.
Knowledge learned
Identify the key road points and concerns of the business, understand how the product works, develop usage scenarios, and try to make your work worthwhile.
Use cloudy and multiple data centers. For better usability, create redundancy on critical paths.
Convert data to minimize process jumps, all for performance only. Anticipate and do everything you can to reduce network jitter.
Leverage good client CPU to establish redundancy on critical paths for availability.
Start small, run first, and then look for the next decision. From the beginning to the end, WiX first of all is how to make the service can run well, and then methodically transferred to the service-oriented architecture.
The
Long tail needs to be resolved in a different way. Instead of caching everything, WiX improves services by optimizing rendering paths and backs up data in both active and archival databases. Use immutable methods. Immutable will have a far-reaching impact on the architecture of the service, overwriting all the processing of the backend to the client, which is an elegant solution for many problems. Vendor lock-in does not exist at all. All features are implemented through the API, and only need to modify the implementation to complete the migration of different cloud vendors within a few weeks. The biggest bottleneck is data. It is extremely difficult to transfer large amounts of data in different cloud environments.