WhatsApp's Erlang World

Source: Internet
Author: User

Rick's two PPT finishing

Download: 2012 2013, after six months of Erlang, re-look at the two ppt to find more value of the learning place, from the PPT organized as follows:

-Prefer Os:timestamp to Erlang:now

The use of Erlang:now () should be banned, slightly more used, the entire node%si full, and the overall performance of the order of magnitude decreased.


-Implement Cross-node Gen_server calls without using monitors (reduces dist traffic and proc Link lock contention)

You can implement your own RPC module, not every call is monitor-call-demonitor, consuming too much. Can be called for the first time on a permanent monitor.

-Partition ETS and Mnesia tables and localize access to smaller number of processes

To reduce lock contention, Ets,mnesia is split, and as few processes as possible access the same Ets/mesia, high concurrent access severely impacts performance and consumes CPU.

WhatsApp restricts access to the number of single ETS or Mnesia processes to 8, which allows lock contention to be in control.

-Small Mnesia clusters

Using a small Mnesia cluster, the PPT shows that WhatsApp uses the optimized async_dirty operation Mneisa table, does not use transactions, and the general cluster has only two nodes (one master and one standby, only operation Master).

Mnesia clusters have a very good clustering effect, but large clusters have too much impact on performance. A 20-node cluster was tested, and a dozens of-byte line was executed on one of the nodes: Read-update transaction,8k/s the Thousand m network card of node ran full.

-offloaded SSL termination to stud

WhatsApp multimedia files are downloaded using the Https,yaws service before using Stud: https://github.com/bumptech/stud as SSL proxy.

Once tested, 16G machine up to 20w connection 2500/s New connection speed, SSL New connection processing is CPU intensive, pure Erlang implementation is too inefficient, performance in a large number of short connections is not optimistic. However, using a large number of loopback addresses to solve the 64K TCP port problem, if the number of connections is too high, it also brings memory consumption, it may be possible to stud the request RPC to the backend processing can also be resolved.

-Chat Message Encrypt

Https://github.com/davidgfnet/wireshark-whatsapp/blob/master/protocol-spec.txt

The SSL process is too complex, the connection speed is slow, and with the password +random code as the RC4 key encryption, even the key interaction process is saved.

WhatsApp Storage Solutions:

-Mnesia memory (similar to Redis as the cache, more flexible than the benefits):

The memory Mnesia database uses approximately 2TB of RAM and stores 18 billion records across 16 shards.

Only the messages and multimedia that are being published are stored, but when the multimedia is published, the information is stored in the database.

Because WhatsApp does not store all of the user's message records for a long time, it is deleted after it has been accepted. Each message is read quickly by the user and completes 50% within 60 seconds.

Therefore, all messages are designed to be stored in memory Mnesia, and if they are received at a certain time, they are deleted directly. It is not charged for a long time before landing.

-Mnesia disc (user information, offline message):

Image, audio and video file index information, the actual file is stored on the disk (PPT on each machine storage 189G * + Node, not to say that each machine 500GB memory).

User information, messages, etc.

[Go] WhatsApp's Erlang world (very good article)

Http://www.csdn.net/article/2014-04-04/2819158-how-whatsapp-grew-to-nearly-500-million-users-11000-cores-an/1

Before we shared the Highscalability founder Tod Hoff summary of the early WhatsApp architecture, which included a number of Erlang optimizations to support single server 2 million concurrent connections, and how to support all types of phones and provide a perfect user experience. However, two years later, WhatsApp is how to support 10 times times the previous traffic, as well as the rapid expansion of the application, here we see the summary of Tod.

Here is a summary of some of the major changes that have taken place in WhatsApp over two years:

1. Changes in WhatsApp can be seen from any dimension, but the number of engineers has remained unchanged. At the moment, WhatsApp has more hosts, more data centers, more memory, more users, and more extensibility issues, but the most proud of that is the 10-person engineering team-each engineer with an average of 40 million users. Of course, this is also the success of the Cloud era: engineers are only responsible for software development, network, hardware and data center operations all delegated.

2. Before, in the face of a surge in load, they had to allow a single server to support as many connections as possible, but now they have stepped out of that era. Of course, based on overall cost control, they still need to control the number of hosts and allow SMP hosts to run more efficiently.

3. Instantaneous benefits. Given that the architecture now includes multimedia, graphics, text, and audio, there is no need to save these large-format information to make the system much easier, and the focus of the architecture is on throughput, caching, and sharding.

4. The world of Erlang. Even though they are still a distributed system, the problems they encounter are similar, but from the beginning to the end it is said that Erlang is truly commendable.

5. Mnesia, the Erlang database seems to have become a major source of their problems. Therefore, have to doubt blindly grasping Erlang will be more blind, whether there are other better alternatives.

6. The problem at this scale is as much as you can imagine. The massive number of connections, the queue due to the priority operation has become too long, the timer, code performance under different loads, high-priority messages under high load are not processed, an operation was accidentally interrupted by another operation, a failure caused by the resource problems and the compatibility of different user platforms, the creation of mega-architecture is not overnight.

7. Rick's ability to find and deal with problems is amazing, and can be said to be surprising.

Rick's share is always wonderful, and he is happy to share many details, many of which can only be found in the production environment. Here is a summary of his latest share:

-Statistics

Month 465 million users

Receive 19 billion messages per day, send 40 billion messages

600 million photos, 200 million voices, 100 million videos

Number of concurrent connections during peak 147 million--phone connected to system

230,000 landing operations per second during peak periods-mobile phone on-line and offline

324,000 information inflow per second during peak period, 712,000 outflow

About 10 engineers are dedicated to Erlang, who are responsible for development and operations

Peak of the festival

Christmas Eve outflow up to 146 GB/s, a considerable amount of bandwidth used to service mobile phones

Christmas Eve video download up to 360 million times

New Year's Eve photo download approx. 2 billion (K/S)

New Year's Eve has a picture downloaded 32 million times

-Stack

Erlang R16B01 (hit its own patch)

FreeBSD 9.2

Mnesia (Database)

Yaws

Use of SoftLayer cloud services and physical servers

-Hardware

Approximately 550 servers + backups

150 or so chat servers (each server handles approximately 1 million of the phone, peak period 150 million connections)

250 or so multimedia information servers

2x2690v2 Ivy Bridge 10-core (total 40 Hyper-Threading Technology)

The database node has 512GB of memory

64GB memory for standard compute nodes

SSDs are primarily used for reliability, and are used to store video when the storage resources are low

Dual-link GigE X2 (Public user-facing, private for back-end systems)

The Erlang system uses more than 11,000 cores

-System Overview

Lone Love Erlang

Very good language, suitable for small engineering team.

Very good SMP scalability. You can run a highly provisioned host and benefit from reducing the node. Operational complexity is only related to the number of nodes, not the number of cores.

You can quickly update your code.

Extensibility is like minesweeper, but they can always be found and resolved before the problem erupts. World-class events are equivalent to system stress tests, especially football matches, which can bring a very high spike. Server failures (usually memory), network failures, and poor software push all test the system.

The traditional architecture

Mobile Client connection to MMS (multimedia)

Chat connects to transient offline storage, and message transmission between users is controlled by the backend system.

Chat connects to a database, such as account, Profile, Push, group, and so on.

Messages sent to the phone

Text messages

Notifications: Group messaging, profile photo changes, etc.

Status messages: Input status, leave status, online or offline scenarios, etc.

Multimedia database

The memory Mnesia database uses approximately 2TB of RAM and stores 18 billion records across 16 shards.

Only the messages and multimedia that are being published are stored, but when the multimedia is published, the information is stored in the database.

The next single server only runs 1 million of concurrent connections, compared with 2 million two years ago, because there are more things the server will do now:

As the number of users increases, WhatsApp expects to reserve more space on each server to handle spikes.

Many features that were not previously running on this server are now being moved to the top, so the server is busier.

-Decoupling

Isolate bottlenecks so that they do not exist throughout the system

Tight coupling can cause successive failures

Front-end systems and back-end systems are first separated

Isolate everything so that there is no impact between components.

Keep as much throughput as possible while the problem is being resolved.

Asynchronous processing to minimize throughput latency

When the delay is unpredictable and exists at different points, the asynchronous can guarantee the throughput as much as possible.

Decoupling allows the system to run as fast as possible.

Avoid HOL blocking

The head block is the first treatment that will starve other items in the queue.

Separates the read and write queues. In particular, when performing transactions on a table, the delay in writing does not affect the read queue, which usually reads quickly, so any blocking can affect read performance.

Detaches the internal queue of the node. If there is a problem with the node or the node of the network connection, it may block other tasks in the application. As a result, messages destined for different nodes are assigned different processes (lightweight concurrency in Erlang), so backups are done only when the message is sent to the problem node, which allows the message to be transmitted freely, the problem is isolated, and the Mnesia is patched to ensure async_dirty response time. The app is decoupled when it sends messages, so it doesn't cause a load problem when a node fails.

The FIFO model is used in an indeterminate delay scenario.

-meta custering

This section appears in the first 29 minutes of the speech, unfortunately, but the amount of information is small.

A method is needed to control the volume of a single cluster and allow him to span a long distance.

The establishment of wandist, distributed transmission based on GEN_TCP, consists of many nodes that need to communicate with each other.

1 A transparent routing layer based on PG2, a single-hop routing scheduling system is established.

For example: Two primary clusters in two data centers, two multimedia clusters in two different data centers, and a shared global cluster between two data centers, all connected using Wandist.

Example

Use Async_dirty to avoid mnesia transaction coupling, and in most cases do not use transactions.

Call is used only when recovering from the database, and in other cases using cast to maintain the asynchronous model. In Erlang, Message Queuing is blocked by waiting for a handle_call response, and handle_cast does not cause blocking because it does not focus on the results.

Call uses timeouts rather than monitoring to reduce the contention of the remote process and the data transferred at the time of distribution.

If you just want to pursue the best delivery capability, use Nosuspend for cast. This prevents the node from being affected by downstream issues-whether it is a node failure or a network problem (in which case the sending data buffer is backed up to the sending node), the Start command sent by the process is suspended by the dispatch system, causing successive failures-everyone is waiting, but no action is being processed.

Use a large send buffer to reduce the impact of receiving from the network and downstream systems.

Parallel

-Task Assignment

Need to assign a task to 11,000 cores

Starts with a single-threaded gen_server, and then establishes a gen_factory that is responsible for the task passing between multiple nodes.

When the load reaches a certain level, the scheduling process itself becomes a bottleneck, not just the execution time problem.

So create a gen_industry, located above the gen_factory, to ingest all the inputs in parallel, and have the ability to assign the work nodes immediately.

The address of the work node is similar to the database through key lookup, so there is an indeterminate delay, such as IO, so in order to avoid the thread blocking, here a FIFO model is used.

-Split Service

Split between 2 and 32, most services are split into 32.

PG2 addressing, a distributed process group, for Shard addressing on a cluster.

Nodes are master-slave settings for disaster tolerance.

Limit access to the number of single ETS or Mnesia processes to 8, which allows lock contention to be in control.

-Mnesia

Because transactions are not used to ensure consistency, they use a process to continuously access records on a node. The hash to a shard is mapped to 1 Mnesia fragment, which is then dispatched to 1 factory, followed by the node. Therefore, access to each single record is converted into a separate Erlang process.

Each Mnesia fragment can only be read at the application level on 1 nodes, so that the replication stream only needs to be done in one place.

Once there is a replication stream between the nodes, there is a bottleneck in the update speed of the shards. They patched the OTP to implement multiple transaction managers to implement Async_dirty so that records can be modified in parallel, which results in more throughput.

Patching allows the Mnesia library to be split directly onto multiple libraries, which means it can write multiple drives, which can directly increase the throughput of the disk. The problem here is that Mnesia peaks, allocating IO over multiple disks, and even adding SSDs to further improve scalability and performance.

Reduce the Mnesia "island" to 2, each "island" is a mnesia cluster. Therefore, when the table is divided into 32 parts, there will be 16 "island" supporting a table. As a result, they can do better schema operation, because only two nodes need to be modified. When you open 1 or 2 nodes at the same time, you can reduce the load time.

Set up alerts to quickly process network shards in the Mnesia, keep them running, and then manually adjust them to consolidate.

-Optimized

Under peak conditions, the offline storage system was a 1 very large bottleneck that could not be pushed to the system much faster.

Each message is read quickly by the user and completes 50% within 60 seconds.

Add a write-back cache so that messages can be delivered before they are written to the file system, with a cache hit rate of up to 98%.

If the IO system is blocked by the load, the cache will have an additional buffering effect on message delivery until the IO system recovers.

Beam (Erlang VM patching) to implement asynchronous file IO to avoid thread-blocking problems, rotation file system port requests on all asynchronous worker threads, and to alleviate writes in the case of large mailbox and slow disks.

Keep Large mailbox away from the cache, and some users join a large number of groups, earning thousands of messages per hour. They affect the entire cache and make processing slower. Banish it from the cache. It is important to note that a disproportionate amount of large user processing is a problem for every system, including Twitter.

Use a large number of fragments to reduce the access speed of Mnesia tables

The account form is divided into 512 parts into "island", which means there is a sparse mapping between the user and the 512 shards, and most of the fragments are empty and idle.

Doubles the number of hosts, but reduces throughput. The reason for record access slowness is that when the target is 7 o'clock, the size of the hash chain exceeds 2K.

One problem here is that the hash pattern causes a large number of empty buckets to be created, some even very long. The two-wire change solves this problem and boosts performance from 4 to 1.

Patch

The race on the timer wheel, when the number of connections to 1 hosts reaches millions of, the timer is created or reset when the phone on each link changes, resulting in a timer of hundreds of thousands of per second. The lock on the timer wheel is the main source of competition, and the solution is to build multiple timer wheels.

Mnesia_tm is a very large selection cycle, so while the load is not full, it can also cause a backlog of transactions, patching to collect transaction flows and saving for later processing.

Add multiple Mnesia_tm Async_dirty senders

There are many cross-cluster operations, so Mnesia is best to load from nearby nodes.

Adds a cyclic schedule to asynchronous file IO.

Use the ETS hash to prevent W/PHASH2 from occurring at the same time.

Optimise ETS main/name table to scale

Do not queue Mnesia dump because there are too many dumps in the queue, schema OPS will not be feasible.

-February 22 of downtime

Even with so much effort, downtime is unavoidable and occurs at a time when it is least expected to take 210 minutes off the Facebook takeover.

The change of load caused the problem, and the outage was due to the routing problem of the back-end system.

Routers cause a LAN paralysis, causing a large number of nodes in the cluster to disconnect and re-connect. At the same time, after the node is re-connected, the cluster has an unprecedented unstable state.

Eventually, they had to stop the repair, which had not happened in a few years.

In the inspection, they found an over-coupled subsystem. When disconnected and re-connected, they found that PG2 was doing n^3 messages, and Message Queuing soared from 0 to 4 million in a few seconds, so they rolled out 1 patches.

-Feature Release

It is not possible to simulate such a large amount of traffic, especially during peak periods, such as New Year bells. As a result, they can only slow down the release of features, first released under small traffic, then quickly iterate until good operation, and then to other clusters to promote.

On-line is a rolling update process. Redundant everything if they want to do a beam update after installation they need to reboot the nodes in the cluster and then update. There are also hot patches, but it is rare that the usual upgrades are cumbersome.

WhatsApp's Erlang World

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.