LinkedIn Data Architecture Analysis

Last Update:2018-06-11 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

LinkedIn is one of the most popular professional social networking websites today. This article describes how LinkedIn manages data. If you have any objection to the point in this article or have any omissions in this article, please feel free to let me know. LinkedIn.com data use cases below are some data use cases. We may have seen them when browsing the LinkedIn Web page. Updated Personal Data

The updated personal data can appear on the recruitment search page in almost real time.
After the updated personal data is updated, it can appear on the web page of contacts in almost real time.
Share an update that can appear on the news feed page in near real time.
Then it will be updated to other read-only pages, such as "people you may know", "people who have read my materials", and "related searches.

Surprisingly, if we use better bandwidth, these pages can be loaded within milliseconds! Let's pay tribute to the LinkedIn engineer team!

Like other startups, LinkedIn's early LinkedIn data architecture saved user data and connections through several tables of a single RDBMS (relational database management system. Is it primitive? Later, the RDMBS extended two additional database systems, one of which was used to support full-text search of users' personal data, and the other to implement social graphs. The two databases use Databus to obtain the latest data. Databus is a change capture system. Its main goal is to capture changes to datasets in the source (like Oracle) and update these changes to the additional database system. However, it was not long before this architecture was able to meet the data requirements of the website. Because according to Brewerd's CAP theory, it seems impossible to satisfy the following conditions simultaneously: consistency: All Applications see the same data availability at the same time: ensure that each request can receive a response, fault Tolerance of successful or failed partitions: Message loss or failure of some systems does not affect the normal operation of the entire system.

According to the above rules, the LinkedIn engineer team implemented what they call timeline consistency (or final consistency of the nearline system, which will be explained below) and two other features: availability and partition fault tolerance. The following describes the current data architecture of LinkedIn.

If LinkedIn's data architecture is to support processing millions of user-related transactions in less than one second, the above data architecture is obviously insufficient. Therefore, the LinkedIn team of engineers proposed a three-step data architecture, which consists of online, offline, and nearline data systems. In general, LinkedIn data is stored in the following data systems (see the figure below ):

RDBMS
- Oracle
- MySQL (as the underlying data storage of Espresso)
RDBMS
- Espresso (a document NoSQL data storage system developed by LinkedIn)
- Voldemart (Distributed Key-value Storage System)
- HDFS (stores Hadoop map-reduce task data)
Caching
- Memcached
Lucene-based index
- Lucene indexes for storing functional data such as queries and relational graphs
- Indexes used by Espresso

Figure: the LinkedIn database system includes DataBus, NoSQL, RDBMS, and Indexes.

The data repository mentioned above is classified into three different types of systems, which will be explained one by one as follows: Online Database SystemThe online system processes real-time user interaction. The primary database is like Oracle. Primary data storage is used to support User write operations and a small number of read operations. Taking Orcale as an example, the Oracle master performs all write operations. Recently, LinkedIn is developing another data system called "Espresso" to meet increasingly complex data requirements, which does not seem to be retrieved from RDBMS such as Oracle. Can they eliminate all or most of Oracle Data and completely transfer the data to NoSQL data storage systems such as Espresso? Let's wait and see.

Espresso is a NoSQL data warehouse that supports horizontal scaling, indexing, timeline consistency, and high availability based on documents,It is designed to replace the traditional Oracle database used to support the company's webpage operations. It was designed to improve the InMail message service availability of LinkedIn. Currently, the following applications use Espresso as the source system. We can see that NoSQL data storage is amazing if it is used to process the data needs of so many applications!

Messages between members,
Social activities, such as updates
Article sharing
User Profile
Company Information
News and articles

Offline Database SystemThe offline system mainly includes Hadoop and a Teradata data warehouse for batch processing and analysis. It is called offline because it performs batch operations on data .? Apache Azkaban is used to manage Hadoop and ETL tasks. These tasks obtain data from the master or source system and submit the data to map-reduce for processing. The processing results are stored in HDFSAnd then notify the 'consumers' (for example: Voldemart) Use the appropriate method to obtain the data and switch the index to ensure that the latest data can be obtained. Nearline Database System (timeline consistency)The goal of the nearline system is to achieve timeline consistency (or eventual consistency). It processes features like 'people you may know (read-only datasets) ', search, and social graphs, data for these features will be constantly updated, but their requirements for latency are not as high as those for online systems. Below are several different types of nearline systems:

Voldemart is a Key-Value storage system that provides services for read-only pages in the system. Voldemart's data comes from the Hadoop framework (Hadoop Azkaban: orchestration of the Hadoop map-reduce task execution plan ). This is the nearline system, which obtains data from an offline system similar to Hadoop. The data on the following pages is from Voldemart:
The following are several different indexes updated by Databus-a changed data capture system:

Demonstrate how data change capture events are updated to nearline systems using Databus: Use Cases to show how they work if you update the latest skills and positions in your profile. You also accept a connection request. So what happened inside the system:

Write updates to the Oracle Master database
Then Databus made the following wonderful work to achieve timeline consistency:

Data Architecture experience if you want to design a data architecture that supports data consistency, high scalability, and high availability like LinkedIn.com, you can refer to the following experience:

Database read/write Splitting:You should plan two types of databases, one of which can be called"Reliable source"System, another type of derived database system that performs read operations. The rule of thumb is to distinguish the write operation initiated by the user from the database used by the user's read operation.
Derived Database System:The user's read operations should be assigned to the derived database or read replica set. The derived database system can be built on the following system:
For User read operations, try to create indexes or key-value-based data (from Hadoop map-reduce and other systems) from the primary source database system ), in addition, the changes initiated by the user to the primary node can be updated to these indexes or derived data (key-value) together ).
To ensure that the data in the derived database system is up-to-date, you can choose application-dual writes to write data to the primary database and the derived database system at the application layer, or log mining (read the transaction commit logs of the primary data storage system obtained through the batch processing task ).
When creating derived data, you can executeHadoop-Based map-reduce tasksAnd then updateHDFSIt also notifies the derived data storage system (similar to Voldemart's NoSQL storage) to take away data.
For data consistency, you canCreate some data repositories as Distributed SystemsEach node in the cluster contains both the master and slave nodes. All nodes can create horizontally Scalable Data Shards.
To maximize the normal running time of these distributed data storage systems, you can use cluster management tools such as Apache Helix.

References

Siddarth Anand LinkedIn Data Infrastructure paper
Https://github.com/linkedin/databus
Http://gigaom.com/2013/03/03/how-and-why-linkedin-is-becoming-an-engineering-powerhouse/
Http://highscalability.com/blog/2012/3/19/linkedin-creating-a-low-latency-change-data-capture-system-w.html

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

LinkedIn Data Architecture Analysis

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

LinkedIn Data Architecture Analysis

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support