LinkedIn Data Architecture Analysis

Source: Internet
Author: User
LinkedIn is one of the most popular professional social networking websites today. This article describes how LinkedIn manages data. If you have any objection to the point in this article or have any omissions in this article, please feel free to let me know. LinkedIn.com data use cases below are some data use cases. We may have seen them when browsing the LinkedIn Web page. Updated Personal Data

LinkedIn is one of the most popular professional social networking websites today. This article describes how LinkedIn manages data. If you have any objection to the point in this article or have any omissions in this article, please feel free to let me know. LinkedIn.com data use cases below are some data use cases. We may have seen them when browsing the LinkedIn Web page. Updated Personal Data

LinkedIn is one of the most popular professional social networking websites today. This article describes how LinkedIn manages data. If you have any objection to the point in this article or have any omissions in this article, please feel free to let me know. LinkedIn.com data use cases below are some data use cases. We may have seen them when browsing the LinkedIn Web page.

  • The updated personal data can appear on the recruitment search page in almost real time.
  • After the updated personal data is updated, it can appear on the web page of contacts in almost real time.
  • Share an update that can appear on the news feed page in near real time.
  • Then it will be updated to other read-only pages, such as "people you may know", "people who have read my materials", and "related searches.

Surprisingly, if we use better bandwidth, these pages can be loaded within milliseconds! Let's pay tribute to the LinkedIn engineer team!

Like other startups, LinkedIn's early LinkedIn data architecture saved user data and connections through several tables of a single RDBMS (relational database management system. Is it primitive? Later, the RDMBS extended two additional database systems, one of which was used to support full-text search of users' personal data, and the other to implement social graphs. The two databases use Databus to obtain the latest data. Databus is a change capture system. Its main goal is to capture changes to datasets in the source (like Oracle) and update these changes to the additional database system. However, it was not long before this architecture was able to meet the data requirements of the website. Because according to Brewerd's CAP theory, it seems impossible to satisfy the following conditions simultaneously: consistency: All Applications see the same data availability at the same time: ensure that each request can receive a response, fault Tolerance of successful or failed partitions: Message loss or failure of some systems does not affect the normal operation of the entire system.

According to the above rules, the LinkedIn engineer team implemented what they call timeline consistency (or final consistency of the nearline system, which will be explained below) and two other features: availability and partition fault tolerance. The following describes the current data architecture of LinkedIn.

If LinkedIn's data architecture is to support processing millions of user-related transactions in less than one second, the above data architecture is obviously insufficient. Therefore, the LinkedIn team of engineers proposed a three-step data architecture, which consists of online, offline, and nearline data systems. In general, LinkedIn data is stored in the following data systems (see the figure below ):

  • RDBMS
    • Oracle
    • MySQL (as the underlying data storage of Espresso)
  • RDBMS
    • Espresso (a document NoSQL data storage system developed by LinkedIn)
    • Voldemart (Distributed Key-value Storage System)
    • HDFS (stores Hadoop map-reduce task data)
  • Caching
    • Memcached
  • Lucene-based index
    • Lucene indexes for storing functional data such as queries and relational graphs
    • Indexes used by Espresso

 Figure: the LinkedIn database system includes DataBus, NoSQL, RDBMS, and Indexes.

The data repository mentioned above is classified into three different types of systems, which will be explained one by one as follows: Online Database SystemThe online system processes real-time user interaction. The primary database is like Oracle. Primary data storage is used to support User write operations and a small number of read operations. Taking Orcale as an example, the Oracle master performs all write operations. Recently, LinkedIn is developing another data system called "Espresso" to meet increasingly complex data requirements, which does not seem to be retrieved from RDBMS such as Oracle. Can they eliminate all or most of Oracle Data and completely transfer the data to NoSQL data storage systems such as Espresso? Let's wait and see.

Espresso is a NoSQL data warehouse that supports horizontal scaling, indexing, timeline consistency, and high availability based on documents,It is designed to replace the traditional Oracle database used to support the company's webpage operations. It was designed to improve the InMail message service availability of LinkedIn. Currently, the following applications use Espresso as the source system. We can see that NoSQL data storage is amazing if it is used to process the data needs of so many applications!

  • Messages between members,
  • Social activities, such as updates
  • Article sharing
  • User Profile
  • Company Information
  • News and articles

Offline Database SystemThe offline system mainly includes Hadoop and a Teradata data warehouse for batch processing and analysis. It is called offline because it performs batch operations on data .? Apache Azkaban is used to manage Hadoop and ETL tasks. These tasks obtain data from the master or source system and submit the data to map-reduce for processing. The processing results are stored in HDFSAnd then notify the 'consumers' (for example: Voldemart) Use the appropriate method to obtain the data and switch the index to ensure that the latest data can be obtained. Nearline Database System (timeline consistency)The goal of the nearline system is to achieve timeline consistency (or eventual consistency). It processes features like 'people you may know (read-only datasets) ', search, and social graphs, data for these features will be constantly updated, but their requirements for latency are not as high as those for online systems. Below are several different types of nearline systems:
  • Voldemart is a Key-Value storage system that provides services for read-only pages in the system. Voldemart's data comes from the Hadoop framework (Hadoop Azkaban: orchestration of the Hadoop map-reduce task execution plan ). This is the nearline system, which obtains data from an offline system similar to Hadoop. The data on the following pages is from Voldemart:
      • People you may know
      • The person who has read this page is still reading
      • Search
      • Jobs you may be interested in
      • Events you may be interested in
  • The following are several different indexes updated by Databus-a changed data capture system:
      • 'Member Search Index' used by SeaS (Search-as-a-Service '. When you search for different members on LinkedIn, the data comes from the search index. This function is usually very helpful to recruiters.
      • The social graph index helps display members and relationships in people's connections. With this index, users can obtain changes in network relationships in almost real time.
      • The member data obtained from the read replica set. The data will be accessed by the 'standardized service. The read replica set is a copy of the source database, so that updates to the source database can be synchronized to these replica sets. The primary reason for increasing the read replica set is that the read operation query can be distributed to the read replica set to relieve the pressure on the source database (performing user-initiated write operations.
Demonstrate how data change capture events are updated to nearline systems using Databus: Use Cases to show how they work if you update the latest skills and positions in your profile. You also accept a connection request. So what happened inside the system:

  • Write updates to the Oracle Master database
  • Then Databus made the following wonderful work to achieve timeline consistency:
      • Update the information, such as the latest skills and job information, to the standardized service.
      • Update the changes mentioned above to the search index service.
      • Update the link change to the graph Index Service.

Data Architecture experience if you want to design a data architecture that supports data consistency, high scalability, and high availability like LinkedIn.com, you can refer to the following experience:
  • Database read/write Splitting:You should plan two types of databases, one of which can be called"Reliable source"System, another type of derived database system that performs read operations. The rule of thumb is to distinguish the write operation initiated by the user from the database used by the user's read operation.
  • Derived Database System:The user's read operations should be assigned to the derived database or read replica set. The derived database system can be built on the following system:
      • Lucene Index
      • NoSQL data storage, such as Voldemart, Redis, Cassandra, and MongoDB.
  • For User read operations, try to create indexes or key-value-based data (from Hadoop map-reduce and other systems) from the primary source database system ), in addition, the changes initiated by the user to the primary node can be updated to these indexes or derived data (key-value) together ).
  • To ensure that the data in the derived database system is up-to-date, you can choose application-dual writes to write data to the primary database and the derived database system at the application layer, or log mining (read the transaction commit logs of the primary data storage system obtained through the batch processing task ).
  • When creating derived data, you can executeHadoop-Based map-reduce tasksAnd then updateHDFSIt also notifies the derived data storage system (similar to Voldemart's NoSQL storage) to take away data.
  • For data consistency, you canCreate some data repositories as Distributed SystemsEach node in the cluster contains both the master and slave nodes. All nodes can create horizontally Scalable Data Shards.
  • To maximize the normal running time of these distributed data storage systems, you can use cluster management tools such as Apache Helix.
References
  • Siddarth Anand LinkedIn Data Infrastructure paper
  • Https://github.com/linkedin/databus
  • Http://gigaom.com/2013/03/03/how-and-why-linkedin-is-becoming-an-engineering-powerhouse/
  • Http://highscalability.com/blog/2012/3/19/linkedin-creating-a-low-latency-change-data-capture-system-w.html

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.