Over the past few years, we have been devoted to refactoring Digg's architecture, which we now call "Digg V4." In this article we will give you an overview of Digg's systems and technologies. Find the secret of the Digg engine. First of all, let's take a look at the services that Digg provides to mass users:
A social news site for personal customizable social news advertising platform API services blogs and documentation sites
People access these Digg services through a browser or other application. Some people with Digg accounts can get "My News." What Every User Can Get We call it Hot News. We have digg.com and the mobile version of m.digg.com, services.digg.com for the API services, info.digg.com for the information, developers.digg.com for the developers. These sites provide blogging and documenting services to users, news publishers, and developers alike.
This article focuses on the advanced technologies Digg uses in social news products.
We are trying to do it
We work hard to build a social news site based on ads released by users and advertisers.
The story submitted to registered users to submit articles, articles include: a title, a paragraph, a media type, a theme, or a thumbnail. The content is extracted from the article through a series of Facebook open graph protocols (OEmbed, etc.), of course, what exactly these metacharacters are submitted by the submitter before they are submitted. Publishers advertise their ads to a separate system, which of course can be a story if Dugg's enough.
Story List In the personalized news product "My News," all the stories you follow from users you follow are listed as "Story Lists" and are arranged in terms of recent releases, media genres, story themes, and more.
Story Action Users can manipulate stories such as reading, clicking, Digg, burying, commenting, voting, and more. Users who are not registered to log in can only read and click on these stories.
Story Recommendation We will decide that stories will be moved from the recent story list to the hot news list every hour. Our algorithm, which is confidential, decides which stories to choose from for breaking news by looking at the user's behavior and the classification of the story's content.
How do we achieve it?
Let's take a macro view of what content-based operations do if a user visits Digg's site. The picture below shows what the public sees and the pages, pictures, API requests and other services provided internally.
A brief description of our internal system is as above. Our API service proxy wants internal back-end services to make requests. The front-end services are virtualized (as distinct from the cache) and placed at the same service level. CMS and advertising systems will not be described in detail in this article, overview of the entire system can be broadly divided into the following two categories: synchronization and asynchronous.
1, the user for real-time response to the synchronization operation
Synchronous operations mainly represent immediate, rapid responses to user requests, including API requests, including some asynchronous requests made through AJAX on the page. These operations usually require a maximum of one or two seconds to complete.
2, off-line bulk asynchronous calculation
In addition to requests for real-time responses, there are times when there is a need for some bulk computational tasks that may be indirectly initiated by the user, but the user will not wait for the completion of these tasks. These asynchronous calculations can often take seconds, minutes or even hours.
The above two parts as shown below:
Here's a more in-depth understanding of the various components.
Online system
Programs that provide page and API request services are mainly written in PHP (front-end Web pages, Drupal CMS) and Python used (API service, Tornado). The front end calls back-end services (Python) via the Thrift protocol. Much data is cached by memory caching systems like Memcached and Redis.
News and events
Online and offline information through the main data storage transient / logging system This synchronization method to connect and use RabbitMQ for queuing system, the operation will not be synchronized to the queue asynchronously. For example, "a user Dugg a story," "calculate this thing."
Batch and asynchronous systems
The above message system refers to the queue, and this refers to the specific part of the task execution removed from the queue. The system takes the task out of the queue, performs some calculations on the primary storage, and does the same for primary storage in both real-time and asynchronous batch systems.
When a message is found in the queue, a "worker" is called to perform a specific action. Some information triggered by the incident, a bit like the cron mechanism. The worker then operates and manipulates the data on the primary storage or off-line storage, records the logs in HDFS, and writes the results back to the primary storage so the online service can use them. For example: for example, indexing new stories, calculating story promotion algorithms and running analysis.
data storage
Digg stores data in different systems depending on the type of data and how it is being used. Of course, there are times when historical reasons can not be avoided.
Cassandra: Information about "Object-like" things like articles, users, Digg actions, etc. is stored using Cassandra. We are using Cassandra0.6 version, 0.6 version does not hijack the secondary index, so we put the data through the application layer and then use it for storage. For example, our user data layer provides an interface for querying user information by user name and email address. This allows the server to view it, for example, by the user's username or email instead of the user's user ID. Here we use the Python Lazyboy wrapper. HDFS: From site and API events, user activity logs are here. The main use of log information storage and analysis of computing, the use of Hive operation Hadoop, MapReduce computing. MogileFS: is a distributed file storage system, used to store binary files, such as user avatars, screenshots and so on. Of course, the top of the file storage there is a unified CDN. MySQL: At the moment, our article is a top-level feature that uses MySQL to store some data that is used to store story-up algorithms and calculated data, as this feature requires a large number of JOIN operations. Naturally not suitable for other types of data storage. HBase at the same time seems to be a good consideration. Redis: Due to the high performance of Redis and its flexible data structure, we use it to provide storage for the Digg Streaming API, store each user's news data, and each user's news is different and needs to be updated in time. At the same time with Redis to provide Digg Streaming API and real time view and click counts service. As a memory-based system, it offers ultra-low load. SOLR: used to build a full-text indexing system. To provide full-text search of article content, topics, etc. Scribe: log collection system, more powerful and simpler than syslog-ng. Logs collected with it will be put into HDFS for analysis and calculation.
Operating system and configuration
digg currently runs on Debian systems based on GNU / Linux. Configure Clusto, Puppet. Using Zookeeper for system coordination.
This article is translated by punctuation, the original link: http://about.digg.com/blog/how-digg-is-built