Grilled one grilled Facebook support trillion post search behind the story

Source: Internet
Author: User
Keywords Cloud computing Big Data Facebook facebook graph search cloud computing

A while ago, Facebook added graph search for post searches. After that, Facebook generated about 1 billion post,post indexes a day, and the amount of data was 700TB. These amazing numbers are no doubt challenging to index and build real-time query systems for these posts. So how does Facebook deal with this challenge?

Data collection

Facebook's underlying data structure is primarily designed to meet the needs of fast iterative network services, which is the biggest challenge in building real-time query systems. Adding new features often requires changes to these data structures, and Facebook's usual style of change is to try not to cause trouble to engineers. However, as Wall post, photo, check-in and other functions adopt different data storage mechanism, the change of the underlying data structure increases the difficulty of sorting by time, place and label. Currently, there are about 70 types of sorted and indexed data, many of which are based on specific post types. In addition, the data is stored in a MySQL database for production environments. This means that when the database supports both production transmission and data collection, the load will increase significantly, so these processes must be strictly monitored.

Index establishment

After the data is collected, we store it in the HBase cluster, then perform the Hadoop map-reduce task, indexing it in a high and parallel way. After the raw data is indexed, it is then transmitted to the base unicorn of the search. We divide the data into two pieces--document data and reverse indexing (inverted index). The document data for each post contains related information for sorting. In the traditional sense of what the search index has, the reverse index is what. To build a reverse index, you need to traverse each post and determine which search filter in the assumption matches.

Index Update

To update the index, we use the wormhole technology to subscribe to the changes in the MySQL database. Once a new post is available, we will update the related post when the existing post is modified, deleted, or the relevant data related to post is edited. To reduce duplication of code, we use the same logic as mentioned in the "Data collection" section to update operations. The difference is that we deliberately avoid caching when collecting data because we want to avoid requesting data that is not cached. When we update the index, we will hit the cache because we want the data to have been recently accessed and is still in the cache.

Index Storage

Post indexes are much larger than other search indexes maintained by Facebook. All of Facebook's search indexes are stored in RAM before starting the search post. This is great for quick queries. It is also possible for small search indexes. However, the overhead of storing more than 700 terabytes of data in RAM is unimaginable because it requires the maintenance of indexes distributed across multiple machines. The Unicorn team has been forced to look for new ways to store post indexes by coordinating multiple machines that store indexes so that they work in an orderly manner to bring huge performance losses to the system. The solution we finalized was to store most of the indexes with solid-state flash storage, with RAM storage accessing the most frequent data structures, and to maintain performance.

Order of results

Since the 1 trillion post is indexed, most queries return more results than anyone can read. To do this, Facebook began to design the results. In order to make the user valuable and relevant content to float to the above, the main use of two main strategies: query rewrite and the results of dynamic scoring. Before executing, rewrite the query and add the clauses flexibly to ensure that the query results are more valuable to the user. Scoring search results, including sorting and selecting documents based on a series of features for sorting. The sorting feature is extracted from the document, at present a total of 100 features are extracted, combined with the sorting model, to find the best search results. With the increase of user's quantity and the increase of user feedback, the sorting model will be improved further.

Brief History of the project

Like many other products on Facebook, the post search function was also born in a programming marathon project. Over the past year, dozens of people in the Graph search team have implemented most of the functions of post search-infrastructure, sequencing, and production.

"Edit Recommendation"

Facebook and Yahoo "Sparks": real-time data flow management tools challenge traditional relational databases: Facebook graphics database Tao uncover star Facebook what challenges Internet hegemony Google? Facebook exposes server design data service industry faces shuffle "responsible editor: Xiao Yun TEL: (010) 68476606"

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.