How does the Pixable architecture support 20 million pictures per day?

Source: Internet
Author: User
Keywords We we crawl we crawl work we crawl work every day we crawl work every day all

Introduction: Pixable is becoming a light blog Tumblr (tumblr:150 The architectural challenges behind the amount of browsing), another hot social media, it is a photo-sharing center. But pixable can automatically grab your Facebook and Twitter images and add up to 20 million images a day, how do they handle, save, and analyze the data that is exploding? Pixable CTO Alberto Lopez Toledo and VP of Engineering Julio Viera's system architecture is described in detail.

Pixable your pictures through a variety of social platforms, and you won't miss any important moments. Currently, pixable handles more than 20 million new images each day: Crawl, analyze, categorize, and compare and sort with over 5 billion pictures. How to understand these data is a big challenge, two of which are more prominent:

1. Every day, the way to ensure efficient from Facebook, Twitter, Instagram and other applications to get millions of pictures.

2. How to process, organize, index and store the metadata of the associated picture.

Of course, Pixable's infrastructure is constantly changing, but we've learned a lot in the past year. So we've been able to build an extensible infrastructure built on our tools, languages and cloud services today (all 80 of our services are running on AWS). This document will briefly describe these lessons.

Backend architecture-everything is possible

Infrastructure-Favorite Amazon EC2

All of our services are using Amazon EC2, from CentOS Linux T1.micro to M2.2xlarge. Then we set up the server to build an internal ami.

We are ready to deploy new tasks at any time to cope with a sudden increase in load. Therefore, we always maintain the lowest performance standards.

In response to the fluctuations, we have developed an automatic reduction and reduction technology that predicts the number of new additions to each service at a specific point in time of the current and historical load. Then we guarantee the supply of the system resources by opening or terminating some applications. In this way, we can save money to buy the servers that we don't need. In Amazon, it's not easy to make automatic additions and deletions, which takes a lot of variables into account.

For example: It is pointless to stop a service that runs for half an hour because Amazon delivers it in one-hour units. Similarly, Amazon spends more than 20 minutes starting a service. For sudden congestion, we can intelligently arrange the demand (smart scheduling starts very quickly), and some requirements are scheduled to start in the next one hours. This is only considering the operational aspects of the research results, the goal is to squeeze all the performance of the investment. This kind of thought is like "Moneyball" the movie, we changed the baseball athlete to the virtualization service.

Our site is currently running on apache+php 5.3 (we will adjust some of our servers to nginx+php-fpm over time, which will be our standard configuration). Based on Amazon's resilient load balancing, our servers are evenly dispersed across regions so that we can digest Amazon's downtime and price fluctuations. Our static content is stored on Amazon CloudFront and used by Amazon Route 53 for DNS services. Yes, we love Amazon.

Work queues-grab pictures and ratings, send notifications, etc.

In fact, all of Pixable's work is asynchronous (such as grabbing pictures from different users ' Facebook, sending notifications, calculating user rankings, etc.). We have dozens of servers responsible for capturing image metadata from different services and processing them. The work was carried on continuously and day and night.

As expected, we have a variety of different types of work: some require higher priority, such as user-time calls, short messages, and a picture of the currently active user. Low-priority work includes capturing images of offline users and long-term data-enhancement delays. Although we used a very powerful Beanstalke as a work queue service, we still developed a management framework on top of him. We call it Auto-pilot (Automatic navigator), he can automatically manage priorities, give system resources to highly privileged jobs, and stop low-priority jobs.

We developed very complex priority processing rules, taking into account both system performance and user experience. Some metrics are easy to measure, such as the average waiting time for a job, the latency between the system's master and slave servers (we never delay). More are some complex metrics, such as our own PHP sync mutex state of the distributed lock environment. We try to make a fair trade-off between performance and efficiency as much as possible.

Crawl engine-grab new pictures via Facebook, Twitter and other 24x7 sites

We constantly improve the crawl technology, this is a complex parallel algorithm, through the mutual Exclusion Lock Library, synchronization of the specific user's all processes. This algorithm has helped us to capture images on Facebook at least 5 times times faster. Now, every day we can easily get more than 200,000 new pictures. This is quite remarkable, in fact, that any large data request for the Facebook API can last only a few seconds. The attached file after the body will delve into our crawl engine.

Data storage-indexing pictures and metadata

Currently, 90% of data is stored in MySQL (built on top of a distributed cache) on two sets of servers. The first group uses 2 Master 2 from the design to store those rules of data, such as user information, global classification settings, and other system parameters.

The second group contains a manual shared server that stores information about the user's pictures, such as the URL address of the picture. This part of the data is highly non-standard, only in the MySQL table (nosql in MySQL) we adopted the NOSQL scheme, such as MongoDB. So, another 10% of the data is stored through MongoDB! We are moving in part to MongoDB, primarily because it is simple and flexible, and provides partitioning and rewriting (sharding and replication) capabilities.

logs, profiling and analysis

We developed a highly flexible logging and profiling framework that allows logging to be granular and fine-grained to every line of code. Each event log is categorized by tags for subsequent queries (for example, User X's events, or in Module y). It is important that we can dynamically analyze all the log events at a certain point in time and establish the real-time analysis of the whole system. The pressure on the storage system from the log and analysis system is very high (thousands of updates per second), so we combine two levels of MySQL tables (a fast cache based on memory, the server is like a bucket of live data) and a couple of separate MySQL tables (the data will be asynchronously to the table later). This architecture can handle more than 15,000 logs per second. We have our own event tracking system, which includes each user from login to share to personalized clicks, and we can then query and analyze these complex record actions.

We also rely on excellent mixpanel services to help us achieve high standards of analysis and reporting.

Front End-Simple visualization device

Pixable can be run on a variety of (front-end) devices, such as the iphone and ipad. We also have Web pages and mobile pages, they only read a simple Web page, and the rest of the tasks are all left to the client's jquery and Ajax complete. Soon, all of our site front-end will run a single code base that automatically adapts to the mobile or desktop screen (please try http://new.pixable.com). This way, we can use the same code on the main web site, including the expansion of the Android device and the Chrome browser. In fact, our Android OID app combines mobile Web front-end programs. The Android OID app requests minimal framework control and displays the contents of the mobile web.

It sounds a bit harsh, but all of our front ends are "dumb". All heavy tasks are performed in the background through our own private API connection. This allows us to quickly develop and deploy without changing the existing user base. Flexible, baby!

api-Connect our front end and back end

The API is like glue that makes everything work together. Develop our own restful API using fizzy PHP to connect back-end functionality. We also fizzy PHP based on the development. This allows for backward compatibility when developing new features or changing the response format of the API. We configure only those versions that support API calls. We keep all the old versions of the API, and older mobile devices do not have to worry about compatibility issues.

In fact, the API simply connects the front end to the back end and does not actually do any work. The back end is the core of the pixable. (Compile/Bao)

(Responsible editor: Lu Guang)

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.