Design and implementation of popular tags

Last Update:2015-07-09 Source: Internet

Author: User

Tags natural logarithm postgres database

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Original link: Trending at Instagram
Translator: Jay Micro Magazine-Zhang Di

With the launch of the "Search and Explore" feature last week, we introduced a capability: on Instagram, we can find it easily when the moment of fun happens. The popular tags (trending hashtags) and popular locations in the Explore feature show the best and most popular content from around the world, which may have been retrieved from places and accounts you didn't care about. Creating a system that analyzes more than 70 million new photos per day for 200 million people is undoubtedly a challenge. Let's take a look at how we can identify, rank, and present the best hits on Instagram.

Popular definitions definition of a Trend

Intuitively, a hot tag should be more of a label than usual, a special event that happens at some point in time. For example, people don't usually send pictures of the Northern Lights, but on the day we released the feature, a group of people shared some amazing photos using the # northernlights tag. You can see in the picture below how the usage of this tag has increased over time.

When we wrote this blog, #equality是Instagram上最热门的标签.

Similarly, whenever a large number of people share photos or videos taken in one place or at a time, this location becomes a hot spot. When we wrote this article, the U.S. Supreme Court became a hot spot, as hundreds of people were there to share their demonstrations supporting same-sex marriage decisions.

From the above example, we find that a hot spot usually contains three elements:

① popularity (popularity)--hot should be for many people in our society are interested in.

② Novelty (novelty)--hot should be about new things. People did not publish before, or the number of releases is not large.

③ timeliness (timeliness)-when real events happen, the buzz should show up on Instagram.

In this article, we will discuss the algorithms and systems that we are using to identify, rank, and present the hottest.

Identify popular identifying a Trend

Identify the hot, ask us to quantify the difference between the activities being observed (the number of shared photos and videos) and the expected popular activity. In layman's terms, if the observed activity is much hotter than the expected activity, then we can be sure that it is hot, and then we can sort the popular order based on the difference in expected values.

Let's go back to the #equality example at the beginning of the article. Usually we observe that only a few photos and videos per hour use this tag. Starting with 07:00am PT (Pacific Time zone), thousands of people began to share content using #equality tags. This means that #equality is more active than we expected. Instead, there are more than 100,000 photos and videos per day tagged with #love, as it is a very popular label. Even if we observe more than 10,000 #love in a day, it is not enough to exceed our expectations given its historical mean.

For each label and location, we will store some data: In the last seven days, within five minutes, in how many places used this tag or location. For the sake of simplicity, now we only consider the label, let's assume that C (h,t) is the tag h at the T moment of the count (that is, it is from t-5min to t this 5 minute period, the number of use of this tag). Because over time, this count varies greatly between labels, and we normalize it: the probability P (h,t) of the label H at moment T is computed. Given the historical count of a tag (that is, time series), we can build a model that predicts the expected number of observations: C ' (h,t); Similarly, calculate the expected probability P ' (h,t). Given the two values for each label, the difference between a common metric probability is the KL divergence, which is calculated in our case:

S (h, t) = P (H, t) *-ln (P (h, t)/P ' (h, t))

In essence, we will consider the current observed popularity (popularity) and novelty (novelty), the sentiment is reflected by the probability P (h,t), the novelty degree is calculated, we currently observe the number of P (h,t)/P ' (h,t). The natural logarithm (LN) is the "intensity (strength)" used to moderate and novelty, making it comparable to popularity. Time-sensitive (timeliness) is balanced by the parameters, which are selected by viewing the number of occurrences in the recently displayed window.

Predicted problems A prediction problem

How to calculate the expected baseline probabilities based on past observations (the expected baseline probability)?

There are several factors that can affect the accuracy of estimates and the computational time complexity and spatial complexity (time-and-space complexity). Usually, these things are not handled well-the more you want to be accurate, the higher the 2 complexity required by the algorithm. We tried different scenarios, such as using values from the same time last week, such as a regression model (regression models), or even a neural network (neural networks). It turns out that chic things are more accurate and simple things work better, so we end up choosing the maximum probability of measured values over the past week (maximal probability over the past week ' s worth of measurements). Why is this good?

① is easy to calculate and memory requirements are relatively low.

② use of Gaofangcha (high variance) to prohibit non-popular publication is very effective.

③ quickly identify new favourites.

There are two things in this explanation that we simplify, so let's improve the model.

First, although some tags are very popular, most of the labels are not popular and the five minute count is very low or zero. So, for the previous count, we kept the hour as the interval, because the five-minute rule was not required to calculate the baseline probabilities. We'll also look at the data for a few hours, so we can minimize the "noise (Noise)" caused by random usage spikes (spikes). We note that there is a balance between getting enough data and quickly discovering popular events-the longer the time period, the more data we get, but the slower it is to determine a hot spot.

Second, if the Prediction Datum p (h,t) is still zero, even after accumulating for a few hours is still zero, we will not be able to calculate the KL divergence (because the denominator is 0). Therefore, we simply deal with: if we have not seen any media coverage of a given label in the past, and there are three posts tagged in this tag during this period, we will record it. Why are there three of them? This allows us to save a lot of memory when storing data (> 90%), since most of the tags cannot have more than three posts per hour, so this part of the label count we do not save, only save at least three posts per hour using the tag count.

Ranking and synthesis sort Ranking and Blending

The next step is to rank based on the label's heat, and the way to do this is to assemble all the candidates for a country/language label (now available in the United States) and pick it up according to its KL divergence score, S (h,t). We note that some of the hotspots tend to disappear faster than the interest around them. For example, a tag with a large number of articles is popular at that moment, but this hot heat will naturally decrease with the end of the event. As a result, its KL divergence score is quickly lowered, and then this label will not be hot, even if people usually like to see their photos and videos after a few hours after a hot event ends.

To overcome these problems, we use exponential decay functions (exponential decay function) to define previously popular survival times to determine how long they will survive. We track each popular maximum KL score, i.e. SM (h), time Tmax, Formula S (h,tmax) = SM (h). We then calculated the exponential decay of the SM (h) for each candidate tag at that moment (exponential decayed value), and finally we could sort by its nearest KL score.

Sd (h, t) = SM (h) * (?) ^ ((t-tmax)/half-life)

We set the attenuation parameter half-life to two hours, which means that sm (h) halved every two hours. This way, if a label or place is a big hit a few hours ago, it may still be in the hot form, with the newest favourites side-by-side display.

Similar popular groups Grouping similar trends

People often use different labels to describe the same event, and when the event is popular, multiple tags describing the event may be hot. Because displaying multiple tags that describe the same event can be an unpleasant user experience, we make a group of "concept" tags.

For example, describes all the tags used with the # equality. This shows that #equality总是和 #lovewins, #love, #pride, #lgbt, and many other popular tags are used together. By grouping These tags together, we can use #equality as a hot display and keep the need to filter all the other different labels until we get another interesting hot tag.

There are two important tasks that need to be done-first, to figure out which tags are talking about the same thing, and second, to find a label that best represents the group. There are two challenges here-first, we need to capture the similarity between some tags, and then we use the "unsupervised (unsupervised)" approach to assemble them, which means that at any moment we cannot know how many clusters (cluster) we should have.

We use similar concepts between the following tags:

① also appears (cooccurrences)-a set of tags that people tend to use simultaneously, such as #fashionweek, #dress和 #model. The simultaneous calculation is to look at recent media reports, and then count the number of occurrences of each label along with other tags.

② Editing Distance (edit distance)--different spelling methods (or errors) for the same label, such as # Valentineday and #valentinesday, are often not present at the same time because people rarely use them together. Use the string similarity algorithm to circumvent the problem of spelling differences.

③ Topic Distribution (Topic distribution)--a label that describes the same thing, such as #gocavs, #gowarriors有不同的拼写方式, rarely appears at the same time. We looked at the headings using these labels, and ran an internal tool to divide them into a set of predefined topics. For each label, we'll look at the topic distribution (collected from all the media titles that have appeared) and use TF-IDF to standardize.

Our label grouping process calculates the different similarities of the two hot tags and then decides which two are sufficiently similar to be considered the same. During the merge process, there are some tag clusters (clusters of tags) that will be merged if they are similar enough.

Now that we've discussed the process of confirming the most popular tags on Instagram, let's see how each of these sections is implemented in the background.

System Design for systems

The background of the hot tag, designed as a stream processing application with four nodes, is connected to an assembly line-like chain structure, as shown in:

Each node consumes and produces a stream of record lines ("Log" lines). The entry point receives a stream of media authoring events, and the last node outputs the rank (label or location) of a popular item. Each node has one of the following specific roles:

① Pre-processor (pre-processor)--Mastering the media original event content and its creator metadata, during the pre-processing phase, we obtain and enclose all the data needed to use the quality filter (quality filters) in the next step.

② Parser (parser)-Extracts a label or place from a photo or video and uses a quality filter. If an article does not meet our criteria, it will not be counted as hot.

③ (scorer)-stores each popular time aggregation count (time-aggregated counters). Our scoring function S (h,t) is also calculated here. Every few minutes, the result value of S (H,t) is published.

④ rank (ranker)--aggregates and arranges all the candidate's favourites and their heat scores.

Our systems process and store large amounts of real-time data, which should be efficient and disaster-tolerant. This streamlined architecture (stream-lined architecture) allows us to differentiate between hot, multiple instances on each node, which allows each node to store less data and to handle those hotspots in parallel. In addition, failures are quarantined in a specific partition, so if one instance fails, the hot compute does not fail completely.

So far, we've discussed the calculation of popular tags. The following diagram adds components that are responsible for providing popular tags for app requests:

, requests for popular tags and locations from the Instagram app need to be processed without imposing a load on the background. Therefore, in case of cache loss, we use the memcached read-through cache layer (caching layer) and the Postgres database (Miss). The database populates the new favourites from the rankings with timed tasks, using algorithms to gather similar hits together and store the results in Postgres. In this way, the Instagram app always gets the latest hot tags from the cache layer, which allows us to easily extend the storage and information flow processing components.

Summary Conclusion

When we deal with the hot, we try to break it down into smaller problems that can be handled individually by a particular functional component. As a result, team members can focus on a specific issue before moving on to the next question. We want users to enjoy this new feature and make it better to connect with the world, just as it has happened.

-------------------long time no see the split line-------------------

If you find any questions about this translation, please feel free to contact Jay Weibo.

[Reprint please retain source, translator and reviewer.] Can not keep our links]

Design and implementation of popular tags

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More