Filtering microblogging messages for Social TV

Source: Internet
Author: User

Abstract: filtering microblogging messages for Social TV, a bootstrapping approach to identifying relevant tweets for Social TV

Social TV was named one of the ten most important emerging technologies in 2010 by the MIT Technology Review.

Social televisionIs a general term for technology that supports communication and social interaction in either the context of watching television, or related to TV content.

Some of these systems allow users to read microblogging messages related to the TV program they are currently watching.

So the question discussed here is how to filter out the TV-related information, which is the simplest and the method we have been using is as follows,

Current social TV applications search for these messages by issuing queries to social networks with the full title of the TV program. This naive approach can lead to low precision and recall.

For example, you can understand why the precision and recall methods are low...

The popular TV showHouseIs an example that results in low precision.

For house, this is an ambiguous word (ambiguous). In addition to TV programs, there are many other uses in different contexts, such as White House, House of Representatives, building, home, etc. therefore, directly searching for house must beLow Precision.

Continuing with our example for the show house, there are generating messages which do not mention the title of the show but make references to users, hashtags, or even actors and Characters Related to the show. the problemLow recallIs more severe for shows with long titles.
The recall problem mentioned above is very obvious for TV with a long title. Few people prefer to write a full title in tweet and often use the abbreviation.

To sum up, we want to solve this problemChallengesAs follows,
Our task is to retrieve microblogging messages relevant to a given TV show with high precision. Filtering messages from microblogging websites poses several challenges, including:

1. microblogging messages areShort and often lack context. For instance, Twitter messages (tweets) are limited to 140 characters and often contain abbreviated expressions such as hashtags and short URLs.

2. Adjust social media messages lack proper grammatical structure. Also, users of social networks pay little attention to capitalization and punctuation. This makes itDifficult to apply natural language processing technologiesTo parse the text.

3. many social media websites offer access to their content through search APIs, but most have rate limits. in order to filter messages we first need to collect them by issuing queries to these services. for each show we require a set of queries which provides the best tradeoff between the need to cover as your messages about the show as possible, and the need to respect
The API rate limits imposed by the social network. Such queries cocould include the title of the show and other related strings such as hashtags and usernames related to the show.Determining which keywords best describe a TV show can be a challenge.

4. in the last decade alone, television networks have aired more than a thousand new TV shows. obtaining training data for every show wocould be prohibitively expensive. furthermore, new shows are aired every six months.

I have been thinking about how to solve this problem for a long time. I have also thought about establishing a classifier to identify whether a tweet is about TV, but I have not figured out how to solve it, this paper is a method for establishing this classifier.

Classifier is a mature technology. The key is feature selection and training set collection.

We propose a bootstrapping method which is built upon 1) a small set of labeled data, 2) a large unlabeled dataset, and 3) Some domain knodge DGE, to form a classifier that can generalize to an arbitrary number of TV shows.

Because the lable training set is time-consuming, we only need a small training set labeled data and use domain knodge DGE to select the initial classification feature, this completes the training of the initial classifier. then, a large unlabeled dataset is used as the test set to test the initial classifier. New features are found and constantly improved during the test to form an available improved classifier.

This is the general idea of this method. Through testing, we can find that the classifier after improved has greatly improved on recall.

I personally think the value of this paper lies in the selection of features. Let's take a look at which features will be selected,

Terms related to TV watching

General terms commonly associated with watching TV. These features are collected manually and contain the following three features,

TV _terms, General terms such as watching, episode, HDTV, Netflix, etc.

Network_terms, Contains names of television networks such as CNN, BBC, PBS, etc.

Season_episode,

Some users post messages which contain the season and episode number of the TV show they are currently watching.

"S06e07", "06x07", and even "6.7" are common ways of referring to the sixth season and the seventh episode of a participant TV show. therefore, we need to use regular expressions to determine whether season_episode is included.

For the above features, when the tweet contains the corresponding term, the feature is 1; otherwise, it is 0.

General positive rules

Rules_score,
The motivation behind the rules_score feature is the fact that contains messages which discuss TV shows follow certain patterns.
For example,
<Start> watching <show_name>
Episode of <show_name>
<Show_name> was awesome

If we have such a rule list, when the tweet contains the corresponding rule, the feature is 1, otherwise it is 0.

The problem is how we can find these rule. Of course, we can manually find them one by one, which also improves the accuracy, but the efficiency is too low.

We developedAutomaticWayExtract such general rulesAnd compute their probability of occurrence.

We start from a manually compiled list of tenUnambiguousTV show titles, such as "Mythbusters", "The Simpsons", "grey's anatomy", etc. unambiguous is unambiguous and clear. This word must represent a TV, relative to ambiguous, such as house
Now we want to extract general rules from TV-related tweets, so we must ensure that the found tweets are actually related to TV. A better way is to useUnambiguousTV show. We have used this method before.

For each message which contained one of these titles, the algorithm replaced the title of TV shows, hashtags, references to episodes, etc. with General placeholders, then computed the occurrence of Trigrams around the keywords.

This is a key step. We need to extract general rules, so we need to first block the information related to a specific TV, and then count the occurrence of Trigrams.

Features related to show titles

Although social media messages lack proper capitalization, when users do capitalize the titles of the shows this can be used as a feature.

Title_case, Which is set to 1 if the title of the show is capitalized, otherwise it has the value 0.

Titles_match, Any of the titles mentioned in the message are unambiguous, we can set the value of this feature to 1.

Here it is more valuable that he proposed a method to determine whether it is unambiguous. We used to calculate the Stop Word method by ourselves, but the effect was not very good, especially for multiple words, he proposed that WordNet ...... Good.

We define unambiguous title to be a title which has zero or one hits when searching for it inWordNet

Features Based on domain knowledge crawled from online sources

One of our assumptions is that messages relevant to a show often contain namesActors,Characters, Or other keywords strongly related to the show.

Cosine_characters,Cosine_actors, AndCosine_wiki, We compute the cosine similarity between a new message and the information we crawled (from TV .com and Wikipedia) about the show for each of the three features.

This method can greatly improve recall, but it is troublesome to implement it. Due to Twitter access restrictions, too many terms can be set for a show, so it has never been used.

Nine initial features are listed above. After testing the test set by using the initial classifier, the following features are found,

Pos_rules_scoreAndNeg_rules_scoreAre natural extensions of the feature rules_score.

For instance, for the show house we can now learn positive rules such as episode of house, as well as negative rules such as in the house or the White House.

Users_scoreAndHashtags_score

Using messages labeled by classifier #1, we can determine commonly occurring hashtags and users which often talk about a special show. furthermore, these features can also help us expand the set of queries for each show, thus improving the recall by searching for hashtags and users related to the show, in addition to the title.

We have also thought of this before, but we have not implemented it to improve recall.

Rush_period, This feature is based on the observation that users of social media websites often discuss about a show during the time it is on air. when classifying a new message we check how many mentions of the show there were in the previous window of 10 minutes. if the value exceeds one of the threshold values, the value is 0.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.