OPEN Inforeb Search Query logsmation EXTRACTION from WEB SEARCH query LOGS

Last Update:2014-09-11 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

OPEN INFOREB SEARCH QUERY logsmation EXTRACTION fromWEBSEARCH QUERY LOGSChapter I.Introduction

Search engines are increasingly more advanced than traditional keyword input and document output, and by focusing on user-oriented tasks to improve user experience, user-facing tasks include query suggestions, search personalization, referral links. These user-centric tasks are supported by mining data from search query logs. In fact, the query log captures the user's perception of the world and is the key to this application.

The language knowledge extracted from the query log, such as entities and relationships, is of great value to the above application. However, there is not much research on extracting knowledge from query log. In this paper, we first investigate the open information extraction based on the query log. Our goal is to extract user-oriented knowledge from the query log to help with reasoning.

Traditional information extraction focuses on extracting structured information, such as entities, relationships, facts, from unstructured text, using two main assumptions: (1) text resources are syntactic and semantic well-structured text fragments, such as news corpus, Web document; 2) extraction process is generated from some prior knowledge of the bootstrap.

Open Information Extraction (openinformation extraction, OIE) is a newly emerging paradigm of information extraction that relaxes assumptions (2) and extracts entities and relationships from web-scale corpus in a focused field, without the need for manual input.

In this article, we went further andexplored the usefulness of search query log for OIE by avoiding assumptions (1). We elaborate on the relaxation assumptions (1) and (2) allow us to achieve our goals appropriately and extract user-oriented knowledge. In particular, we assume that Web page text and query logs model two different spaces: Web page text Modeling Web space,search query log modeling user space.

To enable us to model OIE assumptions based on search query log , some challenges need to be addressed. First, we need to avoid the assumption (2) of digging up naturally occurring information from the query log, and we need to create an extraction method that is completely independent of any prior knowledge. Second, the query log does not have a syntactic structure, we need to establish a robust extraction method, do not need to correlate any traditional natural language processing tools, such as POS tagger. Thirdly, the query log is concise, we need to design an effective extraction of the entity representation, this expression can properly grasp the characteristics of the query log. Finally, although the query log is not as large as the web corpus, it is still a large data set, so our approach must be able to handle large datasets efficiently.

We present a two-phase approach to OIE based on search query log. The first phase (Entity extraction), using unsupervised methods to extract entities from the query log, applying schema-based heuristics and statistical methods. The second stage (entity clustering), by using a clustering method, uses a variety of information from the query log to create categories on those entities. To sum up, our main contribution is:

We propose and implement a new model for OIE based on search query log.
We introduce an unsupervised technique for extracting entities from atypical corpora in a domain-independent manner;
We describe how to characterize each of the extracted entities and create categories from those entities.
We introduce a broad evaluation based on realistic data sets, showing that the query log is a valuable resource for the domain-independent user-extraction task. We have also shown the usefulness of our approach by introducing it into two practical applications: the entity that is related to the news recommendation, and the recommended keyword for paid search.

Chapter IIQuery Log Entity extraction

Entity extraction is a task that plays an important role in NLP and Web-based applications. Historically, entity extraction was defined as extracting an instance of a predefined category. We introduce an unsupervised approach to extracting entities from query logs in large-scale open areas. In accordance with our knowledge, we are the first to attempt to propose an algorithm explicitly in order to achieve the following two goals: (1) Extracting entities from the query log, (2) Extracting entities from open domain types, no predefined categories.

Starting with the original user search query logs, our method first identifies the candidate entity, and then the reliable entity is selected from the candidate entity by calculating containment filtering by using two text-based trust scores.

2.1 Generating candidate entities

Open domain extraction of entities from query logs there are some challenges: first, the existing class-based decision-making approach is proving to be domain-qualified. Second, we extract the entity from the query log, the query log is a non-typical corpus, the query is short, and the lack of syntactic structure, thus weakening the traditional method based on contextual evidence and syntactic characteristics of the use of methods.

Our approach to generating candidate entities is based on simple observations, where users often build their queries by copying the phrases that exist on the page. Because of this behavior, the user's query typically contains layer-level attributes, such as uppercase attributes and word-breaker attributes. Our approach is aware of this observation by identifying successive letters capitalized words from user queries. In particular, given a query q=q1q2...qn, we define a candidate entity E=e1e2...em,e is a sequence of the maximum length in Q, which satisfies the initial capitalization of each word ei in E .

Given the free query, we take the surface level technology distance perfectly. For example, a very small part of the user only drives into uppercase characters. We need to identify and discard the fake entities. method is described below.

2.2 Getting a trust score

Given a candidate string generated by the step just E=e1e2...em, we assign him 2 information points: A Web-based expression score, a separate score based on the query log. representation score grasp this intuition, in e the case sensitive Q, in the Web page data should also have the same form. More formalized, web-based expression score RW (E) is calculated by the following formula:

Where|x| is the number of times a string x appears in the corpus of a Web page,R (i) is the case-sensitive representation of I, andO (e) is all occurrences of the string E (not very clear if it is not all case-by-cases?). ）。

Standalone score based on observation, a candidate entity E should often appear independently in the query log. In fact, in the query log, we have to find Q==e's query to catch the fact that the user wants to know more about the Chinese entity. More formalized, we calculate the query log based standalone score sq (E) by the following formula:

Above got a score of RW (e) and sq (e), we reserve the entity to meet RW (e) ≥τR and sq (e) ≥τS. In the experiment, we estimated a large number of tauR and Tau s through the development set, andset the Tau R to 0.1,τs set to 0.2.

2.3 Apply Containment Filter

As a final step, we consider the problem of boundary detection. Usually, we may have a large number of overlapping candidates, they just represent concepts, not entities. These strings may not be filtered out. We use this method to filter: A string that completely contains another entity will be discarded.

Chapter III Entity Clustering

We introduce the clustering methods used in entities that are open in a write field. The goal here is to aggregate entities that have similarities in user space. To accomplish this, we first need to represent each entity as a feature set in this space, and then we need to use a clustering algorithm to aggregate entities with similar characteristics.

3.1 User space

The context feature space. the basic assumption of contextual feature space is that an entity can be effectively represented as a set of contexts in which it resides in the query log. This can capture the user's view of the entity.

Our query-based log features may be significantly different from traditional Web corpus-based features, because the same entities may exhibit different expressions and concepts in both corpora (that is, usage in Web pages and usage in queries may vary).

In order to get our context representation, we use the following processing. For each entity E, we first find all the query logs that contain entity E. We then find the prefix and suffix of the query that this entity appears in (that is, the preceding string and subsequent strings).

When all the contexts of all entities are statistically good, we ignore the context where the number of occurrences is less than tau, which avoids the statistical bias caused by sparse data (in the experiment, Tau is set to 200). We then calculate the point mutual information for the correction (corrected pointwise mutual INFORMATION,CPMI): (Can be found in the paper "discovering word senses from text")

wheref (e,c) indicates the number of occurrences of E and C in the same query,F (e) and F (c) are the number of occurrences of entity E and Context C in the query,F (*,*) Represents the number of occurrences of all words and all contexts (that is, the number of queries used, with e or c). m is a correction factor that reduces the statistical error caused by low-frequency entities and low-frequency contexts. This allows each entity to be represented as a vector of PMI values. Note: Our approach does not have any NLP parsing, because the query has almost no syntactic structure. This ensures that the computational complexity of the algorithm is not high and can be easily adapted to other languages.

Click Feature Space. during a search conversation, the user initiates a search, and the search engine returns a list of URLs. The result of the search is that the user chooses the URLs that can express their intentions . This interaction can be clicked to capture, these click behavior will be written by most of the search engine log, as click-trough data.

Our main motivation for aggregating entities based on user click behavior is that different queries click on the same URL to capture the user's similar intentions. As a result, entities that aggregate the same URLs that users click on may be similar. We observed that URLs tended to contribute a URL for each entity . So by clicking on the URL aggregation entity, you may find synonyms (different representations of the same entity) and deformed bodies (spelling errors). To get more relevant clustering, we use the base URL instead of the click URL.

Because of the presence of Wikipedia-like web sites, such as Wikipedia, taking a base URL may result in a non-similar entity being placed in the same category. To solve this problem, in our experiment, we used a stop-list, by excluding the first 5 URLs based on the inverse document frequency , where entity was considered "document".

In fact, each extracted entity E is represented as a vector that is equal to the number of base URLs that are clicked by all users. Each dimension of a vector represents a URL. The dimension of the entity e vector about the URL J is calculated as follows:

Where μ is the base URL collection that is obtained when entity E is initiated as a query, W (e,j) is the number of times the URL J is clicked when entity E is initiated as a query.

mixed feature space. We also experiment with the mixed feature space, which uses the normalized and set of the context feature space and the click Feature Space.

3.2 Clustering Algorithm

The clustering phase uses any of the feature spaces described above to aggregate entities by similarity of the vectors of the entities. This task clustering algorithm needs to have this characteristic: (1) The algorithm must be highly extensible, efficient, can calculate the high dimension, because the number of queries and the dimension of eigenvectors is very large, (2) We do not know the number of categories in advance.

Any clustering algorithm that satisfies the above two requirements can be used. In today's experiment, we use CBC, the most advanced clustering algorithm, has been shown in many language tasks than the K-means algorithm. We use a highly scalable Map Reduce CBC to ensure robust and efficient memory usage. CBC Introduction slightly.

The fourth chapter of experimental evaluation

Let's start with a brief introduction to the data used

query log: randomly selected 100 million, in 2009 years ago 3 months, was collected by the search engine anonymous query, and query the frequency. We use the month to split the dataset JN,FB,MR. This data is used to extract entities, generate context, and click feature Spaces.

Web site Documentation: 500 million pages crawled by search engines . This data is used to implement Web-based features.

4.1 Entity Extraction

Evaluation method: We implement two sets of experiments, one to assess accuracy, one to assess coverage.

For the accuracy experiment, we randomly and evenly select 400 entities for a method and distribute them to two expert-level master workers who must determine whether an entity is correct.

For coverage experiments, we focus on 5 categories of entities that frequently appear in query logs : Actors, athletes, cities, diseases, and movies. For each category, we create a representative gold set based on Wikipedia .

Comparison method: We use Mr Datasets to compare the following entity extraction systems:

- Ql-base: a Base line. Only build with candidate entities.
- Ql-conf: Use trust score filtering.
- Ql-full: Use trust score filtering and containment filtering.
- Ql-pasca: Other System Pasca.
- Web: A Web-based entity extraction method of open domain, first with post-tagger, rule-based chunker, and then noun Chunk selected as candidate instances, and finally over 50 caps start and end candidates as the final result.

4.2 Entity Clustering

The goal of this experiment is twofold: (1) Estimating the intrinsic quality of the clustering algorithm, and (2) confirming that we are starting to make the right assumptions.

Evaluation methods: Many existing evaluation criteria require the gold standard data set. Because in our case, such datasets are not available and are difficult to construct. We use a certification process. We first Select a set of random n entities from the Ql-full, randomly generated by their frequency in the log. For each entity E in the sample set , we introduce a random list of entities that belong to the same category as K and E. In our experiment,n=10,k=20. Then we give these to the hired editors,E and the same K and entities as E. Editors need to determine whether each pair of homogeneous entities is correct or incorrect. If the entity ei and entity e are visually similar or related to the user. These editors are more consistent than a threshold of 0.64. Attached, we ask the editor entity E and ei for a relationship.

Comparison method: Use the following method

Cl-ctx:CBC, based on the query log context feature space.
CL-CLK:CBC, based on click Feature Space.
Cl-hyb:CBC, based on mixed features, combined with CTX and CLK.
Cl-web: Advanced open domain based on Web page text data features.

Experimental results

You can see that using the Click Feature Space is very effective. Contextual feature space is less than click Space and page space.

Fifth ChapterApplication

In this chapter, we explore two practical applications of the model we propose. Provides keyword generation for news referral related entities and paid search.

5.1 related entities of the news

News-related sites often help users explore news through a list of news that may be of interest, in order to read in depth based on the interests of the current article user. In particular, the potential problem is to identify the main concepts in the news, and to provide concepts that are not mentioned in this article based on this concept. Some methods are proposed for (a) to effectively identify the main concept in the article (b) Recommendation related concepts. Our goal here is to verify whether our entity clustering will successfully resolve (a) and (B) and satisfy the user.

Data set creation: We randomly selected 3 million articles from 2009 of years of news . For each method, we produce a sample of 50 news articles and guarantee that they contain at least 2 entities in one category. For each article, we propose that the two entities belong to the same category as the first 10 entities.

Evaluation and measurement: Our estimation method is to generate the relevant entity usage accuracy: Given an article and the associated pair of entities, we have two tagging people tag the related entities. If a user is interested in the entity in this article, he may be interested in the recommended entity. The Kappa value in the 50 recommendations for the consistency of the callout is 1.78. The accuracy rate is obtained by dividing the recommendations by the total number of recommendations.

Contrast: Using cl-ctx, CL-CLK,Cl-hyb and Web methods.

Conclusion:.

5.2 keyword generation for paid search

Paid search accounts for much of the annual revenue of search companies. In paid search, online advertisers bid on clear keywords (called bidterms) through a search company's dedicated platform for auctions. Bid winners will be allowed to link their ads to search company search results pages when bidterms is queried.

companies like Google and Yahoo have put in the effort and money to boost their bidding platforms, in order to attract more advertisers to the auction. Bidterm Recommendations are examples of these efforts. In the bidterms proposal, advertisers type a seed keyword to express the intent of his ad, and then this tool provides a list of suggested keywords that can be auctioned on a list of keywords.

Bidding advice for a seed generation is automatic, and the search company has been paid attention to. All keyword suggestion techniques can be divided into 3 categories: The nearest Neighbor Searchmethod uses the seed to query a search engine, and extracts n-grams in the seed's nearest neighbor in the results page Proximity. The query log (Query-log) method, which is typically used to look at past frequent seed-containing queries, and suggests them as the most common use of the Google Adwords tool and Yahoo Search Marketing Tool. The meta-tag spidering (Media tag Crawl) method uses seed to query a search engine and extracts media labels as a recommendation in the best ranked page.

The existing tools for keyword generation are highly accurate. But it's all about exploring suggestions that contain seeds, which tend to ignore other, less obvious suggestions. These are not obvious suggestions that may not be expensive for advertisers but are still highly relevant.

5.2.1 Experiment Settings

The goal of this experiment is to estimate the quality of the recommendations of different methods for some of the popular seed bidterm.

DataSet Construction: to build a seed collection, we use Google sktool database. This tool provides a list of popular bidterm. We chose a list of 3 topics: tourism, transportation and e-customers. For each topic, we randomly select 5 seeds, which are also in ql-full.

Evaluation and measurement: we use accuracy and non-apparent degree. The accuracy rate is by asking whether two experienced people are relevant if an advertiser is willing to choose a proposal auction. The simple count of the non-obvious number of recommendations does not contain the seed itself, and is calculated by simple string matching and simple stemming.

Comparisons:cl-ctx, CL-CLK, Cl-hyb and Web. There are two most advanced systems, Google AdWords (GOO) and Yahoo Search Marketing Tool (YAH).

5.2.2 Experimental Conclusion

Slightly.

OPEN Inforeb Search Query logsmation EXTRACTION from WEB SEARCH query LOGS

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

OPEN Inforeb Search Query logsmation EXTRACTION from WEB SEARCH query LOGS

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

OPEN Inforeb Search Query logsmation EXTRACTION from WEB SEARCH query LOGS

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support