about using an SQL statement to find a record collection algorithm that meets multiple tag conditions simultaneously

Source: Internet
Author: User

Table structure

Tag table:{tag_id, tag_name} #标签表

News table:{news_id, Title,......} #新闻表

Newstags table:{tag_id, news_id} #新闻的标签关系表

Explain:

A piece of news that has multiple tag tags, such as:

News A{tag1,tag2, TAG3, TAG4}

News B{tag1,tag6, Tag7, Tag8}

News C{tag8,tag9, TAG10, TAG1}

News... {Tag ..., ...}

Search for records with TAG1,TAG8 two tags at the same time.


SQL optimized in MySQL:

Select News.titlefrom (  select news_id from  (    select tag_id from    tags    WHERE tag_name in (' Tag1 ', ' Tag8 ')  B left joins Newstags C (b.tag_id = c.tag_id)  GROUP by news_id have  COUNT (0) =2) a leftjoin news B on (a . news_id = b.news_id)

Principle:

This is a 3-nested SQL that minimizes the recordset quickly and results in the most memory-efficient way possible.

1, the most inner layer to remove the need to compare the tag_id

2, the second layer through the left JOIN join, find the news_id record that owns these tag_id at the same time; the focus of this treatment is to do group by the news_id and then Count 2 of the news_id in the having (i.e., the existence of these two tag records)

3, the outermost layer according to NEWS_ID (this is the final minimum recordset), with the news table left join to find out the content of the press record


At this point, you can get a template that conforms to multiple tag names, depending on the number of tags found, you need to do a parameter modification to any number of tag intersections in the inner () and second layer.


This algorithm is not optimal, but if you want to use a template method of a SQL to achieve, basically this is the optimal algorithm (I humble opinion ha).

The disadvantage of this algorithm is that if each tag corresponds to a large number of Recordset, and gives a lot of tag intersection, then the second layer of leftjoin operation will consume a lot of memory space (because the need to generate a news_id for each tag_id collection, actually will first get a Cartesian set, Then make a group of this set, then count).


If you want to use efficiency and search memory resource consumption to do the extreme roughly, can make the following changes, the idea is as follows.

1, change the tag table, add a quoted Quantity field, for example: Tag table:{tag_id, tag_name, links} (can actually regard links as a manual index)

2, maintenance of this tag table is more troublesome, in consideration of efficiency, generally will be timed to the links field to do a full table update (for frequent insertion or deletion of records of the news table, can be early every day, to the full record set of tags to do the number of citations and update the links field)

3, constructs the dynamic multi-layered SQL statement (that is, according to the number of tags, generate n-layer nested query)

Implementation principle (SQL statement is not written)

1, according to the given Tag_name find tag_id, and according to the links size, ascending order

2, first select the first tag_id to Newstags table to find the intersection of NEWSTAGS.TAG_ID (this is the minimum basic Recordset) to get the news_id, and then repeat this step for this recordset (gradually reduce the recordset), When you complete the match for all tags, you can find all the recordsets that meet these tag_id in the smallest range.

The core of the algorithm is to get the smallest set of records for the first time (to maximize the number of subsequent alignment), and then gradually make the recordset smaller until the alignment is complete.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.