Group by indicates that by rules are grouped. Use certain rules to divide a dataset into several small regions, and then process data for several small regions.
Group by can only return the calculation results of group by fields and Aggregate functions.
If the returned set fields are included after the group by statement, they are used as the basis for grouping, or they are included in the aggregate function. For example, if you want to aggregate multiple values into one value, you can match the group by fields one by one. For detailed analysis, see Example 1.
After understanding the group by and Aggregate functions, you can understand whether count (distinct field) is available.
Example 1.
Fruitname productplace price discount
Apple China $1.1 0.8
Apple Japan $2.1 0.9
Apple USA $2.5 0.9
Orange China $0.8 null
Banana China $3.1 null
Requirement: I want to know how many fruits each country has.
Group:
Once the requirement is "Every **", it is easy to think of group. Data sets are grouped by country of origin (productplace), and the number of records is counted by group.
Selectcount (*) as fruit type, productplaceas Country of Origin fromt_test_fruitinfo groupby productplace
Return Value Analysis:
Think of the Group by operation as the following process. First, the system obtains a result set based on the SELECT statement, such as a detailed table of the first fruit, country of production, and unit price.
Then, records with the same group fields are merged into one record based on the group field. At this time, the remaining fields that do not exist behind the group by statement as the basis for grouping may have multiple values, but currently, only one record exists in a group, a Data lattice cannot be placed into multiple values. Therefore, we need to convert these multi-value columns into single values through some processing, and then put them in the corresponding data lattice, the aggregate function is used to complete this step.
That's why these functions are called Aggregate functions.
Example 2:
Memberid externalid
1 sina1001
1 qq2002
2 sina1002
Requirement: Query memberid with no repeated values for externalid.
Group by: If you want to determine whether a field has a repeated value, you can group it by this field first. If there is more than one data record in the group, there must be more than one record equal to this field. If you execute group by B. memberid having count (1)> 1, for 1 user, it is obvious that count (1) = 2, 1 users should be excluded.
Select * From t_external_member A where a. memberid in (select B. memberid from t_external_memberb group by B. memberid having count (1)> 1)
Example 3:
Scenario: In the table, memberid indicates the user and geoid indicates the city in which the user has been to. Because there are multiple scenic spots in a city, if you write a record for each scenic spot, geoid has repeated values.
Requirement: count the number of cities that each user has visited.
DESIGN: first, users are divided into small areas where their users have visited scenic spots. In a small area, count (distinct geoid) can eliminate repeated geoids and use an aggregate function to obtain a value corresponding to memberid.
Inefficient SQL statement: the idea of removing duplicates by group by is used first, and then count.
Select a. memberid, count (A. geoid) as want_geo_count from
(
Select a. memberid, A. geoid from t_daodao_member_location a group by A. memberid, A. geoid order by A. memberid
) As a group by A. memberid
Memberid, want_geo_count
47 1
52 2
60 3
63 1
The preceding SQL subquery uses group by to group by memberid + geoid, so that the same geoid under the same memberid will be grouped into a group.
Memberid geoid
47 55229
52 45963
52 294212
60 60763
60 294217
60 528733
63 60763
The external query is performed in the secondary grouping based on the above dataset by memberid. Calculate each count value corresponding to each memberid.
Normal SQL: first by the memberid group, for each group of data, you can certainly do the count operation to return a unique value corresponding to the memberid, count (field) it can certainly evolve to count (distinct field ). In this way, distinct is used instead of group by deduplication.
Select a. memberid, count (distinct A. geoid) as want_geo_count from t_daodao_member_location a group by A. memberid order by A. memberid
Execution sequence of group by and other keywords:
When a query statement contains the where, group by, having, and order by statements, the execution sequence and writing sequence are as follows:
1. Execute where xx to filter the full table data and return 1st result sets.
2. Group by is used for 1st result sets, and 2nd result sets are returned.
3. Execute select XX for each group of data in the 2nd result sets. If there are several groups, the execution is performed several times and 3rd result sets are returned.
4. Perform having XX for 3rd sets, and 4th result sets are returned.
5. Sort the 4th result sets.
Example:
To complete a complex query statement, you must:
Show the names and average scores of students whose average score is over 70 points in the order from high to low. In order to increase the average score as much as possible, scores below 60 are not included before the average score, and it does not count the scores of the monks (JR.
Analysis:
1. The student name and average score must be displayed.
So determine Step 1 select s_name, AVG (score) from student
2. The average score is not included in the score below 60, and the score is not calculated by the monks (JR ).
So confirm Step 1 where score> = 60 and s_name! = 'Jr'
3. show average personal score
Students with the same name (the same student) take multiple subjects, so they are grouped by name.
Confirm Step 1 group by s_name
4. show that the average personal score is over 70
Therefore, determine Step 4 having AVG (s_score)> = 70
5. from high to low
Therefore, determine Step 1 order by AVG (s_score) DESC
--------------------------------------------
Group by Lenovo
The group by multi-field is very convenient for the database data source. Below is a pen question on the Internet, which has a requirement similar to group by, but at this time the data source becomes a log file and cannot use SQL statements, you need to design the data structure for the application:
1. When a user accesses a website, a random string is used to represent the user. The same user accesses other pages and uses the same string to represent the user ID.
2. Each page of a website is represented by an English String called page_type_id;
3. The website access log format is: <userid space page_type_id>
4. access path: Find the three access paths that a user can access at most. The so-called three access paths are home -- prouduct --- prouduct detail;
Or prouduct -- surport -- FAQ and so on.
Assume that you already have such a log file, and write a program in Java to find the most three-Node path.
The following is an example of a log file:
123 home
234 Product
456 product detail
123 Product
456 Product
123 product detail
......
Solution: Two hashmaps, M1: hashmap <userid, id1_id2_id3> M2: hashmap <id1_id2_id3, count>
According to the group by idea, M1 is equivalent to the group by userid, id1_id2_id3, M2 is equivalent to the aggregate function count (id1_id2_id3.
The question now is how to use an application to construct a value such as id1_id2_id3 and how to accumulate the count function.
The idea is as follows:
The first hashmap <userid, page_type_id> is added sequentially, for example:
Userid page_type_id
123 home 1
456 production 2
123 home 3
Then the first hashmap will save ['123456', 'home']. Similarly, the third line will mark the user '123456' to refresh on the same page, this can be skipped (equivalent to the value saved by hashmap). If the values are not the same,
For example:
Userid page_type_id
123 home 4
123 production 5
In this way, the path becomes home-production, map = ['000000', 'home, production ']. If the path of user 123 is in line N, if it is not in home-production, for example:
Userid page_type_id
123 detail n
In this case, the first hashmap = ['20170101', 'home, production, detail '], and the second hashmap stores ['home, production, detail ', 1] (COUNT = 1 ).
Next, if you encounter "123", replace the value of the first hashmap ['123456', 'home, production, detail, that is, it is changed to ['123', 'home']. Repeat the previous practice.When the path conforms to three sections, obtain the value of user 123 on a hashmap, followed by the second
Hashmap ['home, production, detail ', 1] Key ('home, production, detail') comparison, if some
If not, add a new key to the second hashmap.
Two accumulators: one is used to count and determine whether the Layer 3 is reached. Another key that obtains the map, adds 1 to it, and fails to get it. Set 1. Typical hashmap usage.