This blog will write some basic algorithms related to operation data. The content is very basic and helps you review your memory. You can also provide me with a document to quickly learn about basic algorithms, just like me, for those who are new to database and data mining technologies. I think the basic algorithms are not only for programming and performance optimization, but also for analyzing and solving problems. Okay, not much nonsense. You are welcome to comment. If there is a mistake in the text, you are also welcome to make a picture ~
Joint queries are often used when we query data.
select r,s from R join S on R.id=S.rid
How does this connection work? What is the cost? Let's analyze it. Let's take a look at the size of the result set: If R ∩ S = NULL, the join is a Cartesian product. If R ∩ S is the R code, we can see that one of S's tuples is connected to at most one of R's tuples. Therefore, the number of connected result sets does not exceed the number of groups in S. If R ∩ S is neither the r Code nor the s Code, set R ∩ S = {A} and set the number of S Element groups NS, V (A, S) for the number of different values of attribute a in S, the NS/V (A, S) is the average number of tuples when the value of attribute a in s relation is given, the result set of R connection S contains Nr * NS/V (A, S) tuples.
First, we will introduce the most basic connection methods:Nested loop connection. To calculate the theta connection between R and S, check the pseudocode:
For each tuples TR in r do
Begin
For each tuples ts in S do
Begin
Test whether the (TR, TS) pair meets the theta connection condition.
If yes, add tr * ts to the result.
End
End
If the number of tuples is Nr * NS, we must perform a full scan of S for each record in relation R. In the best case, the memory space can accommodate two relationships. In this case, each data block only needs to be bet once, so that only BS + Br is used for block access (set BS, BR indicates containing relationship s, number of tuples in R ). If the smaller link can be fully stored in the memory and processed as the inner link, the inner loop relationship only needs to be read once. In the worst case, the buffer zone can only accommodate one data block of each link. This requires Nr * BS + Br block access.
As you can see, when the memory cannot accommodate two relationships at the same time, this method is costly. Below we will make a small optimization:
For each block Br of R do
Begin
For each block BS of S do
Begin
For each tuples TR in BR do
Begin
For each tuples ts in S do
Begin
Test whether the (TR, TS) pair meets the connection condition θ.
If yes, add tr * ts to the result.
End
End
End
End
This algorithm processes the relationship in blocks instead of tuples.Nested loop connections.In the worst case, for each part of the outer relation, each part of the inner relation s only needs to be read once and needs to be accessed by Br * BS + Br blocks. In the block nested loop connection algorithm, blocks with outer relations can be measured in units of the maximum memory capacity, instead of disk blocks. If there is an index on the inner loop connection attribute, you can use a more effective index search method instead of the file scanning method.
The above two algorithms can be used for all connection operations: (if the connection attribute in a natural or equivalent connection is the code of the inner link, once the first matching tuples are found in the inner loop, they can be terminated .). If it is a natural connection or equivalent connection, there are more efficient algorithms:
Merge connections
The address of the first tuple of Pr = R;
The address of the first tuple of PS = s;
While (PS is not equal to null and PR is not equal to null) Do
Begin
TS = The tuples pointed to by PS;
Ss = {ts };
Point PS to the next tuple of link S;
Done = false;
While (not done and PS are not equal to null) Do
Begin
TS = The tuples pointed to by PS;
If (TS [joinattrs] = tr [joinattrs])
Then begin
Ss = SS and TS;
Point PS to the next tuple of link S;
End
Else done = true;
End
Tr = The tuples pointed to by PR;
While (PR is not equal to null and TR [joinattrs] <ts [joinattrs]) Do
Begin
Point PR to the next tuple of the relational R;
Tr = The tuples pointed to by PR;
End
While (PR is not equal to null and TR [joinattrs] = ts [joinattrs]) Do
Begin
For each ts in SS do
Begin
Add the TS connection tr to the result;
End
Point PR to the next tuple of the relational R;
Tr = The tuples pointed to by PR;
End
End
The merge join algorithm allocates a pointer to each link. These pointers start to point to the first tuple of the corresponding link. As the algorithm proceeds, the pointer traverses the entire Link. In one of these relationships, all tuples with the same value on the connection property are added to the SS. As the relationships are sorted, the ordered group only needs to be read once. Disk access times: Br + BS.
Hash join
If one of the tuples in the relational R and one in the relational s meet the connection condition, they have the same value in the connection attribute. If the value is mapped to I by the hash function, the tuples of relational R must be in HRI, and those of relational s must be in HSI. Therefore, the tuples In HRI only need to be compared with the tuples in HSI, rather than any other criteria of S.
For each tuples ts in S do
Begin HSI = H (TS [joinattrs]);
HSI = hsi and {ts };
End
For each tuples TR in r do
Begin HRI = H (TR [joinattrs]);
HRI = HRI and {tr };
End
For HSI = 0 to Max do
Create a hash index for begin read Hsi in the memory;
For each tuples TR in HSI do
Use begin to retrieve HSI hash indexes and locate all hash indexes meeting tr [joinattrs] = TS
[Joinattrs] tuples ts;
For each matched tuples ts in HSI do
Begin adds the tr connection ts to the result;
End
End
End
The selected max value should be large enough for any I. The memory can accommodate any HSI that constructs an input relationship and its hash indexes. It is best to use a smaller input relationship as the construction of an input relationship.
Recursive Division: if the value of Max is greater than or equal to the number of page frames in the memory, the division of the relationship is impossible. In each trip, the maximum number of input splits cannot exceed the number of buffered pages used for output. The buckets generated by each trip are read and divided separately in the next trip. The hash function used in each partition is different from the hash function used in the previous partition. The split process repeats until all the shards of the constructed input relationship can be accommodated by the memory. When M (memory block)> MAX + 1, max = B/M, the link does not need to be divided recursively.
Overflow processing: When the HSI hash index is greater than the memory size, the hash overflow occurs when the split I of the input relation S is constructed. Cause: there are multiple tuples with the same value; the hash function has no randomness and uniformity.
Decomposition: given to I. If we find that the HSI is too large, we can use another hash function to further divide it.
Avoid: divide the relationship s into many small scores, and then combine some scores to ensure the scores after the combination.
Memory capacity.
If there are elephants with the same value, instead of creating a hash index in the memory and then using the nested loop connection Algorithm for the partitioning method, you can use other technologies such as block nested loop connections.
Cost:
No recursive Division: 3 (Br + BS) + 2 * max. When dividing, read and write 2 (Br + BS), construct and retrieve each shard, and read it into Br + BS. In Max partitioning, each shard may have a part full of blocks, read and write each time, it cannot exceed 2 * max.
Recursive Division: 2 (Br + BS) [logM-1 (BS)-1] + Br + BS. The size of each row is reduced to the original 1/(M-1) until each row occupies up to M blocks.
Complex connection
Nested loops are connected to block nested loop connections under any conditions. Other connection technologies are more effective than nested loop connections, but can only process simple connection conditions, such as natural connections or equivalent connections.
For example:
R and S are in... The connections under thetan condition can calculate the connections between R and S under thetan respectively... Thetan.
In addition to using the "Combination Law", you can create indexes on both sides to calculate the tuples corresponding to each of the three links.