Classification Algorithms
In website construction, classification algorithms are widely used. When designing an electronic store, product classification should be involved; when designing a publishing system, topic or channel classification should be involved; when downloading software, software classification should be involved; and so on. Classification is a common problem.
1. classification algorithms are often represented by Tree Representation and traversal. So, what fields should I have if I use a table in the database to express tree classification?
2. How to quickly restore a tree from this table;
3. Determine whether a category is a subclass of another category;
4. How to find all products under a certain category;
5. How to generate the path of the category.
6. How to add a category;
These questions are not easily answered without limiting the number of classifications and the number of classifications per level. This article attempts to solve these problems.
Data Structure of classification
We know that the data structure of classification is actually a tree. In the data structure course, you may have learned Tree algorithms. Because we use a large number of databases during website construction, we will talk about the storage of trees in the database.
To simplify the problem, we assume that each node only needs to retain the name information. We need to number each node. There are many numbering methods. Automatic numbers are commonly used in databases. This is true in access, SQL Server, and Oracle. Assume that the number field is ID.
To indicate that a node id1 is the parent node of another node Id2, we need to keep another field in the database to indicate which node belongs to this category. Name this field fatherid. For example, in Id2, its fatherid is id1.
In this way, we get the definition of the data table of the category catalog:
Create Table [catalog] (
[ID] [int] not null,
[Name] [nvarchar] (50) not null,
[Fatherid] [int] not null
);
Conventions: we agree to use-1 as the parent code for the top layer of classification. -1. This is a virtual classification. It is not recorded in the database.
How to restore a tree
The biggest advantage of the catalog definition above is that it can be used to easily restore a tree-Classification Tree. To better display the algorithm, we first consider a simple question: how to display the next level of a category. We know that the SQL statement is very simple to query the next level of FID:
Select name from catalog where fatherid = FID
When displaying these categories, we simply use
To:
<%
Rem oconn --- database connection, enabled when getchildren is called
Rem FID ----- Number of the current category
Function getchildren (oconn, FID)
Strsql = "select ID, name from catalog where fatherid =" & FID
Set rscatalog = oconn. Execute (strsql)
%>
<%
Do while not rscatalog. EOF
%>
<% = Rscatalog ("name") %>
<%
Loop
%>
<%
Rscatalog. Close
End Function
%>
Now let's take a look at how to display all categories under FID. This requires recursive algorithms. You only need to call all IDs in the getchildren function: getchildren (oconn, catalog ("ID.
<%
Rem oconn --- database connection, enabled
Rem FID ----- Number of the current category
Function getchildren (oconn, FID)
Strsql = "Select name from catalog where fatherid =" & FID
Set rscatalog = oconn. Execute (strsql)
%>
<%
Do while not rscatalog. EOF
%>
<% = Rscatalog ("name") %>
<% = Getchildren (oconn, catalog ("ID") %>
<%
Loop
%>
<%
Rscatalog. Close
End Function
%>
After the modification, getchildren can complete the task of displaying all sub-categories of the FID category. To display all categories, you only need to call this method:
<%
Rem strconn -- string used to connect to the database. Modify the value as needed.
Set oconn = server. Createobject ("ADODB. Connection ")
Oconn. Open strconn
= Getchildren (oconn,-1)
Oconn. Close
%>
How to find all products under a certain category;
Now we can solve the fourth problem we mentioned above. Leave the third question for exercise. Assume that the data table of the product is defined as follows:
Create Table product (
[ID] [int] not null,
[Name] [nvchar] not null,
[Fatherid] [int] not null
);
Here, ID is the product number, name is the product name, And fatherid is the product category.
For the fourth question, it is easy to think of the method: First find all the sub-classes of this category FID, and then query all the products under all sub-classes. Implementing this algorithm is actually complicated. The code is roughly as follows:
<%
Function getallid (oconn, FID)
Dim strtemp
If FID =-1 then
Strtemp = ""
Else
Strtemp = ","
End if
Strsql = "Select name from catalog where fatherid =" & FID
Set rscatalog = oconn. Execute (strsql)
Do while not rscatalog. EOF
Strtemp = strtemp & rscatalog ("ID") & getallid (oconn, catalog ("ID") REM recursive call
Loop
Rscatalog. Close
Getallid = strtemp
End Function
Rem strconn -- string used to connect to the database. Modify the value as needed.
Set oconn = server. Createobject ("ADODB. Connection ")
Oconn. Open strconn
FID = request. querystring ("FID ")
Strsql = "select Top 100 * from product where fatherid in (" & getallid (oconn, FID )&")"
Set rsproduct = oconn. Execute (strsql)
%>
<%
Do while not rsproduct. EOF
%>
<% = Rsproduct ("name") %>
<%
Loop
%>
<% Rsproduct. Close
Oconn. Close
%>
This algorithm has many disadvantages. The following are some examples:
1. Because we need to query all categories under FID, when there are many categories, the algorithm will be very economic, and because we need to construct a large strsql, imagine if there are 1000 categories, this strsql will be very large, and whether it can be executed is a problem.
2. We know that using the in clause in SQL is very inefficient. This algorithm inevitably uses the in clause, which is inefficient.
I found that more than 80% of programmers love this algorithm and use it in many systems. Careful programmers will find that they write very slow programs, but cannot find the cause. They repeatedly check the SQL Execution efficiency and improve the machine's grade, but the increase in efficiency is very small.
The most fundamental problem lies in the algorithm itself. If the algorithm is fixed, there will be no more opportunities for optimization. Next we will introduce an algorithm, which is more than 10 times the efficiency of the above algorithm.
Classification Algorithm
The problem is that we used sequential encoding, which is the simplest encoding method. As you know, simplicity does not mean efficiency. In fact, coding science is a required course for programmers. Next, we design an encoding algorithm to make the ID of a classification contain the information of its parent class at the same time. An example of a level-5 classification is as follows:
In this example, 32 (4 + 7 + 7 + 7 + 7 + 7) integers are used for encoding. The first-level classification has 4 digits, which can express 16 types of classification. There are 7 sub-categories from Level 2 to level 5, which can express 128 sub-categories.
Obviously, if we get a classification encoded as 1092787200, we will know that because it is encoded
0100 0001001 0001010 0111000 0000000
So it is Level 4 classification. The binary code of its parent class is 0100 0001001 0001010 0000000 0000000, And the decimal number is 1092780032. We can also know that the parent class's parent class encoding is 0100 0001001 0000000 0000000 0000000, and the parent class's parent class encoding is 0100 0000000 0000000 0000000 0000000. (I am not so arrogant about J, but this is very important. Let's look back at the fifth question we mentioned above. Haha, isn't the classification path of category 1092787200 already obtained ?).
Now we will discuss the issue of class encoding in general cases. Set the class level to K and the number of bits in the I layer to Ni. The total number of BITs is n (N1 + N2 +... + NK ). The encoding format of any category is as follows:
2 ^ (N-(N1 + N2 +... + Ni) * j + parent class Encoding
Where, I indicates the I layer, and J indicates the J category of the current layer.
In this way, the encoding of any classification is divided into two parts, one of which is its layer encoding and the other is its parent class encoding.
According to the following formula, one of the K encodings is called the signature code: (because I can take K values, there are K)
2 ^ N-2 ^ (N-(N1 + N2 +... + Ni ))
For any given category ID, if we "phase" the ID and K signatures, the non-zero encoding is the encoding of all its parent classes!
Bit Encoding Algorithm
For catalog tables of any sequential encoding, we can design a single-bit encoding algorithm to convert all class encoding specifications into bit encoding. In specific implementation, we first create a temporary table:
Create tempcatalog (
[Oldid] [int] not null,
[Newid] [int] not null,
[Oldfatherid] [int] not null,
[Newfatherid] [int] not null
);
In this table, we reserve all the original class numbers oldid and its parent class numbers oldfatherid, and recalculate the corresponding numbers newid and newfatherid that meet the bit encoding requirements.
The procedure is as follows:
<%
Rem oconn --- database connection, enabled
Rem oldfather --- original parent class number
Rem newfather --- new parent class number
Rem N --- total number of digits of the Encoding
Rem Ni-array of digits of each encoding level
Rem level -- current level
Sub formatallid (oconn, oldfather, newfather, N, nm, ni byref, level)
Strsql = "select catalogid, fatherid from catalog where fatherid =" & oldfather
Set rscatalog = oconn. Execute (strsql)
J = 1
Do while not rscatalog. EOF
I = 2 ^ (n-nm) * j
If level then I = I + newfather
Oldcatalog = rscatalog ("catalogid ")
Newcatalog = I
Rem write to temporary table
Strsql = "insert into tempcatalog (oldcatalogid, newcatalogid, oldfatherid, newfatherid )"
Strsql = strsql & "values (" & oldcatalog & "," & newcatalog & "," & oldfather & "," & newfather &")"
Conn. Execute strsql
Rem recursively calls formatallid
Nm = Nm + Ni (LEVEL + 1)
Formatallid oconn, oldcatalog, newcatalog, N, nm, Ni, level + 1
Rscatalog. movenext
J = J + 1
Loop
Rscatalog. Close
End sub
%>
An example of calling this algorithm is as follows:
<%
Rem defines the encoding parameters, where N is the total number of digits, and Ni is the number of digits per level.
Dim N, Ni (5)
Ni (1) = 4
N = Ni (1)
For I = 2 to 5
Ni (I) = 7
N = N + Ni (I)
Next
Rem open the database and create a temporary table
Strsql = "create tempcatalog ([oldid] [int] not null, [newid] [int] not null, [oldfatherid] [int] not null, [newfatherid] [int] not null );"
Set conn = server. Createobject ("ADODB. Connection ")
Conn. open application ("strconn ")
Conn. Execute strsql
Rem call normalization routine
Formatallid Conn,-1,-1, n, Ni (1), Ni, 0
Rem ------------------------------------------------------------------------
Rem updates the category encoding of all related tables here as the new encoding.
Rem ------------------------------------------------------------------------
Rem closes the database
Strsql = "Drop table tempcatalog ;"
Conn. Execute strsql
Conn. Close
%>
Fourth Question
Now let's look back at the fourth question: how to get all the products under a certain category. Because of the bit encoding, the problem is now very simple. It is easy to estimate that the condition for a product to belong to a certain category is product. fatherid & (Pattern of catalog. ID) = catalog. Id. "&" Indicates the bit and algorithm. This is directly supported in SQL Server.
For example, the product category is 1092787200, and the current category is 1092780032. The feature value corresponding to the current category is 4294950912. Because 1092787200 & 4294950912 = 8537400, this product belongs to category 8537400.
We have provided a formula for calculating the signature. There are not many signatures, and they are easy to calculate. You can consider calculating them when the application_onstart time is triggered in global. Asa and storing them in the application ("mark") array.
Of course, with signatures, we can also get more efficient algorithms. We know that although we adopt bitwise encoding, it is actually a sequential encoding method. The class I encoding must be smaller than the class I + 1 encoding. Based on this feature, we can also get two signatures by FID, one of which is the level-level signature fid0, and the other is the level-level signature fid1. The FID of a product belongs to the following criteria:
Product. fatherid> fid0 and product. fatherid
The following program displays all products under category FID. Because the data table product has indexed the fatherid, the query speed is extremely fast:
<%
Rem oconn --- database connection, enabled
Rem FID --- current category
Rem fidmark --- an array of feature values, typically an application ("mark ")
Rem K --- Number of array elements, also the level of classification
Sub getallproduct (oconn, FID, fidmark byref, K)
Rem calculates the feature values fid0 and fid1 Based on FID.
For I = K to 1
If (FID and fidmark = FID) Then exit
Next
Strsql = "Select name from product where fatherid>" fidmark (I) & "and fatherid <" fidmark (I-1)
Set rsproduct = oconn. Execute (strsql) %>
<%
Do while not rsproduct. EOF %>
<% = Rsproduct ("name ")
Loop %>
<%
Rsproduct. Close
End sub
%>