A high efficiency classification algorithm is introduced

Source: Internet
Author: User
Tags definition connect table definition
Algorithm


In the website construction, the classification algorithm application is very widespread. When designing an electronic store, it involves the classification of goods, the classification of columns or channels when designing a publishing system, and the classification of software when designing software downloads; It can be said that classification is a very common problem.
I often interview some programmers, and I almost invariably ask them some questions about classification algorithms. Here are a few questions that I often ask. Do you think you can easily answer ^_^?
1, the classification algorithm often manifests as the tree representation and the traversal question. So, excuse me: If you use a table in the database to express the tree category, how many fields should there be?
2. How to quickly restore a tree from this table;
3, how to determine whether a classification is another subcategory of the subclass;
4, how to find a category of all products;
5, how to generate the path of the classification.
6, how to add new classification;
These questions are not easily answered without limiting the number of levels and categories of classification. This article tries to solve these problems.
Data structure of the classification
We know: The data structure of the classification is actually a tree. In the course of data structure, you may have studied the tree algorithm. Since we use a lot of database in the construction of the website, we will talk about the storage of tree in the database.
To simplify the problem, we assume that each node only needs to keep the name of this information. We need to number each node. There are many kinds of numbering methods. An automatic number is often used in a database. This is true in Access, SQL Server, and Oracle. Assume the Number field ID.
In order to indicate that a node ID1 is a parent node of another node ID2, we need to keep a field in the database to show which node this taxonomy belongs to. Name this field Fatherid. such as the ID2 here, its fatherid is ID1.
In this way, we get the data table definition for the classification catalog:
Create Table [Catalog] (
[ID] [int] not NULL,
[Name] [nvarchar] () not NULL,
[Fatherid] [INT] Not NULL
);
Agreement: We agreed to use-1 as the first layer of the classification of the Father code. The category is numbered-1. This is a virtual classification. It has no records in the database.
How to restore a tree
The biggest advantage of the catalog definition above is that it makes it easy to recover a tree-a classification tree. To show the algorithm more clearly, let's consider a simple question: How to display the next level of classification for a category. We know that to query the next level classification of a taxonomy FID, the SQL statement is very simple:
Select Name from Catalog where Fatherid=fid
When displaying these categories, we simply use <LI> to:

<%
REM oconn---Database connection, opened when calling GetChildren
REM FID-----Number of the current category

Function GetChildren (Oconn,fid)
strSQL = "Select id,name from Catalog where fatherid=" &fid
Set rscatalog = Oconn.execute (strSQL)
%>
<UL>
<%
Do as not rscatalog.eof
%>
<li><%=rscatalog ("Name")%>
<%
Loop
%>
</UL>
<%
Rscatalog.close
End Function
%>
Now let's look at how to show all the classifications under the FID. This requires a recursive algorithm. All we need to do is simply call all IDs in the GetChildren function: GetChildren (Oconn,catalog ("ID") is OK.
<%
REM oconn---Database connection, already open
REM FID-----Number of the current category

Function GetChildren (Oconn,fid)
strSQL = "Select Name from Catalog where fatherid=" &fid
Set rscatalog = Oconn.execute (strSQL)
%>
<UL>
<%
Do as not rscatalog.eof
%>
<li><%=rscatalog ("Name")%>
<%=getchildren (Oconn,catalog ("ID"))%>

<%
Loop
%>
</UL>
<%
Rscatalog.close
End Function
%>
The modified GetChildren can complete the task of displaying all subcategories of the FID classification. To show all the categories, just call it:
<%
REM strconn--the string to connect to the database, modify it as appropriate
set oconn = Server.CreateObject ("ADODB. Connection ")
oConn.Open strconn
=getchildren (oconn,-1)
Oconn.close
%>

How to find all products of a category;
Now to address the fourth question we raised earlier. The third question is left as an exercise. We assume that the data table for the product is defined as follows:
Create Table Product (
[ID] [int] not NULL,
[Name] [Nvchar] Not NULL,
[Fatherid] [INT] Not NULL
);
Where the ID is the product number, name is the product, and the Fatherid is the category to which the product belongs.
For the fourth question, it's easy to think of a way to find all the subclasses of this taxonomy, and then query all the products under all subclasses. Implementing this algorithm is actually very complex. The code is roughly as follows:
<%
Function Getallid (Oconn,fid)
Dim strtemp

If Fid=-1 Then
strtemp = ""
Else
strtemp = ","
End If

strSQL = "Select Name from Catalog where fatherid=" &fid
Set rscatalog = Oconn.execute (strSQL)
Do as not rscatalog.eof
Strtemp=strtemp&rscatalog ("id") &getallid (Oconn,catalog ("id")) REM recursive call
Loop
Rscatalog.close

Getallid = strtemp
End Function

REM strconn--the string to connect to the database, modify it as appropriate
set oconn = Server.CreateObject ("ADODB. Connection ")
oConn.Open strconn

FID = Request.QueryString ("FID")

strSQL = "SELECT top * from Product where Fatherid in (" &getallid (Oconn,fid) & ")"
Set Rsproduct=oconn.execute (strSQL)
%>
<ul><%
Do as not rsproduct.eof
%>
<li><%=rsproduct ("Name")%>
<%
Loop
%>
</UL>
<%rsproduct.close
Oconn.close
%>

This algorithm has many drawbacks. Try to list several of the following:
1, because we need to query the FID under all classifications, when the classification is very much, the algorithm will be very economic, and, because to construct a very large strsql, imagine if there are 1000 categories, this strSQL will be very large, whether the implementation is a problem.
2, we know that the efficiency of using in clauses in SQL is very low. This algorithm inevitably uses in clause, the efficiency is very low.

I find that more than 80% of programmers love such algorithms and use them heavily in many systems. Careful programmers will find that they write slow programs, but they can't find the cause. They repeatedly check the execution efficiency of SQL, improve the grade of the machine, but the efficiency of the increase is very small.
The most fundamental problem is the algorithm itself. The algorithm is fixed, the chance to be optimized is not much. We are going to introduce an algorithm that is more than 10 times times more efficient than the above algorithm.
Classification coding algorithm
The problem is that we used sequential coding, which is one of the simplest coding methods. As you know, simplicity does not mean efficiency. In fact, coding science is a compulsory course for programmers. Next, we design an encoding algorithm that includes information about its parent class in the numbering ID of the category. An example of a level five classification is as follows:

In this case, the 32 (4+7+7+7+7) bit integer is encoded, where the first level category has 4 bits and can express 16 categories. The second level to the fifth level category has 7 digits, which can express 128 subcategories.
Obviously, if we get a category encoded in 1092787200, we know that it is a fourth-level classification because it is encoded as
0100 0001001 0001010 0111000 0000000
. The binary encoding of its parent class is 0100 0001001 0001010 0000000 0000000, and the decimal number is 1092780032. In turn, we also know that the parent class encoding of its parent class is 0100 0001001 0000000 0000000 0000000, and that the parent class of its parent class is 0100 0000000 0000000 0000000 0000000. (I'm not too wordy about J, but it's important.) Look back at the fifth question we mentioned earlier. Haha, this does not already have the classification 1092787200 where the classification path? )。
Now we'll discuss the category encoding problem in general. The level of the class is K, the number of encoded digits in layer I is NI, then the total number of encoded digits is n (n1+n2+). +NK). We'll get the encoded form of any category as follows:
2^ (N-(N1+n2+...+ni)) *j + parent class encoding
wherein, I represents layer I and J represents the first J classification of the current layer.
This allows us to divide the encoding of any classification into two parts, part of which is its layer encoding and, in part, its parent class code.
The K-code defined by the following formula is what we call a signature: (because I can take k values, so there are k)
2^n-2^ (n (n1+n2+...+ni))
for any given class ID, if we get the ID and K signature "phase", the non 0 code, is the encoding of all its parent classes!

Bit coding algorithm
For any sequence-coded catalog table, we can design a bit-coded algorithm to normalize all class encodings to bits. In the concrete implementation, we first create a temporary table:
Create Tempcatalog (
[oldid] [INT] Not NULL,
[NewID] [INT] Not NULL,
[Oldfatherid] [INT] Not NULL,
[Newfatherid] [INT] Not NULL
);
In this table, we reserve all the original category number oldid and its parent class number Oldfatherid, as well as the corresponding number newid, newfatherid that are recalculated to meet the bit encoding requirements.
The procedure is as follows:
<%
REM oconn---Database connection, already open
REM Oldfather---The original parent class number
REM Newfather---The New parent class number
Total digits of REM N---encoding
REM ni--An array of encoded digits per level
REM level--Current Progression

Sub Formatallid (Oconn,oldfather,newfather,n,nm,ni byref,level)
strSQL = "Select Catalogid, Fatherid from Catalog where fatherid=" & Oldfather
Set Rscatalog=oconn.execute (strSQL)

j = 1
Do as not rscatalog.eof
i = 2 ^ (n-nm) * j
If level then i= i + newfather


Oldcatalog = Rscatalog ("Catalogid")
NewCatalog = i

REM Write temp table
strSQL = "Insert into Tempcatalog (Oldcatalogid, Newcatalogid, Oldfatherid, Newfatherid)"
strSQL = strSQL & "VALUES (" & Oldcatalog & "," & NewCatalog & "," & Oldfather & "," &am P Newfather & ")"

Conn.execute strSQL

REM Recursive call Formatallid
NM = NM + Ni (level+1)
Formatallid Oconn,oldcatalog, NewCatalog, N,nm,ni,level + 1

Rscatalog.movenext

j = j+1
Loop

Rscatalog.close
End Sub
%>

An example of calling this algorithm is as follows:
<%
REM defines the encoding parameters, where n is the total number of digits, and NI is the number of digits per level.
Dim N,ni (5)

Ni (1) = 4

N = Ni (1)

For i=2 to 5
Ni (i) = 7
n = n + Ni (i)
Next

REM Open database, create temporary table
strSQL = "Create tempcatalog" ([oldid] [int] NOT NULL, [NewID] [int] NOT NULL, [Oldfatherid] [int] NOT NULL, [Newfatherid] [INT] Not NULL);
Set Conn = Server.CreateObject ("ADODB. Connection ")
Conn.Open Application ("strconn")
Conn.execute strSQL

REM invokes normalization routines
Formatallid Conn,-1,-1,n,ni (1), ni,0

REM------------------------------------------------------------------------
REM updates the category encoding for all related tables here to the new encoding.
REM------------------------------------------------------------------------

REM Shutdown Database
strsql= "drop table tempcatalog;"
Conn.execute strSQL
Conn.close
%>

Fourth question
Now we look back at the fourth question: How to get all the products under a certain category. With the use of bit code, the problem is now simple. It is easy to extrapolate that a product is a category with a condition of product.fatherid& (catalog.id signature) =catalog.id. Where "&" represents the bit and algorithm. This is directly supported in SQL Server.
For example: the category to which the product belongs is: 1092787200, and the current category is 1092780032. The current category corresponds to the characteristic value of: 4294950912, because of 1092787200&4294950912=8537400, so this product belongs to category 8537400.
We have previously given formulas for calculating signatures. The signatures are not many and are easy to compute, and can be considered for application_onstart time triggers in Global.asa and stored in the application ("Mark") array.
of course, with signatures, we can also get more efficient algorithms. We know that although we use bit coding, it is actually a sequential coding method. It shows that the classification code of Grade I is certainly smaller than that of the i+1 classification. According to this characteristic, we can also get two signatures from FID, one of which is this level signature FID0, and one is the superior bit signature FID1. The sufficient and necessary condition for a product to be a classified FID is that the following program in
Product.fatherid>fid0 and PRODUCT.FATHERID<FID1
Displays all products under the FID. Because the datasheet product has already indexed Fatherid, the query is extremely fast:
<%
REM oconn---Database connection, and has opened the
REM FID---Current classification
REM Fidmark---Array of eigenvalues, Typically, application ("Mark")
REM k---array elements, and also the series of categories
Sub getallproduct (Oconn,fid,fidmark byref,k)
REM Computes the eigenvalues Fid0,fid1
for i=k to 1
if (FID and fidmark = FID) then exit
Next

strSQL = "Select Name from Product where fatherid>" Fidmark (i) & "and fatherid<" Fidmark (i-1)
Set Rsproduct=oconn.execute (strSQL)%>
<ul><%
Do as not rsproduct.eof%>
<li><%=rsproduct ("Name")
Loop%>
</ul><%
Rsproduct.close
End Sub
%>




Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.