An efficient Classification Algorithm

Source: Internet
Author: User
Category Algorithm Problems to be Solved
In website construction, classification algorithms are widely used. When designing an electronic store, the product category should be involved; when designing a publishing system, the topic or channel category should be involved; when downloading the design software Program Software classification, and so on. Classification is a common problem.

I often interview some programmers, and I have to ask them questions about classification algorithms without exception. The following are some questions that I often ask. Do you think you can easily answer? ^_^.

1. classification algorithms are often represented by Tree Representation and traversal. So, what fields should I have if I use a table in the database to express tree classification?
2. How to quickly restore a tree from this table;
3. Determine whether a category is a subclass of another category;
4. How to find all products under a certain category;
5. How to generate the path of the category.
6. How to add a category;

These questions are not easily answered without limiting the number of classifications and the number of classifications per level. This article attempts to solve these problems.

Data Structure of classification
We know that the data structure of classification is actually a tree. In the data structure course, you may have learned Tree algorithms. Because we use a large number of databases during website construction, we will talk about the storage of trees in the database.

To simplify the problem, we assume that each node only needs to retain the name information. We need to number each node. There are many numbering methods. Automatic numbers are commonly used in databases. This is true in access, SQL Server, and Oracle. Assume that the number field is ID.

To indicate that a node id1 is the parent node of another node Id2, we need to keep another field in the database to indicate which node belongs to this category. Name this field fatherid. For example, in Id2, its fatherid is id1.

In this way, we get the definition of the data table of the category catalog:

Create Table [catalog] (

[ID] [int] not null,

[Name] [nvarchar] (50) not null,

[Fatherid] [int] not null

);

Conventions: we agree to use-1 as the parent code for the top layer of classification. -1. This is a virtual classification. It is not recorded in the database.

How to restore a tree
The biggest advantage of the catalog definition above is that it can be used to easily restore a tree-Classification Tree. To better display the algorithm, we first consider a simple question: how to display the next level of a category. We know that the SQL statement is very simple to query the next level of FID:

Select name from catalog where fatherid = FID

When displaying these categories, we simply use <li>:

<%

Rem oconn --- database connection, enabled when getchildren is called

Rem FID ----- Number of the current category

Function getchildren (oconn, FID)

Strsql = "select ID, name from catalog where fatherid =" & FID

Set rscatalog = oconn. Execute (strsql)

%>

<Ul>

<%

Do while not rscatalog. EOF

%>

<Li> <% = rscatalog ("name") %>

<%

Loop

%>

</Ul>

<%

Rscatalog. Close

End Function

%>

Now let's take a look at how to display all categories under FID. This requires recursive algorithms. You only need to call all IDs in the getchildren function: getchildren (oconn, catalog ("ID.

<%

Rem oconn --- database connection, enabled

Rem FID ----- Number of the current category

Function getchildren (oconn, FID)

Strsql = "Select name from catalog where fatherid =" & FID

Set rscatalog = oconn. Execute (strsql)

%>

<Ul>

<%

Do while not rscatalog. EOF

%>

<Li> <% = rscatalog ("name") %>

<% = Getchildren (oconn, catalog ("ID") %>

<%

Loop

%>

</Ul>

<%

Rscatalog. Close

End Function

%>

After the modification, getchildren can complete the task of displaying all sub-categories of the FID category. To display all categories, you only need to call this method:

<%

Rem strconn -- string used to connect to the database. Modify the value as needed.

Set oconn = server. Createobject ("ADODB. Connection ")

Oconn. Open strconn

= Getchildren (oconn,-1)

Oconn. Close

%>

How to find all products under a certain category;
Now we can solve the fourth problem we mentioned above. Leave the third question for exercise. Assume that the data table of the product is defined as follows:

Create Table product (

[ID] [int] not null,

[Name] [nvchar] not null,

[Fatherid] [int] not null

);

Here, ID is the product number, name is the product name, And fatherid is the product category.

For the fourth question, it is easy to think of the method: First find all the sub-classes of this category FID, and then query all the products under all sub-classes. Implementing this algorithm is actually complicated.CodeIt is roughly as follows:

<%

Function getallid (oconn, FID)

Dim strtemp

If FID =-1 then

Strtemp = ""

Else

Strtemp = ","

End if

Strsql = "Select name from catalog where fatherid =" & FID

Set rscatalog = oconn. Execute (strsql)

Do while not rscatalog. EOF

Strtemp = strtemp & rscatalog ("ID") & getallid (oconn, catalog ("ID") REM recursive call

Loop

Rscatalog. Close

Getallid = strtemp

End Function

Rem strconn -- string used to connect to the database. Modify the value as needed.

Set oconn = server. Createobject ("ADODB. Connection ")

Oconn. Open strconn

FID = request. querystring ("FID ")

Strsql = "select Top 100 * from product where fatherid in (" & getallid (oconn, FID )&")"

Set rsproduct = oconn. Execute (strsql)

%>

<Ul> <%

Do while not rsproduct. EOF

%>

<Li> <% = rsproduct ("name") %>

<%

Loop

%>

</Ul>

<% Rsproduct. Close

Oconn. Close

%>

This algorithm has many disadvantages. The following are some examples:

1. Because we need to query all categories under FID, when there are many categories, the algorithm will be very economic, and because we need to construct a large strsql, imagine if there are 1000 categories, this strsql will be very large, and whether it can be executed is a problem.

2. We know that using the in clause in SQL is very inefficient. This algorithm inevitably uses the in clause, which is inefficient.

I found that more than 80% of programmers love this algorithm and use it in many systems. Careful programmers will find that they write very slow programs, but cannot find the cause. They repeatedly check the SQL Execution efficiency and improve the machine's grade, but the increase in efficiency is very small.

The most fundamental problem lies in the algorithm itself. If the algorithm is fixed, there will be no more opportunities for optimization. Next we will introduce an algorithm, which is more than 10 times the efficiency of the above algorithm.

Classification Algorithm
The problem is that we used sequential encoding, which is the simplest encoding method. As you know, simplicity does not mean efficiency. In fact, coding science is a required course for programmers. Next, we design an encoding algorithm to make the ID of a classification contain the information of its parent class at the same time. An example of a level-5 classification is as follows:

In this example, 32 (4 7 7 7 7) integers are used for encoding. The first-level classification has 4 digits, which can express 16 types of classification. There are 7 sub-categories from Level 2 to level 5, which can express 128 sub-categories.

Obviously, if we get a classification encoded as 1092787200, we will know that because it is encoded

0100 0001001 0001010 0111000 0000000

So it is Level 4 classification. The binary code of its parent class is 0100 0001001 0001010 0000000 0000000, And the decimal number is 1092780032. We can also know that the parent class's parent class encoding is 0100 0001001 0000000 0000000 0000000, and the parent class's parent class encoding is 0100 0000000 0000000 0000000 0000000. (I am not so arrogant about J, but this is very important. Let's look back at the fifth question we mentioned above. Haha, isn't the classification path of category 1092787200 already obtained ?).

Now we will discuss the issue of class encoding in general cases. Set the class level to K and the number of bits in the I layer to Ni. The total number of BITs is n (N1 N2 .. NK ). The encoding format of any category is as follows:

2 ^ (N-(N1 N2... Ni) * j parent class Encoding

Where, I indicates the I layer, and J indicates the J category of the current layer.

In this way, the encoding of any classification is divided into two parts, one of which is its layer encoding and the other is its parent class encoding.

According to the following formula, one of the K encodings is called the signature code: (because I can take K values, there are K)

2 ^ N-2 ^ (N-(N1 N2... Ni ))

For any given category ID, if we "phase" the ID and K signatures, the non-zero encoding is the encoding of all its parent classes!

Bit Encoding Algorithm
For catalog tables of any sequential encoding, we can design a single-bit encoding algorithm to convert all class encoding specifications into bit encoding. In specific implementation, we first create a temporary table:

Create tempcatalog (

[Oldid] [int] not null,

[Newid] [int] not null,

[Oldfatherid] [int] not null,

[Newfatherid] [int] not null

);

In this table, we reserve all the original class numbers oldid and its parent class numbers oldfatherid, and recalculate the corresponding numbers newid and newfatherid that meet the bit encoding requirements.

The procedure is as follows:

<%

Rem oconn --- database connection, enabled

Rem oldfather --- original parent class number

Rem newfather --- new parent class number

Rem N --- total number of digits of the Encoding

Rem Ni-array of digits of each encoding level

Rem level -- current level

Sub formatallid (oconn, oldfather, newfather, N, nm, ni byref, level)

Strsql = "select catalogid, fatherid from catalog where fatherid =" & oldfather

Set rscatalog = oconn. Execute (strsql)

J = 1

Do while not rscatalog. EOF

I = 2 ^ (n-nm) * j

If level then I = I newfather

Oldcatalog = rscatalog ("catalogid ")

Newcatalog = I

Rem write to temporary table

Strsql = "insert into tempcatalog (oldcatalogid, newcatalogid, oldfatherid, newfatherid )"

Strsql = strsql & "values (" & oldcatalog & "," & newcatalog & "," & oldfather & "," & newfather &")"

Conn. Execute strsql

Rem recursively calls formatallid

Nm = Nm Ni (level 1)

Formatallid oconn, oldcatalog, newcatalog, N, nm, Ni, level 1

Rscatalog. movenext

J = J 1

Loop

Rscatalog. Close

End sub

%>

An example of calling this algorithm is as follows:

<%

Rem defines the encoding parameters, where N is the total number of digits, and Ni is the number of digits per level.

Dim N, Ni (5)

Ni (1) = 4

N = Ni (1)

For I = 2 to 5

Ni (I) = 7

N = n Ni (I)

Next

Rem open the database and create a temporary table

Strsql = "create tempcatalog ([oldid] [int] not null, [newid] [int] not null, [oldfatherid] [int] not null, [newfatherid] [int] not null );"

Set conn = server. Createobject ("ADODB. Connection ")

Conn. open application ("strconn ")

Conn. Execute strsql

Rem call normalization routine

Formatallid Conn,-1,-1, n, Ni (1), Ni, 0

Rem ------------------------------------------------------------------------

Rem updates the category encoding of all related tables here as the new encoding.

Rem ------------------------------------------------------------------------

Rem closes the database

Strsql = "Drop table tempcatalog ;"

Conn. Execute strsql

Conn. Close

%>

Fourth Question
Now let's look back at the fourth question: how to get all the products under a certain category. Because of the bit encoding, the problem is now very simple. It is easy to estimate that the condition for a product to belong to a certain category is product. fatherid & (Pattern of catalog. ID) = catalog. Id. "&" Indicates the bit and algorithm. This is directly supported in SQL Server.

For example, the product category is 1092787200, and the current category is 1092780032. The feature value corresponding to the current category is 4294950912. Because 1092787200 & 4294950912 = 8537400, this product belongs to category 8537400.

We have provided a formula for calculating the signature. There are not many signatures, and they are easy to calculate. You can consider calculating them when the application_onstart time is triggered in global. Asa and storing them in the application ("mark") array.

Of course, with signatures, we can also get more efficient algorithms. We know that although we adopt bitwise encoding, it is actually a sequential encoding method. The class I encoding is definitely smaller than the class I 1 encoding. Based on this feature, we can also get two signatures by FID, one of which is the level-level signature fid0, and the other is the level-level signature fid1. The FID of a product belongs to the following criteria:

Product. fatherid> fid0 and product. fatherid <fid1

The following program displays all products under category FID. Because the data table product has indexed the fatherid, the query speed is extremely fast:

<%

Rem oconn --- database connection, enabled

Rem FID --- current category

Rem fidmark --- an array of feature values, typically an application ("mark ")

Rem K --- Number of array elements, also the level of classification

Sub getallproduct (oconn, FID, fidmark byref, K)

Rem calculates the feature values fid0 and fid1 Based on FID.

For I = K to 1

If (FID and fidmark = FID) Then exit

Next

Strsql = "Select name from product where fatherid>" fidmark (I) & "and fatherid <" fidmark (I-1)

Set rsproduct = oconn. Execute (strsql) %>

<Ul> <%

Do while not rsproduct. EOF %>

<Li> <% = rsproduct ("name ")

Loop %>

</Ul> <%

Rsproduct. Close

End sub

%>

For 5th and 6th questions, leave them for exercises. With the bitcode above, everything should be solved

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.