Design and Implementation of. NET-based Word Segmentation software V6.0-using database

Source: Internet
Author: User
Tags common sql injection attacks

 

I have been busy for a while. Today I will spend some time to summarize the unfinished word segmentation series ..

 

The previous article mentioned the use of HashSet <T> as a dictionary to store data structures. This is also the best solution that you can find within the scope of capabilities without using a database.

 

However, if you use a database, well, let's look at the performance of this word segmentation software when you use a database.

 

1. Create a database

In earlier versions, Word Segmentation dictionaries are saved directly in the txt file in the form of text. In this case, all dictionaries must be transferred to the database table, the dictionary uses the method to access one word per line. I use the method to read every line of the text document cyclically, and then use the insert statement to input it into the database table.

 

Then, we started a simple test without any optimization measures. First, we enabled the SQL statement to display statistics and analyze, compile, and execute time-consuming functions:

 

Set statistics io onset statistics time on let's look at the simple word "they". select * from Vocabulary where item = 'theirs'

 

SQL Execution result:

  

 

 

Note the following data: Logical reads are performed for 871 times. The CPU time is 62 milliseconds, and the Occupied time is 59 milliseconds.

 

Then, we changed the program in the program to determine whether a word exists:

 

/// <Summary> /// Updated: determine whether a dictionary contains /// </summary> /// <param name = "str"> </param> /// <returns> </returns> bool isExist (string str) {DBHelper db = new DBHelper (); return Convert. toInt32 (db. executeScalar ("select count (*) from Vocabulary where item = '" + str + "'")> 0 ;}

At first I planned to test it in 1000 words, but at last I found this idea unrealistic. Why? Let's take a look at the word splitting result of the 100-character text:

 

 

That's right. A 100-word text word splitting time is 20 + seconds, which is an intolerable result.

 

2. Optimization Step 1-index creation

There are a lot of indexed articles in the garden. From the principle to the example, there are some classic ones. I don't have to say much here. Here I will mainly look at the application of indexes in this word segmentation software.

 

First, we frequently use the item field, that is, the field storing words for query, and it is the primary key of the table, which is suitable for building clustered indexes:

 

USE [Splitter] GO/****** object: Index [PK_Vocabulary] script Date: 05/06/2011 23:56:26 ******/create clustered index [PK_Vocabulary] ON [dbo]. [Vocabulary] ([item] ASC) WITH (SORT_IN_TEMPDB = OFF, DROP_EXISTING = OFF, IGNORE_DUP_KEY = OFF, ONLINE = OFF) ON [PRIMARY] OK, the final result is as follows:

 

 

Does it look more pleasing to the eye? (I added the following two fields: py and len to facilitate some special queries, for example, how many pinyin abbreviations are AB words, it is useful in programs ).

 

Query the word "they" with the same statement and view the result:

  

 

Note the following data: Logical reads are performed twice. The CPU time is set to 0 ms, and the Occupied time is set to 11 ms. The time consumption is significantly reduced.

 

3. Optimization Step 2-Use fill factor

Fill factor percentage indicates the leaf-level full program of the index page when the index is created for the first time. If no display setting is displayed, the default value is 0.

 

When an index is initially generated, SQL Server places the index B tree structure on a consecutive physical page to obtain the optimal I/O performance through consecutive I/O scan index pages. When a page is split and a new page needs to be inserted into the logical B-tree structure of the index, SQL Server must allocate a new 8 KB index page. This insertion occurs elsewhere on the hard disk, interrupting the physical continuity of the index page. I/O operations change from continuous to discontinuous, reducing the performance by half. You can reconstruct the index page to restore the physical Continuous Order of the index page to solve excessive page splitting. Clustering indexes also encounter the same problem at the leaf level, thus affecting the data page of the table.

 

100% fill factor can improve reading performance, but it will slow down the Write activity and lead to frequent page splitting, because the database engine must continuously switch the location of rows to get space on the data page.

 

The low fill factor facilitates insertion, but the reading speed is slow, because if the fill factor is too small, more pages are needed to save the data, more pages mean that the number of read operations on the physical data pages to be read by each query increases. At this time, the reading and writing functions are reduced.

 

Best Practice: Use a 100% fill factor in a table without modifying the activity, 70-90% for medium-and low-activity, and 50% or even lower for high activity.

 

 

USE [Splitter] GO/****** object: Index [PK_Vocabulary] script Date: 05/06/2011 23:56:26 ******/create clustered index [PK_Vocabulary] ON [dbo]. [Vocabulary] ([item] ASC) WITH (PAD_INDEX = ON, FILLFACTOR = 100, -- fill factor 100% SORT_IN_TEMPDB = OFF, DROP_EXISTING = OFF, IGNORE_DUP_KEY = OFF, ONLINE = OFF) ON [PRIMARY]

 

 

The code above demonstrates creating a clustered index and specifying a fill factor of 100%.

 

 

Continue to check the effect of "they" in the database:

  

 

 

Note the following data: Logical reads are performed three times. The CPU time is 0 ms, and the Occupied time is 1 ms.

 

We can see that the index creation and the default fill factor are more advanced.

 

Next, let's take a look at the test results of the 1000 text:

  

 

As you can see, the text of 1000 words takes about 1.7 seconds, which is a lot better than the text without indexes.

 

4. Optimization Step 3-Using Stored Procedures

First, let's list several advantages of the accepted stored procedure:

 

• Efficiency:

 

-The stored procedure is compiled only when it is created, and does not need to be re-compiled after each execution. Generally, the SQL statement needs to be analyzed and then executed every execution, therefore, using stored procedures can improve the efficiency of SQL statement execution.

-Stored Procedure Code is directly stored in the database and does not generate a large amount of T-SQL statement code traffic, significantly reducing network traffic.

• Security: during execution of stored procedures, parameterized SQL statements are used to prevent common SQL injection attacks.

• Reusable and highly maintainable: updating stored procedures usually requires less time and effort than changing, testing, and re-deploying an assembly.

Because a large number of repeated SQL statements are required during word splitting, does the storage process improve the execution efficiency? Let's try:

USE [Splitter] GO/***** object: StoredProcedure [dbo]. [IsExist] script Date: 05/07/2011 01:34:56 ******/SET ANSI_NULLS ONGOSET QUOTED_IDENTIFIER ongocreate procedure [dbo]. [IsExist] (@ item varchar (50), @ result int output) asbegin set nocount on; SELECT @ result = count (*) from Vocabulary WHERE item = @ itemEND detailed call steps in the program:

 

Bool isExist (string str) {SqlConnection con = new SqlConnection (ConfigurationManager. connectionStrings ["ConnectionString"]. toString (); con. open (); SqlCommand cmd = new SqlCommand ("IsExist", con); cmd. commandType = CommandType. storedProcedure; cmd. parameters. add (new SqlParameter ("@ item", SqlDbType. varChar, 50); cmd. parameters ["@ item"]. value = str; cmd. parameters. add (new SqlParameter ("@ result", SQL DbType. int); cmd. parameters ["@ result"]. direction = ParameterDirection. output; int result = cmd. executeNonQuery (); con. close (); return Convert. toBoolean (result);} tested the text of 1000 words again, with a slight increase in time, roughly around 1.3 seconds, which is about 25% better than the previous one.

 

V. Summary

So far, we have completed all the research on this word segmentation software and made a summary example as follows (V5.0 and V6.0 are combined in the previous article ):

 

 

However, this article is the last one, which describes the final effect of BS porting.

 

  

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.