My current project contains a table with 62 million rows and more than 500 columns. The data comes from outside SQL Server. The tables they arrive at have a primary key, and all the remaining columns are saved in varchar (50) format.
Problem Solving
Many dates are saved in the YYYYMMDD format, which is the first problem I have noticed. I changed these dates to integer columns. Then I found that many integers are saved as varchar, and I changed them to integers.
Next, many columns contain "Y", "N", or NULL. We are likely to convert them into rankings, but there is a major defect in doing so-you cannot index a ranking. Therefore, I changed them to char (1) NULL.
In addition, some columns contain zip codes. The first column is five digits, and the last column is four digits or NULL. I changed them to char (5) and char (4), both of which are NULL.
Another problem I encountered was that I could not understand the content in a considerable number of columns. I process them separately, for varchar (50 ). I want to determine the maximum length of the string saved in each varchar column, but it is impractical to check all the data in row 62 million. Therefore, I choose to generate a char or varchar list for all columns.
SELECTCOLUMN_NAME, DATA_TYPE, CHARACTER_MAXIMUM_LENGTH, IS_NULLABLE FROMINFORMATION_SCHEMA.COLUMNSWHERE TABLE_NAME ='tblArthurClean'AND DATA_TYPE IN('char','varchar')
|
Copy the first column in The result table and paste it into notepad (this is my favorite text editor because it has excellent search-replace functions .) To determine the maximum data in a specific column, use the following script:
SELECT Max(Len( ColumnName )) FROM TableName
|
In other words, I want to traverse the column list and generate the declaration I need. I concentrated the required content into some formatting rows and used a separate query to achieve this goal.
SELECT',Max(Len( ' + COLUMN_NAME + ' )) AS ' + COLUMN_NAME FROMINFORMATION_SCHEMA.COLUMNSWHERE TABLE_NAME = 'tblArthurClean'AND DATA_TYPE IN( 'char', 'varchar' )
|
List A is A short result set. Perform the following steps:
Max(Len( Record_Length_Indicator_or_Filler )) AS Record_Length_Indicator_or_Filler, Max(Len( Job_Number )) AS Job_Number, Max(Len( Personal_or_Firm_Name )) AS Personal_or_Firm_Name, Max(Len( Mailing_Address_Supplied_by_Customer
|
Paste the result set into a text editor.
Insert SELECT before the first line.
Add the FROM and table names at the end of the file.
I add the following content at the beginning and end of the file.
Then I delete the comma before the first line of the output result, save it as a query, load it, and run it again.
I am very satisfied with its performance on my server. It takes more than an hour to calculate the result. It is not too long to analyze a table with hundreds of rows and hundreds of columns.
The result is a result set of a row. I can use it as a guide for modifying the table structure and column name, and associate these results with the original table to make corresponding adjustments.
This article explains how to save a lot of input time. My motto is: "As long as SQL can be done, it should be done ."