My current project includes a table with 62 million rows and more than 500 columns. Where the data comes from outside of SQL Server, they arrive in a table that identifies the primary key, and all remaining columns are saved in varchar (50).
Problem solving
There are many dates saved in YYYYMMDD format, which is the first question I have noticed. I'll change these dates to integer columns. Then I found that many integers were saved in varchar form, and I changed them to integers.
Next, many columns contain "Y" or "N" or null. We are likely to turn them into ranks, but there is a major flaw in doing so-you can't index a ranked one. So I changed them to char (1) NULL.
In addition, there are columns that contain a ZIP code, preceded by a column of five digits, followed by four digits or null. I changed them to char (5) and char (4), all null.
Another problem I encountered was that I could not understand the contents of a considerable portion of the column. I treated them separately, for varchar (50). I want to determine the maximum length of the string saved in each varchar column, but it is impractical to examine all the data in 62 million rows. So I chose to generate a char or varchar list of all the columns.
SELECT
COLUMN_NAME, DATA_TYPE,
CHARACTER_MAXIMUM_LENGTH, IS_NULLABLE
FROM
INFORMATION_SCHEMA.COLUMNS
WHERE TABLE_NAME ='tblArthurClean'
AND DATA_TYPE IN('char','varchar')
I copied the first column in the results table and pasted it into Notepad (this is my favorite text editor because it has excellent find-and-replace features.) To determine the longest data in a particular column, use the following script:
SELECT Max(Len( ColumnName ))
FROM TableName
By extension, I want to traverse the column list and generate the declarations I need. I focus on a few formatting lines and use a separate query to do this.
SELECT
',Max(Len( ' + COLUMN_NAME + ' ))
AS ' + COLUMN_NAME
FROM
INFORMATION_SCHEMA.COLUMNS
WHERE TABLE_NAME = 'tblArthurClean'
AND DATA_TYPE IN( 'char', 'varchar' )
A short result set in List A. Now follow these steps:
Max(Len( Record_Length_Indicator_or_Filler ))
AS Record_Length_Indicator_or_Filler,
Max(Len( Job_Number )) AS Job_Number,
Max(Len( Personal_or_Firm_Name ))
AS Personal_or_Firm_Name,
Max(Len( Mailing_Address_Supplied_by_Customer
Paste the result set into a text editor.
Inserts a select before the first line.
Adds the From and table names at the end of the file.
I add the following at the beginning and end of the file.
SELECT GetDate()
GO
Then I delete the comma in front of the first line of the output result, save it as a query, load it and let it run.
I am very satisfied with the performance of it on my server. It only takes a little more than one hours to work out the results-not too long for the analysis of a 62 million-row hundreds of-column table.
The result is a row of result sets, which I can use as a guide to modifying the table structure and column names, and to relate these results to the original table and make adjustments accordingly.
This article explains how to take a step back and think about how you can save a lot of input time. My motto is: "As long as SQL can do it, it should be done." ”