SAS Optimization Tips (3) sort

Source: Internet
Author: User

1: Prevent unnecessary sorting

Here are four ways to prevent sorting

?? 1.1:by-group processing with an index to avoid a sort

The by statement does not use the index in the following cases

The by statement includes the descending or notsorted option or if SAS detects that the data file is physical Ly stored in sorted order on the by variables

The pros and cons of index columns used for sorting

Disadvantages:

?? 1:it is generally less efficient than sequentially reading a sorted data set because
Processing by groups typically means retrieving the entire file.
?? 2:it requires storage space for the index.


?? 1.2:by-group processing with the notsorted option/groupformat option

by variable option;

The notsorted option specifies that observations, which has the same by value is grouped together but is not necessarily Sorted in alphabetical or numeric order.

The notsorted option works best when observations that has the same by value is stored together.

Precautions:

The notsorted option turns off sequence checking. If your data is not grouped, using the notsorted option can produce a large amount of output.

The notsorted option cannot is used with the MERGE or UPDATE statements

Groupformat option

The Groupformat option uses the formatted values of a variable instead of the internal values to determine where a by grou P begins and ends

It means grouping with formatted variable values instead of using the original data set values.

 by Order_date Groupformat notsorted;

?? 1.3:A CLASS Statement

Ordering variables in advance has little to do with class statements, but it helps a by statement.

1.4:the sortedby= Data Set option.

If you were working with input data , which is already sorted, what can specify how the data was ordered by us ing the sortedby= data set option.

Although the sortedby= option does not sort a data set, it sets the value of the Sorted flag. It does not set the value of the Validated sort flag. (PROC sort sets the Validated sort flag.)

Data company.transactions (sortedby=invoice); Invoice is a sorted column, this option indicates that the column has been sequenced.

Sorting Requirements for space

When data was sorted, SAS requires enough space in the data library for both copies ofthe data? Le that's being sorted as W Ell as additional workspace is the space for the original data set, which is for use disk space in order to sort the data

2: Multi-threaded sorting

PROC SORT SAS-data-set-| Nothreads;

Strategies for multi-threaded sequencing

When a threaded sort was used, the observations in the input data set is divided intoequal temporary subsets, BAS Ed on the number of processors is allocated to Thesort procedure. Each subset are then sorted on a different processor. The sortedsubsets is then interleaved to re-create the sorted version of the input data set.

Setting the number of excess actual CPUs can reduce operational efficiency

Cpucount=| ACTUAL;

3: Big Data Set sorting

For large data sets, if the space is not enough, it can be chunked

When merging, if you are splitting with OBS, you cannot use append to merge

Five kinds of segmentation in the advance above to see it ....

Sort with tagsort, multithreading not supported

PROC SORT data=sas-data-set-name Tagsort;

principle : Thetagsort option stores only the to variables and the observation numbers in temporary files. T He by variables and the observation numbers is called tags. at the completion of the sorting process, PROC SORT uses the tags to retrieve records from the input data set in Sorted order.

More time, less space, the first kind of comparison

Compared to normal sorting, if the data set sequence is confusing, it takes a lot more time, I/O is

But if the basic order, then the time I/O is only a little more

PROC SORT DATA=SAS-data-set-name Tagsort;

The Tagsort optionstores only the-variables and the observation numbers in temporary? Les. The byvariables and the observation numbers are called tags. At the completion of the sortingprocess, PROC SORT uses the tags to retrieve records from the input data set in Sortedorde R.

4: Efficient removal of duplicate values

4.1:using the nodupkey Option

PROC SORT compares all by-variable values for each observation to those for the previous observation that is written to the output data set

PROC SORT DATA=SAS-data-set-name Nodupkey;

4.2:using the noduprecs/nodup Option

The Noduprecs option compares all of the variable values for each observation to those for the Previ OUs observation that is written to the output data set.

PROC SORT DATA=SAS-data-set-name Noduprecs;

Because Noduprecs checks only consecutive observations, some nonconsecutive duplicate observations might remain I n the output data set. You can remove allduplicates with the This option to sorting on all variables. (this option is only valid for consecutive duplicate values, and will not be eliminated if discontinuous)

4.3:using the EQUALS | Noequals Option

EQUALS maintains the order from the input data set in the output data set. Noequals does not necessarily preserve this order in the output data set. Noequals can save CPU time and memory resources.

This is to be understood here, for such two data

0 S

1 3

Make such a procedure proc sort data=old out=new nodupkey equal/unequal; by,..; Run

If it's equal, then keep 1 2.

If it is unequal, it will remain 1 3.

5:host Sort Utility

Host sort Utilities is third-party sort packages that is available in some operating environments. In some cases, using a host sort utility with PROC sort might being more efficient than using the SAS sort utility with PROC SORT. (It's a third-party package that works better than proc for a particular data set.)

5.1:using the sortpgm= System Option

Tells SAS whether to use the SAS sort, to use the host sort, or to determine which sort utility are best For the data set.

Specify which sorting strategy to use or let SAS choose the best

5.2:using the sortcutp= System Option

The sortcutp= system option specifies the number of bytes above which the host sort utility is used instead of th E SAS sort utility.

OPTIONS SORTCUTP=////MIN/MAX/ Hexx;

5.3:using the sortcut= System Option

Beginning with SAS 9, the sortcut= system option can is used to specify the number of observations above which th E Host sort utility is used instead of the SAS sortutility.

OPTIONS sortcut=////MIN/MAX/ Hexx;

5.4:using the sortname= System Option

The sortname= option specifies the host sort utility that'll be used if the value of sortpgm= are best or HOST.

OPTIONS sortname=Host-sort-utility name;
Options SORTPGM=best Sortcutp=10000 sortname=syncsort;

SAS Optimization Tips (3) sort

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.