A detailed analysis of the R language rank function

Source: Internet
Author: User

What is the 1.rank function

Rank related document [1] can be translated as "return the original array (?) sort (?) for each element in the After the rank (?)", on the surface can really get the order, but the array, sorting, rank is not clear.

2.rank function Usage Scenarios

For example, in the 100-meter race, the results of a-B-C trio were 6.8s, 8.1s, 7.2s, then ranked by rank function:

> Rank (t <-C (6.8, 8.1, 7.2)) [1] 1 3 2

Again, for example, a three-person test score for 74,92,85, using the same method to obtain the position will be counterproductive. Of course, we can assume that the implementation

> Rank (-(S <-C (in the ()))[1] 3 1 2

Can achieve the goal, but this does not change the rank function of the sorting mechanism.

3.rank function Sort Type

Rank (x, na.last = TRUE,
Ties.method = C ("average", "first", "Random", "Max", "Min"))

> t <-C (4, NaN, 4, 7, 8, 2, Nan, 9, 9, 7, Nan, 5, 2, 2, 1)# mark the corresponding element at the same time > names (t) < -Letters[1:length (t)]

By the above method, we can obtain

Result
A B C D E F G H I J K L M N O
Original 4 NaN 4 7 8 2 NaN 9 9 7 NaN 5 2 2 1
Average 5.5 13.0 5.5 8.5 10.0 3.0 14.0 11.5 11.5 8.5 15.0 7.0 3.0 3.0 1.0
First 5 13 6 8 10 2 14 11 12 9 15 7 3 4 1
Random (1) 6 13 5 9 10 2 14 11 12 8 15 7 3 4 1
Random (2) 5 13 6 8 10 2 14 11 12 9 15 7 4 3 1
Max 6 13 6 9 10 4 14 12 12 9 15 7 4 4 1
Min 5 13 5 8 10 2 14 11 11 8 15 7 2 2 1

We have found that the order of the label "B", "G", and "K" has not changed, and it is inferred that the Ties.method effect is to deal with the order of non-missing values.

You might want to refer to Rank's implementation code

function (x, na.last = TRUE, Ties.method = C ("Average"," First",     "Random","Max","min") ) {nas<- is. NA (x) #得到与x相同长度的boolean型数组 to mark if the corresponding bit is a missing value nm<-Names (x) #获取数组中元素所对应的标签
  #names函数暗示了该方法的设计初衷是对一维数组即列向量进行排序, although X is a matrix, it also results, but the role of NM has been invalidated and the results are meaningless.
Ties.method<-Match.arg (Ties.method)if( is. Factor (x)) x<-As.integer (x) #若x为因子, " categorize"the element and encode the integer element by "The size of the class" , see [Note 1] x<-X[!nas] #剔除x中的缺失值

#Average\min\max used the corresponding. Internal (rank (x, Length (x), Ties.method)), see [Note 2] for details.
# First uses sort.list (Sort.list (x)), see [Note 3] for details.
#Random uses sort.list (Order (x, stats::runif (sum (!nas))), see [Note 4] y<-switch (ties.method, average =, Min =, max = . Internal (rank (x, Length (x), Ties.method)), first=sort.list (sort.list (x)), Random=Sort.list (Order (x, stats::runif (sum (!nas))))

#下面是补全缺失值的次序的方法
#na. Last = "Keep", do not handle missing values, Na.last = TRUE, post-sort missing values, Na.last = FALSE, first sort missing values. if(! is. NA (na.last) &&Any (NAS)) {yy<-NA Nakeep<-(Na.last = ="Keep") if(Nakeep | |na.last) {Yy[!nas]<-yif (! Nakeep) Yy[nas]<-(Length (y) + 1L): Length (yy)}Else{len<-sum (NAS) Yy[!nas]<-y +Len Yy[nas]<-Seq_len (len)} y<-yy names (y)<-nm}ElseNames (y) <-Nm[!nas] y}

[Note 1] about factor-to-integer

> F <-C ('Ba','BA','b','A','A','b','Ba','Bac', Nan, Nan)> FAC <-factor (colour)>As.integer (FAC) [1] 3 4 2 1 1 2 3 5 6 6

thus: (1) factors are mechanically compared as strings, and the order of expulsion. (2) the status (size) of any two missing values in the factor is the same.

In the actual problem, the factor is artificially set, so the order factor (ordered factor)is used to eliminate the disturbance of mechanical conversion.

> Qulity <-C ('Good','Soso','Good','Soso',' Bad','Good',' Bad')> Names (qulity) <-C ('Day1','Day2','Day3','Day4','Day5','Day6','Day7')> Q <-Factor (qulity, levels = C (' Bad','Soso','Good'), labels = c (' Bad','Soso','Good'), order =TRUE)>rank (q) day1 day2 day3 day4 day5 day6 day76.0 3.5 6.0 3.5 1.5 6.0 1.5

[description 2] "average", "Max", "min" Sort

>T a b c D e F g h i j k l m n o4 Nan 4 7 8 2 NaN 9 9 7 NaN 5 2 2 1 > Rank (t, Na.last ="Keep", Ties.method =" First") a b c d e F g h i j k l m n o5 na 6 8 2 na one 9 na 7 3 4 1 > rank (t, Na.last ="Keep", Ties.method ="Average") a b c d e F g h i j k l m n o5.5 NA 5.5 8.5 10.0 3.0 na 11.5 11.5 8.5 na 7.0 3.0 3.0 1.0

The "average" sort can be interpreted as first ordering the data, that is, all elements have unique and different order.

such as F, M, n scores the same, but can be ranked in order 2, 3, 4, but F, M, n belong to the same group, it can take the average level of the group as a sequence, so that the same elements of the same score.

Therefore, it is not difficult to understand that the "max" sort is the best level of all the elements in the group, which is also widely used in the "parallel Ranking" method;

The "min" Sort is the worst level of all the elements in the group, thus increasing the order difference between different levels.

[Note 3] first = Sort.list (Sort.list (x))

The sequence is sorted by size, with the same size, from start to finish.

>4 4 7 8 2 9 9 7 5 2 2 1 > sort.list (sort.list (x)) [1]  5  6  8 ten  2  9 7  3  4  1

[Description 4] random = sort.list (Order (x, stats::runif (sum (!nas)))

Weight = stats::runif (sum (!nas)) generates a random number between 0-1 for each known element as a "weight" sequence weight

Sort.list (Order (x, Weigth)) determines the order of the elements with the same score based on a random "weight"

You might want to engage in weight design

4 4 7 8 2 9 9 7 5 2 2 1 > Weight = C (0.45, 0.55, 0.1, 0.1, 0.1, 0.55, 0.45, 0.1, 0.1, 0.3, 0.1, 0.1); > sort.list (Order (x,weight)) [1]  5  6  8  2  9  7 4 3  1

It is not difficult to find that a, C scores are 4, but w (a) = 0.45 < W (c) = 0.55, followed by the trumpet in front of a, a row in front of c. H, J just opposite W (h) = 0.55 > W (j) = 0.45, j is in front of H.

D, J scores, "weights" are all the same, so they are arranged in ascending order from beginning to end.

F, M, n scores are 2, W (f) = W (n) = 0.1 < W (m) = 0.3, and the sort result is F < n < m, thus the"weight" takes precedence over "order", which makes the ordering more randomized, if the sequence has a large number of elements with the same score , some degree overcomes the constraint of the "former small and large" rule, which makes the sorting result more random.

The above is only a mechanism to illustrate the random ordering, the actual application can only determine the decimal in the first large number after, and cannot explain the order between the same number.

Summary of 4.rank functions

Rank (x, na.last = TRUE,
Ties.method = C ("average", "first", "Random", "Max", "Min"))

(1) The Rank function sorts the one-dimensional degree group and vector x. If x is a numeric value, it is sorted according to the principle of decimal number on-line, and if X is a factor, then the order factor design should be referred to [Note 1] .

P.S. The actual situation, there are a large number of two-dimensional table description of the data, such as the row represents a place column represents the time table, if the order, should first by the means of character splicing into one-dimensional vector, otherwise the result will lose meaning.

(2) rank divides the data into definite values and missing values of two. The missing values can be ranked between the determined values (Na.last = FALSE), or after (Na.last = TRUE), but also reserved, not participating in the sort (na.last = "Keep").

(3) "First" is the most basic sort, decimals in the first large number after, the same element precedes the latter in the former.

"Max" is the same element that takes the best level in the group, which is usually the parallel sort.

"Min" is the same element that takes the worst level of the group and can increase the rank difference of the sequence.

"Average" is the same element that takes the average level in the group, which may be a decimal.

"Random" is the same element in the random order, avoids the "first come first served", "weight" better than the "order" mechanism to increase the degree of randomness.

[1] Returns the sample ranks of the values in a vector. Ties (i.e., equal values) and missing values can is handled in several ways.

A detailed analysis of the R language rank function

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.