R language Delete duplicate value _r language

Source: Internet
Author: User
Tags true true
A batch of data has recently been dropped to remove duplicate values, such as: > Data.set    ensembl.gene.id gene.biotype chromosome.name gene.start. bp. Gene.end. bp. 1  ensg00000236666    antisense               22        16274560       16278602 2  ensg00000236666    antisense               22        16274560       16278602 3  ensg00000234381   pseudogene               22        16333633       16342783 4  ensg00000234381   pseudogene               22        16333633       16342783 5  ensg00000234381   pseudogene               22        16333633       16342783 6  ensg00000234381   pseudogene               22        16333633       16342783 7  ensg00000234381   pseudogene               22        16333633       16342783 8  ensg00000234381   pseudogene               22        16333633       16342783 9  ensg00000234381   pseudogene                22        16333633      16342783 10 ENSG00000224435   pseudogene               22        16345912      16355362
In this data, there are only three values in the Ensembl.Gene.ID, the rest are duplicates, and now you want to regenerate the data based on the Ensembl.Gene.ID column, such as: > Data.set2 Ensembl.Gene.ID gene.biotype Chromosome.name Gene.start. bp. Gene.end.              bp. 1 ENSG00000236666 antisense 16274560 16278602 3 ENSG00000234381 pseudogene 16333633 16342783 ENSG00000224435 pseudogene 22 16345912 16355362
Then, in this data processing also wanted to use Excel to solve, but Excel has his limitations, can not handle large quantities of data, so still want to be able to use the R language to deal with this batch of data. Found in online and several of the main R language books, no effective treatment was discovered. such as: http://cos.name/cn/topic/7621 but get a better solution: with the duplicated function. The duplicated function is a function that can be used to resolve a vector or a repeating value of a data box, and it returns a vector of true and false to mark whether the corresponding value of the index is a duplicate of the previous data.
So we still use the first mentioned data Data.set as an example to illustrate the solution: 1, establish whether to duplicate the index; > index<-duplicated (data.set$ensembl.gene.id) > index [1] False true to False true true true true false
2, generate new data so to this step, it should be a lot of r language enthusiasts can handle the problem, but we will find that we want the value of the line is false, so in the back we use! To take counter: > Data.set2<-data.set[!index,] > Data.set2 Ensembl.Gene.ID gene.biotype chromosome.name gene.start ... bp. Gene.end.              bp. 1 ENSG00000236666 antisense 16274560 16278602 3 ENSG00000234381 pseudogene 16333633 16342783 ENSG00000224435 pseudogene 22 16345912 16355362
So we're done.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.