Use Python [orange] with DNA sequences for race Prediction

Source: Internet
Author: User

On Coursera, Web Intelligence and big data finally deployed hw7, which requires prediction of a series of DNA sequences. The details are as follows:

Data analytics assignment (for hw7) predict the ethnicity of individuals from their genes

========================================================== ====
It is now possible to get the DNA sequence of an individual at a reasonable cost. an individual's genetic make-up determines a numberof charactersistics-eye color, propensity for certain diseases, response to treatment and so on. in this problem, you are given a subset of genetic information for several individuals. for some of the individuals you are also told their ethinicity.
Your task is to figure out the ethnicity of the other individuals.
The information provided is as follows:
1. for each individual the presence (1) or absence (0) of a genetic variation at a particle position on chromosome 6 is provided. in some cases, information for an individual at a participant position is not available and this represented
? (Missing ).
2. Information is provided for approximately 204000 positions. These are your features.
3. The training set has data for 139 individuals along with their ethnicity.
4. the test (prediction) set has data for 11 individuals. You have to predict the ethnicity for these individuals and enter your answers via hw7.

Data Sets

-----------


The training set is available here: genestrain.tab.zip (6.2
MB)


The test set is available here: genesblind.tab.zip (1.2
MB)


File Format -----------


(Note: data sets are. Tab files in the tab-separated format that can be read into orange ):

Both the training and test data files have a header line which is a tab-separated line of column/feature names: for example '6 _ 10000005 'indicates that the column describes the presence or absence of variations at position 10000005 on chromosome
#6.


Entries in the second header line indicate the type of column (in this case all features are 'discrete ').


Entries in the third header line indicate the nature of each column:

A' for most columns that contain a feature, and 'class' for the first column as it contains the actual class labels (I. E ., ethnicities of the individuals in each row ).

These header lines are followed by lines containing feature values (0, 1, or ?) For each genetic feature of an individual.


In the training set file the first column, which denotes the class label, is a three-letter code with one of the following values:
O ceu is Northern and Western European

O gih is Gujarati Indian from Houston

O jpt is Japan in Tokyo

O asw is Americans of African ancestry

O yri is Yoruba in Ibadan, nigera
In the test file the ethnicity column also exists but is blank.


======================================


For the purposes of Your HW answer alone, each three letter code is to be marked with a numeric value as indicated in the table below:


O ceu is Northern and Western European-0

O gih is Gujarati Indian from Houston-1

O jpt is Japanese in Tokyo-2

O asw is Americans of African ancestry-3

O yri is Yoruba in Ibadan, nigera-4


You must use the above numeric values to encode your answer. Note: This numeric value has no presence in the test or training data.

Task: for each of the individuals in the test file, predict their ethnicity as Ceu, gih, jpt, ASW or YRI and enter your answers in hw7 in exactly the order that the 11 individuals appear in the test file.
So, for example, if your prediction is Ceu, gih, jpt, ASW, YRI Ceu, gih, jpt, ASW, YRI, CEU, you shoshould enter your answer as 0 1 2 3 4 0 1 2 4 0 (I. e. numbers separated
By a space-No commas, tabs or anything else, just as space between single digit numbers ).

However, many people in discussion form reflect that the Indian teachers did not understand the problem (mainly because they did not tell them how to do it ), I didn't give a video instruction in the video. Fortunately, after the data is written down, it is found that there is a training set and a prediction set. It is estimated that only training and prediction can be performed first.

The training set is a tab file in the following format:

The X-coordinate class represents the type (139 rows represent 139 training data records), and the Y-coordinates represent the DNA fragments (about 0.2 million, which are not listed in the n columns)

The prediction set is as follows:

The first column with question marks is the information to be predicted. There are 11 types of information in total.

After understanding the data, the next step is to see how to train and predict, discussion form someone proposed to use the orange library, friends python, very convenient to use, the address is http://orange.biolab.si/doc/ofb/c_basics.htm, more detailed can see http://orange.biolab.si/docs/latest/tutorial/rst/classification/

The Bayesian classifier can solve this problem. The code is short as follows:

# Description: Read data, build naive Bayesian classifier and classify first few instances# Category:    modelling# Uses:        genestrain.tab# Predict:     genesblind.tab# Referenced:  c_basics.htmimport orangedata = orange.ExampleTable("genestrain")data2= orange.ExampleTable("genesblind")classifier = orange.BayesLearner(data)i = 0for item in data2:    c = classifier(item)    print "%d: %s " % (i, c)    i = i + 1

We can see that the training data is used for training to obtain the classifier, and then the classifier is used to predict each row of the prediction data. The output result is clear, however, the only drawback is that when the data volume is slightly larger, the running speed and resource consumption are very high. In this case, 1 GB of memory is required for 10 minutes:

The final output result is as follows:

In this way, we can get 11 people to be predicted, and fill in the answer.

It is estimated that this is the last programming assignment of this course, and there is still an online final exam. Please close the course.

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.