coursera上 web intelligence and big data 終於布置了HW7,這一次的要求是對一系列DNA序列進行預測,具體說明如下:
Data Analytics Assignment (for HW7)Predict the Ethnicity of Individuals from their Genes
============================================
It is now possible to get the DNA sequence of an individual at a reasonable cost. An individual's genetic make-up determines a numberof charactersistics - eye colour, propensity for certain diseases, response to treatment and so on. In this problem, you are given a subset of genetic information for several individuals. For some of the individuals you are also told their ethinicity.
Your task is to figure out the ethnicity of the other individuals.
The information provided is as follows:
1. For each individual the presence (1) or absence (0) of a genetic variation at a particular position on chromosome 6 is provided. In some cases, information for an individual at a particular position is not available and this represented as
? (missing).
2. Information is provided for approximately 204000 positions. These are your features.
3. The training set has data for 139 individuals along with their ethnicity.
4. The test (prediction) set has data for 11 individuals. You have to predict the ethnicity for these individuals and enter your answers via HW7.
Data Sets
-----------
The training set is available here: genestrain.tab.zip (6.2
Mb)
The test set is available here: genesblind.tab.zip (1.2
Mb)
File Format-----------
(Note: Data sets are .tab files in the tab-separated format that can be read into Orange):
Both the training and test data files have a header line which is a tab-separated line of column/feature names: For example '6_10000005' indicates that the column describes the presence or absence of variations at position 10000005 on chromosome
#6.
Entries in the second header line indicate the type of column (in this case all features are 'discrete').
Entries in the third header line indicate the nature of each column:
A ' ' for most columns that contain a feature, and 'class' for the first column as it contains the actual class labels (i.e., ethnicities of the individuals in each row).
These header lines are followed by lines containing feature values (0, 1, or ?) for each genetic feature of an individual.
In the training set file the first column, which denotes the class label, is a three-letter code with one of the following values:
o CEU is Northern and Western European
o GIH is Gujarati Indian from Houston
o JPT is Japanese in Tokyo
o ASW is Americans of African Ancestry
o YRI is Yoruba in Ibadan, Nigera
In the test file the ethnicity column also exists but is blank.
=========================
For the purposes of your HW answer alone, each three letter code is to be marked with a NUMERIC VALUE as indicated in the table below:
o CEU is Northern and Western European - 0
o GIH is Gujarati Indian from Houston - 1
o JPT is Japanese in Tokyo - 2
o ASW is Americans of African Ancestry - 3
o YRI is Yoruba in Ibadan, Nigera - 4
YOU MUST USE THE ABOVE NUMERIC VALUES TO ENCODE YOUR ANSWER. Note: This numeric value has no presence in the test or training data.
Task: For each of the individuals in the test file, predict their ethnicity as CEU, GIH, JPT, ASW or YRI and enter your answers in HW7 in exactly the order that the 11 individuals appear in the test file.
So, for example, if your prediction is CEU, GIH, JPT, ASW, YRI CEU, GIH, JPT, ASW, YRI, CEU, you should enter your answer as 0 1 2 3 4 0 1 2 3 4 0 (i.e. numbers separated
by a space - no commas, tabs or anything else, just as space between single digit numbers).
不過很多人在discussion form裡面反映著印度老師在描述的時候沒有把問題講明白(主要是沒告訴他們該怎麼做),也沒在video裡面給個指導視頻啥的。好在把資料下下來以後,發現其中有一個訓練集,一個預測集,估計也只能是先訓練,再預測而已。
訓練集是一個tab檔案,格式如下:
橫座標class代表人種(這裡有139行,代表139個訓練資料),縱座標代表DNA片段(約有20萬個,後面n列未列出)
預測集如下:
這裡第一列加 問號 的就是要預測的,總共為11個人種資訊。
瞭解完資料的情況後,下一步就是看如何來訓練和預測了,discussion form中有人提出了用Orange這個庫,基友Python,使用起來很方便,地址是 http://orange.biolab.si/doc/ofb/c_basics.htm ,更詳細的可以看 http://orange.biolab.si/docs/latest/tutorial/rst/classification/
針對這個問題,貝葉斯分類器就能搞定了,代碼很短如下:
# Description: Read data, build naive Bayesian classifier and classify first few instances# Category: modelling# Uses: genestrain.tab# Predict: genesblind.tab# Referenced: c_basics.htmimport orangedata = orange.ExampleTable("genestrain")data2= orange.ExampleTable("genesblind")classifier = orange.BayesLearner(data)i = 0for item in data2: c = classifier(item) print "%d: %s " % (i, c) i = i + 1
可以看到這裡先用訓練資料進行訓練,得到分類器,然後用分類器對預測資料的每一行進行預測,輸出結果,思想還是比較清晰的,不過唯一的缺點是在資料量稍大一點時,運行速度和消耗資源很大,針對這題要使用1G記憶體,運行10分鐘:
最終輸出結果如下:
這樣就得到了有待預測的11個人種,填寫答案搞定。
估計這是這門課最後一次編程作業了,還剩一個線上的final exam,趕緊結課吧。