Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revision Previous revision
Next revision
Previous revision
data [2014/08/10 20:31]
ychen [DANN data]
data [2014/08/10 20:57]
ychen [DANN data]
Line 3: Line 3:
 The raw data can be found here [[http://​krishna.gs.washington.edu/​martin/​download/​cadd_training/​]]. The real SNV, insertion and deletion samples sum up to 16,627,775. We randomly sample equal number of simutation samples (SNV, insertion and deletion), combine with the real data, and get a dataset of 33,255,550 samples. ​ The raw data can be found here [[http://​krishna.gs.washington.edu/​martin/​download/​cadd_training/​]]. The real SNV, insertion and deletion samples sum up to 16,627,775. We randomly sample equal number of simutation samples (SNV, insertion and deletion), combine with the real data, and get a dataset of 33,255,550 samples. ​
  
-This dataset is transormed into svmlight format with script impute2svmlight.py,​ which is provided by Dr. Martin Kircher (the author of CADD paper), and the python package [[https://​github.com/​mblondel/​svmlight-loader|svmlight-loader]] ​is needed ​+This dataset is transormed into svmlight format with script impute2svmlight.py,​ which is provided by Dr. Martin Kircher (the author of CADD paper), and the python package [[https://​github.com/​mblondel/​svmlight-loader|svmlight-loader]]. We roughly partition the dataset into 80% for training, 10% for validation and 10% for testing. Their svmlight files are here: 
  
  
You are here: startdata