Differences

This shows you the differences between two versions of the page.

--- data [2014/08/10 20:06]
ychen
+++ data [2014/08/10 20:57] (current)
ychen [DANN data]
@@ Line 1: / Line 1: @@
 ===== DANN data =====
-The raw data can be found here [[http://krishna.gs.washington.edu/martin/download/cadd_training/]]
+The raw data can be found here [[http://krishna.gs.washington.edu/martin/download/cadd_training/]]. The real SNV, insertion and deletion samples sum up to 16,627,775. We randomly sample equal number of simutation samples (SNV, insertion and deletion), combine with the real data, and get a dataset of 33,255,550 samples.
+This dataset is transormed into svmlight format with script impute2svmlight.py, which is provided by Dr. Martin Kircher (the author of CADD paper), and the python package [[https://github.com/mblondel/svmlight-loader|svmlight-loader]]. We roughly partition the dataset into 80% for training, 10% for validation and 10% for testing. Their svmlight files are here: