This shows you the differences between two versions of the page.
Both sides previous revision Previous revision Next revision | Previous revision | ||
data [2014/08/10 20:03] ychen |
data [2014/08/10 20:57] (current) ychen [DANN data] |
||
---|---|---|---|
Line 1: | Line 1: | ||
+ | ===== DANN data ===== | ||
+ | |||
+ | The raw data can be found here [[http://krishna.gs.washington.edu/martin/download/cadd_training/]]. The real SNV, insertion and deletion samples sum up to 16,627,775. We randomly sample equal number of simutation samples (SNV, insertion and deletion), combine with the real data, and get a dataset of 33,255,550 samples. | ||
+ | |||
+ | This dataset is transormed into svmlight format with script impute2svmlight.py, which is provided by Dr. Martin Kircher (the author of CADD paper), and the python package [[https://github.com/mblondel/svmlight-loader|svmlight-loader]]. We roughly partition the dataset into 80% for training, 10% for validation and 10% for testing. Their svmlight files are here: | ||
+ | |||
+ | |||
+ | |||
===== tree-hmm sample .bam's from chr19 ===== | ===== tree-hmm sample .bam's from chr19 ===== | ||