The raw data can be found here http://krishna.gs.washington.edu/martin/download/cadd_training/. The real SNV, insertion and deletion samples sum up to 16,627,775. We randomly sample equal number of simutation samples (SNV, insertion and deletion), combine with the real data, and get a dataset of 33,255,550 samples.
This dataset is transormed into svmlight format with script impute2svmlight.py, which is provided by Dr. Martin Kircher (the author of CADD paper), and the python package svmlight-loader. We roughly partition the dataset into 80% for training, 10% for validation and 10% for testing. Their svmlight files are here:
tree-hmm sample data from the ENCODE human project http://cbcl.ics.uci.edu/public_data/tree-hmm-sample-data
Our ChIP-seq analysis of LRH-1 can be found at: http://cbcl.ics.uci.edu/public_data/LRH-1
Included here is ChIP-Seq raw and processed data from:
Genome-wide analysis of SREBP-1 binding in mouse liver chromatin reveals a preference for promoter proximal binding to a new motif. PNAS 2009 106:13765-13769; Young-Kyo Seo, Hansook Kim Chong, Aniello M. Infante, Seung-Soon Im, Xiaohui Xie, and Timothy F. Osborne