I got the opportunity to work on a project aimed at building predictive model for a disorder X. So sharing few observations and procedures here. Due to some constraints wont be getting into details.
Aim: Build a generic predictive model for disorder X in all ethnic groups. The predictive model works better if developed for a specific ethnic group. However we were trying to build a generic model which means we might not get too good results.
Data set: The data gathered was from three different studies and involving three ethnic groups. Every source was a case-control study. Case is a record with positive outcome (here for disorder X) and control is observation with negative outcome (for disorder X).
Time division: It was a year long project and we spent around 80% of the time in data pre-processing (gathering, profiling, cleaning and formatting). Also most of the heavy processing was done in this part. The remaining 20% time was spend on building and evaluating different predictive models.
Size: The size of data set was pretty big. At some points in the entire process, we had to deal with around 38 million data points. Certainly that requires appropriate infrastructure, so we had to use high performance computing (HPC) cluster for computing purpose.
Parallel processing: We had to execute jobs in parallel in order to expedite the process. We divided data pre-processing tasks based on chromosomes and then into chunks whenever possible. Still few jobs took days to complete on cluster.
Removing noise: The data required some generic cleaning procedures like removing duplicates, missing value records etc. Then we had to implement few specific cleaning procedures like removing SNPs with very high and low frequencies
Formatting data: After removing some noise we tried to enrich the data using available tools and techniques. These tools and techniques use reference chips and add more SNPs between existing SNPs. Formatting was needed for input and after output as well.
Identifying signal: We identified 17 possible traits associated with disorder X. For example BMI is a trait associated with CVD. For each trait we considered top n SNPs (genetic piece of information) from GWAS meta-analysis study. The value of n varied from trait to trait. These SNPs associated with 17 traits help us to identify signal in data sets we have.
Building risk scores: Based on the reference data of SNPs associated with 17 traits we created risk scores for each trait. First we created unweighted risk scores, then weighted risk scores and finally we standardized them.
Ranking risk scores: Now by running logistic regression we derived numerical values (p-value) to represent association between each risk score and outcome (case or control). Then we sorted risk scores based on p-value in ascending order. Lower the p-value, stronger the association.
Data integration and train-test division: Records from all three studies were merged into one dataset. Two studies were labelled as training subset and remaining study was considered as testing data set. All models were built on training data and evaluated against test data. The split was 70% of training data and 30% of testing data.
Building the model: Now we started building the predictive model based on risk scores by adding them one by one. Ofcourse the strongly associated risk score was added first and so on. The intention was to optimize number of inputs as there might be some inputs adding more noise than the signal.
Evaluating the model: Every time we add new predictor (risk scores), we evaluate the model based on AUC value. There are many other methods to evaluate predictive models similar to AUC. The ideal scenario is with every additional predictor we should get higher accuracy, however practically that doesn't happen. So we don't want to add all risk scores but only those who gives additional information increasing predictive ability.