Katja Ickstadt, University of Dortmund Germany
Talk Title: “Variable Selection Methods for High-Dimensional Data using Cross Leverage Scores”
Abstract: In this talk, we present several variable selection methods for high-dimensional regression in the context of genetics. The methods are intended for investigating the association of single nucleotide polymorphisms (SNPs) and their interactions on health outcomes. Our approaches are based on cross leverage scores to select variables while maintaining interpretability. In order to be able to handle large datasets, for one approach we divide the data into subsets of variables (batches). Successively for each batch we store the (predefined) q most important variables, compare them to those selected from the previous batch, store the combined q most important variables and reject the rest. We receive the q most important variables of the whole dataset after analyzing all batches. In another approach we use fast approximations of the cross leverage scores by applying random projections to reduce the number of variables. We test our variable selection methods in simulation studies and on a real data example with respect to their quality to select the important SNPs and to their predictive validity in a subsequent analysis. Both approaches are effective, since we avoid complex and time-consuming computations of high-dimensional matrices.