Outline Background as long as genetic studies of complex traits Recursive partitioning, trees, in addition to as long as ests Challenges & solutions in genetic studies A case study Complex Traits Diseases that do not follow Mendelian Inheritance Pattern Genetic factors, Environment factors, G-G in addition to G-E interactions Interactions: effects that deviate from the additive effects of single effects

Successes in Genetic Studies of Complex Traits Genetic variants have been identified as long as Age-related Macular Degeneration, Diabetes, Inflammatory Bowel Disorders, etc. SNP in addition to Complex Traits SNPs in addition to Haplotypes

Gold Mining Regression approach 1 2 25 26 72 ~ ~ ~ ~ ~ Classic Modeling vs Genomic Association Analysis In classic statistical modeling, we tend to have an adequate sample size as long as estimating parameters of interest. Often, we have hundreds or thous in addition to s of observations as long as the inference on a few parameters. We can try to settle an “optimal” model. In genomic studies, we have more in addition to more variables (gene based) but the access to the number of study subjects remains the same. One model can no longer provide an adequate summary of the in as long as mation.

Recursive Partitioning A technique to identify heterogeneity in the data in addition to fit a simple model (such as constant or linear) locally, in addition to this avoids pre-specifying a systematic component. of 48 Leukemia Data Source: Contents: 25 mRNA – acute myeloid leukemia (AML) 38 – B-cell acute lymphoblastic leukemia (B-ALL) 9 – T-cell acute lymphoblastic leukemia (T-ALL) 7,129 genes Question: are the microarray data useful in classifying different types of leukemia 3-D View AML T-ALL B-ALL

Click to see the diagram Node Splitting Tree Structure Forests To identify a constellation of models that collectively help us underst in addition to the data. For example, in gene expression profiling, we can select in addition to rank the genes whose expressions show a great promise of classifying tumor cells.

Bagging (Bootstrap Aggregating) Cancer Normal High Low A r in addition to om tree A R in addition to om Forest Repetition A tree For the highlighted daughter nodes, we choose three best splits Deterministic Forest Challenge I: Memory Constraint The number of SNPs makes it impossible to conduct a full genomewide association study in st in addition to ard desktop computers. Data security requirements often do not allow the analysis done in computers with huge memory. We need a simple but efficient memory management design. of 48

How to Use Memory Efficiently 0 (AA), 1 (AB), 2 (BB) & 3 (missing) 2 0 3 1 0 1 0 0 0 1 1 0 1 0 0 byte bit Compression Decompression of 48 Willows of 48 Williows GUI

Williows Output of 48 Challenge II: Haplotype Certainty SNPs Directly observed No uncertainty Less in as long as mative Tree approaches Haplotypes Inferred from SNPs Uncertain More in as long as mative Forest approaches of 48 Forest Forming Scheme Unphased data

Haplotype Frequency Estimation Existing haplotype frequency estimation software that output a set of haplotype pairs with corresponding frequencies as long as each subject in each region. We used SNPHAP (Clayton 2006) Unphased to Phased Data One unphased data exp in addition to s to a large number of phased datasets. In each region, an individual’s haplotype pair is r in addition to omly selected based on the estimated frequencies to account as long as the uncertainty of the haplotypes. Haplotypes with low frequencies (~5-10%) should have some representations. Trees Based on Phased Data A tree is grown as long as each phased data set. A r in addition to om as long as est is as long as med as long as all phased data sets.

Inference from the Forest Significance Level Distribution of the maximum haplotype importance under null hypothesis is determined by permutation. First, disease status is permuted among study subjects while keeping the genome intact as long as all individuals. Then, each of the permuted data set is treated in the same way as the original data. Simulation Studies (2 loci) 300 cases in addition to 300 controls Each region has 3 SNPs 12 interaction models from Knapp et. al. (1994) in addition to Becker et. al. (2005) 2 additive models with background penetrance 3 scenarios Neither region is in LD with the disease allele One of the regions is in LD (D’ = 0.5) with the disease allele Both regions are in LD (D’ = 0.5) with the disease allele of 48

Trees in Genetic Studies Zhang in addition to Bonney (2000) Nelson et al. (2001) Bastone et al. (2004) Cook, Zee in addition to Ridker (2004) Foulkes, De Gruttola in addition to Hertogs (2004) References on Forests Breiman L. Bagging predictors. Machine Learning, 24(2):123-140, 1996. Zhang HP. Classification trees as long as multiple binary responses. Journal of the American Statistical Association, 93: 180-193, 1998. Zhang HP et al. Cell in addition to Tumor Classification using Gene Expression Data: Construction of Forests. Proceedings of the National Academy of Sciences USA, 100: 4168-4172, 2003. Thank you! of 48

