Introduction to SLCMA

Applications for DNA Methylation Analysis

Joshua A. Goode

June 19, 2025

Some Quick Notes


  • Slides Available
  • Please Stop Me
    • Questions
    • Audio Issues

Notes on Slide Navigation


Copying Code Blocks

All code blocks can be copied by clicking the clipboard icon in the upper right corner. If the icon is hidden, hovering your mouse cursor in the area should reveal it.

Scrolling Content

In come cases, content may wrap beyond the limits of the slide. It may be necessary to scroll up/down or left/right.

Some Caveats


Let’s Keep it Simple, Silly

My goal is to focus largely on conceptual issues related to SLCMA and studies of DNA methylation. Although I briefly discuss the slcma R Package created by Dr. Andrew Smith, this is not meant to be a full tutorial.

I’m Just a Padawan

Dr. Andrew Smith is a Jedi Master. I learned everything I know about SLCMA from him. Some materials have been borrowed/adapted from his teaching. I am grateful and humbled by the opportunity to learn from him.

The Importance of Social Science


DNAm Research

  • Limited Focus in Biology
    • Current Exposure
    • Ever Exposed
  • Contribution of Social Science
    • Several large panel studies
    • Many years of data
    • A plethora of data types
    • Really interesting questions/theory

What We Need

  • Way to use our vast data to test hypotheses across the life course
    • Systematic to avoid false-positive results
    • Efficient to accommodate analysis of high-dimensional data
    • Easy to use
  • SLCMA can help!

What is SLCMA?


Structured
Life
Course
Modeling
Approach

Which life course hypothesis best fits our data?

SLCMA Hypotheses


Big 3

  • Sensitive Periods
  • Accumulation
  • Recency

Others

  • Mobility
  • Change
  • Always Exposed
  • Ever Exposed

Hypotheses (Big 3)


Sensitive Periods \(\left(SP \text{ at } t_j\right)\)

  • The developmental timing of an exposure has the strongest effect on the outcome at a specific time point due to heightened levels of plasticity or reprogramming
  • Just the exposure variable at each age
  • Can be continuous or binary

    \[SP_j = x_j\]

Hypotheses (Big 3)


Accumulation \(\left(Acc\right)\)

  • Every additional time point of exposure affects the outcome in a dose-response manner, independent of the exposure timing
  • Add up the exposure variable across
  • Can be continuous or binary

    \[Acc = \sum_{j=1}^m{x_j}\]

Hypotheses (Big 3)


Recency \(\left(Rec\right)\)

  • More proximal exposures (closer in time to the of the outcome) are more strongly linked to the outcome than are more distal exposures
  • Add up the products of each exposure variable multiplied by its age of observation
  • Can be continuous or binary

    \[Rec = \sum_{j=1}^m{\left(x_jt_j\right)}\]

SLCMA Steps


  1. Fit a regression model for each single life course hypothesis of interest, as well as groups of compound hypotheses
  1. Measure the goodness-of-fit of each model and select the best one
  1. Calculate appropriate p-values for the selected model

Simulate Example Data


Variable Description
y Outcome
sp04 Binary Exposure (\(Age = 4\))
sp26 Binary Exposure (\(Age = 26\))
sp43 Binary Exposure (\(Age = 43\))
acc Accumulation
rec Recency

Simulate Example Data


# set seed
set.seed(1234)

# simulate exposure & outcome data
n <- c(141, 20, 88, 80, 40, 35, 317, 367)
y <- c(28.7, 27.5, 29.1, 27.8, 28.3, 27.1, 27.6, 26.0)
se <- c(0.5, 1.1, 0.7, 0.7, 0.8, 0.9, 0.3, 0.2)
sp04 <- rep(c(0, 1, 0, 0, 1, 1, 0, 1), times = n)
sp26 <- rep(c(0, 0, 1, 0, 1, 0, 1, 1), times = n)
sp43 <- rep(c(0, 0, 0, 1, 0, 1, 1, 1), times = n)
e <- lm(rnorm(sum(n)) ~ sp04 * sp26 * sp43)$residuals
y <- rep(y, n) + rep(se * sqrt(n), n) * e / sd(e)

# construct accumulation & recency measures
acc <- sp04 + sp26 + sp43
rec <- (sp04 * 4) + (sp26 * 26) + (sp43 * 43)

# create data frame
dats_bin <- data.frame(cbind(y, sp04, sp26, sp43, acc, rec))

# clean up
rm(list = "n", "y", "se", "sp04", "sp26", "sp43", "acc", "rec", "e")

Step 1: Fit Models


Which single hypothesis best fits our data?

model_sp04 <- lm(y ~ sp04, data = dats_bin)
model_sp26 <- lm(y ~ sp26, data = dats_bin)
model_sp43 <- lm(y ~ sp43, data = dats_bin)
model_acc <- lm(y ~ acc, data = dats_bin)
model_rec <- lm(y ~ rec, data = dats_bin)

Step 1: Fit Models


Which single hypothesis best fits our data?

Hyp. Coeff. R2
SP04 -1.737 0.026
SP26 -1.075 0.008
SP43 -1.820 0.023
Acc -0.964 0.034
Rec -0.032 0.025

Step 1: Fit Models


Which compound hypothesis best fits our data?

model_acc_sp04 <- lm(y ~ acc + sp04, data = dats_bin)
model_acc_sp26 <- lm(y ~ acc + sp26, data = dats_bin)
model_acc_sp43 <- lm(y ~ acc + sp43, data = dats_bin)
model_acc_rec <- lm(y ~ acc + rec, data = dats_bin)

Step 1: Fit Models


Which compound hypothesis best fits our data?

Hyp. Coeff. R2
Acc + SP04 -0.731 0.036
Acc + SP26 -1.389 0.039
Acc + SP43 -0.836 0.034
Acc + Rec -1.179 0.034

Step 1: Fit Models


Which compound hypothesis best fits our data?

model_acc_sp26_sp04 <- lm(y ~ acc + sp26 + sp04, data = dats_bin)
model_acc_sp26_sp43 <- lm(y ~ acc + sp26 + sp43, data = dats_bin)
model_acc_sp26_rec <- lm(y ~ acc + sp26 + rec, data = dats_bin)

Step 1: Fit Models


Which compound hypothesis best fits our data?

Hyp. Coeff. R2
Acc + SP26 + SP04 -1.380 0.039
Acc + SP26 + SP43 -1.396 0.039
Acc + SP26 + Rec -1.397 0.039

Step 2: Compare Model Fit


Are compound hypotheses improving our model?

Model Hypothesis R2
1 Acc 0.034
2 Acc + SP26 0.039
3 Acc + SP26 + SP04 0.039

Step 2: Compare Model Fit


Step 2: Compare Model Fit


Step 2: Compare Model Fit


Step 2: Compare Model Fit


Step 3: Calculate Correct p-Value


final_model <- summary(lm(y ~ acc, data = dats_bin))
print(final_model)

Call:
lm(formula = y ~ acc, data = dats_bin)

Residuals:
    Min      1Q  Median      3Q     Max 
-21.771  -3.318  -0.039   3.160  20.029 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  29.1837     0.3376  86.433  < 2e-16 ***
acc          -0.9642     0.1566  -6.158 1.04e-09 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 5.215 on 1086 degrees of freedom
Multiple R-squared:  0.03373,   Adjusted R-squared:  0.03285 
F-statistic: 37.92 on 1 and 1086 DF,  p-value: 1.039e-09

Step 3: Calculate Correct p-Value


  • The \(p\)-value in our output \(\left(p = 1.04 \times 10^{-9}\right)\) is incorrect.
  • Assumes we only tested a single hypothesis when we actually tested five.
  • Easiest way to address this is with a Bonferroni correction. \[5\left(1.04 \times 10^{-9}\right) = 5.19 \times 10^{-9}\]

slcma R Package


A Better Way

Please Cite the Package

slcma R Package


SLCMA Steps w/ Package

  1. Fit a regression model for each single life course hypothesis of interest, as well as groups of compound hypotheses
    • Manually: Fit a model for each hypothesis, as well as compound hypotheses
    • Package: Uses LARS to fit each model

slcma R Package


SLCMA Steps w/ Package

  1. Measure the goodness-of-fit of each model and select the best one
    • Manually: Create an elbow plot
    • Package: Generates an elbow plot with a single command

slcma R Package


SLCMA Steps w/ Package

  1. Calculate appropriate p-values for the selected model
    • Manually: Apply a Bonterroni correction
    • Package: Uses fixed LASSO inference or max-|t| test to correct p-values

Installing the Package


  • The slcma package is available on GitHub
  • Can be installed with the install_github() command from the remotes package
# install the package
remotes::install_github("thedunnlab/slcma")

# load the package
library(slcma)

Step 1: Fit Models


slcma_model <- slcma(y ~ sp04 + sp26 + sp43 +
                       Accumulation(sp04, sp26, sp43) +
                       Recency(weights = c(4, 26, 43), sp04, sp26, sp43),
                     data = dats_bin)
                                              Term
                                       (Intercept)
                                              sp04
                                              sp26
                                              sp43
                    Accumulation(sp04, sp26, sp43)
 Recency(weights = c(4, 26, 43), sp04, sp26, sp43)
                             Role
       Adjusted for in all models
 Available for variable selection
 Available for variable selection
 Available for variable selection
 Available for variable selection
 Available for variable selection

Step 2: Compare Model Fit


summary(slcma_model)

Summary of LARS procedure
 Step              Variable selected Variable removed Variables R-squared
    0                                                         0     0.000
    1 Accumulation(sp04, sp26, sp43)                          1     0.022
    2                           sp04                          2     0.029
    3                           sp43                          3     0.039

Step 2: Compare Model Fit


plot(slcma_model)

Step 3: Calculate Correct p-Value


slcmaInfer(slcma_model, 1, method = "slcmaFLI")

Inference for model at Step 1 of LARS procedure

Number of selected variables: 1
R-squared from lasso fit: 0.022

Results from fixed lasso inference (selective inference):

Standard deviation of noise (specified or estimated) sigma = 5.210

Testing results at lambda = 18.578, with alpha = 0.050

                                 Coef P-value  CI.lo CI.up LoTailArea
Accumulation(sp04, sp26, sp43) -0.964       0 -1.272 -0.63      0.024
                               UpTailArea
Accumulation(sp04, sp26, sp43)      0.024

SLCMA for Methylation Data


  • Clocks/Surrogates
    • Just basic outcome variables
    • Easy to do
  • Differentially Methylated Probes (EWAS)
    • SLCMA for each probe
    • A lot of models
      • \(\text{5 Hypoytheses} \times \text{850,000 Probes} \approx \text{4,250,000 Models}\)
      • Package is super helpful!

Post-Selection Inference Methods


Naive Calculation
  • Inflated FWER
  • Biased p-Values
  • Fast Computation
Bonferroni Correction
  • FWER Controlled
  • Unbiased p-Values
  • Fast Computation
  • Overly Conservative
Fixed LASSO Inference
  • FWER Controlled
  • Unbiased p-Values
  • Slow Computation
Max-|t| Test
  • FWER Controlled
  • Unbiased p-Values
  • Slow Computation

Post-Selection Inference Methods


Fixed LASSO Inference
  • Uses selectiveInference Package
  • 2-Tail CIs; 1-Tail p-Values
  • Strange Warnings
Max-|t| Test
  • “Baked into” slcma Package
  • CIs Very Slow
  • No Compound Hypotheses

Post-Selection Inference Methods


When true hypothesis is compound, power to select a single hypothesis is greater for Max-|t| Test than Fixed LASSO Inference.

Additional Considerations


Outcome Variables

  • Package currently works only for continuous outcomes

Time-Varying Covariates

  • SLCMA does not yet accommodate these
  • Research is ongoing
  • Stick with covariates measured before life course hypotheses

Additional Considerations


M-Values vs. Beta Values

  • M-Values have better statistical properties
  • Betas Values are easier to interpret

Parallel Processing

  • SLCMA to detect differentially methylated probes is computationally intensive
  • Parallel processing is key
  • Best approach will depend on your cluster

Additional Resources


Thank You


  • Please feel free to reach out with questions.
  • If there is interest, I can put together a video tutorial that dives a bit more in depth.