MethylSurroGetR

Tools to Generate Predicted Values of DNAm Surrogates in R

Joshua A. Goode
Trey Smith

June 19, 2025

Some Quick Notes

Notes on Slide Navigation

Copying Code Blocks

All code blocks can be copied by clicking the clipboard icon in the upper right corner. If the icon is hidden, hovering your mouse cursor in the area should reveal it.

Scrolling Content

In come cases, content may wrap beyond the limits of the slide. It may be necessary to scroll up/down or left/right.

🚧 Under Development 🚧

This package is still in development and not yet ready for general use.

Proceed with caution!

What is MethylSurroGetR?

Simple set of user-friendly functions for generating predicted values from existing DNA methylation surrogates

  • Data Management
  • Handling Missing Data
  • Generating Estimates

What is MethylSurroGetR?

  • Does not develop surrogates
  • Existing packages already address well-known clocks
  • Fills in gap for recently published and/or less well-known surrogates
  • Allows construction of MRSs/PEGs from EWAS results

Install & Load

# install.packages("remotes")
remotes::install_github("jagoode27/MethylSurroGetR")

library(MethylSurroGetR)

Load Example Methylation Data

data("beta_matrix_miss", package = "MethylSurroGetR")
print(beta_matrix_miss)
         samp1      samp2      samp3     samp4      samp5
cg01 0.1028646 0.32037324 0.48290240        NA 0.36948887
cg02 0.2875775         NA         NA 0.8830174         NA
cg04 0.4348927 0.18769112 0.89035022 0.1422943 0.98421920
cg05 0.9849570 0.78229430 0.91443819 0.5492847 0.15420230
cg07 0.8998250 0.24608773         NA 0.3279207 0.95450365
cg08 0.8895393 0.69280341 0.64050681 0.9942698 0.65570580
cg09        NA 0.09359499 0.60873498 0.9540912         NA
cg10 0.8864691         NA         NA 0.5854834 0.14190691
cg12 0.1750527         NA 0.14709469        NA 0.69000710
cg13 0.9630242 0.90229905 0.69070528 0.7954674 0.02461368
cg14 0.1306957         NA 0.93529980 0.6478935         NA
cg16 0.6531019 0.33282354 0.30122890 0.3198206 0.89139412
cg17 0.1428000 0.41454634 0.41372433 0.3688455 0.15244475
cg19 0.3435165 0.48861303 0.06072057 0.3077200         NA
cg20 0.6567581 0.95447383 0.94772694 0.2197676         NA

Load Example Surrogate Weights

data("wts_df", package = "MethylSurroGetR")
print(wts_df)
                wt_lin      wt_prb       wt_cnt
cg02      -0.009083377  0.16511519  0.050895032
cg03      -0.001155999 -0.40515934  0.025844226
cg06       0.005978497 -0.11603036  0.042036480
cg07      -0.007562015 -0.22561636 -0.099875045
cg08       0.001218960  0.31464004 -0.004936685
cg11      -0.005869372 -0.05148366 -0.055976223
cg13      -0.007449367  0.31006435 -0.024036692
cg15       0.005066157  0.31238951  0.022554201
cg17       0.007900907  0.29434232 -0.029640418
cg18      -0.002510744 -0.06016831 -0.077772915
Intercept  1.211000000  0.01900000  0.937000000
wts_vec_lin <- setNames(wts_df$wt_lin, rownames(wts_df))

Create mehtyl_surro Object

surro_set(methyl, weights, intercept = NULL)
  • methyl: Numeric matrix of methylation data
    • CpG sites as row names
    • sample IDs as column names
  • weights: Named numeric vector of surrogate weights.
  • intercept: Optional chacracter string to identify the name of the intercept in the weights object

Create mehtyl_surro Object

lin_surrogate <- surro_set(methyl = beta_matrix_miss,
                           weights = wts_vec_lin,
                           intercept = "Intercept")
print(lin_surrogate)
$methyl
         samp1     samp2     samp3     samp4      samp5
cg02 0.2875775        NA        NA 0.8830174         NA
cg07 0.8998250 0.2460877        NA 0.3279207 0.95450365
cg08 0.8895393 0.6928034 0.6405068 0.9942698 0.65570580
cg13 0.9630242 0.9022990 0.6907053 0.7954674 0.02461368
cg17 0.1428000 0.4145463 0.4137243 0.3688455 0.15244475
cg03        NA        NA        NA        NA         NA
cg06        NA        NA        NA        NA         NA
cg11        NA        NA        NA        NA         NA
cg15        NA        NA        NA        NA         NA
cg18        NA        NA        NA        NA         NA

$weights
        cg02         cg03         cg06         cg07         cg08         cg11 
-0.009083377 -0.001155999  0.005978497 -0.007562015  0.001218960 -0.005869372 
        cg13         cg15         cg17         cg18 
-0.007449367  0.005066157  0.007900907 -0.002510744 

$intercept
Intercept 
    1.211 

attr(,"class")
[1] "methyl_surro"

Two Types of Missing Values

  • Missing Observatons
    • probes present in target data
    • missing for some samples
  • Missing Probes
    • probes not present in target data
    • missing for all samples
      • removed during QC
      • not on chip

Two Types of Missing Values

samp1 samp2 samp3 samp4 samp5
cg02 0.288 NA NA 0.883 NA
cg07 0.900 0.246 NA 0.328 0.955
cg08 0.890 0.693 0.641 0.994 0.656
cg13 0.963 0.902 0.691 0.795 0.025
cg17 0.143 0.415 0.414 0.369 0.152
cg03 NA NA NA NA NA
cg06 NA NA NA NA NA
cg11 NA NA NA NA NA
cg15 NA NA NA NA NA
cg18 NA NA NA NA NA

Check for Missing Values

methyl_miss(methyl_surro)
  • methyl_surro: methyl_surro object created with surro_set()

Check for Missing Values

missing <- methyl_miss(methyl_surro = lin_surrogate)
print(missing)
Missing Data Summary for methyl_surro Object
============================================

Total probes: 10
Total samples: 5
Complete probes: 3 (30.0%)
Probes with missing observations: 2 (20.0%)
Completely missing probes: 5 (50.0%)
Overall missing rate: 58.0%

Probes with partial missing data:
cg02 cg07 
 0.6  0.2 

Completely missing probes:
cg03, cg06, cg11, cg15, cg18

Impute Missing Observations

impute_obs(methyl_surro,
           method = c("mean", "median"),
           min_nonmiss_prop = 0)
  • methyl_surro: methyl_surro object
  • method: Character string indicating the imputation method
    • Current options are “mean” or “median”
    • Currently developing KNN and weighted KNN options
  • min_nonmiss_prop: Optional minimum proportion of non-missing data required in a probe for imputation to proceed

Impute Missing Observations

lin_surrogate <- impute_obs(methyl_surro = lin_surrogate,
                            method = "mean",
                            min_nonmiss_prop = 0)
print(lin_surrogate)
$methyl
         samp1     samp2     samp3     samp4      samp5
cg02 0.2875775 0.5852975 0.5852975 0.8830174 0.58529746
cg07 0.8998250 0.2460877 0.6070843 0.3279207 0.95450365
cg08 0.8895393 0.6928034 0.6405068 0.9942698 0.65570580
cg13 0.9630242 0.9022990 0.6907053 0.7954674 0.02461368
cg17 0.1428000 0.4145463 0.4137243 0.3688455 0.15244475
cg03        NA        NA        NA        NA         NA
cg06        NA        NA        NA        NA         NA
cg11        NA        NA        NA        NA         NA
cg15        NA        NA        NA        NA         NA
cg18        NA        NA        NA        NA         NA

$weights
        cg02         cg03         cg06         cg07         cg08         cg11 
-0.009083377 -0.001155999  0.005978497 -0.007562015  0.001218960 -0.005869372 
        cg13         cg15         cg17         cg18 
-0.007449367  0.005066157  0.007900907 -0.002510744 

$intercept
Intercept 
    1.211 

attr(,"class")
[1] "methyl_surro"

Fill Missing Probes

reference_fill(
  methyl_surro,
  reference,
  type = c("probes", "obs", "all")
)
  • methyl_surro: methyl_surro object
  • reference: Named numeric vector of methylation reference values
  • type: Character string to identify which probes to fill

Fill Missing Probes

data("ref_df", package = "MethylSurroGetR")
print(ref_df)
          mean    median
cg01 0.3992451 0.3694889
cg02 0.6616689 0.7883051
cg03 0.4948262 0.5281055
cg04 0.5278895 0.4348927
cg05 0.6770353 0.7822943
cg06 0.5526592 0.5726334
cg07 0.4940793 0.3279207
cg08 0.7745650 0.6928034
cg09 0.5281033 0.6087350
cg10 0.4982656 0.4667790
cg11 0.4566024 0.5440660
cg12 0.3856340 0.4045103
cg13 0.6752219 0.7954674
cg14 0.5866269 0.6192565
cg15 0.4004940 0.3181810
cg16 0.4996738 0.3328235
cg17 0.2984722 0.3688455
cg18 0.3923206 0.2659726
cg19 0.3747138 0.3435165
cg20 0.7031609 0.7370777
ref_vec_mean <- setNames(ref_df$mean, rownames(ref_df))

Fill Missing Probes

lin_surrogate <- reference_fill(methyl_surro = lin_surrogate,
                                reference = ref_vec_mean,
                                type = "probes")
print(lin_surrogate)
$methyl
         samp1     samp2     samp3     samp4      samp5
cg02 0.2875775 0.5852975 0.5852975 0.8830174 0.58529746
cg07 0.8998250 0.2460877 0.6070843 0.3279207 0.95450365
cg08 0.8895393 0.6928034 0.6405068 0.9942698 0.65570580
cg13 0.9630242 0.9022990 0.6907053 0.7954674 0.02461368
cg17 0.1428000 0.4145463 0.4137243 0.3688455 0.15244475
cg03 0.4948262 0.4948262 0.4948262 0.4948262 0.49482616
cg06 0.5526592 0.5526592 0.5526592 0.5526592 0.55265924
cg11 0.4566024 0.4566024 0.4566024 0.4566024 0.45660238
cg15 0.4004940 0.4004940 0.4004940 0.4004940 0.40049405
cg18 0.3923206 0.3923206 0.3923206 0.3923206 0.39232059

$weights
        cg02         cg03         cg06         cg07         cg08         cg11 
-0.009083377 -0.001155999  0.005978497 -0.007562015  0.001218960 -0.005869372 
        cg13         cg15         cg17         cg18 
-0.007449367  0.005066157  0.007900907 -0.002510744 

$intercept
Intercept 
    1.211 

attr(,"class")
[1] "methyl_surro"

Estimate Surrogate Values

surro_calc(methyl_surro,
           transform = c("linear", "count", "probability"))
  • methyl_surro: methyl_surro object
  • transform: Character string specifying the transformation to apply
    • "linear": For surrogates estimated with Gaussian models
    • "count": For surrogates estimated with Poisson models
    • "probability": For surrogates estimated with binomial models

Estimate Surrogate Values

estimates <- surro_calc(methyl_surro = lin_surrogate,
                        transform = "linear")
print(estimates)
   samp1    samp2    samp3    samp4    samp5 
1.197718 1.202317 1.201093 1.199796 1.201382 

Piping Commands

estimates <- beta_matrix_miss |>
  surro_set(weights = wts_vec_lin, intercept = "Intercept") |>
  impute_obs(method = "mean") |>
  reference_fill(reference = ref_vec_mean, type = "probes") |>
  surro_calc(transform = "linear")

print(estimates)
   samp1    samp2    samp3    samp4    samp5 
1.197718 1.202317 1.201093 1.199796 1.201382 

Thank You


  • Text-based version of this tutorial is available HERE.
  • Please feel free to reach out with any questions.