quant_summary()
The Importance of Checking Data
This section is currently under construction and will be completed soon.
Quant Summary Function
- All code blocks on this page can be copied by clicking
in the upper right corner. - Note that some code and output blocks may scroll left and right.
- As with all content on my site, please feel free to reach out if you have any questions.
The Function
quant_summmary <- function(df){
summary <- apply(df, 2, function(var){
if (sum(!is.na(var)) == 0) {
c(rep(NA, times = 11), sum(is.na(var)), sum(!is.na(var)))
} else {
c(mean(var, na.rm = TRUE), sd(var, na.rm = TRUE),
quantile(var, c(0, 0.05, 0.25, 0.50, 0.75, 0.95, 1.00), na.rm = TRUE),
sum(is.na(var)), sum(!is.na(var)))
}
})
summary <- data.frame(t(summary))
colnames(summary) <- c("mean", "sd", "min", "q05", "q25", "q50", "q75", "q95", "max", "miss", "nonmiss")
corrs <- data.frame(round(cor(df), digits = 3))
return(list(summary = summary, corrs = corrs))
}
Examples
Basic Approach
We’re loading the dplyr
package here for data management only; it is not required for the quant_summary()
function.
Because the function is written for a data frame, we’re using the select()
function in dplyr to select the variables we want.
We can run the function to create a new object. Although it’s incredibly unoriginal, we’re calling it data_summary
.
Because our new object is stored as a list, we can access the summary statistics in the summary
element.
mean sd min q05 q25 q50 q75 q95
mpg 20.09062 6.0269481 10.400 11.995 15.42500 19.200 22.80 31.30000
hp 146.68750 68.5628685 52.000 63.650 96.50000 123.000 180.00 253.55000
disp 230.72188 123.9386938 71.100 77.350 120.82500 196.300 326.00 449.00000
wt 3.21725 0.9784574 1.513 1.736 2.58125 3.325 3.61 5.29275
max miss nonmiss
mpg 33.900 0 32
hp 335.000 0 32
disp 472.000 0 32
wt 5.424 0 32
Similarly, I can access the the correlation matrix in the corrs
element.
mpg hp disp wt
mpg 1.000 -0.776 -0.848 -0.868
hp -0.776 1.000 0.791 0.659
disp -0.848 0.791 1.000 0.888
wt -0.868 0.659 0.888 1.000
We could also access these without elements without creating an object.
mean sd min q05 q25 q50 q75 q95
mpg 20.09062 6.0269481 10.400 11.995 15.42500 19.200 22.80 31.30000
hp 146.68750 68.5628685 52.000 63.650 96.50000 123.000 180.00 253.55000
disp 230.72188 123.9386938 71.100 77.350 120.82500 196.300 326.00 449.00000
wt 3.21725 0.9784574 1.513 1.736 2.58125 3.325 3.61 5.29275
max miss nonmiss
mpg 33.900 0 32
hp 335.000 0 32
disp 472.000 0 32
wt 5.424 0 32
mpg hp disp wt
mpg 1.000 -0.776 -0.848 -0.868
hp -0.776 1.000 0.791 0.659
disp -0.848 0.791 1.000 0.888
wt -0.868 0.659 0.888 1.000
Modified Approach
If we’re feeling fancy, we can combine the summary statistics and correlations into a single table. Here we’re using the rename_with()
function in dplyr
to rename the correlation columns as {var}_r
.
cbind(
quant_summmary(subset_data)$summary,
quant_summmary(subset_data)$corrs |>
rename_with(~ paste0(.x, "_r")
)
)
mean sd min q05 q25 q50 q75 q95
mpg 20.09062 6.0269481 10.400 11.995 15.42500 19.200 22.80 31.30000
hp 146.68750 68.5628685 52.000 63.650 96.50000 123.000 180.00 253.55000
disp 230.72188 123.9386938 71.100 77.350 120.82500 196.300 326.00 449.00000
wt 3.21725 0.9784574 1.513 1.736 2.58125 3.325 3.61 5.29275
max miss nonmiss mpg_r hp_r disp_r wt_r
mpg 33.900 0 32 1.000 -0.776 -0.848 -0.868
hp 335.000 0 32 -0.776 1.000 0.791 0.659
disp 472.000 0 32 -0.848 0.791 1.000 0.888
wt 5.424 0 32 -0.868 0.659 0.888 1.000