Infers the genome build of summary statistics files (GRCh37 or GRCh38) from the data. Uses SNP (RSID) & CHR & BP to get genome build.
get_genome_builds(
sumstats_list,
header_only = TRUE,
sampled_snps = 10000,
names_from_paths = FALSE,
dbSNP = 155,
nThread = 1
)
A named list of paths to summary statistics,
or a named list of data.table
objects.
Instead of reading in the entire sumstats
file,
only read in the first N rows where N=sampled_snps
.
This should help speed up cases where you have to read in sumstats
from disk each time.
Downsample the number of SNPs used when inferring genome build to save time.
Infer the name of each item in sumstats_list
from its respective file path.
Only works if sumstats_list
is a list of paths.
version of dbSNP to be used (144 or 155). Default is 155.
Number of threads to use for parallel processes.
ref_genome the genome build of the data
Iterative version of get_genome_build
.
# Pass path to Educational Attainment Okbay sumstat file to a temp directory
eduAttainOkbayPth <- system.file("extdata", "eduAttainOkbay.txt",
package = "MungeSumstats"
)
sumstats_list <- list(ss1 = eduAttainOkbayPth, ss2 = eduAttainOkbayPth)
## Call uses reference genome as default with more than 2GB of memory,
## which is more than what 32-bit Windows can handle so remove certain checks
is_32bit_windows <-
.Platform$OS.type == "windows" && .Platform$r_arch == "i386"
if (!is_32bit_windows) {
#multiple sumstats can be passed at once to get all their genome builds:
#ref_genomes <- get_genome_builds(sumstats_list = sumstats_list)
#just passing first here for speed
sumstats_list_quick <- list(ss1 = eduAttainOkbayPth)
ref_genomes <- get_genome_builds(sumstats_list = sumstats_list_quick,
dbSNP=144)
}
#> Inferring genome build of 1 sumstats file(s).
#> Inferring genome build.
#> Reading in only the first 10000 rows of sumstats.
#> Importing tabular file: /__w/_temp/Library/MungeSumstats/extdata/eduAttainOkbay.txt
#> Checking for empty columns.
#> Standardising column headers.
#> First line of summary statistics file:
#> MarkerName CHR POS A1 A2 EAF Beta SE Pval
#> Loading SNPlocs data.
#> Loading reference genome data.
#> Preprocessing RSIDs.
#> Validating RSIDs of 93 SNPs using BSgenome::snpsById...
#> BSgenome::snpsById done in 12 seconds.
#> Loading SNPlocs data.
#> Loading reference genome data.
#> Preprocessing RSIDs.
#> Validating RSIDs of 93 SNPs using BSgenome::snpsById...
#> BSgenome::snpsById done in 42 seconds.
#> Inferred genome build: GRCH37
#> Time difference of 55.81949 secs
#> GRCH37: 1 file(s)