vignettes/OpenGWAS.Rmd
OpenGWAS.Rmd
MungeSumstats now offers high throughput query and import functionality to data from the MRC IEU Open GWAS Project.
#### Search for datasets ####
metagwas <- MungeSumstats::find_sumstats(traits = c("parkinson","alzheimer"),
min_sample_size = 1000)
head(metagwas,3)
ids <- (dplyr::arrange(metagwas, nsnp))$id
## id trait group_name year author
## 1 ieu-a-298 Alzheimer's disease public 2013 Lambert
## 2 ieu-b-2 Alzheimer's disease public 2019 Kunkle BW
## 3 ieu-a-297 Alzheimer's disease public 2013 Lambert
## consortium
## 1 IGAP
## 2 Alzheimer Disease Genetics Consortium (ADGC), European Alzheimer's Disease Initiative (EADI), Cohorts for Heart and Aging Research in Genomic Epidemiology Consortium (CHARGE), Genetic and Environmental Risk in AD/Defining Genetic, Polygenic and Environmental Risk for Alzheimer's Disease Consortium (GERAD/PERADES),
## 3 IGAP
## sex population unit nsnp sample_size build
## 1 Males and Females European log odds 11633 74046 HG19/GRCh37
## 2 Males and Females European NA 10528610 63926 HG19/GRCh37
## 3 Males and Females European log odds 7055882 54162 HG19/GRCh37
## category subcategory ontology mr priority pmid sd
## 1 Disease Psychiatric / neurological NA 1 1 24162737 NA
## 2 Binary Psychiatric / neurological NA 1 0 30820047 NA
## 3 Disease Psychiatric / neurological NA 1 2 24162737 NA
## note ncase
## 1 Exposure only; Effect allele frequencies are missing; forward(+) strand 25580
## 2 NA 21982
## 3 Effect allele frequencies are missing; forward(+) strand 17008
## ncontrol N
## 1 48466 74046
## 2 41944 63926
## 3 37154 54162
You can supply import_sumstats()
with a list of as many
OpenGWAS IDs as you want, but we’ll just give one to save time.
datasets <- MungeSumstats::import_sumstats(ids = "ieu-a-298",
ref_genome = "GRCH37")
By default, import_sumstats
results a named list where
the names are the Open GWAS dataset IDs and the items are the respective
paths to the formatted summary statistics.
print(datasets)
## $`ieu-a-298`
## [1] "/tmp/RtmpI3qdKJ/ieu-a-298.tsv.gz"
You can easily turn this into a data.frame as well.
results_df <- data.frame(id=names(datasets),
path=unlist(datasets))
print(results_df)
## id path
## ieu-a-298 ieu-a-298 /tmp/RtmpI3qdKJ/ieu-a-298.tsv.gz
Optional: Speed up with multi-threaded download via axel.
datasets <- MungeSumstats::import_sumstats(ids = ids,
vcf_download = TRUE,
download_method = "axel",
nThread = max(2,future::availableCores()-2))
See the Getting started vignette for more information on how to use MungeSumstats and its functionality.
utils::sessionInfo()
## R Under development (unstable) (2022-12-07 r83413)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 22.04.1 LTS
##
## Matrix products: default
## BLAS: /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3
## LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.20.so
##
## locale:
## [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
## [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
## [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
## [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
## [9] LC_ADDRESS=C LC_TELEPHONE=C
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
##
## time zone: UTC
## tzcode source: system (glibc)
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] MungeSumstats_1.7.10 BiocStyle_2.27.0
##
## loaded via a namespace (and not attached):
## [1] tidyselect_1.2.0 dplyr_1.0.10
## [3] blob_1.2.3 filelock_1.0.2
## [5] R.utils_2.12.2 Biostrings_2.67.0
## [7] bitops_1.0-7 fastmap_1.1.0
## [9] RCurl_1.98-1.9 BiocFileCache_2.7.1
## [11] VariantAnnotation_1.45.0 GenomicAlignments_1.35.0
## [13] XML_3.99-0.13 digest_0.6.31
## [15] lifecycle_1.0.3 ellipsis_0.3.2
## [17] KEGGREST_1.39.0 RSQLite_2.2.19
## [19] googleAuthR_2.0.0 magrittr_2.0.3
## [21] compiler_4.3.0 rlang_1.0.6
## [23] sass_0.4.4 progress_1.2.2
## [25] tools_4.3.0 utf8_1.2.2
## [27] yaml_2.3.6 data.table_1.14.6
## [29] rtracklayer_1.59.0 knitr_1.41
## [31] prettyunits_1.1.1 curl_4.3.3
## [33] bit_4.0.5 DelayedArray_0.25.0
## [35] xml2_1.3.3 BiocParallel_1.33.6
## [37] purrr_0.3.5 BiocGenerics_0.45.0
## [39] desc_1.4.2 R.oo_1.25.0
## [41] grid_4.3.0 stats4_4.3.0
## [43] fansi_1.0.3 biomaRt_2.55.0
## [45] SummarizedExperiment_1.29.1 cli_3.4.1
## [47] rmarkdown_2.18 crayon_1.5.2
## [49] generics_0.1.3 ragg_1.2.4
## [51] httr_1.4.4 rjson_0.2.21
## [53] DBI_1.1.3 cachem_1.0.6
## [55] stringr_1.5.0 zlibbioc_1.45.0
## [57] assertthat_0.2.1 parallel_4.3.0
## [59] AnnotationDbi_1.61.0 BiocManager_1.30.19
## [61] XVector_0.39.0 restfulr_0.0.15
## [63] matrixStats_0.63.0 vctrs_0.5.1
## [65] Matrix_1.5-3 jsonlite_1.8.4
## [67] bookdown_0.30 IRanges_2.33.0
## [69] hms_1.1.2 S4Vectors_0.37.3
## [71] bit64_4.0.5 systemfonts_1.0.4
## [73] GenomicFeatures_1.51.2 jquerylib_0.1.4
## [75] glue_1.6.2 pkgdown_2.0.6.9000
## [77] codetools_0.2-18 stringi_1.7.8
## [79] GenomeInfoDb_1.35.5 BiocIO_1.9.1
## [81] GenomicRanges_1.51.3 tibble_3.1.8
## [83] pillar_1.8.1 rappdirs_0.3.3
## [85] htmltools_0.5.4 GenomeInfoDbData_1.2.9
## [87] BSgenome_1.67.1 dbplyr_2.2.1
## [89] R6_2.5.1 textshaping_0.3.6
## [91] rprojroot_2.0.3 evaluate_0.18
## [93] Biobase_2.59.0 lattice_0.20-45
## [95] R.methodsS3_1.8.2 png_0.1-8
## [97] Rsamtools_2.15.0 gargle_1.2.1
## [99] memoise_2.0.1 bslib_0.4.1
## [101] Rcpp_1.0.9 xfun_0.35
## [103] fs_1.5.2 MatrixGenerics_1.11.0
## [105] pkgconfig_2.0.3