Read in a VCF file as a VCF or a data.table. Can optionally save the VCF/data.table as well.
read_vcf(
path,
as_datatable = TRUE,
save_path = NULL,
tabix_index = FALSE,
samples = 1,
which = NULL,
use_params = TRUE,
sampled_rows = 10000L,
download = TRUE,
vcf_dir = tempdir(),
download_method = "download.file",
force_new = FALSE,
mt_thresh = 100000L,
nThread = 1,
verbose = TRUE
)
Path to local or remote VCF file.
Return the data as a
data.table (default: TRUE
)
or a VCF (FALSE
).
File path to save formatted data. Defaults to
tempfile(fileext=".tsv.gz")
.
Index the formatted summary statistics with tabix for fast querying.
Which samples to use:
1 : Only the first sample will be used (DEFAULT).
NULL : All samples will be used.
c("<sample_id1>","<sample_id2>",...) : Only user-selected samples will be used (case-insensitive).
A GRanges
describing the sequences and
ranges to be queried. Variants whose POS
lies in the interval(s)
[start, end]
are returned. If which
is not specified all
ranges are returned.
When TRUE
(default), increases the speed of reading in the VCF by
omitting columns that are empty based on the head of the VCF (NAs only).
NOTE that that this requires the VCF to be sorted, bgzip-compressed,
tabix-indexed, which read_vcf will attempt to do.
First N rows to sample.
Set NULL
to use full sumstats_file
.
when determining whether cols are empty.
Download the VCF (and its index file)
to a temp folder before reading it into R.
This is important to keep TRUE
when nThread>1
to avoid
making too many queries to remote file.
Where to download the original VCF from Open GWAS.
WARNING: This is set to tempdir()
by default.
This means the raw (pre-formatted) VCFs be deleted upon ending the R session.
Change this to keep the raw VCF file on disk
(e.g. vcf_dir="./raw_vcf"
).
"axel"
(multi-threaded) or
"download.file"
(single-threaded) .
If a formatted file of the same names as save_path
exists, formatting will be skipped and this file will be imported instead
(default). Set force_new=TRUE
to override this.
When the number of rows (variants) in the VCF is
< mt_thresh
, only use single-threading for reading in the VCF.
This is because the overhead of parallelisation outweighs the speed benefits
when VCFs are small.
Number of threads to use for parallel processes.
Print messages.
The VCF file in data.table format.
#### Local file ####
path <- system.file("extdata","ALSvcf.vcf", package="MungeSumstats")
sumstats_dt <- read_vcf(path = path)
#> Loading required namespace: GenomicFiles
#> Using local VCF.
#> bgzip-compressing VCF file.
#> Finding empty VCF columns based on first 10,000 rows.
#> Dropping 1 duplicate column(s).
#> 1 sample detected: EBI-a-GCST005647
#> Constructing ScanVcfParam object.
#> VCF contains: 39,630,630 variant(s) x 1 sample(s)
#> Reading VCF file: single-threaded
#> Converting VCF to data.table.
#> Expanding VCF first, so number of rows may increase.
#> Dropping 1 duplicate column(s).
#> Checking for empty columns.
#> Unlisting 3 columns.
#> Dropped 314 duplicate rows.
#> Time difference of 0.1 secs
#> VCF data.table contains: 101 rows x 11 columns.
#> Time difference of 0.6 secs
#> Renaming ID as SNP.
#> VCF file has -log10 P-values; these will be converted to unadjusted p-values in the 'P' column.
#> No INFO (SI) column detected.
#### Remote file ####
## Small GWAS (0.2Mb)
# path <- "https://gwas.mrcieu.ac.uk/files/ieu-a-298/ieu-a-298.vcf.gz"
# sumstats_dt2 <- read_vcf(path = path)
## Large GWAS (250Mb)
# path <- "https://gwas.mrcieu.ac.uk/files/ubm-a-2929/ubm-a-2929.vcf.gz"
# sumstats_dt3 <- read_vcf(path = path, nThread=11)
### Very large GWAS (500Mb)
# path <- "https://gwas.mrcieu.ac.uk/files/ieu-a-1124/ieu-a-1124.vcf.gz"
# sumstats_dt4 <- read_vcf(path = path, nThread=11)