R/format_sumstats.R
format_sumstats.Rd
Check that summary statistics from GWAS are in a homogeneous format
format_sumstats(
path,
ref_genome = NULL,
convert_ref_genome = NULL,
chain_source = "ensembl",
convert_small_p = TRUE,
convert_large_p = TRUE,
convert_neg_p = TRUE,
compute_z = FALSE,
force_new_z = FALSE,
compute_n = 0L,
convert_n_int = TRUE,
impute_beta = FALSE,
es_is_beta = TRUE,
impute_se = FALSE,
analysis_trait = NULL,
INFO_filter = 0.9,
FRQ_filter = 0,
pos_se = TRUE,
effect_columns_nonzero = FALSE,
N_std = 5,
N_dropNA = TRUE,
rmv_chr = c("X", "Y", "MT"),
rmv_chrPrefix = TRUE,
on_ref_genome = TRUE,
strand_ambig_filter = FALSE,
allele_flip_check = TRUE,
allele_flip_drop = TRUE,
allele_flip_z = TRUE,
allele_flip_frq = TRUE,
bi_allelic_filter = TRUE,
snp_ids_are_rs_ids = TRUE,
remove_multi_rs_snp = FALSE,
frq_is_maf = TRUE,
indels = TRUE,
dbSNP = 155,
check_dups = TRUE,
sort_coordinates = TRUE,
nThread = 1,
save_path = tempfile(fileext = ".tsv.gz"),
write_vcf = FALSE,
tabix_index = FALSE,
return_data = FALSE,
return_format = "data.table",
ldsc_format = FALSE,
save_format = NULL,
log_folder_ind = FALSE,
log_mungesumstats_msgs = FALSE,
log_folder = tempdir(),
imputation_ind = FALSE,
force_new = FALSE,
mapping_file = sumstatsColHeaders
)
Filepath for the summary statistics file to be formatted. A dataframe or datatable of the summary statistics file can also be passed directly to MungeSumstats using the path parameter.
name of the reference genome used for the GWAS ("GRCh37" or "GRCh38"). Argument is case-insensitive. Default is NULL which infers the reference genome from the data.
name of the reference genome to convert to ("GRCh37" or "GRCh38"). This will only occur if the current genome build does not match. Default is not to convert the genome build (NULL).
source of the chain file to use in liftover, if converting genome build ("ucsc" or "ensembl"). Note that the UCSC chain files require a license for commercial use. The Ensembl chain is used by default ("ensembl").
Binary, should non-negative p-values <= 5e-324 be converted to 0? Small p-values pass the R limit and can cause errors with LDSC/MAGMA and should be converted. Default is TRUE.
Binary, should p-values >1 be converted to 1? P-values >1 should not be possible and can cause errors with LDSC/MAGMA and should be converted. Default is TRUE.
Binary, should p-values <0 be converted to 0? Negative p-values should not be possible and can cause errors with LDSC/MAGMA and should be converted. Default is TRUE.
Whether to compute Z-score column. Default is FALSE. This can be computed from Beta and SE with (Beta/SE) or P (Z:=sign(BETA)*sqrt(stats::qchisq(P,1,lower=FALSE))). Note that imputing the Z-score from P for every SNP will not be perfectly correct and may result in a loss of power. This should only be done as a last resort. Use 'BETA' to impute by BETA/SE and 'P' to impute by SNP p-value.
When a "Z" column already exists, it will be used by
default. To override and compute a new Z-score column from P set
force_new_z=TRUE
.
Whether to impute N. Default of 0 won't impute, any other integer will be imputed as the N (sample size) for every SNP in the dataset. Note that imputing the sample size for every SNP is not correct and should only be done as a last resort. N can also be inputted with "ldsc", "sum", "giant" or "metal" by passing one of these for this field or a vector of multiple. Sum and an integer value creates an N column in the output whereas giant, metal or ldsc create an Neff or effective sample size. If multiples are passed, the formula used to derive it will be indicated.
Binary, if N (the number of samples) is not an integer, should this be rounded? Default is TRUE.
Binary, whether BETA should be imputed using other effect data if it isn't present in the sumstats. Note that this imputation is an approximation (for Z & SE approach) so could have an effect on downstream analysis. Use with caution. The different methods MungeSumstats will try and impute beta (in this order or priority) are:
log(OR) 2. Z x SE Default value is FALSE.
Binary, whether to map ES to BETA. We take BETA to be any BETA-like value (including Effect Size). If this is not the case for your sumstats, change this to FALSE. Default is TRUE.
Binary, whether the standard error should be imputed using other effect data if it isn't present in the sumstats. Note that this imputation is an approximation so could have an effect on downstream analysis. Use with caution. The different methods MungeSumstats will try and impute se (in this order or priority) are:
BETA / Z 2. abs(BETA/ qnorm(P/2)) Default is FALSE.
If multiple traits were studied, name of the trait for analysis from the GWAS. Default is NULL.
numeric The minimum value permissible of the imputation information score (if present in sumstats file). Default 0.9.
numeric The minimum value permissible of the frequency(FRQ) of the SNP (i.e. Allele Frequency (AF)) (if present in sumstats file). By default no filtering is done, i.e. value of 0.
Binary Should the standard Error (SE) column be checked to ensure it is greater than 0? Those that are, are removed (if present in sumstats file). Default TRUE.
Binary should the effect columns in the data BETA,OR (odds ratio),LOG_ODDS,SIGNED_SUMSTAT be checked to ensure no SNP=0. Those that do are removed(if present in sumstats file). Default FALSE.
numeric The number of standard deviations above the mean a SNP's N is needed to be removed. Default is 5.
Drop rows where N is missing.Default is TRUE.
vector or character The chromosomes on which the SNPs should be removed. Use NULL if no filtering necessary. Default is X, Y and mitochondrial.
Remove "chr" or "CHR" from chromosome names. Default is TRUE.
Binary Should a check take place that all SNPs are on the reference genome by SNP ID. Default is TRUE.
Binary Should SNPs with strand-ambiguous alleles be removed. Default is FALSE.
Binary Should the allele columns be checked against reference genome to infer if flipping is necessary. Default is TRUE.
Binary Should the SNPs for which neither their A1 or A2 base pair values match a reference genome be dropped. Default is TRUE.
Binary should the Z-score be flipped along with effect and FRQ columns like Beta? It is assumed to be calculated off the effect size not the P-value and so will be flipped i.e. default TRUE.
Binary should the frequency (FRQ) column be flipped along with effect and z-score columns like Beta? Default TRUE.
Binary Should non-biallelic SNPs be removed. Default is TRUE.
Binary Should the supplied SNP ID's be assumed to be RSIDs. If not, imputation using the SNP ID for other columns like base-pair position or chromosome will not be possible. If set to FALSE, the SNP RS ID will be imputed from the reference genome if possible. Default is TRUE.
Binary Sometimes summary statistics can have multiple RSIDs on one row (i.e. related to one SNP), for example "rs5772025_rs397784053". This can cause an error so by default, the first RS ID will be kept and the rest removed e.g."rs5772025". If you want to just remove these SNPs entirely, set it to TRUE. Default is FALSE.
Conventionally the FRQ column is intended to show the minor/effect allele frequency (MAF) but sometimes the major allele frequency can be inferred as the FRQ column. This logical variable indicates that the FRQ column should be renamed to MAJOR_ALLELE_FRQ if the frequency values appear to relate to the major allele i.e. >0.5. By default this mapping won't occur i.e. is TRUE.
Binary does your Sumstats file contain Indels? These don't exist in our reference file so they will be excluded from checks if this value is TRUE. Default is TRUE.
version of dbSNP to be used for imputation (144 or 155).
whether to check for duplicates - if formatting QTL datasets this should be set to FALSE otherwise keep as TRUE. Default is TRUE.
Whether to sort by coordinates of resulting sumstats
Number of threads to use for parallel processes.
File path to save formatted data. Defaults to
tempfile(fileext=".tsv.gz")
.
Whether to write as VCF (TRUE) or tabular file (FALSE).
Index the formatted summary statistics with tabix for fast querying.
Return data.table
, GRanges
or VRanges
directly to user. Otherwise, return the path to the save data. Default is
FALSE.
If return_data is TRUE. Object type to be returned ("data.table","vranges","granges").
DEPRECATED, do not use. Use save_format="LDSC" instead.
Output format of sumstats. Options are NULL - standardised output format from MungeSumstats, LDSC - output format compatible with LDSC and openGWAS - output compatible with openGWAS VCFs. Default is NULL.
Binary Should log files be stored containing all filtered out SNPs (separate file per filter). The data is outputted in the same format specified for the resulting sumstats file. The only exception to this rule is if output is vcf, then log file saved as .tsv.gz. Default is FALSE.
Binary Should a log be stored containing all messages and errors printed by MungeSumstats in a run. Default is FALSE
Filepath to the directory for the log files and the log of MungeSumstats messages to be stored. Default is a temporary directory. Note the name of the log files (log messages and log outputs) are now the same as the name of the file specified in the save path parameter with the extension '_log_msg.txt' and '_log_output.txt' respectively.
Binary Should a column be added for each imputation step to show what SNPs have imputed values for differing fields. This includes a field denoting SNP allele flipping (flipped). On the flipped value, this denoted whether the alelles where switched based on MungeSumstats initial choice of A1, A2 from the input column headers and thus may not align with what the creator intended.Note these columns will be in the formatted summary statistics returned. Default is FALSE.
If a formatted file of the same names as save_path
exists, formatting will be skipped and this file will be imported instead
(default). Set force_new=TRUE
to override this.
MungeSumstats has a pre-defined column-name mapping file which should cover the most common column headers and their interpretations. However, if a column header that is in youf file is missing of the mapping we give is incorrect you can supply your own mapping file. Must be a 2 column dataframe with column names "Uncorrected" and "Corrected". See data(sumstatsColHeaders) for default mapping and necessary format.
The address for the modified sumstats file or the actual data dependent on user choice. Also, if log files wanted by the user, the return in both above instances are a list.
# Pass path to Educational Attainment Okbay sumstat file to a temp directory
eduAttainOkbayPth <- system.file("extdata", "eduAttainOkbay.txt",
package = "MungeSumstats"
)
## Call uses reference genome as default with more than 2GB of memory,
## which is more than what 32-bit Windows can handle so remove certain checks
## Using dbSNP = 144 for speed as it's smaller but you should use 155 unless
## you know what you are doing and need 144
is_32bit_windows <-
.Platform$OS.type == "windows" && .Platform$r_arch == "i386"
if (!is_32bit_windows) {
reformatted <- format_sumstats(
path = eduAttainOkbayPth,
ref_genome = "GRCh37",
dbSNP = 144
)
} else {
reformatted <- format_sumstats(
path = eduAttainOkbayPth,
ref_genome = "GRCh37",
on_ref_genome = FALSE,
strand_ambig_filter = FALSE,
bi_allelic_filter = FALSE,
allele_flip_check = FALSE,
dbSNP=144
)
}
#>
#>
#> ******::NOTE::******
#> - Formatted results will be saved to `tempdir()` by default.
#> - This means all formatted summary stats will be deleted upon ending the R session.
#> - To keep formatted summary stats, change `save_path` ( e.g. `save_path=file.path('./formatted',basename(path))` ), or make sure to copy files elsewhere after processing ( e.g. `file.copy(save_path, './formatted/' )`.
#> ********************
#> Formatted summary statistics will be saved to ==> /tmp/RtmpmswcRa/file14803bfd5fc9.tsv.gz
#> Importing tabular file: /__w/_temp/Library/MungeSumstats/extdata/eduAttainOkbay.txt
#> Checking for empty columns.
#> Standardising column headers.
#> First line of summary statistics file:
#> MarkerName CHR POS A1 A2 EAF Beta SE Pval
#> Summary statistics report:
#> - 93 rows
#> - 93 unique variants
#> - 70 genome-wide significant variants (P<5e-8)
#> - 20 chromosomes
#> Checking for multi-GWAS.
#> Checking for multiple RSIDs on one row.
#> Checking SNP RSIDs.
#> Checking for merged allele column.
#> Checking A1 is uppercase
#> Checking A2 is uppercase
#> Ensuring all SNPs are on the reference genome.
#> Loading SNPlocs data.
#> Loading reference genome data.
#> Preprocessing RSIDs.
#> Validating RSIDs of 93 SNPs using BSgenome::snpsById...
#> Loading required package: BiocGenerics
#>
#> Attaching package: ‘BiocGenerics’
#> The following objects are masked from ‘package:stats’:
#>
#> IQR, mad, sd, var, xtabs
#> The following objects are masked from ‘package:base’:
#>
#> Filter, Find, Map, Position, Reduce, anyDuplicated, aperm, append,
#> as.data.frame, basename, cbind, colnames, dirname, do.call,
#> duplicated, eval, evalq, get, grep, grepl, intersect, is.unsorted,
#> lapply, mapply, match, mget, order, paste, pmax, pmax.int, pmin,
#> pmin.int, rank, rbind, rownames, sapply, setdiff, sort, table,
#> tapply, union, unique, unsplit, which.max, which.min
#> Loading required package: S4Vectors
#> Loading required package: stats4
#>
#> Attaching package: ‘S4Vectors’
#> The following objects are masked from ‘package:base’:
#>
#> I, expand.grid, unname
#> BSgenome::snpsById done in 27 seconds.
#> Checking for correct direction of A1 (reference) and A2 (alternative allele).
#> There are 46 SNPs where A1 doesn't match the reference genome.
#> These will be flipped with their effect columns.
#> Checking for missing data.
#> Checking for duplicate columns.
#> Checking for duplicate SNPs from SNP ID.
#> Checking for SNPs with duplicated base-pair positions.
#> INFO column not available. Skipping INFO score filtering step.
#> Filtering SNPs, ensuring SE>0.
#> Ensuring all SNPs have N<5 std dev above mean.
#> Removing 'chr' prefix from CHR.
#> Making X/Y/MT CHR uppercase.
#> Checking for bi-allelic SNPs.
#> Warning: When method is an integer, must be >0.
#> 67 SNPs (72%) have FRQ values > 0.5. Conventionally the FRQ column is intended to show the minor/effect allele frequency.
#> The FRQ column was mapped from one of the following from the inputted summary statistics file:
#> FRQ, EAF, FREQUENCY, FRQ_U, F_U, MAF, FREQ, FREQ_TESTED_ALLELE, FRQ_TESTED_ALLELE, FREQ_EFFECT_ALLELE, FRQ_EFFECT_ALLELE, EFFECT_ALLELE_FREQUENCY, EFFECT_ALLELE_FREQ, EFFECT_ALLELE_FRQ, A1FREQ, A1FRQ, A2FREQ, A2FRQ, ALLELE_FREQUENCY, ALLELE_FREQ, ALLELE_FRQ, AF, MINOR_AF, EFFECT_AF, A2_AF, EFF_AF, ALT_AF, ALTERNATIVE_AF, INC_AF, A_2_AF, TESTED_AF, AF1, ALLELEFREQ, ALT_FREQ, EAF_HRC, EFFECTALLELEFREQ, FREQ.A1.1000G.EUR, FREQ.A1.ESP.EUR, FREQ.ALLELE1.HAPMAPCEU, FREQ.B, FREQ1, FREQ1.HAPMAP, FREQ_EUROPEAN_1000GENOMES, FREQ_HAPMAP, FREQ_TESTED_ALLELE_IN_HRS, FRQ_A1, FRQ_U_113154, FRQ_U_31358, FRQ_U_344901, FRQ_U_43456, POOLED_ALT_AF, AF_ALT, AF.ALT, AF-ALT, ALT.AF, ALT-AF, A2.AF, A2-AF, AF.EFF, AF_EFF, AF_EFF
#> As frq_is_maf=TRUE, the FRQ column will not be renamed. If the FRQ values were intended to represent major allele frequency,
#> set frq_is_maf=FALSE to rename the column as MAJOR_ALLELE_FRQ and differentiate it from minor/effect allele frequency.
#> Sorting coordinates.
#> Writing in tabular format ==> /tmp/RtmpmswcRa/file14803bfd5fc9.tsv.gz
#> Summary statistics report:
#> - 93 rows (100% of original 93 rows)
#> - 93 unique variants
#> - 70 genome-wide significant variants (P<5e-8)
#> - 20 chromosomes
#> Done munging in 0.474 minutes.
#> Successfully finished preparing sumstats file, preview:
#> Reading header.
#> SNP CHR BP A1 A2 FRQ BETA SE P
#> 1: rs301800 1 8490603 T C 0.17910 0.019 0.003 1.794e-08
#> 2: rs11210860 1 43982527 G A 0.63060 -0.017 0.003 2.359e-10
#> 3: rs34305371 1 72733610 G A 0.91231 -0.035 0.005 3.762e-14
#> 4: rs2568955 1 72762169 T C 0.23690 -0.017 0.003 1.797e-08
#> Returning path to saved data.
# returned location has the updated summary statistics file