NEWS.md
log_folder parameter in format_sumstats() has been updated. It is still used to point to the directory for the log files and the log of MungeSumstats messages to be stored. And the default is still a temporary directory. However, now the name of the log files (log messages and log outputs) are the same as the name of the file specified in the save_path parameter with the extension ’_log_msg.txt’ and ’_log_output.txt’ respectively.data.table::fread() leaves NAs blank instead of including a literal NA. That’s fine for CSVs and if the output is read in by fread, but it breaks other tools for TSVs and is hard to read. Updated that and added a message when the table is switched to uncompressed for indexing.read_header:
n=NULL.seqminer from all code (too buggy).import_sumstats:
@inheritDotParams format_sumstats for better documentation.parse_logs: Added new fields.format_sumstats: Added time report at the end (minutes taken total). Since this is a message, will be included in the logs, and is now parsed by parse_logs and put into the column “time”.find_sumstats():
vcf2df.
read_vcf can now be parallised: splits query into chunks, imports them, and (optionally) converts them to data.table before rbinding them back into one object.
mt_thresh to avoid using parallelisation when VCFs are small, due to the overhead outweighing the benefits in these cases.tryCatch to downloader with different download.file parameters that may work better on certain machines.file.path to specify URL in:
get_chain_fileimport_sumstatsdownload_vcf to pass URLs directly (without downloading the files) when vcf_download=FALSE.download_vcf:
load_ref_genome_data:
read_vcf_genome: more robust way to get genome build from VCF.read_sumstats: Speed up by using remove_empty_cols(sampled_rows=), and only run for tabular file (read_vcf already does this internally).select_vcf_field: Got rid of “REF col doesn’t exists” warning by omitting rowRanges.vignettes/MungeSumstats.Rmd were surrounding by ticks.vcf2df: Accounted for scenarios where writeVcf accidentally converts geno data into redundant 3D matrices.
data.table::rbindlist(fill=TRUE) to bind chunks back together.read_vcf upgrades:
infer_vcf_sample_idsis_vcf_parsedcheck_tab_delimitedread_vcf_dataremove_nonstandard_vcf_colsdt_to_granges by merging functionality into to_granges.
liftover to accommodate the slight change.is_tabix (I had incorrectly made path all lowercase).index_vcf recognize all compressed vcf suffixes.
BiocParallel registered threads back to 1 after read_vcf_parallel finishes, to avoid potential conflicts with downstream steps.find_sumstats output to keep track of search parameters.import_sumstats:
save_path) exists before downloading to save time.force_new in additional to force_new_vcf.MungeSumstats.read_vcf to be more robust.IRanges to Imports.stringr (no longer used)is_tabix to check whether a file is already tabix-indexed.read_sumstats:
samples as an arg.GenomicFiles.read_sumstats: now takes samples as an arg.INFO_filter= from ALS VCF examples in vignettes (no longer necessary now that INFO parsing has been corrected).download_vcf can now handle situations with vcf_url= is actually a local file (not remote).check_info_score step.check_info_score:
log_files$info_filter in these instances.check_empty_cols was accidentally dropping more columns than it should have.write_sumstats when indexing VCF.read_sumstats can read in any VCF files (local/remote, indexed/non-indexed).test-vcf_formatting.R
test-check_impute_se_beta
setkey on SNP (now automatically renamed from ID by read_vcf).test-read_sumstats:
read_sumstats.vcf_ss are dropped.parse_logs: Add lines to parsing subfunctions to allow handling of logs that don’t contain certain info (thus avoid warnings when creating the final data.table).check_pos_secheck_signed_colRsamtools::bgzip does compression in Bioc 3.15. Switched to using fread + readLines in:
read_headerread_sumstatsread_header: wasn’t reading in enough lines to get past the VCF header. Increase to readLines(n=1000).read_vcf: Would sometimes induce duplicate rows. Now only unique rows are used (after sample and columns filtering).liftover
GenomeInfoDb::mapGenomeBuilds to standardise build names.standardise_sumstats_column_headers_crossplatform
standardise_header while keeping the original function name as an internal function (they call the same code).vignette -liftover` tutorial
compute_nsizestandardise_sumstats_column_headers_crossplatformformatted_examplestandardise_sumstats_column_headers_crossplatform: Added arg uppercase_unmapped to to allow users to specify whether they want make the columns that could not be mapped to a standard name uppercase (default=TRUE for backcompatibility). Added arg return_list to specify whether to return a named list (default) or just the data.table.formatted_example: Added args formatted to specify whether the file should have its colnames standardised. Added args sorted to specify whether the file should sort the data by coordinates. Added arg return_list to specify whether to return a named list (default) or just the data.table..datatable.aware=TRUE to .zzz as extra precaution.vcf2df: Documented arguments.import_sumstats: Create individual folders for each GWAS dataset, with a respective logs subfolder to avoid overwriting log files when processing multiple GWAS.parse_logs: New function to convert logs from one or more munged GWAS into a data.table.list_sumstats: New function to recursively search for local summary stats files previously munged with MungeSumstats.inst/extdata/MungeSumstats_log_msg.txt to test logs files.list_sumstats and parse_logs.gh-pages branch automatically by new GHA workflow.convert_large_p and convert_neg_p, respectively. These are both handled by the new internal function check_range_p_val, which also reports the number of SNPs found meeting these criteria to the console/logs.check_small_p_val records which SNPs were imputed in a more robust way, by recording which SNPs met the criteria before making the changes (as opposed to inferred this info from which columns are 0 after making the changes). This function now only handles non-negative p-values, so that rows with negative p-values can be recorded/reported separately in the check_range_p_val step.check_small_p_val now reports the number of SNPs <= 5e-324 to console/logs.check_range_p_val and check_small_p_val.parse_logs can now extract information reported by check_range_p_val and check_small_p_val.logs_example provides easy access to log file stored in inst/extdata, and includes documentation on how it was created.check_range_p_val and check_small_p_val now use #' @inheritParams format_sumstats to improve consistency of documentation.suppressWarnings where possible.validate_parameters can now handle ref_genome=NULLto_GRanges/to_GRanges functions to all-lowercase functions (for consistency with other functions).nThread=1 in data.table test functions.get_genome_buildssave_path is in was actually created (as opposed to finding out at the very end of the pipeline).read_header and read_sumstats now both work with .bgz files.format_sumstats(FRQ_filter) added so SNPs can now be filtered by allele frequencyformat_sumstats(frq_is_maf) check added to infer if FRQ column values are minor/effect allele frequencies or not. frq_is_maf allows users to rename the FRQ column as MAJOR_ALLELE_FRQ if some values appear to be major allele frequenciesget_genome_builds() can now be called to quickly get the genome build without running the whole reformatting.format_sumstats(compute_n) now has more methods to compute the effective sample size with “ldsc”, “sum”, “giant” or “metal”.format_sumstats(convert_ref_genome) now implemented which can perform liftover to GRCh38 from GRCh37 and vice-versa enabling better cohesion between different study’s summary statistics.check_no_rs_snp can now handle extra information after an RS ID. So if you have rs1234:A:G that will be separated into two columns.check_two_step_col and check_four_step_col, the two checks for when multiple columns are in one, have been updated so if not all SNPs have multiple columns or some have more than the expected number, this can now be handled.FRQ column have been added to the mapping filecheck_multi_rs_snp can now handle all punctuation with/without spaces. So if a row contains rs1234,rs5678 or rs1234, rs5678 or any other punctuation character other than , these can be handled.format_sumstats(path) can now be passed a dataframe/datatable of the summary statistics directly as well as a path to their saved location.A0/A1 corresponding to ref/alt can now be handled by the mappign file as well as A1/A2 corresponding to ref/alt.import_sumstats reads GWAS sum stats directly from Open GWAS. Now parallelised and reports how long each dataset took to import/format in total.find_sumstats searches Open GWAS for datasets.compute_z computes Z-score from P.compute_n computes N for all SNPs from user defined smaple size.format_sumstats(ldsc_format=TRUE) ensures sum stats can be fed directly into LDSC without any additional munging.read_sumstats, write_sumstas, and download_vcf functions now exported.format_sumstats(sort_coordinates=TRUE) sorts results by their genomic coordinates.format_sumstats(return_data=TRUE) returns data directly to user. Can be returned in either data.table (default), GRanges or VRanges format using format_sumstats(return_format="granges").format_sumstats(N_dropNA=TRUE) (default) drops rows where N is missing.format_sumstats(snp_ids_are_rs_ids=TRUE) (default) Should the SNP IDs inputted be inferred as RS IDs or some arbitrary ID.format_sumstats(write_vcf=TRUE) writes a tabix-indexed VCF file instead of tabular format.format_sumstats(save_path=...) lets users decide where their results are saved and what they’re named.save_path indicates it’s in tempdir(), message warns users that these files will be deleted when R session ends.format_sumstats via report_summary().preview_sumstats() messages improved.format_sumstats(pos_se=TRUE,effect_columns_nonzero=TRUE)
format_sumstats(log_folder_ind=TRUE,log_folder=tempdir())
format_sumstats(imputation_ind=TRUE)
data(sumstatsColHeaders). See format_sumstats(mapping_file = mapping_file).read_vcf upgraded to account for more VCF formats.check_n_num now accounts for situations where N is a character vector and converts to numeric.