NEWS.md
log_folder
parameter in format_sumstats()
has been updated. It is still used to point to the directory for the log files and the log of MungeSumstats messages to be stored. And the default is still a temporary directory. However, now the name of the log files (log messages and log outputs) are the same as the name of the file specified in the save_path
parameter with the extension ’_log_msg.txt’ and ’_log_output.txt’ respectively.data.table::fread()
leaves NAs blank instead of including a literal NA. That’s fine for CSVs and if the output is read in by fread, but it breaks other tools for TSVs and is hard to read. Updated that and added a message when the table is switched to uncompressed for indexing.read_header
:
n=NULL
.seqminer
from all code (too buggy).import_sumstats
:
@inheritDotParams format_sumstats
for better documentation.parse_logs
: Added new fields.format_sumstats
: Added time report at the end (minutes taken total). Since this is a message, will be included in the logs, and is now parsed by parse_logs
and put into the column “time”.find_sumstats()
:
vcf2df
.
read_vcf
can now be parallised: splits query into chunks, imports them, and (optionally) converts them to data.table
before rbinding them back into one object.
mt_thresh
to avoid using parallelisation when VCFs are small, due to the overhead outweighing the benefits in these cases.tryCatch
to downloader
with different download.file
parameters that may work better on certain machines.file.path
to specify URL in:
get_chain_file
import_sumstats
download_vcf
to pass URLs directly (without downloading the files) when vcf_download=FALSE
.download_vcf
:
load_ref_genome_data
:
read_vcf_genome
: more robust way to get genome build from VCF.read_sumstats
: Speed up by using remove_empty_cols(sampled_rows=)
, and only run for tabular file (read_vcf
already does this internally).select_vcf_field
: Got rid of “REF col doesn’t exists” warning by omitting rowRanges
.vignettes/MungeSumstats.Rmd
were surrounding by ticks.vcf2df
: Accounted for scenarios where writeVcf
accidentally converts geno
data into redundant 3D matrices.
data.table::rbindlist(fill=TRUE)
to bind chunks back together.read_vcf
upgrades:
infer_vcf_sample_ids
is_vcf_parsed
check_tab_delimited
read_vcf_data
remove_nonstandard_vcf_cols
dt_to_granges
by merging functionality into to_granges
.
liftover
to accommodate the slight change.is_tabix
(I had incorrectly made path
all lowercase).index_vcf
recognize all compressed vcf suffixes.
BiocParallel
registered threads back to 1 after read_vcf_parallel
finishes, to avoid potential conflicts with downstream steps.find_sumstats
output to keep track of search parameters.import_sumstats
:
save_path
) exists before downloading to save time.force_new
in additional to force_new_vcf
.MungeSumstats
.read_vcf
to be more robust.IRanges
to Imports.stringr
(no longer used)is_tabix
to check whether a file is already tabix-indexed.read_sumstats
:
samples
as an arg.GenomicFiles
.read_sumstats
: now takes samples
as an arg.INFO_filter=
from ALS VCF examples in vignettes (no longer necessary now that INFO parsing has been corrected).download_vcf
can now handle situations with vcf_url=
is actually a local file (not remote).check_info_score
step.check_info_score
:
log_files$info_filter
in these instances.check_empty_cols
was accidentally dropping more columns than it should have.write_sumstats
when indexing VCF.read_sumstats
can read in any VCF files (local/remote, indexed/non-indexed).test-vcf_formatting.R
test-check_impute_se_beta
setkey
on SNP (now automatically renamed from ID by read_vcf
).test-read_sumstats
:
read_sumstats
.vcf_ss
are dropped.parse_logs
: Add lines to parsing subfunctions to allow handling of logs that don’t contain certain info (thus avoid warnings when creating the final data.table).check_pos_se
check_signed_col
Rsamtools::bgzip
does compression in Bioc 3.15. Switched to using fread + readLines
in:
read_header
read_sumstats
read_header
: wasn’t reading in enough lines to get past the VCF header. Increase to readLines(n=1000)
.read_vcf
: Would sometimes induce duplicate rows. Now only unique rows are used (after sample and columns filtering).liftover
GenomeInfoDb::mapGenomeBuilds
to standardise build names.standardise_sumstats_column_headers_crossplatform
standardise_header
while keeping the original function name as an internal function (they call the same code).vignette -
liftover` tutorial
compute_nsize
standardise_sumstats_column_headers_crossplatform
formatted_example
standardise_sumstats_column_headers_crossplatform
: Added arg uppercase_unmapped
to to allow users to specify whether they want make the columns that could not be mapped to a standard name uppercase (default=TRUE
for backcompatibility). Added arg return_list
to specify whether to return a named list (default) or just the data.table
.formatted_example
: Added args formatted
to specify whether the file should have its colnames standardised. Added args sorted
to specify whether the file should sort the data by coordinates. Added arg return_list
to specify whether to return a named list (default) or just the data.table
..datatable.aware=TRUE
to .zzz as extra precaution.vcf2df
: Documented arguments.import_sumstats
: Create individual folders for each GWAS dataset, with a respective logs
subfolder to avoid overwriting log files when processing multiple GWAS.parse_logs
: New function to convert logs from one or more munged GWAS into a data.table
.list_sumstats
: New function to recursively search for local summary stats files previously munged with MungeSumstats
.inst/extdata/MungeSumstats_log_msg.txt
to test logs files.list_sumstats
and parse_logs
.gh-pages
branch automatically by new GHA workflow.convert_large_p
and convert_neg_p
, respectively. These are both handled by the new internal function check_range_p_val
, which also reports the number of SNPs found meeting these criteria to the console/logs.check_small_p_val
records which SNPs were imputed in a more robust way, by recording which SNPs met the criteria before making the changes (as opposed to inferred this info from which columns are 0 after making the changes). This function now only handles non-negative p-values, so that rows with negative p-values can be recorded/reported separately in the check_range_p_val
step.check_small_p_val
now reports the number of SNPs <= 5e-324 to console/logs.check_range_p_val
and check_small_p_val
.parse_logs
can now extract information reported by check_range_p_val
and check_small_p_val
.logs_example
provides easy access to log file stored in inst/extdata, and includes documentation on how it was created.check_range_p_val
and check_small_p_val
now use #' @inheritParams format_sumstats
to improve consistency of documentation.suppressWarnings
where possible.validate_parameters
can now handle ref_genome=NULL
to_GRanges
/to_GRanges
functions to all-lowercase functions (for consistency with other functions).nThread=1
in data.table
test functions.get_genome_builds
save_path
is in was actually created (as opposed to finding out at the very end of the pipeline).read_header
and read_sumstats
now both work with .bgz files.format_sumstats(FRQ_filter)
added so SNPs can now be filtered by allele frequencyformat_sumstats(frq_is_maf)
check added to infer if FRQ column values are minor/effect allele frequencies or not. frq_is_maf allows users to rename the FRQ column as MAJOR_ALLELE_FRQ if some values appear to be major allele frequenciesget_genome_builds()
can now be called to quickly get the genome build without running the whole reformatting.format_sumstats(compute_n)
now has more methods to compute the effective sample size with “ldsc”, “sum”, “giant” or “metal”.format_sumstats(convert_ref_genome)
now implemented which can perform liftover to GRCh38 from GRCh37 and vice-versa enabling better cohesion between different study’s summary statistics.check_no_rs_snp
can now handle extra information after an RS ID. So if you have rs1234:A:G
that will be separated into two columns.check_two_step_col
and check_four_step_col
, the two checks for when multiple columns are in one, have been updated so if not all SNPs have multiple columns or some have more than the expected number, this can now be handled.FRQ
column have been added to the mapping filecheck_multi_rs_snp
can now handle all punctuation with/without spaces. So if a row contains rs1234,rs5678
or rs1234, rs5678
or any other punctuation character other than ,
these can be handled.format_sumstats(path)
can now be passed a dataframe/datatable of the summary statistics directly as well as a path to their saved location.A0/A1
corresponding to ref/alt can now be handled by the mappign file as well as A1/A2
corresponding to ref/alt.import_sumstats
reads GWAS sum stats directly from Open GWAS. Now parallelised and reports how long each dataset took to import/format in total.find_sumstats
searches Open GWAS for datasets.compute_z
computes Z-score from P.compute_n
computes N for all SNPs from user defined smaple size.format_sumstats(ldsc_format=TRUE)
ensures sum stats can be fed directly into LDSC without any additional munging.read_sumstats
, write_sumstas
, and download_vcf
functions now exported.format_sumstats(sort_coordinates=TRUE)
sorts results by their genomic coordinates.format_sumstats(return_data=TRUE)
returns data directly to user. Can be returned in either data.table
(default), GRanges
or VRanges
format using format_sumstats(return_format="granges")
.format_sumstats(N_dropNA=TRUE)
(default) drops rows where N is missing.format_sumstats(snp_ids_are_rs_ids=TRUE)
(default) Should the SNP IDs inputted be inferred as RS IDs or some arbitrary ID.format_sumstats(write_vcf=TRUE)
writes a tabix-indexed VCF file instead of tabular format.format_sumstats(save_path=...)
lets users decide where their results are saved and what they’re named.save_path
indicates it’s in tempdir()
, message warns users that these files will be deleted when R session ends.format_sumstats
via report_summary()
.preview_sumstats()
messages improved.format_sumstats(pos_se=TRUE,effect_columns_nonzero=TRUE)
format_sumstats(log_folder_ind=TRUE,log_folder=tempdir())
format_sumstats(imputation_ind=TRUE)
data(sumstatsColHeaders)
. See format_sumstats(mapping_file = mapping_file)
.read_vcf
upgraded to account for more VCF formats.check_n_num
now accounts for situations where N is a character vector and converts to numeric.