- Use vcfanno for faster, more generalized variant annotation and database creation.
- Fix bug in
comp_hetwhere candidates were not removed when either parent was homozygous at both sites (thanks Jessica Chong). (inheritance v0.1.0)
x-linked-dominantwith the following rules (thanks Jessica Chong): + mothers of affected males must be het (and affected) + at least 1 parent of affected females must be het (and affected).
- Store extra vep fields for each transcript in variant_impacts.
- Update geneimpacts module to handle new snpEff annotations of SVs (Brad Chapman).
- Update to dbsnp 147.
- Use SQLAlchemy for table definitions to support different RDBMS back-ends.
- Several optimizations to loading.
- X-linked recessive, dominant, and de novo tools.
- Raise exceptions rather than sys.exit() to facilitate use as library. (thanks @brainstorm).
comp_hettool is as much as 20X faster for large cohorts.
- handle rare VEP annotation with ‘’ or ”?” as impact
- fix handling of multiple values for list/uniq_list in gemini_annotate
- handle snpEff with unknown impact (TODO: bump geneimpacts req to 0.1.0)
- fix builtin browser and add support for puzzle browser
- add –gt-filter-required to gemini gene_wise tool that must pass for each variant.
- fix bug in lof_sieve when transcript pos is not specified. (thanks @mmoisse)
- Update the Clinvar annotation file to the February 3, 2016 release.
clinvar_gene_phenotypecolumn which can be used to limit candidates to those that are in the same gene as a gene with the known phenotype from clinvar. Phenotypes are all lower case. Likely usage is:
--filter "... and clinvar_gene_phenotype LIKE '%dysplasia%'where the ‘%’ wildcard is needed because the column is a ‘|’ delimited list of all disease for that gene.
geno2mp_hpo_ctcolumn which will be > 0 if that variant is present in geno2mp (http://geno2mp.gs.washington.edu/Geno2MP/#/)
exac_num_hom_altwas requested. (Thanks to @davemcg for reporting).
MEDpriority in the interest of reducing false negatives in the study of rare disease.
- Set missing AF to -1 (instead of NULL) for unknown for all ESP, 1KG, ExAC allele frequency columns.
- Don’t user
order bywhen not needed for built in tools. Speeds up queries when min-kindreds is None or 1.
- Document the
- Fix gemini load with
-t allwhen only VEP is present.
- Fix bug in gemini
annotatewhere numeric operations on integers did not work for VCF.
- Fix bug in comp_hets tool where attempting to phase a ”.” genotype would result in an error. (Thanks to Aparna and Martina for reporting)
- Fix bug in
mendel_errors toolwhen specifying
--families. Thanks to @davemcg for reporting.
- Improved installation and update via conda (via Brad Chapman!).
- Update support for SO term variant impact predictions via VEP and SnpEff and support newest snpEff version. If both snpEff and VEP annotations are present use gemini load -t all to save all annotations and the most deleterious will be saved in the variants table.
- Add an is_splicing column.
- Add exac_num_het, exac_num_hom_alt and exac_num_chroms columns. (See #568).
- Fix issues with cyvcf2 handling of haploid calls (from X chromosome from GATK; thanks Athina for reporting)
- Fix handling of VEP extra fields (thanks @jsh58).
- Remove pygraph dependency (networkx is easier to install). Allow specifying custom edges to interactions tools.
- Fix some edge-cases in compound het tool. Thanks to Jamie Kwok for reporting.
- Allow requring a minimum genotype quality in inheritance tools with –min-gq option.
- Fix bug in cyvcf2 for parsing hemizyous (0/.) variants from decomposition.
- comp_het tool now has a priority 2 level for unphased singletons where both sites are HET. Candidates where both parents and HET, along with the affected are now priority 3.
- Fix bcolz dependencies owing to issues with bcolz 0.11.0
- Change handling of missing values in PL/GL (–gt-pl-max) so that missing values are set to int32 max. (Thanks Karyn for reporting).
- Fix distributed loading of VEP with extra columns (@chapmanb) [Regression since 0.17.0]
- Fix comp_het test and improve efficiency (thanks Bianca for reporting)
- Bug fix: populate eval dictionary with sample_info.
- switch to cyvcf2 to speed loading
- per-sample depths are calculated using AD (GATK) or AO+RO (Freebayes). This makes depth filters more conservative.
- extra VEP annotations are loaded with loading machinery, not as an extra step as before.
- add max_aaf_all column (https://github.com/arq5x/gemini/issues/520) as an aggregate of a number of population filters.
- use –families to limit queries before any work is done. Thanks to Bianca for reporting.
- No longer create bcolz indicies by default. Users can create them with gemini bcolz_index.
- New genewise tool. See docs.
- gemini load: –skip-info-string has been replaced with –save-info-string and the INFO field is not longer saved by default.
- comp_hets: default to only showing confident (priority 1) candidates. Show all candidates with –max-priority 3.
- Fix bug in
comp_hetwith reporting same pair multiple times.
- Handle UNKOWN genotypes in
- Fix cyvcf dependency in requirements
- Only run tests that require bgzip/tabix/bedtools if they are available on PATH
- Limit ipython version to 3<version<4
- Hone rules for unphased and partially-phased compound hets.
- Remove –lenient argument for comp_hets and add –pattern-only to find compound_hets regardless of affection status.
- The –lenient argument to the autosomal_dominant tool has been relaxed to allow parents with unknown phenotypes.
- Re (vt) decompose data files for 1000 genomes and ExAC (thanks Julien and Xiaolin for reporting).
- Fix regression in loading when AAF is None
- Fix handing in mendelian error tool where all genotype likelihoods are low (thanks Bianca)
- Don’t phase de-novo’s (caused error in comp_het tool). (thanks Bianca)
- Fix regression in loading VEP with multicore (thanks Andrew)
- The built-in inheritance model tools (
auto_rec, etc.) have been modified to be more restrictive in order to remove false positive candidates. The strictness can be reduced by using the
- Leverage bcolz indexing for the built-in inheritance model tools to dramatically improve speed.
- Support for multi-generational pedigrees for the built in inheritance model tools. (thanks to Jessica, Andrew, and jmcelwee for extensive discussion https://github.com/arq5x/gemini/issues/388)
- Leverage genotype likelihoods in tools other than
mendel_errorsas a means to filter variants.
- Automatically phase genotypes by transmission on the fly for the comp_hets tool.
- Further performance improvements for bcolz queries
--affected-onlyoption has been made the default and it’s opposing replacement named
- Fixed a reporting error for the inheritance tools (i.e., family_id was mis-specified in output).
- Annotate the variants table with impact even if there is not severe impact. Thanks to @mjsduncan for reporting.
- Reduce memory requirements when loading. Thanks to @mjsduncan for reporting.
- Fix regression in grabix. Thanks to Sven-Eric Shelhorn for reporting.
- Fix handling of samples with “-”. Thanks to Uma Paila for reporting.
- Use external index to speed genotype queries (this is created by default on load unless –no-bcolz is specified)
- Match on ref and alternate alleles (not just position) when annotating with VCF. Thanks Jeremy Goecks.
- Related to matching, we now load extra annotation, e.g. VEP as VCF and require ref and alt matching. Previously was done with bed overlap.
- Faster queries due to lazy loading of genotype columns.
- Read gt* columns from the database for better backward compatibility.
- Code cleanup. Thanks to Christian Brueffer.
- Standardized the output from the built-in tools into a common, BED+ format. Thanks to feedback from Jessica Chong and Daniel Gaston.
- Release of mendel_errors tool which also outputs the type of error and the probability (based on PL’s)
- Improvements to the load tool when running on large compute clusters using PBS, SGE, SLURM, etc. Also provde a workaround for NFS locking issues. Many thanks to Ben Weisburd in Daniel Macarthur’s lab.
- Improve preprocess script to support varscan, platypus (https://gist.github.com/brentp/4db670df147cbd5a2b32)
- Performance improvements for many of the built-in tools (pre-compile evals)
- Bug fix for installation with sudo privileges.
- Major query speed improvements. For example, the following query goes from 43 seconds in version 0.12.2 to 11 seconds in 0.13.0. All queries involving gt_* fields should be substantially faster.
$ gemini query \ -q "select chrom, start, (gts).(*) from variants" data/tmaster.db \ --gt-filter "(gt_depths).(*).(>=20).(all)" > /dev/null
- Speed improvements to load. The following went from 7 minutes 9 seconds to 6 minutes 21 seconds.
$ gemini load -t VEP -v data/v100K.vcf.gz data/tmaster.db --cores 4
- We added the gt_phred_ll_homref, gt_phred_ll_het, gt_phred_ll_homalt columns to database. These are the genotype likelihoods pulled from the GL or PL columns of the VCF if available. They can all be queried and filtered in the same way as existing gt_* columns. In future releases, we are planning tp use genotype likelihood to assign likelihoods to de novo mutations, mendelian violations, and variants meeting other inheritance patterns.
- Fixed bugs related to splitting multiple alts (thanks to @jdh237)
- We are working to improve development and release testing. This is ongoing, but we now support gemini_install.py –version unstable so that users can try out the latest changes and help with testing before releases. gemini_update is still limited to master as the most recent version.
- Update cyvcf so it doesn’t error when AD tag is used for non-list data.
- Fix regression in cyvcf to handle Flags in info field. (Thanks to Jon for reporting)
- Improvements to install related to PYTHONHOME and other env variables(@chapmanb & @bw2)
Corrected a stale .c file in the cyvcf library. This is effectively a replacement for the 0.12.1 release.
- Support for input VCF files containing variants with multiple alternate alleles. Thanks to Brent Pedersen.
- Updated, decomposed, and normalized the ExAC, Clinvar, Cosmic, dbSNP, and ESP annotation files to properly support variants with multiple alternate alleles.
- Updated the logic for the clinvar significance column to retain all documented singificances.
- Support for VCF annotation files in the annotate tool.
- Improved the speed of loading by 10-15%. Thanks to Brent Pedersen.
- Added –only-affected and –min-kindreds options to the compound heterozygotes tool.
- Added a –format vcf option to the query tool to output query results in VCF format.
- Added the –families option to the auto_*, de_novo, and comp_hets tools. Thanks to Mark Cowley and Tony Roscioli.
- Added the –only-affected option to the de_novo tool.
- Allow the –sample-filter to work with –format TPED. Thanks to Rory Kirchner.
- Add –format sampledetail option that provides a melted/tidy/flattened version of samples along with –showsample and includes information from samples table. Thanks to Brad Chapman.
- Add ‘not’ option to –in filtering. Thanks to Rory Kirchner.
- Fixed a bug in the de_novo tool that prevented proper function when families have affected and unaffected children. Thanks to Andrew Oler.
- Fixed a bug in cyvcf that falsely treated ‘.|.’ genotypes as homozygous alternate. Thanks to Xiao Xu.
- GEMINI now checks for and warns of old grabix index files. Thanks to Andrew Oler and Brent Pedersen.
- Fixed a bug that added newlines at the end of tab delimited PED files. Thanks to Brad Chapman.
- Integration of ExAC annotations (v0.2): http://exac.broadinstitute.org/
- New tools for cancer genome analysis. Many thanks to fantastic work from Colby Chiang.
- gemini set_somatic
- gemini actionable_mutations
- gemini fusions
- Improved support for structural variants. New columns include:
- Updated the 1000 Genomes annotations to the Phase variant set.
- Added clinvar_causal_allele column.
- Fixed a bug in grabix that caused occasional duplicate and missed variants.
- Add fitCons <http://biorxiv.org/content/early/2014/09/11/006825> scores as an additional measure of potential function in variants of interest, supplementing existing CADD and dbNSFP approaches.
- Updated Clinvar, COSMIC, and dbSNP to their latest versions.
- Provide an
--annotation-dirargument that specifies the path the annotation databases, to overwrite configured data inputs. Thanks to Björn Grüning,
- Support reproducible versioned installs of GEMINI with Python dependencies. Enables Galaxy integration. Thanks to Björn Grüning,
- Support arbitrary annotation supplied to VEP, which translate into queryable columns in the main variant table.
- Improve the power of the genotype filter wildcard functionality.
- Suppress openpyxl/pandas warnings (thanks to @chapmanb)
- Fix unit tests to account for cases where a user has not downloaded the CADD or GERP annotation files. Thanks to Xialoin Zhu and Daniel Swensson for reporting this and to Uma Paila for correcting it.
- Added support for CADD scores via new
- Added support for genotype wildcards in query select statements. E.g.,
SELECT chrom, start, end (gts).(phenotype==2) FROM variants. See http://gemini.readthedocs.org/en/latest/content/querying.html#selecting-sample-genotypes-based-on-wildcards.
- Added support for genotype wildcards in the –gt-filter. E.g.,
--gt-filter "(gt_types).(phenotype==2).(==HET). See http://gemini.readthedocs.org/en/latest/content/querying.html#gt-filter-wildcard-filtering-on-genotype-columns.
- Added support for the VCF INFO field both in the API and as a column that can be SELECT’ed.
- Upgraded to the latest version of ClinVar.
- Standardized impacts to use Sequence Ontology (SO) terms.
- Automatically add indexes to custom, user-supplied annotation columns.
- Improvements to the installation script.
- Fixed bugs in the handling of ClinVar UTF8 encoded strings.
- Upgraded the
gene_detailedtables to version 75 of Ensembl.
- Added support for the MPI Mouse Phenotype database via the
mam_phenotype_idcolumn in the
- Enhanced security.
- Corrected the ESP allele frequencies to be based report _alternate_ allele frequency instead of _minor_ allele frequency.
- VEP version support updated (73-75) Support for aa length and bio type in VEP.
- The lof_sieve tool support has been extended to VEP annotations.
- Added the
entrez_idcolumns to the
- Added COSMIC mutation information via new cosmic_ids column.
- New annotation: experimentally validated human enhancers from VISTA.
- Installation improvements to enable isolated installations inside of virtual
machines and containers without data. Allow data-only upgrades as part of
- Fix for gemini query error when
- Fixed a bug that caused
--gt-filterto no be enforced from
querytool unless a GT* column was selected.
- Support for ref and alt allele depths provided by FreeBayes.
- Fixed undetected bug preventing the
comp_hetstool from functioning.
- Added unit tests for the
- Addition permutation testing to the c-alpha test via the
- Addition of the
--passonlyoption during loading to filter out all variants with a filter flag set.
- Fixed bug with parallel loading using the extended sample table format.
- SLURM support added.
- Refactor of loading options to remove explosion of xxx-queue options. Now
- Refactor of Sample class to handle the expanded samples table.
- Addition of
--carrier-summary-by-phenotypefor summarizing the counts of carriers and non-carriers stratified by the given sample phenotype column.
- Added a
--nonsynonymousoption to the C-alpha test.
gemini amendto edit an existing database. For now only handles updating the samples table.
- Fixed a bug that prevented variants that overlapped with multiple 1000G variants from having AAF info extracted from 1000G annotations. This is now corrected such that multiple overlaps with 1000G variants are tolerated, yet the logic ensures that the AAF info is extracted for the correct variant.
- Fixed installation issues for the GEMINI browser.
--show-familiesoption to gemini query.
- Moved –tped and –json options into the more generic –format option.
- Fixed bug in handling missing phenotypes in the sample table.
- Fixed –tped output formatting error.
- API change: GeminiQuery.run takes an optional list of predicates that a row must pass to be returned.
- –sample-filter option added to allow for restricting variants to samples that pass the given sample query.
- ethnicity removed as a default PED field.
- PED file format extended to allow for extra columns to be added to the samples table under the column named in the header.
- The autosomal_recessive and autosomal_dominant tools now warn, but allow for variants to be detected in the absence of known parent/child relationships.
- Corrected bug in de_novo tool that was undetected in 0.6.0. Unit tests have been added to head this off in the future. Thanks to Jessica Chong
- Added the -d option (minimum sequence depth allowed for a genotype) to the autosmal_recessive and autosmal_dominant tools.
- New –tped option in the query tool for reporting variants in TPED format. Thanks to Rory Kirchner.
- New –tfam option in the dump tool for reporting sample infor in TFAM format. Thanks to Rory Kirchner.
- Add the
--min-kindredsoption to the
autosomal_dominanttools to restrict candidate variants/genes to those affecting at least
--min-kindreds. Thanks to Jessica Chong
- Addition of a new
burdentool for gene or region based burden tests. First release supports the C-alpha test. Thanks to Rory Kirchner.
- Use of Continuum Analytics Anaconda python package for the automated installer. Thanks to Brad Chapman.
- Enhancements to the
annotatetool allowing one to create new database columns from values in custom BED+ annotation files. Thanks to Jessica Chong and Graham Ritchie.
- Addition of the
--jsonoptions to the
- Improvements to unit tests.
- Allow alternate sample delimiters in the
querytool via the
--sample-delimoption. Thanks to Jessica Chong.
- Provide a REST-like interface to the gemini browser. In support of future visualization tools.
- Allow the
querytool to report results in JSON format via the
- Various minor improvements and bug fixes.
Tolerate either -9 or 0 for unknown parent or affected status in PED files.
Refine the rules for inheritance and parental affected status for autosomal dominant inheritance models.
de_novomutation tools have received the following improvements.
- improved speed (especially when there are multiple families)
- by default, all columns in the variant table are reported and no conditions are placed on the returned variants. That is, as long as the variant meets the inheritance model, it will be reported.
- the addition of a
--columnsoption allowing one to override the above default behavior and report a subset of columns.
- the addition of a
--filteroption allowing one to override the above default behavior and filter reported variants based on specific criteria.
4. The default minimum aligned sequencing depth for each variant reported by
de_novo tool is 0. Greater stringency can be applied with the
- Added new
- Added a new
--show-samplesoption to the
querymodule to display samples with alternate allele genotypes.
- Improvements and bug fixes for installation.
- Improved speed for adding custom annotations.
- Added GERP conserved elements.
- Optionally addition of GERP conservation scores at base pair resolution.
- Move annotation files to Amazon S3.