Quick start

gemini is designed to allow researchers to explore genetic variation contained in a VCF file. The basic workflow for working with gemini is outlined below.

Importing VCF files into gemini.

Note

we now recommend splitting variants with multiple alternates and left-aligning, and trimming all variants before loading into gemini.

See Step 1. split, left-align, and trim variants for a detailed explanation.

Before we can use GEMINI to explore genetic variation, we must first load our VCF file into the GEMINI database framework. We expect you to have first annotated the functional consequence of each variant in your VCF using either VEP or snpEff (Note that v3.0+ of snpEff is required to track the amino acid length of each impacted transcript). Logically,the loading step is done with the gemini load command. Below are two examples based on a VCF file that we creatively name my.vcf. The first example assumes that the VCF has been pre-annotated with VEP and the second assumes snpEff.

# VEP-annotated VCF
$ gemini load -v my.vcf -t VEP my.db

# snpEff-annotated VCF
$ gemini load -v my.vcf -t snpEff my.db

Assuming you have a valid VCF file produced by standard variation discovery programs (e.g., GATK, FreeBayes, etc.), one loads the VCF into the gemini framework with the load submodule:

$ gemini load -v my.vcf my.db

In this step, gemini reads and loads the my.vcf file into a SQLite database named my.db, whose structure is described here. While loading the database, gemini computes many additional population genetics statistics that support downstream analyses. It also stores the genotypes for each sample at each variant in an efficient data structure that minimizes the database size.

Loading is by far the slowest aspect of GEMINI. Using multiple CPUs can greatly speed up this process.

$ gemini load -v my.vcf --cores 8 my.db

Querying the gemini database.

If you are familiar with SQL, gemini allows you to directly query the database in search of interesting variants via the -q option. For example, here is a query to identify all novel, loss-of-function variants in your database:

$ gemini query -q "select * from variants where is_lof = 1 and in_dbsnp = 0" my.db

Or, we can ask for all variants that substantially deviate from Hardy-Weinberg equilibrium:

$ gemini query -q "select * from variants where hwe < 0.01" my.db
comments powered by Disqus