The advancements and decreased cost of high-throughput sequencing provide researchers with access to a multitude of genetic data. Based on specific experimental goals, methods, and pipeline tools, the extent of data analysis required can be highly variable. Some bioinformatics packages integrate easily into central analysis pipelines, while other tools are more optimized for use as standalone packages.
At the most basic level of analysis, the sequencing instrument itself performs initial data processing. This hardware-generated data provides base calling results, the quality of every base call, and various other machine-generated statistics about the entire sequencing run. FASTQ files are a typical representation of this data. FastQC software for primary analysis provides information regarding the quality of sequencing reads and is useful in determining how to proceed with data. Most researchers are interested in more than just “raw” sequences; after monitoring data quality, FASTQ files are used in secondary analysis.
Experimental goals, sequencing depth, and sample type impact the type of secondary analysis required. FASTQ files, representing DNA reads, first need to be reassembled in order to gain biological insight about samples. Depending on the sample type, quality filtered reads will be either de novo assembled or mapped to a well characterized reference genome. If performing a de novo assembly, read length plays an important role in determining the most appropriate software. In the more common situation of mapping to a reference, the most familiar aligners for DNA are BWA and Bowtie. Mapping statistics can be gathered through the SAMtools package (flagstat or idxstats features), providing valuable information about the quality of the sample compared to a reference. After alignment and gathering mapping statistics, variant detection is a potential next step in analysis. For mutation calling, SAMtools has a feature known as mpileup, followed by bcftools to produce binary bcf files, and converted to vcf files (variant call formatted files). Alternatively, GATK Unified Genotyper of the GATK pipeline could be used. The above tools can either be used individually, or linked together by an informatics specialist to form a custom DNA-Seq pipeline for analysis. Galaxy offers a user-friendly interface alternative because it is server-based and allows for the simple linking of these tools into workflows. Once files are uploaded to the Galaxy server, a workflow can be initiated, and all tools making up the workflow will execute on the input provided. Galaxy also allows for parallel processing of samples.
Once multiple samples have been processed this far, researchers can continue in tertiary analysis by comparing samples to one another, or relative frequencies of samples can be compared to populations through efforts of the HapMap and the 1000 Genomes Project. Deciphering the data, a very complex task, remains essential in determining biological significance of samples.
As illustrated, sequence analysis can be highly variable and an abundance of sequence analysis tools exist to handle high throughput data. Depending on experimental design and requirements, an analysis pipeline can have several stopping points at any level of analysis. As the NGS field continues to grow, best practices utilizing these tools will become more standardized, in an effort to provide analysis capabilities to more users.