logo
logo
Sign in

The Ultimate Guide to Whole Genome Resequencing: Basic and Advanced Facts (IV)

avatar
ivan chen
The Ultimate Guide to Whole Genome Resequencing: Basic and Advanced Facts (IV)

Data Analysis

  1. Primary data analysis

Mainly refers to the preliminary data analysis results returned by the sequencing company under normal circumstances. It mainly includes the following contents.

1.1 Sequencing quality report

Generally, software such as FastQC and NGSQCToolkit are used to check the quality of sequencing raw data. The main contents include BasicStatistics, Per base sequence quality, Persequence quality scores, Kmer Content, etc. At present, this part of the content is generally rarely included in the text chart when publishing the paper, and most of it is placed in supplementary materials as a supplementary reference content. In addition, the current sequencing technology and sequencing companies are relatively mature in genome sequencing, so this part only needs to see whether the sequencing results are reliable, as a raw data quality indicator for the next step of analysis.

Data output statistics: Read Length, original read number, total base number and sequencing depth analysis

Quality control: filtering basic parameters, Clean reads result statistics and the total number of high-quality bases after filtering

Mapping statistics: Total Mapping Reads, UnMapped Reads statistics, Mapping rate and sequencing coverage

1.2 Genetic variation detection (SNP, InDel, CNV and SV detection and statistical analysis of coding and non-coding regions)

SNPCalling calculation, detect all polymorphic sites in the whole genome, combined with quality value, sequencing depth, repeatability and other factors for further filtering and screening, and finally get a highly reliable SNP data set. Generally, the results of integrating multiple SNP detection algorithms are used to comprehensively and more accurately identify SNPs (generally GATK + Samtools). Through the consistency analysis of the SNPs identified by various algorithms, the highly consistent SNPs are retained as the final SNP results. And annotate the detected variation according to the reference genome information. (Frequently used software is mainly FRAPPE: https://github.com/frappe/frappe; GATK: https://software.broadinstitute.org/gatk/; FreeBayes: Samtools: https://sourceforge.net/projects/samtools /? source = navbar)

These highly consistent SNPs also have very high credibility. The SNP identification algorithms used in the analysis include methods based on Bayesian and genotype likelihood calculations, as well as the use of linkage disequilibrium LD or inference techniques to optimize the accuracy of SNP identification detection (common genotype inference software include: Beagle, impute2, Fastphase, Phase and other software).

Statistical distribution of SNV allele frequencies across the genome

The ratio distribution of the number of rare alleles in different types of SNV (a); SNV categories mainly consider: (1) nonsense, (2) non-synonymous in chemical structure, (3) all non-synonymous, (4) conservative non-synonymous, (5) non-coding, (6) synonymous, and other types of SNV; in addition, for conservative discussion, we will analyze the non-coding region SNV conservative type and its distribution.

The analysis objects include the newly predicted SNP, indel, large deletion, and the number ratio of exon SNP under each allele frequency category (fraction). The new predictions refer to the new SNPs, indels, and deletions identified by the prediction analysis results compared with the dbSNP (current version 129), the deletion database dbVar (June 2010 version), and the published genomic data on the indels study. dbSNP contains SNP and indels; dbVAR contains deletion, duplication, and mobile element insertion. Short indels and large deletions provided by the results of dbRIP and other genomics studies (JC Ventrer and Watson Genomics, Yanhuang Project Asian Genome).

Calculate the size distribution of SNP, Deletion, and Insertion. Calculate the proportion of the new prediction results in SNP, Deletion, and Insertion to the number of existing reference databases (relative to the dbSNP database; dbSNP contains SNP and indels; dbVAR contains deletion, duplication, and mobile element insertion. Genomics research (JC Ventrer and Watson genome, Yanhuang Project Asian genome) provides short indels and large deletion), which can give the characteristic location of LINE and Alu.

InDel detection and distribution in the genome:

During the mapping process, a gap-tolerant comparison is performed and a credible short InDel is detected. In the detection process, the gap is 15 bases in length. For each InDel detection, at least 3 Paired-End sequences are required. Theoretically, a 150bp insertion deletion mutation can be detected.

CNV copy number variation and SV structural variation detection and distribution in the genome:

The main types of structural variation that can be detected are: insertion, deletion, duplication, inversion, translocation, etc. Based on the analysis results of the comparison between the sequenced individual sequence and the reference genome sequence, the whole genome-level structural variation is detected and the detected variation is annotated.

Copy number variation detection software:

CNVnator:

1.3 Variation type annotation (occurrence area statistics)

Common software includes SnpEff, ANNOVAR, etc.

1.4 Statistics of codon and amino acid changes

1.5 Statistics of base substitution type and ratio

1.6 Statistics of variation distribution of each gene

1.7 Candidate site detection, statistics, annotation

1.8 Candidate gene GO, KEGG function annotation

Biological pathways, including metabolic pathways and signal transduction pathways, are important components of biological functions. We put various forms of mutations, including SNV and SNP, into the biological pathways for synthesis. Analyze and investigate the degree of influence and the law of influence of functional mutations on pathways. Through the GSEA (with chip expression profiling data), KS test, hypergeometric distribution test and other methods to sort the enrichment of mutated genes in certain pathways, to identify potential pathways for functional changes.

To be continued in Part V…

collect
0
avatar
ivan chen
guide
Zupyak is the world’s largest content marketing community, with over 400 000 members and 3 million articles. Explore and get your content discovered.
Read more