精华内容
下载资源
问答
  • 二代测序数据分析 转录组 表观组 重测序 宏基因组 宏基因组shotgun分析流程(未完结) 宏基因组shotgun研究套路 宏基因组16S分析流程 三代测序 三代测序入门 三代测序分析工具(未完结) 泛基因组 泛基因组入门 群体...
  • 基因组二代测序数据的自动化分析流程基因组二代测序数据的自动化分析流程
  • 配合本人博客使用和学习,https://blog.csdn.net/abcba101/article/details/102810927
  • 二代测序总结

    2019-04-12 18:36:14
    原始sra测序数据、基因组注释文件、基因组参考数据的下载,整理,以及RNA-seq分析软件:hisat2 和 stringtie 的安装和使用,差异基因的获取
  • 二代测序原理与技术(IonProton)
  • 二代测序手册

    2017-09-11 12:32:53
    对生信分析相当有用,值得推荐,如果有R语言基础则更好,希望对从事相关研究的人有不小的帮助,本来打算免费下载,可考虑到自己也没积分,就当是创收吧。呵呵
  • 二代测序数据分析软件包大全Integrated solutions*CLCbioGenomics Workbench-denovoandreference assembly of Sanger, Roche FLX, Illumina, Helicos, andSOLiD data. Commercial next-gen-seq software that ...

    二代测序数据分析软件包大全 Integrated solutions*CLCbio

    Genomics Workbench-de

    novoand

    reference assembly of Sanger, Roche FLX, Illumina, Helicos, and

    SOLiD data. Commercial next-gen-seq software that extends the

    CLCbio Main Workbench software. Includes SNP detection, CHiP-seq,

    browser and other features. Commercial. Windows, Mac OS X and

    Linux.

    *Galaxy-

    Galaxy = interactive and reproducible genomics. A job

    webportal.

    *Genomatix-

    Integrated Solutions for Next Generation Sequencing data

    analysis.

    *JMP

    Genomics-

    Next gen visualization and statistics tool from SAS. They

    areworking with NCGRto

    refine this tool and produce others.

    *NextGENe-de

    novoand

    reference assembly of Illumina, SOLiD and Roche FLX data. Uses a

    novel Condensation Assembly Tool approach where reads are joined

    via "anchors" into mini-contigs before assembly. Includes SNP

    detection, CHiP-seq, browser and other features. Commercial. Win or

    MacOS.

    *SeqMan

    Genome Analyser-

    Software for Next Generation sequence assembly of Illumina, Roche

    FLX and Sanger data integrating with Lasergene Sequence Analysis

    software for additional analysis and visualization capabilities.

    Can use a hybrid templated/de novo approach. Commercial. Win or Mac

    OS X.

    *SHORE-

    SHORE, for Short Read, is a mapping and analysis pipeline for short

    DNA sequences produced on a Illumina Genome Analyzer. A suite

    created by the 1001 Genomes project. Source for

    POSIX.

    *SlimSearch-

    Fledgling commercial product.

    Align/Assemble to a reference

    *BFAST-

    Blat-like Fast Accurate Search Tool. Written by Nils Homer, Stanley

    F. Nelson and Barry Merriman at UCLA.

    *Bowtie-

    Ultrafast, memory-efficient short read aligner. It aligns short DNA

    sequences (reads) to the human genome at a rate of 25 million reads

    per hour on a typical workstation with 2 gigabytes of memory. Uses

    a Burrows-Wheeler-Transformed (BWT) index.Link

    to discussion thread here.

    Written by Ben Langmead and Cole Trapnell. Linux, Windows, and Mac

    OS X.

    *BWA-

    Heng Lee's BWT Alignment program - a progression from Maq. BWA is a

    fast light-weighted tool that aligns short sequences to a sequence

    database, such as the human reference genome. By default, BWA finds

    an alignment within edit distance 2 to the query sequence. C++

    source.

    *ELAND-

    Efficient Large-Scale Alignment of Nucleotide Databases. Whole

    genome alignments to a reference genome. Written by Illumina author

    Anthony J. Cox for the Solexa 1G machine.

    *Exonerate-

    Various forms of pairwise alignment (including

    Smith-Waterman-Gotoh) of DNA/protein against a reference. Authors

    are Guy St C Slater and Ewan Birney from EMBL. C for

    POSIX.

    *GenomeMapper-

    GenomeMapper is a short read mapping tool designed for accurate

    read alignments. It quickly aligns millions of reads either with

    ungapped or gapped alignments. A tool created by the 1001 Genomes

    project. Source for POSIX.

    *GMAP-

    GMAP (Genomic Mapping and Alignment Program) for mRNA and EST

    Sequences. Developed by Thomas Wu and Colin Watanabe at Genentec.

    C/Perl for Unix.

    *gnumap-

    The Genomic Next-generation Universal MAPper (gnumap) is a program

    designed to accurately map sequence data obtained from

    next-generation sequencing machines (specifically that of

    Solexa/Illumina) back to a genome of any size. It seeks to align

    reads from nonunique repeats using statistics. From authors at

    Brigham Young University. C source/Unix.

    *MAQ-

    Mapping and Assembly with Qualities (renamed from MAPASS2).

    Particularly designed for Illumina with preliminary functions to

    handle ABI SOLiD data. Written by Heng Li from the Sanger Centre.

    Features extensive supporting tools for DIP/SNP detection, etc. C++

    source

    *MOSAIK-

    MOSAIK produces gapped alignments using the Smith-Waterman

    algorithm. Features a number of support tools. Support for Roche

    FLX, Illumina, SOLiD, and Helicos. Written by Michael Str?mberg at

    Boston College. Win/Linux/MacOSX

    *MrFAST and

    MrsFAST-

    mrFAST & mrsFAST are designed to map short reads generated with

    the Illumina platform to reference genome assemblies; in a fast and

    memory-efficient manner. Robust to INDELs and MrsFAST has a

    bisulphite mode. Authors are from the University of Washington. C

    as source.

    *MUMmer-

    MUMmer is a modular system for the rapid whole genome alignment of

    finished or draft sequence. Released as a package providing an

    efficient suffix tree library, seed-and-extend alignment, SNP

    detection, repeat detection, and visualization tools. Version 3.0

    was developed by Stefan Kurtz, Adam Phillippy, Arthur L Delcher,

    Michael Smoot, Martin Shumway, Corina Antonescu and Steven L

    Salzberg - most of whom are at The Institute for Genomic Research

    in Maryland, USA. POSIX OS required.

    *Novocraft-

    Tools for reference alignment of paired-end and single-end Illumina

    reads. Uses a Needleman-Wunsch algorithm. Can support Bis-Seq.

    Commercial. Available free for evaluation, educational use and for

    use on open not-for-profit projects. Requires Linux or Mac OS

    X.

    *PASS-

    It supports Illumina, SOLiD and Roche-FLX data formats and allows

    the user to modulate very finely the sensitivity of the alignments.

    Spaced seed intial filter, then NW dynamic algorithm to a SW(like)

    local alignment. Authors are from CRIBI in Italy.

    Win/Linux.

    *RMAP-

    Assembles 20 - 64 bp Illumina reads to a FASTA reference genome. By

    Andrew D. Smith and Zhenyu Xuan at CSHL. (published in BMC

    Bioinformatics). POSIX OS required.

    *SeqMap-

    Supports up to 5 or more bp mismatches/INDELs. Highly tunable.

    Written by Hui Jiang from the Wong lab at Stanford. Builds

    available for most OS's.

    *SHRiMP-

    Assembles to a reference sequence. Developed with Applied

    Biosystem's colourspace genomic representation in mind. Authors are

    Michael Brudno and Stephen Rumble at the University of Toronto.

    POSIX.

    *Slider-

    An application for the Illumina Sequence Analyzer output that uses

    the probability files instead of the sequence files as an input for

    alignment to a reference sequence or a set of reference sequences.

    Authors are from BCGSC. Paper ishere.

    *SOAP-

    SOAP (Short Oligonucleotide Alignment Program). A program for

    efficient gapped and ungapped alignment of short oligonucleotides

    onto reference sequences. The updated version uses a BWT. Can call

    SNPs and INDELs. Author is Ruiqiang Li at the Beijing Genomics

    Institute. C++, POSIX.

    *SSAHA-

    SSAHA (Sequence Search and Alignment by Hashing Algorithm) is a

    tool for rapidly finding near exact matches in DNA or protein

    databases using a hash table. Developed at the Sanger Centre by

    Zemin Ning, Anthony Cox and James Mullikin. C++ for

    Linux/Alpha.

    *SOCS-

    Aligns SOLiD data. SOCS is built on an iterative variation of the

    Rabin-Karp string search algorithm, which uses hashing to reduce

    the set of possible matches, drastically increasing search speed.

    Authors are Ondov B, Varadarajan A, Passalacqua KD and Bergman

    NH.

    *SWIFT-

    The SWIFT suit is a software collection for fast index-based

    sequence comparison. It contains: SWIFT — fast local alignment

    search, guaranteeing to find epsilon-matches between two sequences.

    SWIFT BALSAM — a very fast program to find semiglobal non-gapped

    alignments based on k-mer seeds. Authors are Kim Rasmussen (SWIFT)

    and Wolfgang Gerlach (SWIFT BALSAM)

    *SXOligoSearch-

    SXOligoSearch is a commercial platform offered by the Malaysian

    basedSynamatix.

    Will align Illumina reads against a range of Refseq RNA or NCBI

    genome builds for a number of organisms. Web Portal. OS

    independent.

    *Vmatch-

    A versatile software tool for efficiently solving large scale

    sequence matching tasks. Vmatch subsumes the software tool REPuter,

    but is much more general, with a very flexible user interface, and

    improved space and time requirements. Essentially a large string

    matching toolbox. POSIX.

    *Zoom-

    ZOOM (Zillions Of Oligos Mapped) is designed to map millions of

    short reads, emerged by next-generation sequencing technology, back

    to the reference genomes, and carry out post-analysis. ZOOM is

    developed to be highly accurate, flexible, and user-friendly with

    speed being a critical priority. Commercial. Supports Illumina and

    SOLiD data.

    De

    novoAlign/Assemble

    *ABySS-

    Assembly By Short Sequences. ABySS is a de novo sequence assembler

    that is designed for very short reads. The single-processor version

    is useful for assembling genomes up to 40-50 Mbases in size. The

    parallel version is implemented using MPI and is capable of

    assembling larger genomes. By Simpson JT and others at the Canada's

    Michael Smith Genome Sciences Centre. C++ as source.

    *ALLPATHS-

    ALLPATHS: De novo assembly of whole-genome shotgun microreads.

    ALLPATHS is a whole genome shotgun assembler that can generate high

    quality assemblies from short reads. Assemblies are presented in a

    graph form that retains ambiguities, such as those arising from

    polymorphism, thereby providing information that has been absent

    from previous genome assemblies. Broad

    Institute.

    *Edena-

    Edena (Exact DE Novo Assembler) is an assembler dedicated to

    process the millions of very short reads produced by the Illumina

    Genome Analyzer. Edena is based on the traditional overlap layout

    paradigm. By D. Hernandez, P. Fran?ois, L. Farinelli, M. Osteras,

    and J. Schrenzel. Linux/Win.

    *EULER-SR-

    Short readde

    novoassembly.

    By Mark J. Chaisson and Pavel A. Pevzner from UCSD (published in

    Genome Research). Uses a de Bruijn graph

    approach.

    *MIRA2-

    MIRA (Mimicking Intelligent Read Assembly) is able to perform true

    hybrid de-novo assemblies using reads gathered through 454

    sequencing technology (GS20 or GS FLX). Compatible with 454, Solexa

    and Sanger data. Linux OS required.

    *SEQAN-

    A Consistency-based Consensus Algorithm for De Novo and

    Reference-guided Sequence Assembly of Short Reads. By Tobias Rausch

    and others. C++, Linux/Win.

    *SHARCGS-

    De novo assembly of short reads. Authors are Dohm JC, Lottaz C,

    Borodina T and Himmelbauer H. from the Max-Planck-Institute for

    Molecular Genetics.

    *SSAKE-

    The Short Sequence Assembly by K-mer search and 3' read Extension

    (SSAKE) is a genomics application for aggressively assembling

    millions of short nucleotide sequences by progressively searching

    for perfect 3'-most k-mers using a DNA prefix tree. Authors are

    René Warren, Granger Sutton, Steven Jones and Robert Holt from the

    Canada's Michael Smith Genome Sciences Centre.

    Perl/Linux.

    *SOAPdenovo-

    Part of the SOAP suite. See above.

    *VCAKE-

    De novo assembly of short reads with robust error correction. An

    improvement on early versions of SSAKE.

    *Velvet-

    Velvet is a de novo genomic assembler specially designed for short

    read sequencing technologies, such as Solexa or 454. Need about

    20-25X coverage and paired reads. Developed by Daniel Zerbino and

    Ewan Birney at the European Bioinformatics Institute

    (EMBL-EBI).

    SNP/Indel Discovery

    *ssahaSNP-

    ssahaSNP is a polymorphism detection tool. It detects homozygous

    SNPs and indels by aligning shotgun reads to the finished genome

    sequence. Highly repetitive elements are filtered out by ignoring

    those kmer words with high occurrence numbers. More tuned for ABI

    Sanger reads. Developers are Adam Spargo and Zemin Ning from the

    Sanger Centre. Compaq Alpha, Linux-64, Linux-32, Solaris and

    Mac

    *PolyBayesShort-

    A re-incarnation of the PolyBayes SNP discovery tool developed by

    Gabor Marth at Washington University. This version is specifically

    optimized for the analysis of large numbers (millions) of

    high-throughput next-generation sequencer reads, aligned to whole

    chromosomes of model organism or mammalian genomes. Developers at

    Boston College. Linux-64 and Linux-32.

    *PyroBayes-

    PyroBayes is a novel base caller for pyrosequences from the 454

    Life Sciences sequencing machines. It was designed to assign more

    accurate base quality estimates to the 454 pyrosequences.

    Developers at Boston College.

    Genome Annotation/Genome Browser/Alignment Viewer/Assembly

    Database

    *EagleView-

    An information-rich genome assembler viewer. EagleView can display

    a dozen different types of information including base quality and

    flowgram signal. Developers at Boston

    College.

    *LookSeq-

    LookSeq is a web-based application for alignment visualization,

    browsing and analysis of genome sequence data. LookSeq supports

    multiple sequencing technologies, alignment sources, and viewing

    modes; low or high-depth read pileups; and easy visualization of

    putative single nucleotide and structural variation. From the

    Sanger Centre.

    *MapView-

    MapView: visualization of short reads alignment on desktop

    computer. From the Evolutionary Genomics Lab at Sun-Yat Sen

    University, China. Linux.

    *SAM-

    Sequence Assembly Manager. Whole Genome Assembly (WGA) Management

    and Visualization Tool. It provides a generic platform for

    manipulating, analyzing and viewing WGA data, regardless of input

    type. Developers are Rene Warren, Yaron Butterfield, Asim Siddiqui

    and Steven Jones at Canada's Michael Smith Genome Sciences Centre.

    MySQL backend and Perl-CGI web-based frontend/Linux.

    *STADEN-

    Includes GAP4. GAP5 once completed will handle next-gen sequencing

    data. A partially implemented test version is availablehere

    *XMatchView-

    A visual tool for analyzing cross_match alignments. Developed by

    Rene Warren and Steven Jones at Canada's Michael Smith Genome

    Sciences Centre. Python/Win or Linux.

    Counting e.g. CHiP-Seq, Bis-Seq, CNV-Seq

    *BS-Seq-

    The source code and data for the "Shotgun Bisulphite Sequencing of

    the Arabidopsis Genome Reveals DNA Methylation Patterning" Nature

    paper byCokus et al.(Steve

    Jacobsen's lab at UCLA). POSIX.

    *CHiPSeq-

    Program used by Johnson et al. (2007) in their Science

    publication

    *CNV-Seq-

    CNV-seq, a new method to detect copy number variation using

    high-throughput sequencing. Chao Xie and Martti T Tammi at the

    National University of Singapore. Perl/R.

    *FindPeaks-

    perform analysis of ChIP-Seq experiments. It uses a naive algorithm

    for identifying regions of high coverage, which represent Chromatin

    Immunoprecipitation enrichment of sequence fragments, indicating

    the location of a bound protein of interest. Original algorithm by

    Matthew Bainbridge, in collaboration with Gordon Robertson. Current

    code and implementation by Anthony Fejes. Authors are from the

    Canada's Michael Smith Genome Sciences Centre. JAVA/OS independent.

    Latest versions available as part of theVancouver Short Read Analysis

    Package

    *MACS-

    Model-based Analysis for ChIP-Seq. MACS empirically models the

    length of the sequenced ChIP fragments, which tends to be shorter

    than sonication or library construction size estimates, and uses it

    to improve the spatial resolution of predicted binding sites. MACS

    also uses a dynamic Poisson distribution to effectively capture

    local biases in the genome sequence, allowing for more sensitive

    and robust prediction. Written by Yong Zhang and Tao Liu from

    Xiaole Shirley Liu's Lab.

    *PeakSeq-

    PeakSeq: Systematic Scoring of ChIP-Seq Experiments Relative to

    Controls. a two-pass approach for scoring ChIP-Seq data relative to

    controls. The first pass identifies putative binding sites and

    compensates for variation in the mappability of sequences across

    the genome. The second pass filters out sites that are not

    significantly enriched compared to the normalized input DNA and

    computes a precise enrichment and significance. By Rozowsky J et

    al. C/Perl.

    *QuEST-

    Quantitative Enrichment of Sequence Tags. Sidow and Myers Labs at

    Stanford. From the 2008 publicationGenome-wide analysis of transcription factor binding

    sites based on ChIP-Seq data.

    (C++)

    *SISSRs-

    Site Identification from Short Sequence Reads. BED file input. Raja

    Jothi @ NIH. Perl.

    **See alsothis

    threadfor

    ChIP-Seq, until I get time to update this

    list.

    Alternate Base Calling

    *Rolexa-

    R-based framework for base calling of Solexa data.

    Projectpublication

    *Alta-cyclic-

    "a novel Illumina Genome-Analyzer (Solexa) base

    caller"

    Transcriptomics

    *ERANGE-

    Mapping and Quantifying Mammalian Transcriptomes by RNA-Seq.

    Supports Bowtie, BLAT and ELAND. From the Wold

    lab.

    *G-Mo.R-Se-

    G-Mo.R-Se is a method aimed at using RNA-Seq short reads to build

    de novo gene models. First, candidate exons are built directly from

    the positions of the reads mapped on the genome (without any ab

    initio assembly of the reads), and all the possible splice

    junctions between those exons are tested against unmapped reads.

    From CNS in France.

    *MapNext-

    MapNext: A software tool for spliced and unspliced alignments and

    SNP detection of short sequence reads. From the Evolutionary

    Genomics Lab at Sun-Yat Sen University,

    China.

    *QPalma-

    Optimal Spliced Alignments of Short Sequence Reads. Authors are

    Fabio De Bona, Stephan Ossowski, Korbinian Schneeberger, and Gunnar

    R?tsch. A paper isavailable.

    *RSAT-

    RSAT: RNA-Seq Analysis Tools. RNASAT is developed and maintained by

    Hui Jiang at Stanford University.

    *TopHat-

    TopHat is a fast splice junction mapper for RNA-Seq reads. It

    aligns RNA-Seq reads to mammalian-sized genomes using the ultra

    high-throughput short read aligner Bowtie, and then analyzes the

    mapping results to identify splice junctions between exons. TopHat

    is a collaborative effort between the University of Maryland and

    the University of California, Berkeley

    转载自:http://blog.163.com/luyiming_1986@126/blog/static/151141532201122494757719/

    二代测序数据预处理与分析

    a4c26d1e5885305701be709a3d33442f.png

    常使用的工具列表

    质量控制Quality Control:FastQC、Fastx-toolkit

    拼接Aligner:BWA,Bowtie, Tophat, SOAP2

    Mapper:Tophat, Cufflinks

    基因定量 Gene Quantification: Cufflinks, Avadis NGS

    质量改进 Quality improvement: Genome Analysis Toolkit(GATK)

    SNP: Unified Genotyper,Glfmultiple, SAMtools, Avadis NGS

    CNV: CNVnator

    Indel: Pindel, Dindel, Unified Genotyper, Avadis NGS

    Mapping to a gene: Cufflinks, Rsamtools, Genomic Features

    相关的数据格式

    FASTQ:

    SAM: A generic nucleotide alignment format

    BAM: binary format

    VCF

    数据处理的流程

    a4c26d1e5885305701be709a3d33442f.png

    转载自:http://www.dxy.cn/bbs/thread/23163706#23163706

    http://boyun.sh.cn/bio/?p=1862

    展开全文
  • 二代测序基础知识

    万次阅读 多人点赞 2020-02-27 14:23:32
    二代测序基础知识 二代测序基础概念 (这个是与二代测序相关每个部门都要掌握的) FQ数据格式 高通量测序(如Illumina HiSeqTM/MiseqTM)得到的原始图像数据文件经CASAVA碱基识别(Base Calling)分析转化为原始测序...

    二代测序基础知识

    二代测序基础概念

    (这个是与二代测序相关每个部门都要掌握的)

    FQ数据格式

    • 高通量测序(如Illumina HiSeqTM/MiseqTM)得到的原始图像数据文件经CASAVA碱基识别(Base Calling)分析转化为原始测序序列(Sequenced Reads),我们称之为 Raw Data或Raw Reads,结果以 FASTQ (简称为fq)文件格式存储,其中包含测序序列(reads)的序列信息以及其对应的测序质量信息。
      FASTQ格式文件中每个read由四行描述,如下:
    @HWI-ST1276:71:C1162ACXX:1:1101:1208:2458 1:N:0:CGATGT
    NAAGAACACGTTCGGTCACCTCAGCACACTTGTGAATGTCATGGGATCCAT
    +
    #55???BBBBB?BA@DEEFFCFFHHFFCFFHHHHHHHFAE0ECFFD/AEHH
    
    • 其中:
      第一行以“@”开头,随后为Illumina 测序标识别符(Sequence Identifiers)和描述文字(选择性部分);
      第二行是碱基序列;
      第三行以“+”开头,随后为Illumina 测序标识别符(选择性部分);
      第四行是对应碱基的测序质量,该行中每个字符对应的 ASCII 值减去 33,即为对应第二行碱基的测序质量值。

    原始数据过滤

    • 测序得到的原始测序序列(Sequenced Reads)或者 raw reads,里面含有带接头的、低质量的reads。为了保证信息分析质量,必须对raw reads过滤,得到clean reads,后续分析都基于 clean reads。数据处理的条件如下(非标准条件,可参考,比较松的条件,这个是诺禾的过滤条件,大家比例会有所调整,但是都是过滤的这三项):
      • 去除带接头(adapter)的reads pair;
      • 当单端测序read中含有的N的含量超过该条read长度比例的10%时,需要去除此对paired reads;
      • 当单端测序read中含有的低质量(Q ≤ 5)碱基数超过该条read长度比例的 50% 时,需要去除此对paired reads。

    数据质量统计概念:

    • Raw Base(bp):原始数据产量,测序序列的个数乘以测序序列的长度,以bp为单位。

    • Clean Base(bp):过滤之后的有效数据量,过滤后测序序列的个数乘以测序序列的长度,以bp为单位。

    • Effective Rate(%):过滤后获得clean data 与raw data的比值。

    • Error Rate(%):碱基错误率。

    • GC Content(%):碱基G和C的数量总和占总的碱基数量的百分比。

    • adapter:接头,用于上机测序。建库时引入的接头序列与测序芯片(flow cell)上固定的接头相互识别。

    • index:测序的标签,用于测定混合样本,通过每个样本添加的不同标签进行数据区分,鉴别测序样品。

    • Q20,Q30:Phred 数值大于20、30的碱基占总体碱基的百分比,其中Phred=-10log10(e),e为错误率。

    • raw data/raw reads:测序下机的原始数据。

    • clean data/clean reads:对原始数据进行过滤后,剔除了低质量数据的剩余数据。后续分析均基于clean data。

    参考基因组的一些概念:

    • Seq number:基因组组装的序列总数。
    • Total length:基因组组装结果总长度。
    • GC content:碱基G和C的含量。
    • Gap rate:组装结果中N所占的比例。
    • N50 length:scaffold N50长度,表示组装结果中有一半的序列长度大于该值。
    • N90 length:scaffold N90长度,表示组装结果中有90%的序列长度大于该值。

    比对统计的一些概念:

    • Mapped reads:比对到reference上的reads条数(包括单端比对和双端比对)。
    • Total reads:有效测序数据的reads总条数。
    • Mapping rate:比对率,比对到参考基因组上的reads数目除以有效测序数据的reads数目。
    • Average depth:平均测序深度,比对到参考基因组的碱基总数除以基因组大小。
    • Coverage at least 1X:参考基因组中至少有1个碱基覆盖的位点占基因组的百分比。
    • Coverage at least 4X:参考基因组至少有4个碱基覆盖的位点占基因组的百分比。

    SNP概念

    • SNP(单核苷酸多态性) 主要是指在基因组水平上由单个核苷酸的变异所引起的DNA序列多态性,包括单个碱基的转换、颠换等。
      主要类型:
    • Exonic:变异位于外显子区域;
      • missense:非同义变异;
      • Stop gain:使基因获得终止密码子的变异;
      • Stop loss:使基因失去终止密码子的变异
      • synonymous:同义变异。
    • Intronic:变异位于内含子区域。
    • Splicing:变异位于剪接位点(内含子中靠近外显子/内含子边界的2bp)。
    • Downstream:基因下游1 Kb区域。
    • Upstream/Downstream: 基因上游1 Kb区域,同时也在另一基因的下游1 Kb区域。
    • Intergenic:变异位于基因间区。
    • ts:transitions,转换。
    • tv:transversions,颠换。
    • ts/tv:转换与颠换的比率。

    二代测序原理

    测序技术发展

    在这里插入图片描述

    illumina测序原理

    • 高通量测序(High-Throughput Sequencing)又名二代测序|下一代测序(Next Generation Sequencing,NGS),是相对于传统的桑格测序|一代测序(Sanger Sequencing)而言的。相对于Sanger测序而言,二代测序可以提供中等的读长和适中的价格,适合de novo 测序、转录组测序、宏基因组研究等。
    • Solexa的测序原理是可逆终止化学反应。Solexa是一种基于边合成边测序技术(Sequencing-By-Synthesis,SBS)的新型测序方法。通过利用单分子阵列实现在小型芯片(Flow Cell)上进行桥式PCR反应。由于新的可逆阻断技术可以实现每次只合成一个碱基,并标记荧光基团,再利用相应的激光激发荧光基团,捕获激发光,从而读取碱基信息。
    • 桥氏PCR原理
    • 在这里插入图片描述
    • 二代测序建库测序大致流程
      DNA片段经末端修复、加ployA尾、加测序接头、纯化、PCR扩增等步骤完成整个文库制备。构建好的文库通过illumina HiSeqTM PE150进行测序。文库构建完成后,先使用Qubit2.0进行初步定量,稀释文库至1ng/μl,随后使用Agilent 2100对文库的insert size进行检测,insert size符合预期后,使用Q-PCR方法对文库的有效浓度进行准确定量(文库有效浓度>2nM),以保证文库质量。

    二代测序数据拆分

    • 原始下机数据睡bcl文件,根据前面建库的index信息,进行数据的拆分,除非是包lane或者包run,否则二代测序公司是不会提供该文件的
    • 外包测序返回的是拆分后的rawdata及质控后的cleandata,由rawdata到cleandata的数据过滤过程称为质控

    二代测序数据质控

    • 质控主要进行低质量,含N,含adpter的过滤
    • 过滤主要考虑的参数:
      1. 数据有效数据利用率,一般要求高于95%,现在正常项目大多在99%
      2. 数据量,数据量所有样品,高于约定数据量的95%,看合同签订的是raw还是clean
      3. Q20一般要>90%(illunima官方承诺85%)
      4. Q30一般要>85%(illunima官方承诺80%)
      5. GC含量,一般波动不大,5%波动以内,群体复杂的要特殊考虑
      6. GC波动情况(WGS几乎无波动,简化基因组及panel的另行考虑)
      7. NT比对情况,要求无污染,现在公司不会直接提供,GC波动大时,可以要求测序公司提供,以排除污染。
    • 参考资料:两份测序公司的质控报告,供参考学习(有报告是有明显异常的,需要大家找出)
    • 上述质控参考指标的图表
    • 在这里插入图片描述
    • 在这里插入图片描述
    • 在这里插入图片描述

    二代测序数据比对分析

    比对分析软件及最重要的软件流程

    • 重测序
      必做
    bwa index # 基因组建索引
    bwa mem #比对
    samtools/gatk sort #排序 
    

    可选

    samtools/gatk rmdup #去重
    gatk remap # 重call
    

    比对分析统计结果展示 在这里插入图片描述

    • 一般要求:
      • 比对率,大部分非异常样品都会在90%甚至99%以上
      • 深度,达到合同或者后续分析的需求
      • coverage达到一定水平(85%以上)
      • 重复率低于20%,这个报告没有,但是我们可以统计,不会提供给客户,但是是内部测评的重要指标

    二代测序变异检测

    变异检测软件

    • samtools
    • GATK
    • angsd
    • freebase
    • 前两个还是主流软件

    变异检测注释软件

    • annvoar(人,动物比较多)
    • snpEff(植物较多使用)

    过滤条件

    • 个体过滤
      • 根据深度情况过滤深度4或者更高的7,10
      • 质量值20/30
    • 群体过滤
      • 根据群体情况,进行总体深度的过滤
      • 质量值20/30
      • 个体质量值5/10/20和个体深度4/7/10
      • miss:0.1/0.2/0.5~
      • maf:0.01/0.05
    • 上述仅供参考,还需要根据具体情况进行参数的调整,但是一般这些项是要过滤的

    结果展示

    在这里插入图片描述

    展开全文
  • 利用定向捕获联合二代测序技术发现一DFNA11的新致病突变,陈鹏辉,杨涛,目的:分析探讨一个常染色体显性遗传性非综合征型迟发性耳聋家系的遗传学致病因素。方法:对一江西省耳聋来访者进行病史、家族史
  • 一代测序:又称Sanger测序(多分子,单克隆) 历史:第一代DNA测序技术(又称Sanger测序)在1975年,由Sanger等人开创,并在1977年完成第一个基因组序列(噬菌体X174),全长5375个碱基。研究人员经过30年的实践并...

     

    一、初现庐山真面目

    一代测序:又称Sanger测序(多分子,单克隆)

    历史:第一代DNA测序技术(又称Sanger测序)在1975年,由Sanger等人开创,并在1977年完成第一个基因组序列(噬菌体X174),全长5375个碱基。研究人员经过30年的实践并对技术及测序策略的不断改进(如使用了不同策略的作图法、鸟枪法),2001年完成的首个人类基因组图谱就是以改进了的Sanger法为其测序基础。

    原理:在4个DNA合成反应体系(含dNTP)中分别加入一定比例带有标记的ddNTP(分为:ddATP,ddCTP,ddGTP和ddTTP),通过凝胶电泳和放射自显影后可以根据电泳带的位置确定待测分子的DNA序列。由于ddNTP的2’和3’都不含羟基,其在DNA的合成过程中不能形成磷酸二酯键,因此可以用来中断DNA合成反应。

    二、江山辈有人才出

    二代测序:NGS技术(多分子,多克隆)

    背景:Sanger测序虽读长较长、准确性高,但其测序成本高通量低等缺点,使得de novo测序、转录组测序等应用难以普及。经过数据不断的技术开发和改进,以Roche公司的454技术、illumina公司的Solexa,Hiseq技术,ABI公司的Solid技术为标记的第二代测序技术诞生,后起之秀Thermo Fisher的Ion Torrent技术近年来也杀入历史舞台。

    1、Illumina 原理:

    桥式PCR+4色荧光可逆终止+激光扫描成像

    主要步骤:

    ①DNA文库制备——超声打断加接头

    ②Flowcell——吸附流动DNA片段

    ③桥式PCR扩增与变性——放大信号

    ④测序——测序碱基转化为光学信号

    优势劣势:Illumina的这种测序技术每次只添加一个dNTP的特点能够很好的地解决同聚物长度的准确测量问题,它的主要测序错误来源是碱基的替换。而读长短(200bp-500bp)也让其应用有所局限。

    2、Roche 454

    油包水PCR + 4种dNTP车轮大战 + 检测焦磷酸水解发光

    主要步骤:

    ①DNA文库制备——喷雾打断加接头

    ②乳液PCR——注水入油独立PCR

    ③焦磷酸测序——磁珠入孔,焦磷酸信号转化为光学信号

    优势劣势:454技术优势测序读长较长,平均可达400bp,缺点是无法准确测量类似于PolyA的情况时,测序反应会一次加入多个T,可能导致结果不准确。也正是由于这一原因,454技术会在测序过程中引入插入和缺失的测序错误。

    3、Ion Torrent 原理

    油包水PCR + 4种dNTP车轮大战 + 微电极PH检测

    主要步骤:

    ①DNA文库制备——喷雾打断加接头

    ②乳液PCR——注水入油独立PCR

    ③微电极pH检测——磁珠入池记录pH

    优势劣势:Ion Torrent与454相比,主要差异在测序中,Ion Torrent不需要昂贵的物理成像设备,成本相对较低体积较小,同时操作更为简单,整个上机测序可在2-3.5小时内完成(文库构建时间除外)。其劣势在于芯片的通量并不高,非常适合小基因组和外显子验证的测序。

    小结:二代测序相比一代测序大幅降低了成本,保持了较高准确性,并且大幅降低了测序时间,将一个人类基因组从3年降为1周以内,但在序列读长方面比起第一代测序技术则要短很多,这也给三代测序提供了发展空间。

    三、独辟蹊径补空缺

    三代测序:单分子测序

    背景:测序技术经过第一代、第二代的发展,读长从一代测序的近1000bp,降到了二代测序的几百bp,通量和速度大幅提升,那么第三代测序的发展思路在于保持二代测序的速度和通量优势同时,弥补其读长较短的劣势。三代测序与前两代相比,最大的特点就是单分子测序,测序过程无需进行PCR扩增。

    1、Oxford nanopore

    纳米孔 + 电流检测技术

    原理:该技术设计了一种特殊的纳米孔,孔内共价结合有分子接头,最终得到电信号而不是光信号或pH信号的测序技术。当DNA碱基通过纳米孔时,电荷将发生变化,因而短暂地影响流过纳米孔的电流强度(每种碱基所影响的电流变化幅度是不同的),灵敏的电子设备检测到这些变化从而鉴定所通过的碱基。

    优势劣势:

    ①读长很长,大约在几十kb,甚至100 kb;

    ②错误率目前相比较高,且是随机错误,而不是聚集在读取的两端;

    ③数据可实时读取;

    ④通量很高(30x人类基因组有望在一天内完成);

    ⑤起始DNA在测序过程中不被破坏;

    ⑥样品制备简单又便宜;

    ⑦可直接测序RNA。

    2、PacBio SMRT

    纳米孔 + 荧光可逆终止dNTP技术

    原理:PacBio SMRT技术其实是应用了边合成边测序的思想(使用4色荧光标记 4 种碱基),其超长读长的关键在于使用了活性持久且高保真的DNA聚合酶,并以SMRT芯片为测序载体(ZMW原理)。

    优势劣势:

    ①SMRT技术的测序速度很快,每秒约10个dNTP;

    ②错误率较高,达到15%,出错随机,可通过多次测序来进行有效的纠错(如使用Sparc对30X的数据进行分析,错误率可达到0.5%);

    ③原始DNA不被破坏;

    ④读长可达10kbp。

    3、Helicos Heliscope

    单分子荧光可逆终止技术

    原理:该技术基于边合成边测序的思想,将DNA随机打断成小片段分别进行dNTP荧光标记,经过不断地重复合成、洗脱、成像、淬灭过程完成测序。

    主要步骤:

    ①制备:DNA打断加polyA+Cy3

    ②测序:dNTP荧光可逆终止

    特点:

    ①读取长度约为30-35 bp,每个循环的数据产出量为21-28 Gb;在测序完成前,各小片段的测序进度不同;

    ②可根据同聚物的合成会导致荧光信号的减弱这一特点来推测同聚物的长度;

    ③可通过二次测序来提高准确度(直接变形洗脱模板)

    小结:

    三代测序优势:

    ①第三代基因测序读长较长,可以减少拼接成本,节省内存和计算时间;

    ②作用原理上避免了 PCR 扩增引入错误;

    ③拓展应用:RNA的序列,甲基化的DNA序列等;

    三代测序缺陷:

    ①单读长的错误率偏高,需重复测序以纠错(增加测序成本);

    ②依赖DNA聚合酶的活性;

    ③成本较高(二代Illumina的测序成本是每100万个碱基0.05-0.15美元,三代测序成本是每100万个碱基0.33-1.00美元)。

    ④生信分析软件不够丰富、数据积累少。

    来源:元码基因

    展开全文
  • 二代测序原理(Illumina)

    万次阅读 多人点赞 2019-10-12 18:27:44
    虽然三代测序现在已经商用,但是目前的主流还是二代测序,尤其是Illumina公司的测序方式更是大行其道。那么,下面我们从四个方面来说说illumina家的二代测序是怎么得到的生物数据。 0、 基本原理 基于可逆终止的,...

    虽然三代测序现在已经商用,但是目前的主流还是二代测序,尤其是Illumina公司的测序方式更是大行其道。那么,下面我们从四个方面来说说illumina家的二代测序是怎么得到的生物数据。

    0、 基本原理

    基于可逆终止的,荧光标记dNTP,做边合成边测序
    分为三步:

    • 样本准备 Sample Prep
    • 成簇 Cluster Generation
    • 测序 Sequencing

    1、样本准备 Sample Prep

    通过不同实验方法得到的样品,需要先提取样本基因组中的DNA,用超声波将其随机打断。然后使用酶将两端补平,使用 Klenow 酶在3‘ 端加一个 A 碱基(用于连接接头序列)。为了后续扩增,测序分析,需要为这些DNA片段添加特定的接头序列。接头序列是已知的,大概有三种:

    mark

    • sequencing binding site(绿)
    • index(红,黄)
    • 流动池引物互补的序列(蓝,紫)

    添加完接头序列后的DNA片段集合叫DNA文库 (DNA library),这样就完成了样品准备工作。

    2、成簇 Cluster Generation

    成簇是DNA片段被扩增的过程,该过程在流动池 (Flowcell) 中完成。它是一片带有8条通道(lanes)的玻璃载玻,每个通道内表面附有两种DNA引物。

    mark

    首先,引物会与样品中的DNA片段的接头序列互补配对,固定在通道表面

    mark

    通过聚合酶生成杂交片段的互补片段,然后加入NaOH碱溶液后,双链分子变性,原始模板链(左边的链)被流动池中的液体洗去

    mark

    加入中性液体用于中和碱溶液,剩下的单链拷贝链另一端的接头就会与通道表面的引物结合,形成单链桥。

    mark

    同样的,在聚合酶参与下,生成互补链,最终形成双链桥

    mark

    通过变性,DNA分子线性化,变为两个单链拷贝

    mark

    它们又分别与自己配对的引物结合

    mark

    重复这个循环,同时形成数百万的簇。在这个过程中,所有的DNA片段都会被克隆扩增。

    mark
    桥式扩增后,反向链会被切断洗去,仅留下正向链。为防止特异性结合重新形成单链桥,3‘端被封锁

    mark

    3、测序 Sequencing

    首先,在Flowcell中加入荧光标记的dNTP和酶,由引物起始开始合成子链。但是dNTP存在 3’端叠氮基会阻碍子链延伸,这使得每个循环只能测得一个碱基。合成完一个碱基后, Flowcell 通入液体洗掉多余的dNTP和酶,使用显微镜的激光扫描特征荧光信号。

    mark

    荧光发射波长与信号强度一起决定了碱基的读出,所有的DNA片段的一个碱基会被同时读取。在大规模并行的过程中,机器读取的图像类似下面这样

    mark

    加入化学试剂将叠氮基团与荧光基团切除,然后 Flowcell 再通入荧光标记的dNTP和酶,由引物起始开始合成一个碱基。不断重复这个过程,完成第一次读取。

    由于测序仪每次测序时的通量比较大,所以每次测得的序列可能不止一个样本。为了去区分每个样本及正负链,科学家构建DNA文库时,在接头序列加入了的不同 index(或 barcode)来区分来源。

    首先,在完成第一次读取后,复制出的链会被洗去

    mark

    index 片段引物被引入并与模板杂交,完成序列读取后被洗去。这样读取到的序列与开始时已知的index比对后就可以给测得的序列贴上标签,方便后续分析。

    mark

    Paired-end测序已经是现在的主流,它提高了测序长度的同时,又可以为结构变异分析提供新方法。要完成双末端测序,首先要将模板链3’去保护,模板折叠,index片段引入

    mark

    在聚合酶参与下形成双链桥

    mark

    然后变性,恢复为单链。注意,这次是将正向链切除并洗去,只留下反向链

    mark

    反向链以测序引物为起始,与正向链类似,经过多个循环后完成读取。

    mark

    数据分析 Data Analysis

    测序完成后会产生数百万个 reads,基于在样品准备时构建的 index 分类来自不同样本的序列。对于每个样品来说,具有相似延伸的碱基被聚在一起。正向和反向read配对生成连续序列。这些序列通过与参考基因组匹配后,实现完整序列的构建。


    其他测序技术

    一代测序原理 (Sanger 法)

    二代测序原理(Illumina)

    三代测序原理(SMRT Sequencing)

    三代测序原理(Nanopore)

    ChIPseq 测序原理

    BS-seq 测序原理

    Hi-C 测序原理

    组蛋白修饰

    三维基因组

    展开全文
  • Illumina测序什么时候会测序到接头序列?NGS系列文章 - 高通量测序原理关于二代测序中duplication产生和占比问题的探讨首先来说一下什么是duplication。二代测序...
  • 父母基因型验证对靶向二代测序结果解读的重要性,何龙霞,杨涛,目的:探索父母基因型验证对靶向二代测序结果解读的重要性。方法:收集21例无家族史非综合征型耳聋先证者(已排除三大常见基因GJB2
  • 本文详细介绍了illumina的二代测序技术(NGS)原理,文章很容易理解,有兴趣的了解一下吧!
  • 2020-遗传病二代测序临床检测全流程规范化共识探讨(3)-数据分析流程.pdf
  • 二代测序组装PK三代测序组装 2016-07-29编辑:诺禾致源 三代Pacbio测序技术 以其长读长,无需扩增,无GC偏好性等优势成为de novo组装的新宠儿。 然而,Pacbio测序成本依然很高,并且Pacbio测序错误率较高,需要...
  • NGS技术正逐年成熟,这使得全基因组测序的成本越来越低,但是对全基因组进行测序后得到的极其庞大、繁杂的数据量的分析工作并没有随之一起变得更加简单。相反,测序技术的发展出现了两个极端的方向:一种是大而全的...
  • 以Pacbio为代表的第三代测序平台,其测序读长(reads)长(平均10~15k)且无GC偏向性的优势,使其在基因组组装等方面得到了广泛的应用,但其过高的错误率(15%~20%),使得组装算法的复杂性大大提高。 对于组装策略而...
  • 之前没有做过用二代测序数据的paired-end 数据组装一个基因。今天实验室有一个同学的在图位克隆的时候遇到了一个问题,发现有一个候选基因的可能性很大,从IGV浏览器中看到,这个基因在野生型材料和突变体材料之间有...
  • contig表示从大规模测序得到的短读(reads)中找到的一致性序列。组装的第一步就是从短片段(pair-end)文库中组装出contig。进一步基于不同长度的大片段(mate-pair)文库,将原本孤立的contig按序前后连接,其中会调整...
  • 原文:http://www.biostars.org/p/53528/ This tutorial also serves as the supporting information for Lecture 9 of the course titledAnalyzing High Througput Sequencing Dataoffered at Penn State. Wi...
  • 叶绿体基因组二代测序组装(个人经验分享) 前段时间,有老师咨询我关于叶绿体基因组组装的问题,虽然本人不才,但也很热心地帮了个忙。虽说中间出了一些小意外,唉唉算了还是不提了。在这里顺便就个人常用的...
  • 二代测序的比对算法

    千次阅读 2019-08-19 17:24:26
    [1]苏州大学硕士尚婧的毕业论文《下一代测序短序列比对软件算法比较及评价》   下一代测序短序列比对软件算法比较及评价 - 中国知网​kns.cnki.net   [2]坑主孟浩巍b站视频:20171026-基于BWT算法的...
  • 二代测序call indel 总结

    千次阅读 2017-12-06 14:45:08
    二代测序call indel 总结
  • 二代测序原理及其流程

    千次阅读 2021-06-19 14:37:05
    二代测序有两个重要特点:1.高通量,二代测序能一次并行对几十、几百万条DNA分子进行测序;2.读长短,测序过程随着读长增长,基因簇复制的协同性降低,会导致测序质量下降,二代测序的读长不超过500bp。因此基因组...
  • panel的设计其实很简单,根据实验目的来选择需要捕获的区域,我们需要做的就是把这些需要捕获的区域做成一个bed文件。 下面就以BRCA1/2两个基因来举例子,一般bed都是设计在基因的CDS区,因为内含子区域往往包含很多...
  • 二代测序从头组装基因组

    千次阅读 2018-09-07 18:05:25
    contig表示从大规模测序得到的短读(reads)中找到的一致性序列。组装的第一步就是从短片段(pair-end)文库中组装出contig。进一步基于不同长度的大片段(mate-pair)文库,将原本孤立的contig按序前后连接,其中会调整...
  • 二代测序原理: 1、DNA待测文库构建。 超声波把DNA打断成小片段,一般200--500bp,两端加上不同的接头2、Flowcell。一个flowcell,8个channel,很多接头3、桥式PCR扩增。每个DNA片段将在各自位置集中成束,每一束...
  • 目前,Velvet,ABySS,SOAPdenovo,VCAKE,SPAdes等[4-6]多种与二代测序技术相匹配的de novo组装工具应运而生,而如何在众多组装工具中,根据序列属性和具体要求来选择与分析组装工具的实用性,对组装最佳结果及后续...
  • 二代测序技术总结

    2018-11-03 10:36:00
    二代测序技术比较: 公司 平台名称 测序方法 检测方法 读长(bp) 优点 相对局限性 罗氏/454 基因组测序仪FLX系统 焦磷酸测序法 光学 230-...

空空如也

空空如也

1 2 3 4 5 ... 20
收藏数 1,785
精华内容 714
关键字:

二代测序