• RNA -seq RNA-seq目的、用处：：可以帮助我们了解，各种比较条件下，所有基因的表达情况的差异。 比如：正常组织和肿瘤组织的之间的差异；检测药物治疗前后，基因表达的差异；检测发育过程中，不同的发育阶段，...

RNA -seq
RNA-seq目的、用处：：可以帮助我们了解，各种比较条件下，所有基因的表达情况的差异。
比如：正常组织和肿瘤组织的之间的差异；检测药物治疗前后，基因表达的差异；检测发育过程中，不同的发育阶段，不同的组织之间的基因表达差异 等
在所有检测的差异类型中，最常用的一种检测就是：检测所有mRNA的表达量的差异。
还可以检测 RNA 的结构上的差异。例如：mRNA的剪接方式的差异，即“可变剪接”；还可以检测“融合基因”，同时还可以检测基因单点突变导致的SNP。
测序方法、步骤：人的细胞或组织，一般抽提到的总RNA当中，95%都是核糖体RNA。剩下的2%到3%是mRNA。还有2%到3%是Long non-coding RNA、或者tRNA、microRNA等
先把核糖体RNA先去掉。然后再进行建库测序。比如利用Poly(A)尾巴 抓出mRNA ，镁离子溶液打断，逆转录成cDNA ，再建库扩增，测序

火山图：针对全转录组的分析，表达的是一次看到一个整体的样本（表达）差异的情况。
横轴表示某个基因的表达量是上升或下降。纵轴是表示这种差异的置信程度。这其中的每个点，就是两个样本当中同一个基因的mRNA表达量的变化。
聚类分析图：它是通过多个样本的全基因表达谱对比，来找到它们之间的相似性，和相近关系。
一张聚类分析的图，横轴是样本，纵轴是基因。
应用：我们可以分析疾病的亚型；还可以通过对多个基因在特定疾病当中的表达倾向性，来找出可能的、新的、诊断用的Biomark。
GO(gene ontology)分析：
GO主要描述基因的三个属性：
第一，是这个基因，它参与的生物过程
第二，是这个基因产物的功能
第三、是这个基因产物在细胞器内的空间定位
差异基因GO富集柱状图：可以直观的反映出在生物过程、细胞组分、和分子功能富集的差异基因的个数分布情况。 柱子越高，则表示这个亚类当中突变越多。
有向无环图，是差异基因GO富集分析的图形化展示方式，从上到下，它所定义的功能范围越来越小、越来越精准。 它的分支，表示包含关系。而这个圈圈的颜色越深呐，表示这个富集关系程度越高。

通路(Pathway)分析：在系统水平上完成生物的某一功能的基本单元、或者局部子网络。
散点图是KEGG富集分析结果的图形化展示方式。
在图中，KEGG富集程度通 Rich factor、Qvalue 和 富集到此通路上的基因个数 来衡量。
富集因子越大，则表示富集的程度越大。 qValue是校正之后的pValue，它越接近于0表示富集程度越显著。点面积越大呐，则富集的基因数越多。
RNA-seq中，可以测到mRNA上的各种结构上的变异，即RNA序列的变异。要求测序深度要更深。因为这样才能得到较完整的覆盖，更有把握判断 新的剪接点、一个断点、哪儿碱基发生了突变等。
结构变异分析：
可变剪接：一般一个人的组织样本当中，可以通过高通量测序，发现有5000个到20000个左右的可变剪接。
基因融合：融合基因的示意图，圆形 圆内弧线连接图
点突变(SNP)：泡泡图，泡泡越大 突变频率越高，由大到小逆时针排列

转载于:https://www.cnblogs.com/wangprince2017/p/9794500.html
展开全文
• RNA-Seq-注释和比较 Olson_lab 存储库 Augustus_gene_predict_RNA_seq_data 存储库：augustus_beocat.txt - autoAugPred.pl - intron_filter.pl - shellForAug - KSU_bioinfo_lab 引用为：Jennifer Shelton 等人...
• RNA-seq即转录组测序技术，是将细胞内mRNA，nonconding-RNARNA或其中一些提取出来利用高通量测序技术进行测序和分析的技术，RNA-seq分析的主要目的是分析RNA对应基因的表达量。 RNA-seq的主要步骤如下：分离RNA...
• 一套 Python 程序，用于生成具有高度真实性的模拟 Illumina RNA-Seq 读数。 读取的起始位置以及读取错误和质量代码的分布都是从真实的 RNA-Seq 数据集凭经验得出的。 该套件包括 Python 脚本，用于准备经验读取创建...
• RSCS 在这里，我们开发了一种集成了RNA-seq和小RNA-seq数据（称为RSCS）的计算流水线，该策略极大地提高了多种哺乳动物样品中转录组注释的分辨率和准确性。
• Trinity RNA-Seq Assembly项目提供了针对从Illumina RNA-Seq数据重建全长转录本和可变剪接异构体的软件解决方案。
• Implementation of the Gasch lab RNA-Seq pipeline. 输入 ： A text file with RNA-Seq fastq files to be processed. 请使用专用目录来运行管道。 创建您的目录并将您的 fastq 文件复制到该目录中。 通过移动到...
• 这次RNA-seq研讨会旨在让您开始自己的RNA-seq分析，并假设您已经熟悉bash和R的基础知识。 我们将使用NeSI HPC进行某些分析，因此请确保您拥有NeSI帐户并且能够登录。 浏览讲习班的准备文件，以确保您准备： 研讨...
• ## RNA-Seq

千次阅读 2013-10-10 22:08:17
RNA-Seq

search

RNA-seq (RNA Sequencing), also called "Whole Transcriptome Shotgun Sequencing" [1] ("WTSS"), is a technology that utilizes the capabilities of next-generation sequencing to reveal a snapshot of RNA presence and quantity from a genome at a given moment in time.[2]

Contents

1 Introduction2 Methods
2.1 RNA 'Poly(A)' Library2.2 Small RNA/Non-coding RNA sequencing2.3 Direct RNA Sequencing2.4 Transcriptome Assembly2.5 Experimental Considerations 3 Analysis
3.1 Gene expression3.2 Single nucleotide variation discovery3.3 Post-transcriptional SNVs3.4 Fusion gene detection 4 Application to Genomic Medicine
4.1 History4.2 ENCODE and TCGA 5 External links6 References

Introduction
The  transcriptome of a cell is dynamic; it continually changes as opposed to a static genome.   The recent developments of Next-Generation Sequencing (NGS) allow for increased base coverage of a DNA sequence, as well as higher sample throughput.  This facilitates sequencing of the RNA transcripts in a cell, providing the ability to look at alternative gene spliced transcripts, post-transcriptional changes, gene fusion, mutations/SNPs and changes in gene expression.[3]  In addition to mRNA transcripts, RNA-Seq can look at different populations of RNA to include total RNA, small RNA, such as miRNA, tRNA, and ribosomal profiling.[4]  RNA-Seq can also be used to determine exon/intron boundaries and verify or amend previously annotated 5’ and 3’ gene boundaries. Ongoing RNA-Seq research includes observing cellular pathway alterations during infection,[5] and gene expression level changes in cancer studies.[6] Prior to NGS, transcriptomics and gene expression studies were previously done with expression microarrays, which contain thousands of DNA sequences that probe for a match in the target sequence, making available a profile of all transcripts being expressed.  This was later done with Serial Analysis of Gene Expression ( SAGE).
One deficiency with  microarrays that makes RNA-Seq more attractive has been limited coverage; such arrays target the identification of known common alleles that represent approximately 500,000 to 2,000,000 SNPs of the more than 10,000,000 in the genome.[7]  As such, libraries aren’t usually available to detect and evaluate rare allele variant transcripts,[8] and the arrays are only as good as the SNP databases they’re designed from, so they have limited application for research purposes.[9]  Many cancers for example are caused by rare <1% mutations and would go undetected. However, arrays still have a place for targeted identification of already known common allele variants, making them ideal for regulatory-body approved diagnostics such as cystic fibrosis.
Methods

RNA 'Poly(A)' Library

Creation of a sequence library can change from platform to platform in high throughput sequencing,[10] where each has several kits designed to build different types of libraries and adapting the resulting sequences to the specific requirements of their instruments. However, due to the nature of the template being analyzed, there are commonalities within each technology. Frequently, in mRNA analysis the 3' polyadenylated (poly(A)) tail is targeted in order to ensure that coding RNA is separated from noncoding RNA. This can be accomplished simply with poly (T) oligos covalently attached to a given substrate. Presently many studies utilize magnetic beads for this step.[1][11] The Protocol Online website[12] provides a list of several protocols relating to mRNA isolation.
Studies including portions of the transcriptome outside poly(A) RNAs have shown that when using poly(T) magnetic beads, the flow-through RNA (non-poly(A) RNA) can yield important noncoding RNA gene discovery which would have otherwise gone unnoticed.[1] Also, since ribosomal RNA represents over 90% of the RNA within a given cell, studies have shown that its removal via probe hybridization increases the capacity to retrieve data from the remaining portion of the transcriptome.
The next step is reverse transcription. Due to the 5' bias of randomly primed-reverse transcription as well as secondary structures influencing primer binding sites,[11] hydrolysis of RNA into 200-300 nucleotides prior to reverse transcription reduces both problems simultaneously. However, there are trade-offs with this method where although the overall body of the transcripts are efficiently converted to DNA, the 5' and 3' ends are less so. Depending on the aim of the study, researchers may choose to apply or ignore this step.
Once the cDNA is synthesized it can be further fragmented to reach the desired fragment length of the sequencing system.
Small RNA/Non-coding RNA sequencing
When sequencing RNA other than mRNA the library preparation is modified.  The cellular RNA is selected based on the desired size range.  For small RNA targets, such as  miRNA, the RNA is isolated through size selection.  This can be performed with a size exclusion gel, through size selection magnetic beads, or with a commercially developed kit.  Once isolated, linkers are added to the 3’ and 5’ end then purified.  The final step is cDNA generation through reverse transcription.

RNA-seq mapping of short reads in exon-exon junctions.

Direct RNA Sequencing
As converting RNA into cDNA using reverse transcriptase has been shown to introduce biases and artifacts that may interfere with both the proper characterization and quantification of transcripts,[13] single molecule Direct RNA Sequencing (DRSTM) technology is currently under development by  Helicos. DRSTM sequences RNA molecules directly in a massively-parallel manner without RNA conversion to cDNA or other biasing sample manipulations such as ligation and amplification.
Transcriptome Assembly

Two different assembly methods are used for producing a transcriptome from raw sequence reads: de-novo and genome-guided.
The first approach does not rely on the presence of a reference genome in order to reconstruct the nucleotide sequence. Due to the small size of the short reads de novo assembly may be difficult though some software does exist (  Velvet (algorithm),  Oases, and  Trinity[14] to mention a few), as there cannot be large overlaps between each read needed to easily reconstruct the original sequences. The deep coverage also makes the computing power to track all the possible alignments prohibitive.[15] This deficit can improved using longer sequences obtained from the same sample using other techniques such as Sanger sequencing, and using larger reads as a "skeleton" or a "template" to help assemble reads in difficult regions (e.g. regions with repetitive sequences).
An “easier” and relatively computationally cheaper approach is that of aligning the millions of reads to a "reference genome". There are many tools available for aligning genomic reads to a reference genome (sequence alignment tools), however, special attention is needed when alignment of a transcriptome to a genome, mainly when dealing with genes having intronic regions. Several software packages exist for short read alignment, and recently specialized algorithms for transcriptome alignment have been developed, e.g. Bowtie for RNA-seq short read alignment,[16] TopHat for aligning reads to a reference genome to discover splice sites,[17] Cufflinks to assemble the transcripts and compare/merge them with others,[18] or FANSe.[19]  These tools can also be combined to form a comprehensive system.[20]
Although numerous solutions to the assembly quest have been proposed, there is still lots of room for improvement given the resulting variability of the approaches. A group from the Center for Computational Biology at the East China Normal University in Shanghai compared different de novo and genome-guided approaches for RNA-Seq assembly. They noted that, although most of the problems can be solved using graph theory approaches, there is still a consistent level of variability in all of them. Some algorithms outperformed the common standards for some species while still struggling for others. The authors suggest that the “most reliable” assembly could be then obtained by combining different approaches.[21] Interestingly, these results are consistent with NGS-genome data obtained in a recent contest called Assemblathon where 21 contestants analyzed sequencing data from three different vertebrates (fish, snake and bird) and handed in a total of 43 assemblies. Using a metric made of 100 different measures for each assembly, the reviewers concluded that 1) assembly quality can vary  a lot depending on which metric is used and 2) assemblies that scored well in one species didn’t really perform well in the other species.[22]
As discussed above, sequence libraries are created by extracting mRNA using its poly(A) tail, which is added to the mRNA molecule post-transcriptionally and thus splicing has taken place. Therefore, the created library and the short reads obtained cannot come from intronic sequences, so library reads spanning the junction of two or more exons will not align to the genome.
A possible method to work around this is to try to align the unaligned short reads using a proxy genome generated with known exonic sequences. This need not cover whole exons, only enough so that the short reads can match on both sides of the exon-exon junction with minimum overlap. Some experimental protocols allow the production of strand specific reads.[11]
Experimental Considerations
The information gathered when sequencing a sample's transcriptome in this way has many of the same limitations and advantages as other RNA expression analysis pipelines. The main pros and cons of this approach can be summarized as:
a) Tissue specificity: Gene expression is not uniform throughout an organism's cells, it is strongly dependent on the tissue type being measured; RNA-Seq, as any other sequencing technology that analyzes homogeneous samples, can provide a complete snapshot of all the transcripts being available at that precise moment in the cell. This approach is unlikely to be biased like an oligonucleotide microarray approach that instead analyzes a selected number of previously defined transcripts.
b) Time dependent: During a cell's lifetime and context, its gene expression levels change. As previously mentioned any single sequencing experiment will offer information regarding one point in time. Time course experiments are so far the only solution that would allow a complete overview of the circadian transcriptome so that researchers could obtain a precise description of the physiological changes happening over time. However, this approach is unfeasible for patient samples since it is quite improbable that biopsies will be collected serially in short time intervals. A possible work-around could be the use of urine, blood or saliva samples that won’t require any invasive procedure.
c) Coverage: coverage/depth can affect the mutations seen. Given that everything is expression-centric, an allele might not be detected, either because it is not in the genome, or because it is not being expressed. At the same time, RNA-seq can yield additional information rather than just the existence of a heterozygous gene as it can also help in estimating the expression of each allele. In association studies, genotypes are associated to disease and expression levels can also be associated with disease. Using RNA-seq, we can measure the relationship between these two associated variables, that is, in what relation are each of the alleles being expressed.
The depth of sequencing required for specific applications can be extrapolated from a pilot experiment.[23]
d) Subjectivity of the analysis: As described above, numerous attempts have been taken to uniformly analyze the data. However, the results can vary due to the multitude of algorithms and pipelines available. Most of the approaches are correct, but have to be tailored to the needs of the investigators in order to better capture the desired effect. This variability in methods, although in smaller scale, is still present in other RNA profiling approaches where reagents, personnel and techniques can lead to similar, although statistically different, results. Because of this, care must be taken when drawing conclusions from the sequencing experiment, as some information gathered might not be representative of the individual.
e) Data management: The main issue with NGS data is the volume of data produced. Microarray data occupy up to one thousand times less disk space than NGS data therefore requiring smaller storage units. The high capacity storage units required by RNA-Seq data are, however, directly proportional to the volume of information that goes with it. The payoff of “more complete” big scale datasets have to be evaluated prior to starting the experiment.
f) Downstream interpretation of the data: Different layers of interpretations have to be considered when analyzing RNA-Seq data. Biological, clinical and regulatory functions of the results are what allow clinicians and investigators to draw meaningful conclusions (i.e. the sequence of an RNA molecule presents, although identified with different read depths, might not perfectly mirror the initial DNA sequence). An example of this would be during SNV discovery as the mutations discovered are more precisely the mutations being expressed. Observing a homozygote location to a non-reference allele in an organism does not necessarily mean that this is the individual's genotype, it could just mean that the gene copy with the reference allele is not being expressed in that tissue and/or at the time snapshot the sample was acquired.
Analysis
Gene expression
The characterization of  gene expression in cells via measurement of mRNA levels has long been of interest to researchers, both in terms of which genes are expressed in what tissues, and at what levels. Even though it has been shown that due to other post transcriptional gene regulation events (such as  RNA interference) there is not necessarily always a strong correlation between the abundance of mRNA and the related proteins,[24] measuring mRNA concentration levels is still a useful tool in determining how the transcriptional machinery of the cell is affected in the presence of external signals (e.g. drug treatment), or how cells differ between a healthy state and a  diseased state.
Expression can be deduced via RNA-seq to the extent at which a sequence is retrieved. Transcriptome studies in yeast [25] show that in this experimental setting, a  fourfold coverage is required for  amplicons to be classified and characterized as an expressed gene. When the transcriptome is fragmented prior to cDNA synthesis, the number of reads corresponding to the particular exon normalized by its length in vivo yields gene expression levels which correlate with those obtained through qPCR.[23] This is frequently further normalized by the total number of mapped reads so that expression levels are expressed as Fragments Per Kilobase of transcript per Million mapped reads (FPKM).[18]
The only way to be absolutely sure of the individual's mutations is to compare the transcriptome sequences to the germline DNA sequence. This enables the distinction of homozygous genes versus skewed expression of one of the alleles and it can also provide information about genes that were not expressed in the transcriptomic experiment. An R-based statistical package known as CummeRbund[26] can be used to generate expression comparison charts for visual analysis.
Single nucleotide variation discovery

single nucleotide polymorphism

Transcriptome single nucleotide variation has been analyzed in maize on the Roche 454 sequencing platform.[27] Directly from the transcriptome analysis, around 7000 single nucleotide polymorphisms (SNPs) were recognized. Following Sanger sequence validation, the researchers were able to conservatively obtain almost 5000 valid SNPs covering more than 2400 maize genes. RNA-seq is limited to transcribed regions however, since it will only discover sequence variations in exon regions.  This misses many subtle but important intron alleles that affect disease such as transcription regulators, leaving analysis to only large effectors.  While some correlation exists between exon to intron variation, only whole genome sequencing would be able to capture the source of all relevant SNPs.[28]
Post-transcriptional SNVs
Having the matching genomic and transcriptomic sequences of an individual can also help in detecting post-transcriptional edits,[10] where, if the individual is homozygous for a gene, but the gene's transcript has a different allele, then a post-transcriptional modification event is determined.
mRNA centric single nucleotide variants (SNVs) are generally not considered as a representative source of functional variation in cells, mainly due to the fact that these mutations disappear with the mRNA molecule, however the fact that efficient DNA correction mechanisms do not apply to RNA molecules can cause them to appear more often. This has been proposed as the source of certain prion diseases,[29] also known as TSE or  transmissible spongiform encephalopathies.

RNA-seq mapping of short reads over exon-exon junctions, depending on where each end maps to, it could be defined a
Trans or a
Cis event.

Fusion gene detection

Fusion gene

Caused by different structural modifications in the genome, fusion genes have gained attention because of their relationship with cancer.[30] The ability of RNA-seq to analyze a sample's whole transcriptome in an unbiased fashion makes it an attractive tool to find these kinds of common events in cancer.[31]
The idea follows from the process of aligning the short transcriptomic reads to a reference genome. Most of the short reads will fall within one complete exon, and a smaller but still large set would be expected to map to known exon-exon junctions. The remaining unmapped short reads would then be further analyzed to determine whether they match an exon-exon junction where the exons come from different genes. This would be evidence of a possible fusion event, however, because of the length of the reads, this could prove to be very noisy. An alternative approach is to use pair-end reads, when a potentially large number of paired reads would map each end to a different exon, giving better coverage of these events (see figure). Nonetheless, the end result consists of multiple and potentially novel combinations of genes providing an ideal starting point for further validation.
Application to Genomic Medicine
History
The past five years have seen a flourishing of NGS-based methods for genome analysis leading to the discovery of a number of new mutations and fusion transcripts in cancer. RNA-Seq data could help researchers interpreting the “personalized transcriptome” so that it will help understanding the transcriptomic changes happening therefore, ideally, identifying gene drivers for a disease. The feasibility of this approach is however dictated by the costs in term of money and time.
A basic search on PubMed reveals that the term RNA Seq, queried as “rna Seq OR RNA-Seq OR rna sequencing OR RNASeq” in order to capture the most common ways of phrasing it, gives 147.525 hits demonstrating the exponentially increasing usage rate of this technology. A few examples will be taken into consideration to explain that RNA-Seq applications to the clinic have the potentials to significantly affect patient’s life and, on the other hand, requires a team of specialists (bioinformaticians, physicians/clinicians, basic researchers, technicians) to fully interpret the huge amount of data generated by this analysis.
As an example of excellent clinical applications, researchers at the Mayo Clinic used an RNA-Seq approach to identify differentially expressed transcripts between oral cancer and normal tissue samples. They also accurately evaluated the allelic imbalance (AI), ratio of the transcripts produced by the single alleles, within a subgroup of genes involved in cell differentiation, adhesion, cell motility and muscle contraction[32] identifying a unique transcriptomic and genomic signature in oral cancer patients. Novel insight on skin cancer (melanoma) also come from RNA-Seq of melanoma patients. This approach led to the identification of eleven novel gene fusion transcripts originated from previously unknown chromosomal rearrangements. Twelve novel chimeric transcripts were also reported, including seven of those that confirmed previously identified data in multiple melanoma samples.[33] Furthermore, this approach is not limited to cancer patients. RNA-Seq has been used to study other important chronic diseases such as Alzheimer (AD) and diabetes. In the former case, Twine and colleagues compared the transcriptome of different lobes of deceased AD’s patient’s brain with the brain of healthy individuals identifying a lower number of splice variants in AD’s patients and differential promoter usage of the APOE-001 and -002 isoforms in AD’s brains.[34] In the latter case, different groups showed the unicity of the beta-cells transcriptome in diabetic patients in terms of transcripts accumulation and differential promoter usage[35] and long non coding RNAs (lncRNAs) signature.[36]
ENCODE and TCGA
A lot of emphasis has been given to RNA-Seq data after the Encyclopedia of the regulatory elements (ENCODE) and The Cancer Genome Atlas (TCGA) projects have used this approach to characterize dozens of cell lines[37] and thousands of primary tumor samples,[38] respectively. The former aimed to identify genome-wide regulatory regions in different cohort of cell lines and transcriptomic data are paramount in order to understand the downstream effect of those epigenetic and genetic regulatory layers. The latter project, instead, aimed to collect and analyze thousands of patient’s samples from 30 different tumor types in order to understand the underlying mechanisms of malignant transformation and progression. In this context RNA-Seq data provide a unique snapshot of the transcriptomic status of the disease and look at an unbiased population of transcripts that allows the identification of novel transcripts, fusion transcripts and non-coding RNAs that could be undetected with different technologies.

展开全文
• rna 编辑分类器 从 RNA-seq 变体中分类 A->I RNA 编辑事件
• RNA-seq：转录组数据分析处理 一、流程概括 RNA-seq的原始数据（raw data）的质量评估 raw data的过滤和清除不可信数据（clean reads） reads回帖基因组和转录组（alignment） 计数（count ） 基因差异分析（Gene ...
RNA-seq：转录组数据分析处理（上）

目录
RNA-seq：转录组数据分析处理（上）一、流程概括二、准备工作1. fastq测序文件2.注释文件和基因组文件的获取
三、软件安装四、质量汇报生成与读取1.fastq质量汇报Basic StatisticsPerbase sequence qualityper tail sequence qualityPer sequence quality scoresPer base sequence contentSequence Length DistributionPer sequence GC contentAdapter Content
2. multiqc质量报告
五、数据处理1.trim_galore 的使用方法2. 整理后数据的质量分析。
六、比对回帖1. 索引文件的获取2. hisat2的比对回帖使用hisat2回帖samtools 软件进行格式转换
3.对回帖bam文件进行质量评估。
七、count

一、流程概括
二、准备工作
学习illumina公司测序原理测序得到的fastq文件注释文件和基因组文件的准备
1. fastq测序文件
在illumina的测序文件中，采用双端测序（paired-end），一个样本得到的是seq_1.fastq.gz和seq_2.fastq.gz两个文件，每个文件存放一段测序文件。在illumina的测序的cDNA短链被修饰为以下形式（图源见水印）：
在illumina公司测得的序列文件经过处理以fastq文件协议存储为*.fastq格式文件。在fastq文件中每4行存储一个read。 第一行：以@开头接ReadID和其他信息，分别介绍了 第二行：read测序信息 第三行：规定必须以“+”开头，后面跟着可选的ID标识符和可选的描述内容，如果“+”后面有内容，该内容必须与第一行“@”后的内容相同 第四行：每个碱基的质量得分。记分方法是利用ERROR P经过对数和运算分为40个级别分别与ASCII码的第33号!和第73号I对应。用ASCII码表示碱基质量是为了减少文件空间占据和防止移码导致的数据损失。fastq文件预览如下：
2.注释文件和基因组文件的获取
基因组获取方式：可以从NCBI、NCSC、Ensembl网站或者检索关键词“hg38 ftp UCSC” 人类基因组hg38.fa.gz大概是938MB左右。文件获取可以点击网站下载。可以通过云盘的离线下载来加速下载进程基因组的选择：以Ensembl网站提供的基因组为例，比对用基因组应该选择Homo_sapiens.GRCh38.dna.primary_assembly.faEnsembl基因组的不同版本详见README和高通量测序数据处理学习记录（零）：NGS分析如何选择合适的参考基因组和注释文件
三、软件安装
安装方式：软件安装可以通过例如apt-get、miniconda等方式来安装。由于miniconda的便捷行，使用conda进行如下软件的安装。软件列举 质控：fastqc ,multiqc , trimmomatic, cutadapt, trim-galore 比对：star , hisat2 , tophat , bowtie2 , bwa , subread 计数：htseq , bedtools, salmon, featurecountminiconda的安装：
可以通过点击清华大学开源软件站或者检索“清华大学 conda”访问镜像网站（清华镜像站因为服务器在中国访问速度比较快），点击Anoconda界面，选择Miniconda下载安装，windows在安装好需要设置环境变量。linux测试Miniconda的安装：conda -v创建名为rna的环境变量：conda create -n rna python=2（许多软件依赖python2环境）环境退出：conda deactivate配置conda，添加镜像源头：输入如下命令（更新：2019年05月06日）
conda config --add channels https://mirrors.cloud.tencent.com/anaconda/pkgs/free/

软件安装：conda install <software>会自动安装软件和软件环境。值得注意的是需要在rna的环境变量下安装以上软件。激活rna环境变量的代码：
source activate rna

四、质量汇报生成与读取
1.fastq质量汇报
使用命令fastqc -o <output dir> <seqfile1,seqfile2..>来进行质量报告。每个fastqc文件会获得一个质量分析报告，来描述此次RNA-seq的测序质量。 获取质量报告如图：
Basic Statistics
Perbase sequence quality
横坐标： 第1-100个测序得到的碱基 纵坐标： 测序质量评估。这里的Q=-10*lg10(error P),即20%代表1%的错误读取率，30%代表0.1%的错误读取率 箱型图： 红色线，是某个顺序下测序碱基所有测序质量的中位数。黄色块，是测序质量在25%-75%区域。蓝色线，平均数。 一般要求： 测序箱型图10%的线大于Q=20。Q20过滤法。
per tail sequence quality
横坐标：同上。 纵坐标：tail的index编号。 目的：防止测序过程中某些tail受不可控因素测序质量低。 标准：蓝色表示质量高，浅色或暖色表示质量低，后续的分析可以去除低质量tail。
Per sequence quality scores
Per base sequence content
横坐标：1-100的测序碱基位置 纵坐标：碱基百分比 标准：理论上，ATCG碱基的分布应该差别不大，即四条线应该大致平行状态。如果AT或CG差异超过10%，此项检测是危险的。一般是测序机器前几个碱基测序时候因为状态调整导致测序略有偏差，如果前几个碱基偏差较大，可以在后期将前几个碱基切掉。
Sequence Length Distribution
Per sequence GC content
2. multiqc质量报告
multiqc可以对几个fastqc报告文件进行总结并汇总到一个报告文件中，以更直观到防止展示。使用方法
multiqc <analysis directory>


五、数据处理
1.trim_galore 的使用方法
trim_galore [options] <filename>
--phred33       #设定记分方式，代表Q+33=ASCII码的方式来记分方式。这是默认值。
--output_dir   #输出目录，需确保路径存在并可以访问
--length        #设定长度阈值，小于此长度会被抛弃。这里测序长度是100我设定来75，感觉有点浪费
<filename>  #如果是采用illumina双端测序的测序文件，应该同时输入两个文件。

构建命令：
trim_galore -output_dir clean --paired --length 75 --quality 25 --stringency 5 seq_1.fasq.gz seq_2.fastq.gz

处理需要花上一定时间和磁盘空间。得到处理后数据
2. 整理后数据的质量分析。
对过滤后对文件进行质量分析。观察过滤结果。同样使用fastqc和multiqc两个软件进行质量分析。得到结果如下：

ENCFF108UVC_val_1_fastqc的质量报告

SUMMARISING RUN PARAMETERS
==========================
Input filename: ENCFF108UVC.fastq.gz
Trimming mode: paired-end
Trim Galore version: 0.5.0
Quality Phred score cutoff: 25
Quality encoding type selected: ASCII+33
Adapter sequence: 'AGATCGGAAGAGC' (Illumina TruSeq, Sanger iPCR; auto-detected)
Maximum trimming error rate: 0.1 (default)
Minimum required adapter overlap (stringency): 5 bp
Minimum required sequence length for both reads before a sequence pair gets removed: 75 bp
Output file will be GZIP compressed

This is cutadapt 1.18 with Python 2.7.6
Command line parameters: -f fastq -e 0.1 -q 25 -O 5 -a AGATCGGAAGAGC ENCFF108UVC.fastq.gz
Processing reads on 1 core in single-end mode ...
=== Summary ===

Reads written (passing filters):    26,038,229 (100.0%)

Total basepairs processed: 2,603,822,900 bp
Quality-trimmed:              82,577,636 bp (3.2%)
Total written (filtered):  2,513,138,030 bp (96.5%)


由报告可以知道处理的具体详情。
六、比对回帖
概况：使用处理后的fastq文件和基因组与转录组比对，确定在转录组或者基因组中的关系。在转录组和基因组的比对采取的方案不同。分别是ungapped alignment to transcriptome和Gapped aligenment to genome。 软件：hisat2和STAR在比对回帖上都有比较好的表现。有文献显示，hisat2在纳伪较少但是弃真较多，但是速度比较快。STAR就比对而言综合质量比较好，在长短reads回帖上都有良好发挥。由于hisat2的速度优势，选择hisat2作为本次比对的软件。 在比对之前首先要先进行索引文件的获取或者制作。
1. 索引文件的获取
不同的比对软件构建索引方式不同，所用的索引也不尽相同索引文件可以去网站下载也可以自己构建。但是索引构建会比较费时间。建立索引文件需要大约一个小时（MAC: 2.6 GHz Intel Core i5/ 8 GB 1600 MHz DDR3） 。网站下载hisat2基因组索引：http://ccb.jhu.edu/software/hisat2/index.shtml本地索引文件构建参考了CSDN@ Richard_Jolin的构建过程索引文件的格式如下，是由多个文件构成，要保证索引文件的格式和名称部分一致。
2. hisat2的比对回帖
使用hisat2回帖
公式构建根据hisat2 的使用说明书构建了以下公式：
hisat2  -p 6    -x <dir of index of genome>  -1  seq_val_1.fq.gz   -2  seq_val_2.fq.gz  -S  tem.hisat2.sam

参数说明：

-p #多线程数 -x #参考基因组索引文件目录和前缀 -1 #双端测序中一端测序文件 -2 #同上 -S #输出的sam文件

samtools 软件进行格式转换
SAM文件和BAM文件 samtools 是针对比对回帖的结果——sam和bam格式文件的进一步分析使用的软件。sam格式文件由于体量过大，一般都是使用bam文件来进行存储。由于bam文件是二进制存储所以文件大小比sam格式文件小许多，大约是sam格式体积的1/6 。 samtools将sam转换bam文件

samtools view -S seq.sam -b > seq.bam  #文件格式转换
samtools sort seq.bam -0 seq_sorted.bam  ##将bam文件排序
samtools index seq_sorted.bam  #对排序后对bam文件索引生成bai格式文件，用于快速随机处理。


至此一个回帖到基因组对RNA-seq文件构建完成。这个seq_sourted.bam文件可以通过samtools或者IGV( Integrative Genomics Viewer)独立软件进行查看。在IGV软件中载入seq_sourted.bam文件。 可以很直观清晰地观察到reads在基因组中的回帖情况和外显子与内含子的关系。
3.对回帖bam文件进行质量评估。
**samtools falgstate **：统计bam文件中比对flag信息，然后输出比对结果。 公式：
samtools flagstate seq_sorted.bam > seq_sorted.flagstate

结果如下

47335812 + 0 in total (QC-passed reads + QC-failed reads) 3734708 + 0 secondary 0 + 0 supplementary 0 + 0 duplicates 46714923 + 0 mapped (98.69% : N/A) 43601104 + 0 paired in sequencing 21800552 + 0 read1 21800552 + 0 read2 42216752 + 0 properly paired (96.82% : N/A) 42879780 + 0 with itself and mate mapped 100435 + 0 singletons (0.23% : N/A) 337412 + 0 with mate mapped to a different chr 308168 + 0 with mate mapped to a different chr (mapQ>=5)

七、count
计算RNA-seq测序reads对在基因组中对比对深度。 计数工具：feature counts 公式构建：
feature counts -T 6 -t exon -g gene_id -a <gencode.gtf> -o seq_featurecount.txt <seq.bam>

参数：

-g # 注释文件中提取对Meta-feature 默认是gene_id -t # 提取注释文件中的Meta-feature 默认是 exon -p #参数是针对paired-end 数据 -a #输入GTF/GFF 注释文件 -o #输出文件

接下来是表达矩阵构建。在R语言环境下分析。

共勉！欢迎大家踊跃交流，讨论，质疑，批评。

另外请允许鄙人推广一下，因为我的笔记分布CSDN、简书、知乎专栏等比较零散，管理起来比较麻烦，因此鄙人前几天终于思考再三申请了一个 微信公众号，会更加方便地发布更多有关生信息、统计方面内容，如果你觉得有需要欢迎关注。公众号如下：

我的微信公众号：进击的大肠杆菌

我想建立并管理一个高质量的生信&统计相关的微信讨论群，如果你想参与讨论，可以添加微信：veryqun 。我会拉你进群，当然有问题也可以微信咨询我。
展开全文
• 编码RNA-seq管道 概述 这是ENCODE-DCC RNA测序管道。 流水线的范围是对齐读取，生成信号轨迹以及量化基因和同工型。 安装 安装说明 如何 如何指导 输入 管道描述 输出值 管道说明 参考 参考
• SARTools：RNA-Seq工具的统计分析
• 在具有标准RNA-Seq数据的基准测试中， kallisto可以在Mac台式计算机上不到3分钟的时间内，仅使用读取序列和本身需要10分钟即可完成的转录组索引，就可以定量分析3000万个人的大量RNA-seq读数。 读数的伪比对保留了...
• Grape提供了广泛的RNA-Seq分析渠道。 它允许创建自动化和集成的工作流程来管理和分析RNA-Seq数据。 它使用作为执行后端。 请查看以获取更多信息。 在IHEC联盟中，Grape已被用于RNA-seq整合分析。 根据IHEC建议，...
• 到目前为止，这是一些单细胞 RNA-Seq (scRNA-Seq) 论文的非详尽列表。 粗体的论文是我稍微偏向于覆盖的论文。 这些论文包括分析方法、协议、评论和应用。 论文列表 2015年 2014年 使用单细胞 RNA-seq 重建远端肺...
• RNA-Seq玩具管道 RNA-Seq管道概念的证明，旨在显示Nextflow脚本和可重复性功能。 如何执行 在您的计算机上安装Docker。 在此处了解更多信息 安装Nextflow（版本20.01.0或更高版本） curl -fsSL get.nextflow.io | ...
• matlab矩阵自动拼接代码不断扩大的RNA-seq工具集合 RNA-seq相关工具和基因组数据分析资源。 请， ！ 有关其他编程和与基因组学有关的说明，请参见。 表中的内容 流水线 。 FastQC / MultiQC，TrimGalore，STAR（两次...
• 这个为期一天的在线研讨会深入研究了使用10x平台生成的单细胞RNA-Seq数据进行VDJ分析。 参与者应该已经熟悉单细胞RNA-Seq分析，本课程将不涉及其基础知识。 您将需要一台与互联网具有可靠连接的计算机，已安装的和...
• RNA序列 基于STAR的ENCODE Long RNA-Seq处理管道 不推荐使用此管道，而推荐使用 。 它确实具有用于在DNA Nexus上使用运行RAMPAGE / CAGE实验的代码（如果您正在寻找的是该代码）
• 很好的ngs入门简介，介绍了RNA-seq
• 龙卷风序列协议 GSE157167中提供的用于分析目标RNA-seq原始数据的自定义代码
• RASflow：RNA-Seq分析Snakemake工作流程 RASflow是一种模块化，灵活且用户友好的RNA-Seq分析工作流程。 RASflow可以应用于模型生物和非模型生物。 它支持将RNA-Seq原始读物映射到基因组和转录组（可以从公共数据库...
• RNA-seq 实战, 介绍了大量主流软件,并提供了研究流程与思路.
• clustifyr使用参考大量RNA-seq数据集，分类的微阵列表达数据，单细胞基因签名或标记基因列表对单细胞RNA测序实验中的细胞和簇进行分类。 如果不了解基础生物学，则难以注释单细胞转录组。 即使掌握了这些知识，...
• Snakemake中的基本散装RNA-seq管线 目录 描述 该存储库包含两种基本形式的基本批量RNA-seq管道的演示，即Snakemake （在workflow/目录中）和bash脚本（在bash_workflow/目录中）。 这两个工作流程执行相同的分析，...
• RNA-seq数据分析实用方法，包含了RNA-seq数据分析的各个方面。
• 真棒单细胞：社区管理的单细胞软件包和数据资源列表，包括RNA-seq，ATAC-seq
• RNA-Seq是基于高通量测序技术对转录组进行研究的实验技术，该技术正成为分析基因表达水平的重要实验手段。真核生物中普遍存在的选择性剪切导致从RNA-Seq读段到参考序列存在剪切异构体多源映射，并且读段在参考序列上...
• lcdb-wf 进行常见的高通量测序分析的工作流和工具的集合，以及相关的基础结构。 请参阅文档。 参考工作流程 ...RNA-seq工作流程 DAG用于RNA-seq工作流程： ChIP-seq工作流程 DAG for ChIP-seq工作流程：

...