生物信息学实战

转录组入门演习

计算机资源的准备

系统环境及硬件配置：
- macOS High Sierra 10.13.4
- MacBool Air (13-inch, Early 2015)
- 处理器 1.6GHz intel Core i5
- 内存 4GB 1600MHz DDR3
- 存储 128G
- uname -a x86_64
安装Homebrew包管理，要求预先安装Xcode；
（App Store中Xcode要求macOS High Sierra 10.13.2以上）
1
ruby -e "$(curl -fsSL https://raw.githubusercontent.com/Homebrew/install/master/install)"
用homebrew安装wget；（也可在rudix网站上下载wget安装包）
1
brew install wget
wget下载miniconda；
此处注意网上教程多在Linux或Windows虚拟Ubantu系统下操作，要注意下载MacOSX版文件！
1
wget https://repo.continuum.io/miniconda/Miniconda3-latest-MacOSX-x86_64.sh

bash安装miniconda；

1	bash Miniconda3-latest-MacOSX-x86_64.sh

输入yes接受条款；
回车确认安装位置；
输入yes确定PATH；
source完成最后安装；
此处注意MacOSX系统中不是.bashrc，应根据反馈结果修改为.bash_profile！
1
source ~/.bash_profile
查看是否安装成功；
1
conda
更新miniconda；
1
conda update conda

添加channels,即为conda增加软件源；

conda config - add channels r
conda config - add channels bioconda
conda config - add channels defaults
conda config – add channels conda-forge

查看添加的频道；
1
conda config – get channels

添加清华源镜像；

1
2
3

conda config - add channels https://mirrors.tuna.tsinghua.edu.cn/anaconda/pkgs/free/
conda config - add channels https://mirrors.tuna.tsinghua.edu.cn/anaconda/cloud/conda-forge/
conda config - add channels https://mirrors.tuna.tsinghua.edu.cn/anaconda/cloud/msys2/

显示已添加的源镜像地址；

1	conda config --set show_channel_urls yes

安装sratoolkit,fastqc,hisats,samtools,htseq-count,R,Rstudio软件；

conda install -c bioconda sra-tools
conda install -c bioconda fastqc
conda install -c bioconda hisat2
conda install -c bioconda samtools
conda install -c bioconda htseq
# 本机已安装过R和RStudio,故未重复下载
# conda install r
# conda install rstudio

读文章拿到测序数据

检索实战学习示例文章；
AKAP95 regulates splicing through scaffolding RNAs and RNA processing factors. Nat Commun 2016 Nov 8;7:13347. PMID: 27824034
阅读文章Methods部分，尤Bioinformatic analyses和Data availability部分；

Bioinformatic analyses
We obtained 35–55 million of paired 51 bp reads for each sample. The investigator was blinded to sample identities when performing bioinformatic analyses. All the reads were mapped to the human reference genome (GRCh37/hg19) using TopHat (v2.0.13) with a gene transfer file (GTF version GRCh37.70). Low quality mapped reads (MQ>30) were removed from the analysis. The mean insert sizes and the s.d.’s were calculated using Picard-tools (v1.126).
For mRNA-Seq, read count tables were generated using HTSeq (v0.6.0)53 and deferential expression (DE) analysis of genes were performed using DESeq (v3.0)54. Deferential exon usage was performed using DEXSeq (v3.1)29. Two independent biological repeats for control and KD were included in both DEseq and DEXseq. For 293 cells, the mRNA-seq results of the control samples include (1) those from the doxycycline-treated parental Flp-In T-REx 293 cells by us and (2) those from the doxycycline-treated control Flp-In T-REx 293 cells performed by another group unrelated to us (sample GSM1095127 in GSE44976)31, and the two AKAP95 KD samples were from the two different clones (miR#8 and miR#12) of Flp-In T-REx 293 cells stably expressing AKAP95 microRNAs after doxycycline treatment. For mouse ES cells, the two biological repeats for both control and shRNA#1-mediated Akap95 KD were independently generated by two different persons in our lab. The read per million normalized BigWig files were generated using BEDTools (v2.17.0)55 and bedGraphToBigWig tool (v4). All the downstream statistical analyses and generating plots were performed in R (v3.1.1). Gene ontology analysis was performed at DAVID.
For RIP-Seq, mapped files were passed to Model-based Analysis of ChIP-Seq (MACS) (v1.4.2 20120305)56 for peak calling and HOMER (vv3.12, 6-8-2012)57 was used for Motif finding. Bedgraph files generated by MACS were then normalized based on read per million and converted to BigWig files using bedGraphToBigWig. The input RNA-seq files were not subject to MACS. The genomic regions for annotation (intergenic regions, exons, UTRs, distal and proximal introns) were extracted from the GTF file used above. Peak annotations and coverage calculations for these genomic regions were performed using Bedtools and SAMtools (v0.1.18)58, respectively. Profile plots were generated by ngs.plot (v2.47)59.

Data availability
The RIP-seq an RNA-seq data have been deposited in the Gene Expression Omnibus database, with accession code GSE81916. All other data is available from the author upon reasonable request.

在GEO上查找GSE81916,查找得其对应的SRA号为SRP075747（以及BioProject号PRJNA323422）；

Overall design
Samples 1-8 are RNA-immunoprecipitation (RIP)-seq to determine AKAP95 binding to the transcriptome. Samples 9-15 are mRNA-seq to determine effect of AKAP95 knockdown in human 293 cells (9-11) or mouse ES cells (12-15).

阅读网页Overall design部分介绍得知，属于RNA-seq数据的是第9-15个样品；
在SRA上查找SRP075747（或PRJNA323422）,查找得其第9-15个样品，即GSM2177723到GSM2177729对应的SRA号为SRR3589956到SRR3589962；
注意理解GEO/SRA数据库的数据存放形式
安装axel来下载数据比wget及prefetch更快；（也可使用GEOquery下载）
（Aspera只有Windows系统下的client）
1
brew install axel

循环axel语句实现数据下载；
（2200+KB/s，用时约两小时）

for i in {56..62}  # {56..62}与`seq 57 62`等价
do
  axel ftp://ftp-trace.ncbi.nih.gov/sra/sra-instant/reads/ByStudy/sra/SRP/SRP075/SRP075747/SRR35899$i/SRR35899$i.sra
done

了解fastq测序数据

循环sratoolkit（sra-tools）中fastq-dump语句把sra文件转换为fastq格式；
（用时约三个半小时）
注意这一步不显示百分数进度条！
1
2
3
4
5
fastq-dump -h # 查看帮助文件
for i in {56..62}
do
fastq-dump --gzip --split-3 -O /Users/xiapengfei/ -A SRR35899${i}.sra
done

循环fastqc语句测试测序文件的质量；(MultiQC可以批量显示质控结果）
（4线程，用时约25分钟）

for i in {56..62}
do
  fastqc -t 4 -O /Users/xiapengfei/ SRR35899$i.sra_1.fastq.gz SRR35899$i.sra_2.fastq.gz
done  # fastqc *.fastq在当前目录中无其他同类文件时等价，其他循环或可同理

_理解测序reads，GC含量，质量值，接头，index，fastqc的全部报告_

经验教训

了解参考基因组及基因注释

在UCSC中打开Downloads中Genome Data页面，点击Human选择Feb. 2009 (GRCh37/hg19)栏下Full data set，下拉找到chromFa.tar.gz文件；

chromFa.tar.gz - The assembly sequence in one file per chromosome.
Repeats from RepeatMasker and Tandem Repeats Finder (with period
of 12 or less) are shown in lower case; non-repeating sequence is
shown in upper case.

axel命令下载hg19参考基因组数据；
（400+KB/s，用时约35分钟）

axel http://hgdownload.soe.ucsc.edu/goldenPath/hg19/bigZips/chromFa.tar.gz
tar zxvf chromFa.tar.gz  # 解压该文件
cat *.fa > hg19.fa  # 将解压文件合并为一个文件
rm chr*.fa  # 删除无关数据

从gencode数据库中打开Data中Human栏内GRCh37-mapped Releases，点击GENCODE release栏下最新版的数据，下载基因注释文件；
（用时5分钟)

axel ftp://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_28/GRCh37_mapping/gencode.v28lift37.annotation.gtf.gz  # GTF格式主要是用来描述基因的注释
axel ftp://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_28/GRCh37_mapping/gencode.v28lift37.annotation.gff3.gz  # GFF文件是一种用来描述基因组特征的文件，现在我们所使用的大部分都是第三版（GFF3）
# gzip -d解压并删除压缩包
gzip -d gencode.v28lift37.annotation.gtf.gz
gzip -d gencode.v28lift37.annotation.gff3.gz

- 用IGV去查看感兴趣的基因的结构；

下载安装JAVA 8；
下载igv直接打开；
在菜单栏Genomes中选择Load Genome from File…导入hg19.fa文件；
在菜单栏Tools中打开Run igvtools…界面，在Command栏中选择Sort，在Input File栏中导入gtf文件后运行；
在菜单栏File中选择Load from File…导入sorted.gtf文件并index；
在工具栏输入查找的基因名词（如：TP53）或染色体定位（如chr17:7,569,720-7,592,868)查看可视化结构；

序列比对

从HISAT2的主页下载index文件；
（400+KB/s，用时约两个半+小时）

axel ftp://ftp.ccb.jhu.edu/pub/infphilo/hisat2/data/hg19.tar.gz  # 下载人类的索引
axel ftp://ftp.ccb.jhu.edu/pub/infphilo/hisat2/data/mm10.tar.gz  # 下载小鼠的索引
tar -zxvf *.tar.gz  # 解压压缩包
rm -rf *.tar.gz  # 删除压缩包

这次分析是人类mRNA-Seq测序的结果，但是这里其实只下载了3个人类的sra文件。一般而言RNA-Seq数据分析都至少要有2个重复，所以必须要有4个sra文件才行。实际上前面原文Bioinformatic analyses部分有说该研究control组中的一组数据用的是其他课题组的：

For 293 cells, the mRNA-seq results of the control samples include (1) those from the doxycycline-treated parental Flp-In T-REx 293 cells by us and (2) those from the doxycycline-treated control Flp-In T-REx 293 cells performed by another group unrelated to us (sample GSM1095127 in GSE44976)31, and the two AKAP95 KD samples were from the two different clones (miR#8 and miR#12) of Flp-In T-REx 293 cells stably expressing AKAP95 microRNAs after doxycycline treatment.

不同来源的RNA-Seq数据的构建文库不一样,结果之间比较需要考虑批次效应（batch effect） 的影响
IVT-seq reveals extreme bias in RNA-sequencing 证明了不同文库的RNA-Seq结果会存在很大差异

把fastq格式的reads比对上去得到sam文件(仅小鼠，下同)；
注意这一步也不显示百分数进度条！

# 比对人类的索引 9:55
# for i in {56..58}
# do
#   hisat2 -t -x /Users/xiapengfei/hg19/genome -1 SRR35899${i}.sra_1.fastq.gz -2 SRR35899${i}.sra_2.fastq.gz -S SRR35899${i}.sam
# done

# 比对小鼠的索引
for i in {59..62}
do
  hisat2 -t -x /Users/xiapengfei/hg19/genome -1 SRR35899${i}.sra_1.fastq.gz -2 SRR35899${i}.sra_2.fastq.gz -S SRR35899${i}.sam
done

用samtools把sam文件转为bam文件，并且排序(注意N和P两种排序区别)索引好；

for i in {59..62}
do 
  samtools view -S SRR35899${i}.sam -b > SRR35899${i}.bam  # 将比对后的sam格式文件转换成bam文件
  samtools sort SRR35899${i}.bam -o SRR35899${i}.sorted.bam  # 将所有的bam文件进行排序
  samtools index SRR35899${i}_sorted.bam # 将所有的排序文件建立索引
done

samtools sort默认按照染色体位置进行排序，-n参数根据read名进行排序。-t参数根据TAG进行排序

对bam文件进行质控；
可以使用RSeQC
RSeQC的安装，需要预先安装R，Python2.7，gcc，numpy
1
2
3
4
5
pip install RSeQC
for i in {56..62}
do
bam_stat.py -i SRR35899${i}_sorted.bam
done

另有Picard

conda install ucsc-gtftogenepred
gtfToGenePred -genePredExt -ignoreGroupsWithoutExons -geneNameAsName2 gencode.v28lift37.annotation.gtf gencode.v28lift37.annotation.gpd
axel http://hgdownload.soe.ucsc.edu/goldenPath/hg19/database/refFlat.txt.gz
gzip -d refFlat.txt.gz
java -jar ~/biosoft/picard_2.8.0/picard.jar CollectRnaSeqMetrics I= Homo_sapiens_Control_293_cell_sorted.bam O=Homo_sapiens_Control_293_cell_RNA_Metrics REF_FLAT= ~/rna_seq/data/reference/genome/hg19/refFlat.txt STRAND_SPECIFICITY=NONE

或Qualimap等工具

IGV载入参考序列，注释和BAM文件查看略；

reads计数

用htseq对比得到的bam文件进行count计数；

for i in $(seq 56 58)
do 
  htseq-count -f bam -r pos -s no SRR35899${i}.bam gencode.vM10.annotation.gtf SRR35899${i}.count
done

将count文件合并为表达矩阵；（59和61是control， 60和62是repeat）

options(stringsAsFactors = FALSE)
# 导入表格数据
control1 <- read.table("SRR3589959.count", sep="\t", col.names = c("gene_id","control"))
repeat1 <- read.table("SRR3589960.count", sep="\t", col.names = c("gene_id","akap951"))
control1 <- read.table("SRR3589961.count", sep="\t", col.names = c("gene_id","control"))
repeat2 <- read.table("SRR3589962.count", sep="\t", col.names = c("gene_id","akap952"))
# 按照gene_id合并
raw_count <- merge(merge(control1, control2, by="gene_id"), merge(repeat1, repeat2, by="gene_id"))
# 过滤删去5行
raw_count_filt <- raw_count[-48823:-48825, ]
raw_count_filter <- raw_count_filt[-1:-2, ]
# 因为EBI数据库无法直接搜索小数位，因此需要替换为整数形式
ENSEMBL <- gsub("\\.\\d*", "", raw_count_filter$gene_id)
# 将ENSEMBL重新添加到矩阵中
row.names(raw_count_filter) <- ENSEMBL
# 在EBI或UniProt数据库找到AKAP95的id，并从矩阵中找到访问，并赋值给AKAP95变量
> AKAP95 <- raw_count_filt[rownames(raw_count_filt)=="ENSG00000105127",]
# 查看AKAP95基因表达情况，结果明显比对照组高
> AKAP95
# 查看总体情况
summary(raw_count_filter)

另可使用perl

perl -lne 'if($ARGV=~/Homo_sapiens_(.*)_sorted.count/){print "$1\t$_"}' *|grep -v  Homo_sapiens>hg.count
# 先把所有文件进行合并
setwd("~/rna_seq/work/matrix")
hg <- read.csv(file = "hg.count",header = F,sep = "\t")
colnames(hg) <- c('sample','gene','count')
library(reshape2)
reads <- dcast(hg,formula = gene ~ sample)
write.table(reads,file = "hg_join.count",sep = "\t",quote = FALSE,row.names = FALSE)
summary(reads)
sample_mean <- group_by(hg,sample)%>% summarize(mean=mean(count))
gene_mean <- group_by(hg,gene)%>% summarize(mean=mean(count)) 
tmp <- full_join(hg,gene_mean,by='gene')

差异基因分析

参考：
青山屋主徐洲更 PANDA姐/沈梦圆

在R中安装DESeq2；（也可使用edgeR，limma等）

rm(list=ls())
source("https://bioconductor.org/biocLite.R") # 载入安装工具
biocLite("DESeq2")  # 安装包
library("DESeq2") # 加载包，测试是否安装成功

正式构建dds矩阵；

countData <- raw_count_filter
condition <- factor(c("control","control","KD","KD"))
colData <- data.frame(row.names=colnames(countData), condition)
dds <- DESeqDataSetFromMatrix(countData, DataFrame(condition), design= ~ condition )
head(dds)  #查看一下构建好的矩阵
dds <- DESeq(dds)   #对原始dds进行normalize
# 查看结果的名称，本次实验中是 "Intercept"，"condition_akap95_vs_control"
resultsNames(dds2)
# 将结果用results()函数来获取，赋值给res变量
res <- results(dds2)
# summary一下，看一下结果的概要信息
summary(res)

提取差异分析结果；

# 获取padj（p值经过多重校验校正后的值）小于0.05，表达倍数取以2为对数后大于1或者小于-1的差异表达基因。
> table(res$padj<0.05) #取P值小于0.05的结果
> res <- res[order(res$padj),]
> diff_gene_deseq2 <-subset(res,padj < 0.05 & (log2FoldChange > 1 | log2FoldChange < -1))
> diff_gene_deseq2 <- row.names(diff_gene_deseq2)
> resdata <-  merge(as.data.frame(res),as.data.frame(counts(dds2,normalize=TRUE)),by="row.names",sort=FALSE)
# 得到csv格式的差异表达分析结果
> write.csv(resdata,file= "control_vs_akap95.cvs",row.names = F)

差异基因结果注释/富集分析

参考：
徐洲更宇天沈梦圆

根据padj < 0.05 且Log2FoldChange的绝对值大于1的标准筛选显著差异表达基因集；
1
deseq2.sig <- subset(res, padj < 0.05 & abs(log2FoldChange) > 1)

安装下载注释数据包；（rog.HS.eg.db)

source("https://bioconductor.org/biocLite.R")
biocLite("clusterProfiler")
biocLite（"AnnotationHub")
library(AnnotationHub)
ah <- AnnotationHub()
org.hs <- ah[['AH53766']]

GO富集和GESA；
Y叔的clusterProfiler简单好用，结果正确，没有必要专门导出数据用GSEA软件做超几何分布检验分析了

ego <- enrichGO(
  gene = row.names(deseq2.sig),
  OrgDb = org.hs,
  keytype = "ENSEMBL",
  ont = "MF"
)
dotplot(ego,font.size=5)  # 气泡图
enrichMap(ego, vertex.label.cex=1.2, layout=igraph::layout.kamada.kawai)
plotGOgraph(ego)  # 网络图
gseaplot(gsemf, geneSetID="GO:0004871")  # 画GSEA标志性图

KEGG富集分析；

library(clusterProfiler)
gene_list <- mapIds(org.hs, keys = row.names(deseq2.sig),
  column = "ENTREZID", keytype = "ENSEMBL" )
kk <- enrichKEGG(gene_list, organism="hsa",
  keyType = "ncbi-geneid",
  pvalueCutoff=0.05, pAdjustMethod="BH",
  qvalueCutoff=0.1)
head(summary(kk))
dotplot(kk)  # 气泡图

致谢

感谢曾健明在其建立的生信技能树平台发起转录组入门实战项目；
感谢徐洲更、沈梦圆、青山屋主等在各平台上热心分享经验以资参考；
感谢所有论文作者和数据共享平台开放数据获取途径；
获益良多，相见恨晚！

参考文献

《RNA-seq数据分析实用方法》
（《RNA-seq Data Analysis A Practical Approach》）