上一期,我们已经初步了解了单细胞的质控内容【第2期. 单细胞测序:下游数据质控知多少】,本期我们将通过Seurat 对内置的PBMC数据进行标准化的全流程演示。在实操之前,我们还需要了解两点:①何为Seurat;②基于Seurat分析的单细胞数据格式是什么。
一、Seurat简介
Seurat是一个R包,被设计用于scRNA-seq数据的细胞质控和分析。Seurat旨在使用户能够识别和解释单细胞转录组数据中的异质性来源,同时提供整合不同类型的单细胞数据的函数。目前Seurat软件版本已更新到V5。Seurat可以进行单个样品的标准分析、多个样品合并分析、空间转录组分析、scRNA-seq 和scATAC-seq的联合分析。Seurat的官网:https://satijalab.org/seurat/。大家可以自行挖掘Seurat的更多介绍。
二、Seurat数据格式
当我们在R Studio中用Seurat导入单细胞数据后,可以得到SeuratObject,如下图:
数据通过seurat官网可获取。若想直接获得pbmc练习数据,可在公众号输入:scRNA-seq之PBMC练习数据。
1. 加载包
library(Seurat)
library(ggplot2)
library(patchwork)
2. 读取数据
pbmc.counts <- Read10X(data.dir = "Data")
dim(pbmc.counts)
pbmc <- CreateSeuratObject(counts = pbmc.counts,min.cells=3)
head(pbmc@meta.data)
3. 根据后缀添加样本信息,注意生成矩阵文件时的顺序
sample_info<-c("sample1","sample2")
Header<-colnames(pbmc)
project<-lapply(1:length(sample_info),function(x){
sample_name<-sample_info[x]
pattern<-paste0("-",x,"$")
Index<-grep(pattern,Header,value=F)
info<-c(rep(sample_name,length(Index)))
return(info)
})
project=unlist(project)
pbmc[['orig.ident']]<-project
4. 计算线粒体基因表达情况
pbmc[["percent.mito"]] <-PercentageFeatureSet(pbmc, pattern = "^MT-")
if(!dir.exists("Result")){dir.create("Result")}
6. 绘制细胞基因数及线粒体表达情况小提琴图
p<-VlnPlot(pbmc, features = c("nFeature_RNA", "nCount_RNA", "percent.mito"), ncol = 3,group.by = "orig.ident")
ggsave(file="Result/ngene_numi_pmito_vlnplot.png",dpi=300,height=6,width=8)
7. 细胞过滤
pbmc_filt <- subset(pbmc,subset = nFeature_RNA>500 & nFeature_RNA<2500 & nCount_RNA>-Inf & nCount_RNA<Inf & percent.mito<5)
pbmc_filt
8. 绘制过滤后细胞基因数及线粒体表达情况小提琴图
p<-VlnPlot(pbmc_filt, features = c("nFeature_RNA", "nCount_RNA", "percent.mito"), ncol = 3,group.by = "orig.ident")
ggsave(file="Result/filter_ngene_numi_pmito_vlnplot.png",dpi=300,height=6,width=8)
9. 数据标准化
pbmc_filt <- NormalizeData(pbmc_filt,normalization.method = "LogNormalize", scale.factor = 10000) ## 矫正测序深度(Feature counts for each cell are divided by the total counts for that cell and multiplied by the scale.factor. This is then natural-log transformed using log1p)
pbmc_filt <- FindVariableFeatures(object = pbmc_filt,selection.method = "vst",nfeatures = 2000) #寻找高变基因
top10 <- head(VariableFeatures(pbmc_filt), 10)
plot1 <- VariableFeaturePlot(pbmc_filt)
plot2 <- LabelPoints(plot = plot1, points = top10, repel = TRUE)
p <- plot1 + plot2
ggsave(file="Result/VariableFeature.png",dpi=300,width=12,height=6)
all.genes <- rownames(pbmc_filt)
pbmc_filt <- ScaleData(object = pbmc_filt,features =all.genes) # 对高标基因进行归一化
10. PCA降维
pbmc_filt <- RunPCA(object = pbmc_filt)
pbmc_filt <- JackStraw(pbmc_filt, num.replicate = 100)
pbmc_filt <- ScoreJackStraw(pbmc_filt, dims = 1:20)
p<-JackStrawPlot(pbmc_filt, dims = 1:15)
ggsave(file="Result/JackStraw.png",dpi=300,height=6,width=8)
p<- ElbowPlot(pbmc_filt)
ggsave(file="Result/ElbowPlot.png",dpi=300,height=6,width=8)
一般选择标准差趋于平缓时的PC数量即可,如PC=10。
pbmc_filt <- RunUMAP(pbmc_filt, dims = 1:10)
umap1<-DimPlot(pbmc_filt,reduction = "umap",group.by ="orig.ident",pt.size=1.2)
umap2<-DimPlot(pbmc_filt,reduction = "umap",pt.size=1.2,label = TRUE)
p<-umap1 + umap2
ggsave(file="Result/umap.png",dpi=300,width=14,height=6)
期待已久~|R语言与组学互助交流群来啦!
(欢迎大家入群交流~
联系客服