现如今,由于二代测序的普及化、公共数据库的便利化,越来越多的科研工作者可以将不同的组学数据进行大数据的整合分析。今天,小编就为大家讲解如何进行QTL的分析流程。
首先,我们要理解什么是QTL。数量性状基因座(QuantitativeTrait Loci, QTL)是指染色体上一些能特定调控mRNA(eQTL)、甲基化水平(mQTL)的SNP位点,其mRNA、甲基化的表达水平量与数量性状成比例关系。
有很多文章都用到了QTL的分析,比如小编上一期讲解的“Modulation of long noncoding RNAs by risk SNPs underlying geneticpredispositions to prostate cancer”,用到了eQTL的分析;以及“Associationand cis-mQTL Analysis of Variants in CHRNA3-A5, CHRNA7, CHRNB2, and CHRNB4 inRelation to Nicotine Dependence in a Chinese Han Population”这篇文章中结合了表达和甲基化数据,做了meQTL的分析。他们做这些分析的目的,是结合DNA、RNA、甲基化的数据,探索一些和疾病显著关联的intron上的SNP的潜在生物学机制。比如在“Association andcis-mQTL Analysis of Variants in CHRNA3-A5, CHRNA7, CHRNB2, and CHRNB4 in Relationto Nicotine Dependence in a Chinese Han Population”文章中,作者发现rs3743075和尼古丁成瘾显著相关,并且,这个位点不仅是cis-eQTL,还可以调节临近位点的甲基化水平。那么,这个位点可能通过增加了附近的甲基化水平,抑制了基因的表达,从而导致了疾病的发生。
library(MatrixEQTL)
# Genotype filename
SNP_file_name =paste("/data/liuqiang/mQTL/data_set/wgs/mqtl/HHJ/rs1176744.txt",sep="\t");
snps_location_file_name= paste("/data/liuqiang/mQTL/data_set/wgs/mqtl/HHJ/rs1176744loc.txt",sep="\t");
# Gene expressionfile name
expression_file_name= paste("/data/liuqiang/mQTL/data_set/wgs/mqtl/HHJ/GE_chr11.txt",sep="\t");
gene_location_file_name=paste("/data/liuqiang/mQTL/data_set/wgs/mqtl/HHJ/geneloc_chr11.txt",sep="\t");
# Covariates filename
covariates_file_name= paste("/data/liuqiang/mQTL/data_set/wgs/mqtl/HHJ/Covariates.txt",sep="\t");
此处一共输入5个文件,分别是SNP文件(注意转换为dosage格式)、SNP位置信息、表达\甲基化文件、表达\甲基化位置信息、协变量文件,需要注意的是,所有矩阵的样本顺序要保持一致。
# Output file name
output_file_name_cis= tempfile();
output_file_name_tra= tempfile();
# Onlyassociations significant at this level will be saved
pvOutputThreshold_cis= 5e-2;
pvOutputThreshold_tra= 1e-20;
此处是设定大家对cis-eQTL和trans-eQTL阈值的设定;
# Error covariancematrix
# Set to numeric()for identity.
errorCovariance =numeric();
# errorCovariance= read.table("Sample_Data/errorCovariance.txt");
# Distance forlocal gene-SNP pairs
cisDist = 2.5e5;
这里是设定cis-eQTL的作用范围,即以targetSNP为中心,上下游250kb范围,查看该范围内哪些基因的表达水平和target SNP显著相关。
## Load genotypedata
snps = SlicedData$new();
snps$fileDelimiter= "\t"; # the TABcharacter
snps$fileOmitCharacters= "NA"; # denote missing values;
snps$fileSkipRows= 1; # one row of column labels
snps$fileSkipColumns= 1; # one column of row labels
snps$fileSliceSize= 1000; # read file in slices of2,000 rows
snps$LoadFile(SNP_file_name);
## Load geneexpression data
gene =SlicedData$new();
gene$fileDelimiter= "\t"; # the TABcharacter
gene$fileOmitCharacters= "NA"; # denote missing values;
gene$fileSkipRows= 1; # one row of column labels
gene$fileSkipColumns= 1; # one column of row labels
gene$fileSliceSize= 2000; # read file in slices of2,000 rows
gene$LoadFile(expression_file_name);
## Load covariates
cvrt =SlicedData$new();
cvrt$fileDelimiter= "\t"; # the TABcharacter
cvrt$fileOmitCharacters= "NA"; # denote missing values;
cvrt$fileSkipRows= 1; # one row of column labels
cvrt$fileSkipColumns= 1; # one column of row labels
if(length(covariates_file_name)>0){
cvrt$LoadFile(covariates_file_name);
}
## Run theanalysis
snpspos =read.table(snps_location_file_name, header = TRUE, stringsAsFactors = FALSE);
genepos =read.table(gene_location_file_name, header = TRUE, stringsAsFactors = FALSE);
me = Matrix_eQTL_main(
snps = snps,
gene = gene,
cvrt = cvrt,
output_file_name = output_file_name_tra,
pvOutputThreshold = pvOutputThreshold_tra,
useModel = useModel,
errorCovariance = errorCovariance,
verbose = TRUE,
output_file_name.cis = output_file_name_cis,
pvOutputThreshold.cis =pvOutputThreshold_cis,
snpspos = snpspos,
genepos = genepos,
cisDist = cisDist,
pvalue.hist = "qqplot",
min.pv.by.genesnp = FALSE,
noFDRsaveMemory = FALSE);
unlink(output_file_name_tra);
unlink(output_file_name_cis);
## Results:
cat('Analysis donein: ', me$time.in.sec, ' seconds', '\n');
cat('Detectedlocal eQTLs:', '\n');
show(me$cis$eqtls)
cat('Detecteddistant eQTLs:', '\n');
show(me$trans$eqtls)
## Plot the Q-Qplot of local and distant p-values
plot(me)
write.csv(me$cis$eqtls,file='mQTL_250k.csv')
联系客服