深度学习在基因表达谱分析上取得重要进展
近日,一项刊登在国际知名杂志Bioinformatics上的研究论文中,来自加州大学尔湾分校和博德研究所的研究人员通过深度学习算法进行大规模基因表达预测,并在预测精度上获得了显著提升。


近日,一项刊登在国际知名杂志Bioinformatics上的研究论文中,来自加州大学尔湾分校和博德研究所的研究人员通过深度学习算法进行大规模基因表达预测,并在预测精度上获得了显著提升。

全基因组表达谱分析被广泛应用于描述细胞在不同生理病理条件下的活动状态,例如不同的癌组织细胞在各种给药条件下会产生截然不同的生理反应和表达谱。然而由于其相对昂贵的成本,目前只有少数资金充足的实验室能够进行大规模全基因组表达谱分析。

虽然人体全基因组含有约22000个基因,但是大量数据表明绝大部分基因的表达谱之间存在高度关联。基于此假设,博德研究所的研究人员开发出L1000芯片技术,能以十分低廉的价格(~5$/样本)测量约1000个“标杆”基因的表达谱。在此基础之上,研究人员便可以结合已有的全基因组表达谱数据,计算预测剩余的约21000个“目标”基因的表达谱。目前研究人员采用基于线性回归的计算模型进行预测,而大量实验表明基因表达谱之间存在广泛的非线性关联。因此目前的计算模型在预测精度上还受到一定限制。

来自加州大学尔湾分校的研究人员通过大规模多任务深度学习网络进行“目标”基因表达谱预测,在原有线性回归模型的基础上将预测精度提高了15.33%。进一步分析表明,在约21000个“目标”基因中,深度学习算法在99.97%的基因上获得了更加准确的预测精度。通过查看深度学习网络各层之间的权值,研究人员发现深度学习网络自动捕获了全基因表达谱之间的非线性关联,从而部分解释了深度学习网络相较于线性回归模型的优势。研究人员在文章的最后开放了GitHub源代码,并提供了对NIH LINCS计划约130万个样本表达谱预测结果的下载。


图.文中深度学习网络的计算框架

文章共同一作Yi Li表示,在过去几年中深度学习方法在传统AI问题上已经取得了长足进展,但是在计算生物和生物信息学上的应用,从2015年开始才在DNA序列分析上崭露头角。随着各种大规模生物数据集的公布(例如NIH LINCS计划)以及各种开源工具的普及(例如谷歌的深度学习开源框架TensorFlow),相信深度学习未来将在更多其他类型的生物数据分析上取得进展。

所有文章仅代表作者观点,不代表本站立场。如若转载请联系原作者。
查看更多
  • Gene expression inference with deep learning

    Motivation: Large-scale gene expression profiling has been widely used to characterize cellular states in response to various disease conditions, genetic perturbations, etc. Although the cost of whole-genome expression profiles has been dropping steadily, generating a compendium of expression profiling over thousands of samples is still very expensive. Recognizing that gene expressions are often highly correlated, researchers from the NIH LINCS program have developed a cost-effective strategy of profiling only ~1,000 carefully selected landmark genes and relying on computational methods to infer the expression of remaining target genes. However, the computational approach adopted by the LINCS program is currently based on linear regression, limiting its accuracy since it does not capture complex nonlinear relationship between expression of genes. Results: We present a deep learning method (abbreviated as D-GEX) to infer the expression of target genes from the expression of landmark genes. We used the microarray-based GEO dataset, consisting of 111K expression profiles, to train our model and compare its performance to those from other methods. In terms of mean absolute error averaged across all genes, deep learning significantly outperforms linear regression with 15.33% relative improvement. A gene-wise comparative analysis shows that deep learning achieves lower error than linear regression in 99.97% of the target genes. We also tested the performance of our learned model on an independent RNA-Seq-based GTEx dataset, which consists of 2,921 expression profiles. Deep learning still outperforms linear regression with 6.57% relative improvement, and achieves lower error in 81.31% of the target genes.

    展开 收起
发表评论 我在frontend\modules\comment\widgets\views\文件夹下面 test