bioconductor系列教程之一：分析基因芯片（下）数据分析

2011-10-31 22:09 · genesquared

教程数据下载 SAM算法 CyberT算法 Rank products(RP)算法

bioconductor系列教程之一分析基因芯片下（数据分析）

本节目录：

背景

教程数据下载

SAM算法

CyberT算法

Rank products(RP)算法

总结

提取差异表达的基因

从基因芯片当中提取生物学的信息需要合理的统计学方法。人们已经为优化传统统计学方法在基因芯片方面的应用做出了多年的努力。但是直到现在，最主要的努力依然还是依据实验设计的差别，用统计学方法提取出差异表达的基因，然后再转回使用实验的方法去验证这个结果。

在提取差异表达的基因时，人们总是会有这两种考虑，一是不可漏过一个，二是不能错杀过多（在英语里称为false discovery rate(FDR)错误发现率）。常见的手段是使用多种统计学方法来分析同样一个结果，尽可能多的得到差异表达的基因，而排除那些假的信号。然而学习和使用多种统计分析手段并不一定对于每一个生物学工作者都是非常容易的，这需要付出时间和努力。在这里，我们尽量多介绍几种常用的统计分析手段，并给出实践中人们常常使用的组合，来帮助你更好的分析自己的数据。

现在常用的分析手段主要有：significance analysis of microarrays(SAM)，CyberT和Rank products(RP)三种手段。其中CyberT是bioconductor当中最为常用的分析手段，因为它的算法完整地被limma库实现。但有研究指出，使用SAM和RP算法相结合可能是最佳的方案。其实任何一种算法都是有局限性的，我们需要从根本上对算法有所了解，然后才能有针对性地选择合适的算法。

SAM：Tusher VG, Tibshirani R, Chu G. Significance analysis of microarrays applied to the ionizing radiation response. Proc Natl Acad Sci USA 2001; 98:5116-21

CyberT: Baldi P, Long AD. A Bayesian framework for the analysis of microarray expression data: regularized t-test and statistical inferences of gene changes. Bioinformatics 2001; 17:509-19

RP: Breitling R, Armengaud P, Amtmann A, et al. Rank products: a simple, yet powerful, new method to detect differentially regulated genes in replicated microarray experiments. FEBS lett 2004;573:83-92

教程数据下载

本教程所用的示例数据为estrogen，

下载地址：

https://www.bioconductor.org/help/course-materials/2005/BioC2005/labs/lab01/Data/estrogen.zip。

estrogen是一个2X2因素的基因芯片实验数据。细胞为MCF7癌细胞，芯片为HGU95av2。实验因子为estrogen存在与否，及其暴露时长（10或者48小时）。这个研究的目的就是为了调查哪些基因对estrogen做出应答，哪些是早期应答，哪些是晚期应答。期待早期应答与晚期应答的基因分布在一个代谢途径上，只是所处的位置不同。

解压estrogen压缩包至适当路径下，本例为4R/estrogen目录下。转换当前的工作路径至estrogen数据目录：

> workingdir <- getwd()

> workingdir <- paste(workingdir,"/4R/estrogen",sep="")

> setwd(workingdir)

##读入数据：

> library(affy)

Loading required package: Biobase

Welcome to Bioconductor

Vignettes contain introductory material. To view, type

'openVignette()'. To cite Bioconductor, see

'citation("Biobase")' and for packages 'citation(pkgname)'.

> pd <- read.AnnotatedDataFrame("estrogen.txt",header=TRUE,sep="",row.names="filename")

> pData(pd)

estrogen time.h

low10-1.cel absent 10

low10-2.cel absent 10

high10-1.cel present 10

high10-2.cel present 10

low48-1.cel absent 48

low48-2.cel absent 48

high48-1.cel present 48

high48-2.cel present 48

> Data<-ReadAffy(filenames=rownames(pData(pd)), verbose=TRUE)

1 reading low10-1.cel ...instantiating an AffyBatch (intensity a 409600x8 matrix)...done.

Reading in : low10-1.cel

Reading in : low10-2.cel

Reading in : high10-1.cel

Reading in : high10-2.cel

Reading in : low48-1.cel

Reading in : low48-2.cel

Reading in : high48-1.cel

Reading in : high48-2.cel

SAM算法

SAM算法过程基于数据的噪音分析。一般的，信噪比会随着信号的降低而减小。然而当人们针对每一个基因进行分析时，发现即使他们的信号在同一水平，每一个基因的表达噪音都不一样。所以无法笼统地给出一个具体地线性方程来分析全部基因，必须一个一个基因单独分析。假设有对照实验I和U两组，I和U实验各有平行实验w组，对于某个指定的基因i，定义相对差异值（relative difference(d(i))）为基因i在实验组I和U当中的平均值之差除以I和U的总体标准误差均值。因为当标准误差均值过低时，会产生一个较高的d(i)值，所以总体的标准误差均值增加了一个较正项。具体公式如下：

其中上划线表示平均值，?m和?n表示实验组I和U的基因i表达的样本标准差，n1和n2表示实验组I和U当中基因i参与统计的探针的个数。S0为较正项，它可以由计算机自动给出，也可以由人手动指定。S0的取值标准是：将所有的d(i)值排序后100等分，每一等分先求得中位数绝对离差，然后计算这些中位数绝对离差值的变异系数。取s0分别等于100等分当中的s(i)的最小值，依次计算。最终s0的值就是在这些计算中，使变异系数最小的s0。

我们来比较一下它和t-test的异同：

比较之下，我们就会发现两者十分的相同。左边是两组具有不同探针数和不同标准差的非平行实验t-test计算公式，右边就是具有不同探针数和不同标准差的非平行实验的SAM公式，我们可以看到，它们的最终差别，只在那个s0附加项上。由此我们可以知道，实际上SAM就是t-test的一个具体应用。

在计算得到所有的d(i)值之后，重排样品计算d(i)期待值dE(i)。然后用期待值dE(i)对实测值d(i)做图。如果期待值和实测值相同，那么作图点就应该落在斜率为1的直线上，如果不同，作图点就落在其外。差别越大，距离该直线就越远。给定一个D值（默认的话，可以设置为1.2），当期待值和实测值差别比D大时，就认为有显著差异。这就是SAM算法。

下面我们就来使用R来实践一下SAM算法。

##安装SAM库

> install.packages("samr")

--- Please select a CRAN mirror for use in this session ---

also installing the dependency ‘impute’

trying URL 'https://software.rc.fas.harvard.edu/mirrors/R/bin/windows/contrib/2.11/impute_1.24.0.zip'

Content type 'application/zip' length 1205991 bytes (1.2 Mb)

opened URL

downloaded 1.2 Mb

trying URL 'https://software.rc.fas.harvard.edu/mirrors/R/bin/windows/contrib/2.11/samr_1.28.zip'

Content type 'application/zip' length 80418 bytes (78 Kb)

opened URL

downloaded 78 Kb

package 'impute' successfully unpacked and MD5 sums checked

package 'samr' successfully unpacked and MD5 sums checked

The downloaded packages are in

C:Documents and Settingsjianhong ouLocal SettingsTempRtmpHrqW1sdownloaded_packages

> library(samr)

Loading required package: impute

##表达定量

> eset <- rma(Data)

Background correcting

Normalizing

Calculating Expression

> eset.e <- exprs(eset)

SAM分析。我们首先来了解一下samr函数。

> ?samr

我们注意到其参数resp.type有以下几种类型：”quantitative” 定量反应，其中data参数当中的y必须是数值型数组; “Two class unpaired” 比较两种不同条件下的结果，其中data参数当中的y必须是1或者2数组，即要么值属于其中一种条件，要么属于另外一种条件; “Survival” for censored survival outcome; “Multiclass” : 两组以上实验条件，其中data参数当中的y是自然数数组，指定实验条件编号; “One class” 单组实验，y是1的数组; “Two class paired” 比较两种不同条件下的结果，并且实验是成对进行的，其中data参数当中的y是-1，1，-2，2……-k,k这样成对出现的数组; “Two class unpaired timecourse”, “One class time course”, “Two class.paired timecourse” or “Pattern discovery”与时间相关的比较，y值必须是kTimet这种形式，其中k是组编号，t是时间，Time就是字符Time，起始的时间点要为kTimetStart格式，结束的点要为kTimetEnd格式。

Table 1 Resp.type取值范围

Resp.type 取值

Quantitative 实数，例：27.4 或者-45.34

Two class (unpaired) 整数1，2

Multiclass 整数1，2，3，……

Paired 整数-1，1，-2，2等等成对出现，通常认为负号代表未处理的对照组，正号代表实验组；-1总是与1配对，-2与2配对直至-k与k

Survival data (Time, status)这种成对的数字，比如(50,1),(120,0)。第一个数字代表时间，第二个数字代表状态（1=死亡，0=存活）

One class 整数1

在示例的数据中，实验设计为estrogen存在与不存在，及作用10及48小时的比较。因为时间点过少，不足以形成timecourse，所以我们用多组来比较吧。

> gp=rep(1:4,each=2)

> gnames <- geneNames(Data)

> eset.list <- list(x=eset.e,y=gp,geneid=rownames(eset),

+ genenames=gnames,logged2=TRUE)

> samr.results <- samr(eset.list,resp.type="Multiclass")

##这里的delta的值会决定多少个基因会被筛选出来，具体可以查看samr.plot

> delta=3

> delta.table <- samr.compute.delta.table(samr.results)

> siggenes.table <- samr.compute.siggenes.table(samr.results,delta,eset.list,delta.table)

> siggenes.table$genes.up[,c(2,3,7:10)]

Gene ID Gene Name contrast-1 contrast-2 contrast-3 contrast-4

[1,] "AFFX-CreX-5_at" "g12575" "9.21014811176126" "10.3515038361398" "-9.41932435187662" "-10.1423275960245"

[2,] "AFFX-CreX-3_at" "g12573" "8.98264861865223" "9.66903016961903" "-8.99653473206513" "-9.65514405620612"

[3,] "910_at" "g12472" "-6.85750787389543" "12.9988908367082" "-15.3626330236849" "9.22125006087216"

[4,] "39642_at" "g9735" "-8.0212852366582" "15.3183014288828" "-10.0580016782286" "2.76098548600395"

[5,] "39755_at" "g9848" "-12.3045709423508" "16.3712352189681" "-17.1047959884349" "13.0381317118176"

[6,] "37014_at" "g7082" "22.2003218120704" "7.33249374260306" "-10.0588847725655" "-19.473930782108"

[7,] "AFFX-BioDn-5_at" "g12571" "5.49431914564278" "6.67431873647949" "-5.669534469218" "-6.49910341290427"

[8,] "AFFX-BioB-M_at" "g12563" "5.34962010057122" "6.15040149122691" "-5.53386613126643" "-5.9661554605317"

[9,] "31798_at" "g1814" "-6.11222505998901" "4.14215938115351" "-5.60173170670374" "7.57179738553923"

[10,] "1854_at" "g953" "-4.12322667536586" "5.5815623907716" "-7.19401279780167" "5.73567708239594"

[11,] "AFFX-BioDn-3_at" "g12569" "5.09133552869473" "5.80231767084282" "-5.10531514892988" "-5.78833805060768"

[12,] "41400_at" "g11509" "-5.37996191445008" "6.23141345730527" "-5.89607519248564" "5.04462364963045"

[13,] "AFFX-BioC-3_at" "g12565" "4.071465172315" "5.18167810413931" "-3.74796424599689" "-5.5051790304574"

[14,] "1824_s_at" "g922" "-6.24191003295855" "6.98898512063714" "-6.04477835240415" "5.29770326472558"

[15,] "40117_at" "g10214" "-3.84825871416767" "6.30592786935558" "-5.75147510286634" "3.29380594767844"

[16,] "40533_at" "g10634" "-3.30762693194487" "7.66957812589722" "-9.34676358571819" "4.98481239176583"

[17,] "1197_at" "g217" "4.56929720724491" "4.99251188819152" "-1.79817495925681" "-7.76363413617962"

[18,] "38116_at" "g8195" "-1.24869164816467" "4.51935936246191" "-6.31110260282343" "3.04043488852619"

[19,] "38065_at" "g8143" "-0.0249573892821473" "4.05411252803589" "-6.86175205064385" "2.83259691189011"

[20,] "1034_at" "g39" "-2.32248795602941" "-7.45168570619329" "9.58099156277354" "0.193182099449159"

[21,] "981_at" "g12542" "-3.1434855784531" "6.92649861557433" "-5.86161576347402" "2.07860272635277"

[22,] "39560_at" "g9652" "8.27801365342635" "3.45352424149948" "-2.01363130987094" "-9.71790658505493"

[23,] "33730_at" "g3766" "-0.731056701987878" "-3.69477491990521" "8.33601900175345" "-3.91018737986039"

[24,] "41386_i_at" "g11494" "-2.74972045677182" "-6.88162517707788" "10.6755978977481" "-1.04425226389835"

[25,] "1592_at" "g642" "-2.79183864082533" "2.46757982225474" "-5.54788015270328" "5.87213897127386"

[26,] "AFFX-BioC-5_at" "g12567" "3.41736388133051" "4.47256519086439" "-3.8050664323192" "-4.0848626398757"

[27,] "41439_at" "g11548" "5.68328448618689" "-1.4337563858764" "2.55856303044392" "-6.80809113075442"

[28,] "2004_at" "g1115" "4.59531400647627" "3.29877207114915" "-3.30605917048059" "-4.58802690714483"

[29,] "39581_at" "g9673" "4.9535895829366" "2.25204612331752" "-3.22206862093494" "-3.98356708531919"

[30,] "1126_s_at" "g140" "-4.31270029891489" "7.93500886602537" "-3.93964450871243" "0.317335941601947"

[31,] "37043_at" "g7111" "3.93576313476139" "-4.22003323046959" "5.31060910697541" "-5.0263390112672"

[32,] "35995_at" "g6053" "-0.615910687710384" "4.09250710926335" "-5.28567284546007" "1.8090764239071"

[33,] "33252_at" "g3283" "-4.44224917563441" "5.31667344099975" "-4.30551601881807" "3.43109175345275"

[34,] "40071_at" "g10167" "3.91144735313568" "4.19512365735143" "-3.81109296880425" "-4.29547804168287"

[35,] "40898_at" "g11002" "2.22254650391891" "-5.95841411873872" "9.10182786868782" "-5.36596025386801"

##保存结果：

> write.table(siggenes.table$genes.up,file="data",sep="t")

SAM算法因为其提供有 Microsoft Excel插件，所以使用非常广泛，相反，其在R当中的应用却显得相对较少。有研究表明，SAM对FDR的控制并不是很好，这是它最主要的不足。

CyberT算法

CyberT模型将标准及信号强度的关系使用线性模型进一步强化，并推出了Bioconductor下完整的limma库，该库说明文件清晰，教程完善，故而使用率极高。CyberT模型使用的是基于贝叶斯估计调整后的t-test来确定差异表达的基因，准确率较高。并且有研究指出，它的计算结果要强于SAM算法。虽然它们思路很类似，但是CyberT和limma其实还是有一点点差别的。在t-test公式上，它们都是（区别于SAM的），其中代表倍数平均值，代表标准差的平均值，但是的计算方式不完全相同。不类CyberT，limma在计算标准差时考虑的是全部的基因，而非排序后相近的基因，所以，limma库不适合使用MAS5算法生成的数据，这就使得我们只能倾向于rma算法。但是也正因为如此，所以limma不象CyberT那样局限于两组数据，limma可以依照实验构建对比矩阵，计算多级复杂设计的实验。

使用limma来分析差异表达的基因，主要分几步走：

1. 读取数据

2. 预处理数据

3. 构建实验设计矩阵

4. 使用线性模型估计差异表达的倍数

5. 使用贝叶斯平滑标准差

6. 试用不同的参数来输出差异表达基因结果。

下面我们就开始示例：

##首先读入limma库

> library(limma)

> library(affy)

Loading required package: Biobase

Welcome to Bioconductor

Vignettes contain introductory material. To view, type

'openVignette()'. To cite Bioconductor, see

'citation("Biobase")' and for packages 'citation(pkgname)'.

##载入CDF文件库

> library(hgu95av2cdf)

> workingdir<-getwd()

> workingdir<-paste(workingdir,"/4R/estrogen",sep="")

> setwd(workingdir)

##变换一种方法来读取数据文件，readTargets也是非常实用的读取文件的办法。

> targets<-readTargets("estrogen.txt",sep="",row.names="filename")

> targets

filename estrogen time.h

low10-1.cel low10-1.cel absent 10

low10-2.cel low10-2.cel absent 10

high10-1.cel high10-1.cel present 10

high10-2.cel high10-2.cel present 10

low48-1.cel low48-1.cel absent 48

low48-2.cel low48-2.cel absent 48

high48-1.cel high48-1.cel present 48

high48-2.cel high48-2.cel present 48

> ab <- ReadAffy(filenames=targets$filename)

##使用rma算法来预处理数据

> eset <- rma(ab)

Background correcting

Normalizing

Calculating Expression

##构建实验设计矩阵，这一步非常关键。

##有很多办法都可以构建这个矩阵。

##在这组实验数据中，我们关心的是早期和晚期细胞对estrogen的应答，

##所以我们需要构建两组对比。

> treatments <- factor(c(rep(1:4,each=2)),labels=c("e10","E10","e48","E48"))

> treatments

[1] e10 e10 E10 E10 e48 e48 E48 E48

Levels: e10 E10 e48 E48

> contrasts(treatments) <- cbind(Time=c(0,0,1,1),E10=c(0,1,0,0),E48=c(0,0,0,1))

> treatments

[1] e10 e10 E10 E10 e48 e48 E48 E48

attr(,"contrasts")

Time E10 E48

e10 0 0 0

E10 0 1 0

e48 1 0 0

E48 1 0 1

Levels: e10 E10 e48 E48

> design<-model.matrix(~treatments)

> design

(Intercept) treatmentsTime treatmentsE10 treatmentsE48

1 1 0 0 0

2 1 0 0 0

3 1 0 1 0

4 1 0 1 0

5 1 1 0 0

6 1 1 0 0

7 1 1 0 1

8 1 1 0 1

attr(,"assign")

[1] 0 1 1 1

attr(,"contrasts")

attr(,"contrasts")$treatments

Time E10 E48

e10 0 0 0

E10 0 1 0

e48 1 0 0

E48 1 0 1

> colnames(design) <- c("Intercept","Time","E10","E48")

> design

Intercept Time E10 E48

1 1 0 0 0

2 1 0 0 0

3 1 0 1 0

4 1 0 1 0

5 1 1 0 0

6 1 1 0 0

7 1 1 0 1

8 1 1 0 1

attr(,"assign")

[1] 0 1 1 1

attr(,"contrasts")

attr(,"contrasts")$treatments

Time E10 E48

e10 0 0 0

E10 0 1 0

e48 1 0 0

E48 1 0 1

##拟合线性模型

> fit <- lmFit(eset, design)

> dim(fit)

[1] 12625 4

> cont.matrix <- cbind(E10=c(0,0,1,0),E48=c(0,0,0,1))

> cont.matrix

E10 E48

[1,] 0 0

[2,] 0 0

[3,] 1 0

[4,] 0 1

> fit2 <- contrasts.fit(fit,cont.matrix)

> dim(fit2)

[1] 12625 2

##贝叶斯平滑

> fit2 <- eBayes(fit2)

> dim(fit2)

[1] 12625 2

##选择测试方法，global或者nestedF。

> results.global <- decideTests(fit2, method="global")

> results.nestedF<- decideTests(fit2, method="nestedF")

##去除嵌入探针

> i <- grep("AFFX",featureNames(eset))

##获取p值的总结，用于p值背景去除。

> summary(fit2$F.p.value[i])

Min. 1st Qu. Median Mean 3rd Qu. Max.

0.0001391 0.1726000 0.3562000 0.4207000 0.6825000 0.9925000

> results <- classifyTestsF(fit2, p.value=0.0001)

##获取差异表达基因的总结

> summary(results)

E10 E48

-1 40 76

0 12469 12410

1 116 139

> table(E10=results[,1],E48=results[,2])

E48

E10 -1 0 1

-1 29 11 0

0 47 12370 52

1 0 29 87

##绘制文氏图

> vennDiagram(results, include="up")

> x11()

> vennDiagram(results, include="down")

##生成差异表达基因的结果报表

> options(digits=3)

> topTable(fit2,coef="E10",n=20)

ID logFC AveExpr t P.Value adj.P.Val B

9735 39642_at 2.94 7.88 23.7 4.74e-09 3.13e-05 9.97

12472 910_at 3.11 9.66 23.6 4.96e-09 3.13e-05 9.94

1814 31798_at 2.80 12.12 16.4 1.03e-07 3.51e-04 7.98

11509 41400_at 2.38 10.04 16.2 1.11e-07 3.51e-04 7.92

10214 40117_at 2.56 9.68 15.7 1.47e-07 3.58e-04 7.71

953 1854_at 2.51 8.53 15.2 1.95e-07 3.58e-04 7.49

9848 39755_at 1.68 12.13 15.1 2.05e-07 3.58e-04 7.45

922 1824_s_at 1.91 9.24 14.9 2.27e-07 3.58e-04 7.37

140 1126_s_at 1.78 6.88 13.8 4.12e-07 5.78e-04 6.89

580 1536_at 2.66 5.94 13.3 5.80e-07 7.32e-04 6.61

12542 981_at 1.82 7.78 13.1 6.46e-07 7.42e-04 6.52

3283 33252_at 1.74 8.00 12.6 8.86e-07 9.20e-04 6.25

546 1505_at 2.40 8.76 12.5 9.48e-07 9.20e-04 6.19

4405 34363_at -1.75 5.55 -12.2 1.14e-06 1.03e-03 6.03

985 1884_s_at 2.80 9.03 12.1 1.26e-06 1.06e-03 5.95

6194 36134_at 2.49 8.28 11.8 1.50e-06 1.19e-03 5.79

7557 37485_at 1.61 6.67 11.4 1.99e-06 1.48e-03 5.55

1244 239_at 1.57 11.25 10.4 4.07e-06 2.66e-03 4.90

8195 38116_at 2.32 9.51 10.4 4.09e-06 2.66e-03 4.90

10634 40533_at 1.26 8.47 10.4 4.21e-06 2.66e-03 4.87

> topTable(fit2,coef="E48",n=20)

ID logFC AveExpr t P.Value adj.P.Val B

12472 910_at 3.86 9.66 29.2 8.27e-10 1.04e-05 11.61

1814 31798_at 3.60 12.12 21.0 1.28e-08 7.63e-05 9.89

953 1854_at 3.34 8.53 20.2 1.81e-08 7.63e-05 9.64

8195 38116_at 3.76 9.51 16.9 8.12e-08 2.51e-04 8.48

8143 38065_at 2.99 9.10 16.2 1.12e-07 2.51e-04 8.21

9848 39755_at 1.77 12.13 15.8 1.36e-07 2.51e-04 8.05

642 1592_at 2.30 8.31 15.8 1.39e-07 2.51e-04 8.03

11509 41400_at 2.24 10.04 15.3 1.81e-07 2.75e-04 7.81

3766 33730_at -2.04 8.57 -15.1 1.96e-07 2.75e-04 7.74

732 1651_at 2.97 10.50 14.8 2.39e-07 3.02e-04 7.57

8495 38414_at 2.02 9.46 14.6 2.66e-07 3.05e-04 7.48

1049 1943_at 2.19 7.60 14.0 3.72e-07 3.69e-04 7.18

10214 40117_at 2.28 9.68 14.0 3.80e-07 3.69e-04 7.17

10634 40533_at 1.64 8.47 13.5 4.94e-07 4.45e-04 6.93

9735 39642_at 1.61 7.88 13.0 6.71e-07 5.18e-04 6.65

4898 34851_at 1.96 9.96 12.8 7.51e-07 5.18e-04 6.55

922 1824_s_at 1.64 9.24 12.8 7.95e-07 5.18e-04 6.50

6053 35995_at 2.76 8.87 12.7 8.32e-07 5.18e-04 6.46

12455 893_at 1.54 10.95 12.7 8.43e-07 5.18e-04 6.45

10175 40079_at -2.41 8.23 -12.6 8.62e-07 5.18e-04 6.42

Rank products(RP)算法

RankProd使用的是非参数结果排序法来判断两组对照实验之间基因是否上调或者下调。RP就是测量中某基因的几何平均值。具体RP决定差异表达基因的步骤如下：

1. 生成长度为n的k排序的p排列

2. 计算在p排列中n个基因的RP值

3. 计数在排列中有多少基因的RP值是小于等于对照组的RP值的。设定该计数为c。

4. 通过公式计算 RP期待值。

5. 通过公式计算假阳性的可能性，依照pfp值来决定是否为差异表达。其中rank(g)是RP升序排列的n个基因中基因g的排序。

##安装RankProd库

> source("https://www.bioconductor.org/biocLite.R")

BioC_mirror = https://www.bioconductor.org

Change using chooseBioCmirror().

> biocLite("RankProd")

Using R version 2.11.1, biocinstall version 2.6.9.

Installing Bioconductor version 2.6 packages:

[1] "RankProd"

Please wait...

trying URL 'https://www.bioconductor.org/packages/2.6/bioc/bin/windows/contrib/2.11/RankProd_2.20.0.zip'

Content type 'application/zip' length 494131 bytes (482 Kb)

opened URL

downloaded 482 Kb

package 'RankProd' successfully unpacked and MD5 sums checked

The downloaded packages are in

C:Documents and Settingsjianhong ouLocal SettingsTempRtmpEhAHBydownloaded_packages

##载入RankProd库

> library(RankProd)

##载入数据

> workingdir<-getwd()

> workingdir<-paste(workingdir,"/4R/estrogen",sep="")

> setwd(workingdir)

> library(affy)

Loading required package: Biobase

Welcome to Bioconductor

Vignettes contain introductory material. To view, type

'openVignette()'. To cite Bioconductor, see

'citation("Biobase")' and for packages 'citation(pkgname)'.

> pd<-read.AnnotatedDataFrame("estrogen.txt",header=TRUE,sep="",row.names="filename")

> Data<-ReadAffy(filenames=rownames(pData(pd)))

> eset<-exprs(rma(Data))

Background correcting

Normalizing

Calculating Expression

> gnames<-geneNames(Data)

##对数据分组。

> eset.cl<-rep(c(0,1),each=4)

> eset.origin<-rep(rep(c(0,1),each=2),2)

> eset.origin

[1] 0 0 1 1 0 0 1 1

> eset.cl

[1] 0 0 0 0 1 1 1 1

##我们先研究48小时的对比结果

> eset.sub<-eset[,which(eset.origin==1)]

> eset.cl.sub<-eset.cl[which(eset.origin==1)]

> eset.origin.sub<-eset.origin[which(eset.origin==1)]

> RP.48h.out<-RPadvance(eset.sub,eset.cl.sub,eset.origin.sub,num.perm=100,

+ logged=TRUE, na.rm=FALSE, gene.names=gnames, plot=FALSE, rand=123)

The data is from 1 different origins

Rank Product analysis for two-class case

Starting 100 permutations...

Computing pfp...

##生成结果报表

> topGene(RP.48h.out,cutoff=0.05,method="pfp",logged=TRUE, + logbase=2,gene.names=gnames)

Table1: Genes called significant under class1 < class2

Table2: Genes called significant under class1 > class2

$Table1

gene.index RP/Rsum FC:(class1/class2) pfp P.value

220_r_at 1225 5.7485 0.2817 0.0000 0

1553_r_at 599 9.7335 0.2984 0.0050 0

1493_r_at 534 10.0311 0.3019 0.0033 0

472_at 12038 11.7790 0.3236 0.0075 0

37332_r_at 7403 12.4206 0.3197 0.0060 0

31364_i_at 1376 14.2262 0.3252 0.0067 0

1998_i_at 1108 15.8902 0.3456 0.0086 0

725_i_at 12289 17.9067 0.3391 0.0150 0

34938_i_at 4986 19.9081 0.3574 0.0144 0

37157_at 7226 22.7027 0.4136 0.0210 0

33412_at 3445 25.7021 0.4195 0.0255 0

34851_at 4898 27.2078 0.4212 0.0250 0

37001_at 7069 33.4480 0.4302 0.0415 0

40152_r_at 10249 33.9365 0.3789 0.0407 0

39969_at 10064 34.0465 0.4189 0.0387 0

33836_at 3873 36.5657 0.4445 0.0381 0

735_s_at 12299 37.2111 0.3971 0.0371 0

1677_at 760 37.4833 0.4120 0.0350 0

$Table2

gene.index RP/Rsum FC:(class1/class2) pfp P.value

AFFX-CreX-5_at 12575 1.0000 182.1879 0.0000 0e+00

AFFX-CreX-3_at 12573 2.0000 129.2972 0.0000 0e+00

AFFX-BioDn-5_at 12571 3.0000 21.2158 0.0000 0e+00

AFFX-BioC-3_at 12565 4.0000 16.9343 0.0000 0e+00

AFFX-BioB-M_at 12563 5.0000 13.8760 0.0020 0e+00

AFFX-BioB-5_at 12561 6.2357 10.6860 0.0017 0e+00

AFFX-BioB-3_at 12559 7.7069 9.2046 0.0014 0e+00

AFFX-BioDn-3_at 12569 9.3614 7.0893 0.0038 0e+00

1197_at 217 9.9749 5.9866 0.0033 0e+00

AFFX-BioC-5_at 12567 10.8006 5.4021 0.0050 0e+00

32215_i_at 2236 11.5700 4.9596 0.0055 0e+00

2004_at 1115 14.8054 4.1707 0.0100 0e+00

39581_at 9673 16.1006 4.1104 0.0108 0e+00

40071_at 10167 18.1884 3.7435 0.0136 0e+00

38827_at 8912 25.2488 3.4745 0.0213 0e+00

32755_at 2781 26.5948 3.2038 0.0200 0e+00

859_at 12421 27.2426 3.1781 0.0194 0e+00

35977_at 6035 28.3913 3.1493 0.0206 0e+00

1822_at 920 28.7090 3.2237 0.0200 0e+00

31792_at 1808 29.7571 3.0398 0.0210 0e+00

40522_at 10623 31.6223 3.0428 0.0233 0e+00

36274_at 6335 34.5524 3.2918 0.0250 0e+00

39642_at 9735 35.2116 2.9928 0.0252 0e+00

32034_at 2053 36.6339 2.9261 0.0267 1e-04

32216_r_at 2237 42.4453 2.9482 0.0352 1e-04

38368_at 8448 45.9109 2.6964 0.0412 1e-04

35934_at 5992 47.5850 2.9382 0.0430 1e-04

32043_at 2062 48.8154 2.7270 0.0454 1e-04

39966_at 10061 50.8840 2.6788 0.0479 1e-04

40631_at 10733 52.4594 2.6259 0.0497 1e-04

由上面的过程我们知道，对于RP来说，只能应对小于两组的实验分组，如果多组的话，得分别进行比对。

好了，我们已经了解了三种比较重要的提取差异表达基因的手段。当然，这只是一个比较简单的介绍，更多的工作需要自己去阅读说明文件来了解与掌握这些算法的细节。最后的工作就是对这些基因进行注释了。这个工作呢，affycoretools提供了许多工作，我们可以自己通过阅读说明文件去了解细节了。还有两个重要的链接：

GOstat:

https://gostat.wehi.edu.au

GSEA:

https://www.gseapack.org

关键词：