This study is to explain the effect of the transcription factor SP1 in colon cells. To elucidate this effect a colon cell line was used and a silencing of the transcription factor SP1 was obtained using RNAi techniques in vitro. A gene expression profile of the cells with SP1 silencing and without silencing was done after 72hrs in culture.
The expression profiles were quantified using Affymetrix GeneChip HGU133 PLUS 2. The files containing the data are as follow:
After estimating gene expression, visualise the data and describe the findings. Identify which genes are changing between conditions and define any potential pathway that the silencing of SP1 might have altered.
SP1 (specificity protein 1) is one of the first transcription factors to be isolated and characterised in mammalian cells, originally found to bind to the simian virus 40 promoter through three tandem Cys2His2 zinc-finger motifs (Dynan and Tjian 1983). This subsequently lead to the discovery of other highly similar transcription factors, such as the Krüppel (a Drosophila gap gene) (Kadonaga et al. 1987). This formed the Sp1-like/Krüppel-like factor (Sp1/KLF) protein family.
Initially SP1 was thought of a general transcription factor of ‘housekeeping’ genes, involved in numerous cell processes (e.g. proliferation, growth, death and metabolism) (Black et al. 2001). It has since been shown not to only maintain basal transcription, but to the transcriptional regulation of a large number of genes (Beishline and Azizkhan-Clifford 2015); estimates suggesting that there are a minimum of 12,000 SP binding sites within the human genome (Oleaga et al. 2012). As well as being involved in essential cellular functions, SP1 contributes to the regulation of a number genes that contribute to the ‘hallmarks of cancer’, by both activating and suppressing tumour suppressers and essential oncogenes (Beishline and Azizkhan-Clifford 2015). SP1 also involved in inflammation, where the use of non-steroidal inflammatory drugs to decrease inflammation has been associated with decreases in SP1 levels (Abdelrahim and Safe 2005). Moreover, SP1 has been implicated in genetic instability, regulating centriole function and chromosomal stability (Atrinidis et al. 2010).
SP1 therefore, has a wide ranging regulatory affect on multiple genes throughout the human genome. Through RNAi silencing techniques and micro-array analysis, this study aims to highlight and define any potential pathways in colon cells where SP1 may be acting.
Low level analysis
High level analysis
Step 3: Analysis and diagnostic plotting of gene expression data using different methods and normalisation techniques. Inlcuding mas5, RMA and puma.
Step 4: Basic use of limma and puma for Differential Expression Analysis
Functional and pathway analysis
options(jupyter.plot_mimetypes ='image/png')
setwd("~/Autumn2016/ProjectB/data_projectB")
getwd()
library(affy)
sp1_filenames <- c("M72-1.CEL", "M72-2.CEL", "S72-1.CEL", "S72-2.CEL")
affybatch.sp1 <- ReadAffy(filenames=sp1_filenames)
phenoData(affybatch.sp1)
pData(affybatch.sp1) <- data.frame(
"sample"=c("control","control","silenced","silenced"), row.names=rownames(pData(affybatch.sp1)))
pData(affybatch.sp1)
With single point statistics:
Building the raw values into an AffyBatch to create an expression set (eset) with both Mas5 and RMA techniques.
eset_rma<-affy::rma(affybatch.sp1)
show(eset_rma)
eset_mas5<-mas5(affybatch.sp1)
show(eset_mas5)
Extracting the expression sets and then checking that the two different methods of normalisation (Mas5 and RMA) have returned different values
e_rma<-exprs(eset_rma)
e_mas5<-exprs(eset_mas5)
identical(eset_mas5,eset_rma)
identical(e_mas5,e_rma)
ls()
First step of visualising the processed data through density estimates
density(e_rma)
density(e_mas5)
boxplot(e_mas5, ylab="Gene Expression levels", xlab="Sample Names")
Due to the large value of outliers in the Mas5 plot, the Mas5 processed data needs to be normalised. RMA has this normalisation step built in, so does not require it.
e_mas5<-log2(e_mas5)
density(e_mas5)
Visualisation and comparison between both single point statistical methods, through a boxplot and histogram
library(limma)
par(mfrow=c(2,2))
boxplot(e_rma, main="RMA analysis", ylab="Gene Expression levels", xlab="Sample Names")
boxplot(e_mas5, main="Mas5 analysis", ylab="Gene Expression levels", xlab="Sample Names")
plotMA(e_rma, main="RMA analysis")
abline(h=0, col='red')
plotMA(e_mas5, main="Mas5 analysis")
abline(h=0, col='red')
A comparison between the two single point statistical methods indicates that:
Boxplots - Mas5 boxplot shows that medians are aligned and the data looks normalised, there are negative outliers though. Whereas the RMA box plot shows a similar trend, with medians being aligned and normalised data, yet there are no negative outliers.
MA-plots - An underlying assumption of gene expression analysis is that most of the genes would not see any change in their expression. This expression is shown on the Y-axis, and since Log(1) is 0, you would expect that the majority of points are located at or near to 0. A visual analysis of each of the MA-plots shows that this is more the case with RMA analysis, where more of the genes do not show any change in their expression, in comparison with the Mas5 MA-plot. The RMA MA-plot also shows fewer, but some outliers, indicating that some genes are differentially expressed so that the data has not been normalised too much. Therefore is still sensitive to gene expression.
Between these two models, RMA is the better normalisation model
With a probabilistic approach:
library(puma)
eset_puma<-mmgmos(affybatch.sp1)
Visualisation between RMA and PUMA to determine which method of normalisation to take further to differential analysis
save(eset_puma, file="eset_puma.rda")
e_puma<-exprs(eset_puma)
Norm_eset_puma<-pumaNormalize(eset_puma)
Norm_e_puma<-exprs(Norm_eset_puma)
par(mfrow=c(3,2))
boxplot(Norm_e_puma, main="PUMA Boxplot", xlab="Sample Names", ylab="Expression")
boxplot(e_rma, main="RMA Boxplot", ylab="Gene Expression levels", xlab="Sample Names")
plotMA(Norm_e_puma, main="PUMA MA-Plot")
abline(h=0, col='red')
plotMA(e_rma, main="RMA MA-Plot")
abline(h=0, col='red')
plot(density(Norm_e_puma), main="PUMA Density Plot", xlab="bins", ylab= "Density" )
plot(density(e_rma), main="RMA Density Plot", xlab="bins", ylab="Density")
Between these 2 models, RMA again seems the better model due to the lack of negative outliers and MA-plot distrobution, showing more gene expression nearer to Y=0.
As, in previous analysis between both the PUMA and RMA contained replicates, before a decision can be made on which to take further replicates must be combined.
eset_puma_comb <- pumaComb(eset_puma)
show(eset_puma_comb)
pData(eset_puma_comb)
dim(eset_puma_comb)
pumaDERes <- pumaDE(eset_puma_comb)
write.reslts(pumaDERes, file="pumaDERes")
puma_comb<-exprs(eset_puma_comb)
for (i in 1:2) {
temp<-puma_comb[,i]
temp[temp<0] <- 0
puma_comb[,i] <- temp
}
fc<- puma_comb[,1] - puma_comb[,2]
Combining replicates using RMA
comb_rma<-matrix(0,nrow=dim(e_rma)[1], ncol=2)
colnames(comb_rma)<-c("Control","Silenced")
comb_rma[,1]<- (e_rma[,1]+e_rma[,2])/2
comb_rma[,2]<- (e_rma[,3]+e_rma[,4])/2
fc_rma<- comb_rma[,1] - comb_rma[,2]
par(mfrow=c(1,2))
plotMA(eset_puma_comb, main="PUMA Combined MA-plot")
abline(h=0, col="red")
plotMA(comb_rma, main="RMA Combined MA-plot")
abline(h=0, col="red")
With combined replicates I have chosen the RMA expression set to analyse further with the Limma package.
show(eset_rma)
*Using Limma to identify the top 100 differentially expressed genes. By building a design contrast matrix, fitting a linear model and then calculating p-values with an empirical Bayes test.
library(limma)
group<-factor(c("Control", "Silenced"))
design<-createDesignMatrix(eset_rma)
colnames(design)<-c("Control", 'Silenced')
contrastmatrix<-makeContrasts(Control-Silenced,levels=design)
fit<-lmFit(eset_rma,design)
fit2<-contrasts.fit(fit,contrasts=contrastmatrix)
fit3<-eBayes(fit2)
topDEGenes<-topTable(fit3, coef=1, adjust="BH",n=100)
topDEGenes
Histogram for the contrast of interest: Control vs Silenced to ensure histogram is distributed as expected
hist(fit3$p.value[,1], main="Histogram of p-vales for Control vs Silenced", xlab="p-value")
Annotating the top differentially expressed genes
library(hgu133plus2.db)
library(annotate)
geneProbes<-as.character(rownames(topDEGenes))
annotated_list<-select(hgu133plus2.db, geneProbes,c("SYMBOL","GENENAME"))
annotated_list
pca_rma <- prcomp(t(e_rma))
plot(pca_rma$x, xlab="Component 1", ylab="Component 2",
pch=unclass(as.factor(pData(eset_rma)[,1])),
col=unclass(as.factor(pData(eset_rma)[,1])),
main="Standard PCA")
groups<-paste(eset_rma$sample, sep =" ")
legend(40, 0, groups,pch=unclass(as.factor(pData(eset_rma)[,1]))
, col=unclass(as.factor(pData(eset_rma)[,1])))
PCA is an algorithm which reduces the dimensionality of the data, while retaining variation. It achieves this by identifiying directions which have the maximal variation in the data, known as principle components. Therefore the sample can be represented by a few numbers instead of the 54,675 genes per sample. By plotting these samples it is possible to visually assess similarities and differences between samples and determine if they should be grouped.
library(gplots)
geneProbes<-rownames(topDEGenes)
annotated_list<-select(hgu133plus2.db, geneProbes,"SYMBOL")
indx<-match(annotated_list$PROBEID,rownames(topDEGenes))
tID<-rownames(topDEGenes)[indx]
ind<-1
j<-1
for (i in 1: length(tID)) {
ind[j]<-which(rownames(eset_rma)==tID[i],arr.ind=TRUE)
j<-j+1
}
# ind is the vector with all the indexes
topExpr<-e_rma[ind,]
rownames(topExpr)<-annotated_list$SYMBOL
heatmap.2(topExpr, col=redgreen(75), scale="row",
key=TRUE, symkey=FALSE, density.info="none", trace="none", cexRow=0.4, cexCol=0.8)
Exporting gene names into a tab seperated file for use in PANTHER.
write.table(rownames(topExpr), file="Export DE genes", sep="\t", row.names = F, col.names=F, quote=FALSE)
PANTHER (protein analysis through evolutionary relationships) enables biologists to analyse the results from large genome wide studies, such as sequencing, proteomics or gene expression experiments (Mi et al. 2013). By classifying proteins and their genes in various ways. PANTHER was chosen over DAVID(the database for annotation, visualisation and integrated discovering) as it integrates more updated GO (gene ontology) curation data (Mi et al. 2013). By exporting the top differentially expressed genes into PANTHER both functional and pathway analysis can be performed on the effects of SP1 silencing in vitro on colon cells after 72hrs in culture.
First, analysis of differentially expressed genes with PANTHER GO-Slim Molecular function, which classifies proteins by function or with directly interacting proteins at a biochemical level (see Figure 1). Not surprisingly, due to SP1s role in genetic regulation and instability (Atrinidis et al. 2010), 41.5% of function hits fall into a binding category. SP1 itself falls into this category as a nucleic acid DNA binding protein. Other structurally similar SP binding proteins expression are also altered in SP1 silenced colon cells, such as SP4, potentially indicating compensation between closely regulated genes. Another large proportion of DE expressed genes have catalytic activity. It should be noted that a coding gene can produce a protein that is a binding protein that has catalytic activity, such as rotein N-lysine methyltransferase (METTL20). METTL20 is a mitochondrial lysine methyltransferase implicated in both DNA and RNA methyltransferase (Malecki et al. 2015).
Figure 1. PANTHER GO-Slim Molecular Function Pie Chart.
Using PANTHER biological process analysis of top DE genes reinforces the role of SP1 as a general transcription factor (Black et al. 2001). As affected genes by SP1 silencing ranged from ‘housekeeping’ cellular processes to immune system process.
Figure 2. PANTHER Biological Processes Pie Chart. Showing the range of different biological processes affected by SP1 silencing in colon cells.
Another advantage to biologists of processing results through PANTHER, is that interactions between genes can be identified, using the literature knowledge as a base; allowing pathway analysis. It is unsurprisingly that due to SP1s wide ranging role in the cell that many pathways will be affected and therefore show differential expression; mimicking the trend established by PANTHER biological process analysis.
Figure 3. PANTHER Pathway Pie Chart. Similar to Figure 1, showing the variety of pathways in which SP1 silencing in colon cells effects.
Interesting, the Cholecystokinin (CCK) signalling pathway, is affected by the silencing of SP1 in colon cells (as shown in grey as 9% in Figure 3). CCK is secreted from endocrine cells found in the small intestine, key for the activation of intestinal feedback control of gastrointestinal function (Dockray 2006). Two differentially expressed genes are found in this pathway: SP1 and Plasminogen activator inhibitor 1 (PAI1). SP1 is a downstream promoter of Mitogen-activated protein kinase (MAPK)1/3 (Cramer et al. 2008). PAI1 is also a downstream signalling factor of MAPK1/3 (Norsett et al. 2011). The silencing of SP1 may have caused an change in the expression of PAI1 to compensate and regulate gastrin expression in order to maintain the CCK signalling pathway. Further research is required into the interactions and potential compensation and redundancy of PAI1 and SP1.
Abdelrahim, M. and Safe, S. 2005. Cyclooxygenase-2 inhibitors decrease vascular endothelial growth factor expression in colon cancer cells by enhanced degradation of Sp1 and Sp4 proteins. Mol Pharmacol. 68(2), 317-29.
Astrinidis, A. et al. 2010. The transcription factor SP1 regulates centriole function and chromosomal stability through a functional interaction with the mammalian target of rapamycin/raptor complex. Genes Chromosomes Cancer. 49(3), 282-97.
Beishline, K. and Azizkhan-Clifford, J. 2015. Sp1 and the 'hallmarks of cancer’. The FEBS Journal. 282(2), 224-58.
Black, A. R. et al. 2001. Sp1 and krüppel-like factor family of transcription factors in cell growth regulation and cancer. J Cell Physiol. 188(2), 143-60.
Cramer, T. et al. 2008. Gastrin transactivates the chromogranin A gene through MEK-1/ERK- and PKC-dependent phosphorylation of Sp1 and CREB. Cell Signal. 20(1), 60-72.
Dockray, G. J. 2006. Gastrointestinal hormones: Cholecystokinin, somatostatin and ghrelin. In: Johnson LR, editor. Physiology of the Gastrointestinal Tract. Vol. 1. Elsevier Academic Press. 91–120.
Dynan, W. S. and Tjian, R. 1983. The promoter-specific transcription factor Sp1 binds to upstream sequences in the SV40 early promoter. Cell. 35(1), 79-87.
Kadonaga, J. T. et al. 1987. Isolation of cDNA encoding transcription factor Sp1 and functional analysis of the DNA binding domain. Cell. 51(6), 1079-90.
Małecki, J. et al. 2015. Human METTL20 Is a Mitochondrial Lysine Methyltransferase That Targets the β Subunit of Electron Transfer Flavoprotein (ETFβ) and Modulates Its Activity. The Journal of Biological Chemistry. 290, 423-434.
Mi, H. et al. 2013. Large-scale gene function analysis with the PANTHER classification system. Nature Protocols. 8, 1551-1566.
Nørsett, K. G. et al. 2010. Gastrin stimulates expression of plasminogen activator inhibitor-1 in gastric epithelial cells. Am J Physiol Gastrointest Liver Physiol. 301(3), 446-53.
Oleaga, C. et al. 2012. Identification of novel Sp1 targets involved in proliferation and cancer by functional genomics. Biochem Pharamcol. 84(12), 1581-91.