Project B - Tom Emberson

Contents

  1. Outline
  2. Background
  3. Pipeline
  4. Low level analysis
  5. High level analysis
  6. Functional and pathway analysis
  7. References

Outline

This study is to explain the effect of the transcription factor SP1 in colon cells. To elucidate this effect a colon cell line was used and a silencing of the transcription factor SP1 was obtained using RNAi techniques in vitro. A gene expression profile of the cells with SP1 silencing and without silencing was done after 72hrs in culture.

The expression profiles were quantified using Affymetrix GeneChip HGU133 PLUS 2. The files containing the data are as follow:

  • M72-1.CEL control at 72hrs in culture - sample 1
  • M72-2.CEL control at 72hrs in culture - sample 2
  • S72-1.CEL SP1 silenced at 72hrs in culture - sample 1
  • S72-2.CEL SP1 silenced at 72hrs in culture - sample 2

After estimating gene expression, visualise the data and describe the findings. Identify which genes are changing between conditions and define any potential pathway that the silencing of SP1 might have altered.

Background

SP1 (specificity protein 1) is one of the first transcription factors to be isolated and characterised in mammalian cells, originally found to bind to the simian virus 40 promoter through three tandem Cys2His2 zinc-finger motifs (Dynan and Tjian 1983). This subsequently lead to the discovery of other highly similar transcription factors, such as the Krüppel (a Drosophila gap gene) (Kadonaga et al. 1987). This formed the Sp1-like/Krüppel-like factor (Sp1/KLF) protein family.

Initially SP1 was thought of a general transcription factor of ‘housekeeping’ genes, involved in numerous cell processes (e.g. proliferation, growth, death and metabolism) (Black et al. 2001). It has since been shown not to only maintain basal transcription, but to the transcriptional regulation of a large number of genes (Beishline and Azizkhan-Clifford 2015); estimates suggesting that there are a minimum of 12,000 SP binding sites within the human genome (Oleaga et al. 2012). As well as being involved in essential cellular functions, SP1 contributes to the regulation of a number genes that contribute to the ‘hallmarks of cancer’, by both activating and suppressing tumour suppressers and essential oncogenes (Beishline and Azizkhan-Clifford 2015). SP1 also involved in inflammation, where the use of non-steroidal inflammatory drugs to decrease inflammation has been associated with decreases in SP1 levels (Abdelrahim and Safe 2005). Moreover, SP1 has been implicated in genetic instability, regulating centriole function and chromosomal stability (Atrinidis et al. 2010).

SP1 therefore, has a wide ranging regulatory affect on multiple genes throughout the human genome. Through RNAi silencing techniques and micro-array analysis, this study aims to highlight and define any potential pathways in colon cells where SP1 may be acting.

Pipeline

Low level analysis

  • Step1: Load packages with data from Bioconductor and/or access it from file in the data directory
  • Step 2: Arrange the data in an affybatch using Bioconductor commands. Annotate the PhenoData

High level analysis

  • Step 3: Analysis and diagnostic plotting of gene expression data using different methods and normalisation techniques. Inlcuding mas5, RMA and puma.

  • Step 4: Basic use of limma and puma for Differential Expression Analysis

  • Step 5: Visualisation of the data with PCA
  • Step 6: Hierarchical clustering of DE (Differentially Expressed) genes

Functional and pathway analysis

  • Step 7: Functional analysis of DE targets using PANTHER or DAVID
  • Step 8: Pathway analysis using PANTHER

Low level analysis

  • Step1: Load packages with data from Bioconductor and/or access it from file in the data directory
In [3]:
options(jupyter.plot_mimetypes ='image/png')
In [4]:
setwd("~/Autumn2016/ProjectB/data_projectB")
getwd()
'/projects/6fbec1e0-5b7e-48e8-a754-9e2059beba0e/Autumn2016/ProjectB/data_projectB'
  • Step 2: Arrange the data in an affybatch using Bioconductor commands. Annotate the PhenoData
In [5]:
library(affy)
sp1_filenames <- c("M72-1.CEL", "M72-2.CEL", "S72-1.CEL", "S72-2.CEL")
affybatch.sp1 <- ReadAffy(filenames=sp1_filenames)
Attaching package: ‘affy’

The following objects are masked from ‘package:oligo’:

    intensity, MAplot, mm, mm<-, mmindex, pm, pm<-, pmindex,
    probeNames, rma

The following object is masked from ‘package:oligoClasses’:

    list.celfiles

In [6]:
phenoData(affybatch.sp1)

pData(affybatch.sp1) <- data.frame(
"sample"=c("control","control","silenced","silenced"), row.names=rownames(pData(affybatch.sp1)))
pData(affybatch.sp1)
An object of class 'AnnotatedDataFrame'
  sampleNames: M72-1.CEL M72-2.CEL S72-1.CEL S72-2.CEL
  varLabels: sample
  varMetadata: labelDescription
sample
M72-1.CELcontrol
M72-2.CELcontrol
S72-1.CELsilenced
S72-2.CELsilenced

High Level Analysis

  • Step 3: Analysis and diagnostic plotting of gene expression data using different methods and normalisation techniques. Inlcuding mas5, RMA and puma.

With single point statistics:

  • MAS 5.0 expression measure.
  • RMA (robust multi-array avergae expression measure)

Building the raw values into an AffyBatch to create an expression set (eset) with both Mas5 and RMA techniques.

In [7]:
eset_rma<-affy::rma(affybatch.sp1)
show(eset_rma)
Warning message:
“replacing previous import ‘AnnotationDbi::tail’ by ‘utils::tail’ when loading ‘hgu133plus2cdf’”Warning message:
“replacing previous import ‘AnnotationDbi::head’ by ‘utils::head’ when loading ‘hgu133plus2cdf’”
Background correcting
Normalizing
Calculating Expression
ExpressionSet (storageMode: lockedEnvironment)
assayData: 54675 features, 4 samples 
  element names: exprs 
protocolData
  sampleNames: M72-1.CEL M72-2.CEL S72-1.CEL S72-2.CEL
  varLabels: ScanDate
  varMetadata: labelDescription
phenoData
  sampleNames: M72-1.CEL M72-2.CEL S72-1.CEL S72-2.CEL
  varLabels: sample
  varMetadata: labelDescription
featureData: none
experimentData: use 'experimentData(object)'
Annotation: hgu133plus2 
In [8]:
eset_mas5<-mas5(affybatch.sp1)
show(eset_mas5)
background correction: mas 
PM/MM correction : mas 
expression values: mas 
background correcting...done.
54675 ids to be processed
|                    |
|####################|
ExpressionSet (storageMode: lockedEnvironment)
assayData: 54675 features, 4 samples 
  element names: exprs, se.exprs 
protocolData
  sampleNames: M72-1.CEL M72-2.CEL S72-1.CEL S72-2.CEL
  varLabels: ScanDate
  varMetadata: labelDescription
phenoData
  sampleNames: M72-1.CEL M72-2.CEL S72-1.CEL S72-2.CEL
  varLabels: sample
  varMetadata: labelDescription
featureData: none
experimentData: use 'experimentData(object)'
Annotation: hgu133plus2 

Extracting the expression sets and then checking that the two different methods of normalisation (Mas5 and RMA) have returned different values

In [9]:
e_rma<-exprs(eset_rma) 
e_mas5<-exprs(eset_mas5)
In [10]:
identical(eset_mas5,eset_rma)
identical(e_mas5,e_rma)
FALSE
FALSE
In [11]:
ls()
  1. 'affybatch.sp1'
  2. 'e_mas5'
  3. 'e_rma'
  4. 'eset_mas5'
  5. 'eset_rma'
  6. 'sp1_filenames'

First step of visualising the processed data through density estimates

In [12]:
density(e_rma)
density(e_mas5)
Call:
	density.default(x = e_rma)

Data: e_rma (218700 obs.);	Bandwidth 'bw' = 0.1574

       x                y            
 Min.   : 1.953   Min.   :0.0000003  
 1st Qu.: 5.047   1st Qu.:0.0076393  
 Median : 8.140   Median :0.0546753  
 Mean   : 8.140   Mean   :0.0807386  
 3rd Qu.:11.234   3rd Qu.:0.1463547  
 Max.   :14.327   Max.   :0.2239915  
Call:
	density.default(x = e_mas5)

Data: e_mas5 (218700 obs.);	Bandwidth 'bw' = 23.54

       x                  y            
 Min.   :  -70.51   Min.   :0.000e+00  
 1st Qu.:12249.81   1st Qu.:1.840e-07  
 Median :24570.13   Median :5.890e-07  
 Mean   :24570.13   Mean   :3.324e-05  
 3rd Qu.:36890.45   3rd Qu.:1.862e-06  
 Max.   :49210.77   Max.   :6.346e-03  
In [261]:
boxplot(e_mas5, ylab="Gene Expression levels", xlab="Sample Names")

Due to the large value of outliers in the Mas5 plot, the Mas5 processed data needs to be normalised. RMA has this normalisation step built in, so does not require it.

In [14]:
e_mas5<-log2(e_mas5)
density(e_mas5)
Call:
	density.default(x = e_mas5)

Data: e_mas5 (218700 obs.);	Bandwidth 'bw' = 0.2155

       x                y            
 Min.   :-3.760   Min.   :1.100e-07  
 1st Qu.: 1.238   1st Qu.:3.401e-03  
 Median : 6.236   Median :3.115e-02  
 Mean   : 6.236   Mean   :4.997e-02  
 3rd Qu.:11.233   3rd Qu.:9.059e-02  
 Max.   :16.231   Max.   :1.566e-01  

Visualisation and comparison between both single point statistical methods, through a boxplot and histogram

In [273]:
library(limma)
par(mfrow=c(2,2))
boxplot(e_rma, main="RMA analysis", ylab="Gene Expression levels", xlab="Sample Names")
boxplot(e_mas5, main="Mas5 analysis", ylab="Gene Expression levels", xlab="Sample Names")
plotMA(e_rma, main="RMA analysis")
abline(h=0, col='red')
plotMA(e_mas5, main="Mas5 analysis")
abline(h=0, col='red')

A comparison between the two single point statistical methods indicates that:

  • Boxplots - Mas5 boxplot shows that medians are aligned and the data looks normalised, there are negative outliers though. Whereas the RMA box plot shows a similar trend, with medians being aligned and normalised data, yet there are no negative outliers.

  • MA-plots - An underlying assumption of gene expression analysis is that most of the genes would not see any change in their expression. This expression is shown on the Y-axis, and since Log(1) is 0, you would expect that the majority of points are located at or near to 0. A visual analysis of each of the MA-plots shows that this is more the case with RMA analysis, where more of the genes do not show any change in their expression, in comparison with the Mas5 MA-plot. The RMA MA-plot also shows fewer, but some outliers, indicating that some genes are differentially expressed so that the data has not been normalised too much. Therefore is still sensitive to gene expression.

Between these two models, RMA is the better normalisation model

With a probabilistic approach:

  • PUMA(Propagating Uncertainty in Microarray Analysis)
In [18]:
library(puma)
eset_puma<-mmgmos(affybatch.sp1)
Model optimising ..............................................................................................................
Expression values calculating ..............................................................................................................
Done.

Visualisation between RMA and PUMA to determine which method of normalisation to take further to differential analysis

In [272]:
save(eset_puma, file="eset_puma.rda")
e_puma<-exprs(eset_puma) 
Norm_eset_puma<-pumaNormalize(eset_puma)
Norm_e_puma<-exprs(Norm_eset_puma)

par(mfrow=c(3,2))
boxplot(Norm_e_puma, main="PUMA Boxplot", xlab="Sample Names", ylab="Expression")
boxplot(e_rma, main="RMA Boxplot", ylab="Gene Expression levels", xlab="Sample Names")
plotMA(Norm_e_puma, main="PUMA MA-Plot")
abline(h=0, col='red')
plotMA(e_rma, main="RMA MA-Plot")
abline(h=0, col='red')
plot(density(Norm_e_puma), main="PUMA Density Plot", xlab="bins", ylab= "Density" ) 
plot(density(e_rma), main="RMA Density Plot", xlab="bins", ylab="Density")

Between these 2 models, RMA again seems the better model due to the lack of negative outliers and MA-plot distrobution, showing more gene expression nearer to Y=0.

  • Step 4: Basic use of limma and puma for Differential Expression Analysis

As, in previous analysis between both the PUMA and RMA contained replicates, before a decision can be made on which to take further replicates must be combined.

In [51]:
eset_puma_comb <- pumaComb(eset_puma)
Calculating expected completion time
pumaComb expected completion time is 117 minutes 
.......20%.......40%.......60%.......80%......100%
..................................................
In [61]:
show(eset_puma_comb)
pData(eset_puma_comb)
dim(eset_puma_comb)
ExpressionSet (storageMode: lockedEnvironment)
assayData: 54675 features, 2 samples 
  element names: exprs, se.exprs 
protocolData: none
phenoData
  sampleNames: control silenced
  varLabels: sample
  varMetadata: labelDescription
featureData: none
experimentData: use 'experimentData(object)'
Annotation:  
sample
controlcontrol
silencedsilenced
Features
54675
Samples
2
In [54]:
pumaDERes <- pumaDE(eset_puma_comb)
write.reslts(pumaDERes, file="pumaDERes")
In [63]:
puma_comb<-exprs(eset_puma_comb)


for (i in 1:2) {
temp<-puma_comb[,i]
temp[temp<0] <- 0
puma_comb[,i] <- temp
}
In [65]:
fc<- puma_comb[,1] - puma_comb[,2]

Combining replicates using RMA

In [116]:
comb_rma<-matrix(0,nrow=dim(e_rma)[1], ncol=2)
colnames(comb_rma)<-c("Control","Silenced")
comb_rma[,1]<- (e_rma[,1]+e_rma[,2])/2
comb_rma[,2]<- (e_rma[,3]+e_rma[,4])/2
fc_rma<- comb_rma[,1] - comb_rma[,2]
In [277]:
par(mfrow=c(1,2))
plotMA(eset_puma_comb, main="PUMA Combined MA-plot")
abline(h=0, col="red")
plotMA(comb_rma, main="RMA Combined MA-plot")
abline(h=0, col="red")

With combined replicates I have chosen the RMA expression set to analyse further with the Limma package.

In [216]:
show(eset_rma)
ExpressionSet (storageMode: lockedEnvironment)
assayData: 54675 features, 4 samples 
  element names: exprs 
protocolData
  sampleNames: M72-1.CEL M72-2.CEL S72-1.CEL S72-2.CEL
  varLabels: ScanDate
  varMetadata: labelDescription
phenoData
  sampleNames: M72-1.CEL M72-2.CEL S72-1.CEL S72-2.CEL
  varLabels: sample
  varMetadata: labelDescription
featureData: none
experimentData: use 'experimentData(object)'
Annotation: hgu133plus2 

*Using Limma to identify the top 100 differentially expressed genes. By building a design contrast matrix, fitting a linear model and then calculating p-values with an empirical Bayes test.

In [288]:
library(limma)

group<-factor(c("Control", "Silenced"))
design<-createDesignMatrix(eset_rma)
colnames(design)<-c("Control", 'Silenced')

contrastmatrix<-makeContrasts(Control-Silenced,levels=design)

fit<-lmFit(eset_rma,design)
fit2<-contrasts.fit(fit,contrasts=contrastmatrix)
fit3<-eBayes(fit2)

topDEGenes<-topTable(fit3, coef=1, adjust="BH",n=100)
topDEGenes
Warning message:
“Zero sample variances detected, have been offset”
logFCAveExprtP.Valueadj.P.ValB
226258_at-1.942002 7.313580 -18.72097 6.557700e-060.1222632 3.012662
201467_s_at 2.139865 8.843529 18.22541 7.509345e-060.1222632 2.964737
210519_s_at 1.867496 10.215530 17.54132 9.108621e-060.1222632 2.893344
213208_at 1.782435 6.100312 17.46136 9.320966e-060.1222632 2.884574
217967_s_at 2.052130 6.365426 16.73597 1.154354e-050.1222632 2.800612
213338_at-1.626308 8.585961 -16.07445 1.414183e-050.1222632 2.716523
217028_at-1.823895 5.500121 -15.75313 1.565327e-050.1222632 2.672829
209301_at 1.945263 8.759054 14.10205 2.727934e-050.1601248 2.413975
201468_s_at 1.713713 9.582006 14.02486 2.803917e-050.1601248 2.400285
230195_at 1.493897 5.364139 13.67736 3.178804e-050.1601248 2.336674
226723_at-1.527071 8.009045 -13.64088 3.221532e-050.1601248 2.329801
230720_at-1.364208 7.630830 -12.78820 4.445919e-050.1762727 2.157855
219175_s_at-1.197733 7.887768 -12.57038 4.843047e-050.1762727 2.110198
230264_s_at-1.256816 8.616979 -12.29306 5.411508e-050.1762727 2.047129
206027_at-1.173978 6.185573 -12.05160 5.972117e-050.1762727 1.989930
201397_at 1.328398 9.055163 11.94880 6.231621e-050.1762727 1.964905
217225_x_at 1.253934 9.916543 11.84057 6.519447e-050.1762727 1.938113
1558930_at-1.157106 7.726814 -11.31535 8.162581e-050.1762727 1.801273
203031_s_at 1.309068 7.278333 11.31072 8.179107e-050.1762727 1.800016
203299_s_at-1.136211 6.931924 -11.27863 8.294895e-050.1762727 1.791263
202839_s_at 2.339297 7.330519 11.21023 8.548169e-050.1762727 1.772460
219641_at-1.077283 6.228653 -11.12501 8.876736e-050.1762727 1.748735
223095_at 1.094504 7.815418 11.11967 8.897815e-050.1762727 1.747238
226281_at-1.507434 5.946018 -11.08185 9.048870e-050.1762727 1.736592
202305_s_at 1.128694 6.491732 11.01440 9.325869e-050.1762727 1.717444
206204_at 1.316707 5.530978 10.95615 9.573232e-050.1762727 1.700735
225556_at 1.768642 8.761699 10.94025 9.642101e-050.1762727 1.696147
242005_at-1.186479 5.130984 -10.93852 9.649601e-050.1762727 1.695649
224760_at 2.345907 7.660555 10.83227 1.012568e-040.1762727 1.664666
38892_at 2.211509 6.017416 10.82630 1.015324e-040.1762727 1.662909
71 1.5535106 7.830416 8.986053 0.00025272940.1865253 1.0274080
72 0.9142418 9.922481 8.985152 0.00025285240.1865253 1.0270450
73 0.9395649 9.274893 8.947936 0.00025799650.1865253 1.0119973
74 1.2396798 5.111373 8.930520 0.00026044640.1865253 1.0049215
75 1.3067467 5.035421 8.882880 0.00026728990.1865253 0.9854552
76-1.4691290 6.460288 -8.823529 0.00027611610.1865253 0.9609738
77 1.0755932 7.456638 8.808274 0.00027844020.1865253 0.9546398
78 1.5147335 7.573312 8.724744 0.00029158460.1865253 0.9196536
79-0.9466420 7.117004 -8.654411 0.00030322530.1865253 0.8897910
80-1.1060382 6.296930 -8.600670 0.00031249200.1865253 0.8667209
81-0.9357985 7.984127 -8.579091 0.00031630690.1865253 0.8573953
82 1.0386010 8.341667 8.546118 0.00032224340.1865253 0.8430764
83-1.3692382 6.169334 -8.516713 0.00032764910.1865253 0.8302355
84 1.0456659 8.937607 8.513430 0.00032825920.1865253 0.8287978
85 0.9120852 7.087588 8.444925 0.00034130350.1865253 0.7986028
86 2.4939348 7.413611 8.439367 0.00034238840.1865253 0.7961370
87-0.8621251 8.344828 -8.391888 0.00035182510.1865253 0.7749713
88-0.8131880 9.090789 -8.337040 0.00036311230.1865253 0.7502955
89 0.9262487 5.518531 8.315612 0.00036763810.1865253 0.7405892
90-0.8408450 9.638550 -8.298443 0.00037131250.1865253 0.7327851
91 2.3158683 8.490485 8.281028 0.00037508420.1865253 0.7248446
92-0.8425898 9.785481 -8.249997 0.00038191760.1865253 0.7106340
93-1.0590385 8.622003 -8.186791 0.00039629730.1865253 0.6814432
94-0.7779508 8.680400 -8.167198 0.00040088420.1865253 0.6723269
95 0.7990407 7.371832 8.165250 0.00040134360.1865253 0.6714189
96 0.8844053 5.813451 8.152342 0.00040440400.1865253 0.6653933
97-0.8498385 7.334997 -8.139900 0.00040738000.1865253 0.6595721
98-1.1761256 7.346127 -8.125531 0.00041084940.1865253 0.6528326
99-0.7856796 5.827899 -8.120763 0.00041200820.1865253 0.6505927
100 1.9190023 6.878257 8.120762 0.00041200840.1865253 0.6505923

Histogram for the contrast of interest: Control vs Silenced to ensure histogram is distributed as expected

In [293]:
hist(fit3$p.value[,1], main="Histogram of p-vales for Control vs Silenced", xlab="p-value")

Annotating the top differentially expressed genes

In [240]:
library(hgu133plus2.db)
library(annotate)

geneProbes<-as.character(rownames(topDEGenes))
annotated_list<-select(hgu133plus2.db, geneProbes,c("SYMBOL","GENENAME"))
annotated_list
'select()' returned 1:many mapping between keys and columns
PROBEIDSYMBOLGENENAME
226258_at METTL20 methyltransferase like 20
226258_at AMN1 antagonist of mitotic exit network 1 homolog
201467_s_at NQO1 NAD(P)H dehydrogenase, quinone 1
210519_s_at NQO1 NAD(P)H dehydrogenase, quinone 1
213208_at GLTSCR1L GLTSCR1-like
217967_s_at FAM129A family with sequence similarity 129, member A
213338_at TMEM158 transmembrane protein 158 (gene/pseudogene)
217028_at CXCR4 chemokine (C-X-C motif) receptor 4
209301_at CA2 carbonic anhydrase II
201468_s_at NQO1 NAD(P)H dehydrogenase, quinone 1
230195_at LINC01405 long intergenic non-protein coding RNA 1405
226723_at SVBP small vasohibin binding protein
230720_at RNF182 ring finger protein 182
219175_s_at SLC41A3 solute carrier family 41, member 3
230264_s_at AP1S2 adaptor-related protein complex 1, sigma 2 subunit
206027_at S100A3 S100 calcium binding protein A3
201397_at PHGDH phosphoglycerate dehydrogenase
217225_x_at NOMO2 NODAL modulator 2
217225_x_at NOMO1 NODAL modulator 1
217225_x_at NOMO3 NODAL modulator 3
217225_x_at LOC101060373 uncharacterized LOC101060373
1558930_at LINC00460 long intergenic non-protein coding RNA 460
203031_s_at UROS uroporphyrinogen III synthase
203299_s_at AP1S2 adaptor-related protein complex 1, sigma 2 subunit
202839_s_at NDUFB7 NADH dehydrogenase (ubiquinone) 1 beta subcomplex, 7, 18kDa
219641_at DET1 de-etiolated homolog 1 (Arabidopsis)
223095_at MARVELD1 MARVEL domain containing 1
226281_at DNER delta/notch-like EGF repeat containing
202305_s_at FEZ2 fasciculation and elongation protein zeta 2 (zygin II)
206204_at GRB14 growth factor receptor-bound protein 14
224931_at SLC41A3 solute carrier family 41, member 3
229091_s_at CCNJ cyclin J
226181_at TUBE1 tubulin, epsilon 1
219987_at ERVMER34-1 endogenous retrovirus group MER34, member 1
235223_at RRP36 ribosomal RNA processing 36
204971_at CSTA cystatin A (stefin A)
224407_s_at STK26 serine/threonine protein kinase 26
217731_s_at ITM2B integral membrane protein 2B
228355_s_at NDUFAF2 NADH dehydrogenase (ubiquinone) complex I, assembly factor 2
200028_s_at STARD7 StAR-related lipid transfer (START) domain containing 7
234915_s_at DENR density-regulated protein
222738_at WWC2 WW and C2 domain containing 2
222738_at CLDN22 claudin 22
201141_at GPNMB glycoprotein (transmembrane) nmb
235148_at KRTCAP3 keratinocyte associated protein 3
225553_at CNIH1 cornichon family AMPA receptor auxiliary protein 1
238057_at USP45 ubiquitin specific peptidase 45
204224_s_at GCH1 GTP cyclohydrolase 1
222453_at CYBRD1 cytochrome b reductase 1
219347_at NUDT15 nudix (nucleoside diphosphate linked moiety X)-type motif 15
205187_at SMAD5 SMAD family member 5
1566557_at BAIAP2-AS1 BAIAP2 antisense RNA 1 (head to head)
203372_s_at SOCS2 suppressor of cytokine signaling 2
223073_at HIATL1 hippocampus abundant transcript-like 1
206385_s_at ANK3 ankyrin 3, node of Ranvier (ankyrin G)
236265_at SP4 Sp4 transcription factor
219470_x_at CCNJ cyclin J
218499_at STK26 serine/threonine protein kinase 26
219545_at KCTD14 potassium channel tetramerization domain containing 14
219545_at NDUFC2-KCTD14 NDUFC2-KCTD14 readthrough
  • Step 5: Visualisation of the data with principle component analysis (PCA)
In [297]:
pca_rma <- prcomp(t(e_rma))


plot(pca_rma$x, xlab="Component 1", ylab="Component 2", 
     pch=unclass(as.factor(pData(eset_rma)[,1])), 
     col=unclass(as.factor(pData(eset_rma)[,1])),
     main="Standard PCA")

groups<-paste(eset_rma$sample, sep =" ")

legend(40, 0, groups,pch=unclass(as.factor(pData(eset_rma)[,1]))
, col=unclass(as.factor(pData(eset_rma)[,1])))

PCA is an algorithm which reduces the dimensionality of the data, while retaining variation. It achieves this by identifiying directions which have the maximal variation in the data, known as principle components. Therefore the sample can be represented by a few numbers instead of the 54,675 genes per sample. By plotting these samples it is possible to visually assess similarities and differences between samples and determine if they should be grouped.

  • Step 6: Hierarchical clustering of DE (Differentially Expressed) genes
In [258]:
library(gplots)

geneProbes<-rownames(topDEGenes)
annotated_list<-select(hgu133plus2.db, geneProbes,"SYMBOL")
indx<-match(annotated_list$PROBEID,rownames(topDEGenes))

tID<-rownames(topDEGenes)[indx]
ind<-1
j<-1
for (i in 1: length(tID)) {
	ind[j]<-which(rownames(eset_rma)==tID[i],arr.ind=TRUE)
	j<-j+1
}

# ind is the vector with all the indexes
topExpr<-e_rma[ind,]
rownames(topExpr)<-annotated_list$SYMBOL

heatmap.2(topExpr, col=redgreen(75), scale="row",
key=TRUE, symkey=FALSE, density.info="none", trace="none", cexRow=0.4, cexCol=0.8)
'select()' returned 1:many mapping between keys and columns

Functional and Pathway Analysis

  • Step 7: Functional analysis of DE targets using PANTHER or DAVID

Exporting gene names into a tab seperated file for use in PANTHER.

In [257]:
write.table(rownames(topExpr), file="Export DE genes", sep="\t", row.names = F, col.names=F, quote=FALSE)

PANTHER (protein analysis through evolutionary relationships) enables biologists to analyse the results from large genome wide studies, such as sequencing, proteomics or gene expression experiments (Mi et al. 2013). By classifying proteins and their genes in various ways. PANTHER was chosen over DAVID(the database for annotation, visualisation and integrated discovering) as it integrates more updated GO (gene ontology) curation data (Mi et al. 2013). By exporting the top differentially expressed genes into PANTHER both functional and pathway analysis can be performed on the effects of SP1 silencing in vitro on colon cells after 72hrs in culture.

First, analysis of differentially expressed genes with PANTHER GO-Slim Molecular function, which classifies proteins by function or with directly interacting proteins at a biochemical level (see Figure 1). Not surprisingly, due to SP1s role in genetic regulation and instability (Atrinidis et al. 2010), 41.5% of function hits fall into a binding category. SP1 itself falls into this category as a nucleic acid DNA binding protein. Other structurally similar SP binding proteins expression are also altered in SP1 silenced colon cells, such as SP4, potentially indicating compensation between closely regulated genes. Another large proportion of DE expressed genes have catalytic activity. It should be noted that a coding gene can produce a protein that is a binding protein that has catalytic activity, such as rotein N-lysine methyltransferase (METTL20). METTL20 is a mitochondrial lysine methyltransferase implicated in both DNA and RNA methyltransferase (Malecki et al. 2015).

Figure 1. PANTHER GO-Slim Molecular Function Pie Chart.

Using PANTHER biological process analysis of top DE genes reinforces the role of SP1 as a general transcription factor (Black et al. 2001). As affected genes by SP1 silencing ranged from ‘housekeeping’ cellular processes to immune system process.

Figure 2. PANTHER Biological Processes Pie Chart. Showing the range of different biological processes affected by SP1 silencing in colon cells.

  • Step 8: Pathway analysis using PANTHER

Another advantage to biologists of processing results through PANTHER, is that interactions between genes can be identified, using the literature knowledge as a base; allowing pathway analysis. It is unsurprisingly that due to SP1s wide ranging role in the cell that many pathways will be affected and therefore show differential expression; mimicking the trend established by PANTHER biological process analysis.

Figure 3. PANTHER Pathway Pie Chart. Similar to Figure 1, showing the variety of pathways in which SP1 silencing in colon cells effects.

Interesting, the Cholecystokinin (CCK) signalling pathway, is affected by the silencing of SP1 in colon cells (as shown in grey as 9% in Figure 3). CCK is secreted from endocrine cells found in the small intestine, key for the activation of intestinal feedback control of gastrointestinal function (Dockray 2006). Two differentially expressed genes are found in this pathway: SP1 and Plasminogen activator inhibitor 1 (PAI1). SP1 is a downstream promoter of Mitogen-activated protein kinase (MAPK)1/3 (Cramer et al. 2008). PAI1 is also a downstream signalling factor of MAPK1/3 (Norsett et al. 2011). The silencing of SP1 may have caused an change in the expression of PAI1 to compensate and regulate gastrin expression in order to maintain the CCK signalling pathway. Further research is required into the interactions and potential compensation and redundancy of PAI1 and SP1.

References

Abdelrahim, M. and Safe, S. 2005. Cyclooxygenase-2 inhibitors decrease vascular endothelial growth factor expression in colon cancer cells by enhanced degradation of Sp1 and Sp4 proteins. Mol Pharmacol. 68(2), 317-29.

Astrinidis, A. et al. 2010. The transcription factor SP1 regulates centriole function and chromosomal stability through a functional interaction with the mammalian target of rapamycin/raptor complex. Genes Chromosomes Cancer. 49(3), 282-97.

Beishline, K. and Azizkhan-Clifford, J. 2015. Sp1 and the 'hallmarks of cancer’. The FEBS Journal. 282(2), 224-58.

Black, A. R. et al. 2001. Sp1 and krüppel-like factor family of transcription factors in cell growth regulation and cancer. J Cell Physiol. 188(2), 143-60.

Cramer, T. et al. 2008. Gastrin transactivates the chromogranin A gene through MEK-1/ERK- and PKC-dependent phosphorylation of Sp1 and CREB. Cell Signal. 20(1), 60-72.

Dockray, G. J. 2006. Gastrointestinal hormones: Cholecystokinin, somatostatin and ghrelin. In: Johnson LR, editor. Physiology of the Gastrointestinal Tract. Vol. 1. Elsevier Academic Press. 91–120.

Dynan, W. S. and Tjian, R. 1983. The promoter-specific transcription factor Sp1 binds to upstream sequences in the SV40 early promoter. Cell. 35(1), 79-87.

Kadonaga, J. T. et al. 1987. Isolation of cDNA encoding transcription factor Sp1 and functional analysis of the DNA binding domain. Cell. 51(6), 1079-90.

Małecki, J. et al. 2015. Human METTL20 Is a Mitochondrial Lysine Methyltransferase That Targets the β Subunit of Electron Transfer Flavoprotein (ETFβ) and Modulates Its Activity. The Journal of Biological Chemistry. 290, 423-434.

Mi, H. et al. 2013. Large-scale gene function analysis with the PANTHER classification system. Nature Protocols. 8, 1551-1566.

Nørsett, K. G. et al. 2010. Gastrin stimulates expression of plasminogen activator inhibitor-1 in gastric epithelial cells. Am J Physiol Gastrointest Liver Physiol. 301(3), 446-53.

Oleaga, C. et al. 2012. Identification of novel Sp1 targets involved in proliferation and cancer by functional genomics. Biochem Pharamcol. 84(12), 1581-91.

In [ ]: