# Computational Enzyme Design: A Practical Exercise

Pablo Carbonell, [SYNBIOCHEM Centre](http://synbiochem.co.uk//), Manchester Institute of Biotechnology, University of Manchester

## 1. Objectives

In this exercise, you will learn about bioinformatics tools and databases that can guide you in order to perform an *in silico* protein design. New tools are continuously becoming available. It is advisable, thus, to check always for the latest developments. For instance, take a look [here](http://www.ncbi.nlm.nih.gov/pubmed?term=%22computational%20protein%20design%22%20OR%20%22computational%20enzyme%20design%22) at some recent publications on the subject.

For this practical exercise, our focus will be on a particular type of protein design problem, the one addressing the challenge of **improving catalytic efficiency of an upstream enzyme** producing endogenous precursors that are needed for the production of some target compound in *E. coli*. This problem typically appears if we are interested in compensating the metabolic drain of a precursor essential for growth because of the insertion of the heterologous pathway producing the target compound.

In particular, the **objective** of this practical exercise is that you learn computational techniques that will allow you:

1. To locate functional regions in a protein structure.
2. Identifying hot-spots for a specific protein interaction.
3. Prediction of best mutations in order to improve the desired protein activity.

## 2. Protocol

1. Create a model of the transition state of the desired substrate bound to the enzyme
2. Position protein side-chain functional groups important for the interaction substrate-enzyme
3. Set up a methodology for remodeling the protein backbone when a new side chain is introduced
4. Implement a search algorithm+an energy (score) function within the desired constraints 

(*Adapted from [Murphy et al, PNAS, 2009](http://www.pnas.org/content/106/23/9215.long)*).

**Note**: we will provide below some illustrative results for a guided example.

Recommended software to install: molecular visualization [PyMOL](http://www.pymol.org/), [CHIMERA](http://www.cgl.ucsf.edu/chimera/), [Jmol](http://jmol.sourceforge.net/), etc...; format interconversion [Open Babel](http://openbabel.org/).

**Note**: check available [public Galaxy servers](https://wiki.galaxyproject.org/PublicGalaxyServers) for community-contributed workflows for protein prediction/design.

## 3. Choosing the enzyme

>>**Study 1**. Select a structure template for the enzyme.
>>
>>	1. Go to the ""RetroPath"" server and select an enzymatic step producing one of the precursors of the biosynthetic pathway of interest.
>>	2. Build a structure template, which will be used during this study to map the predictions.
>>	3. Is there a structure of the transition state of the substrate bound to the enzyme?
>>

A selected list of enzymes producing precursors for the targets that you have been studying in the metabolic engineering course can be found in [this page](https://cloud.sagemath.com/projects/f90610cf-edf9-4eed-a3d6-39505611bf76/files/files/targets.html). 

**Note**: For all of these enzymes, there are [PDB](http://www.pdb.org/) structures available. However, this list is based on annotations that have not been curated. So, it is important to check if actually the structures correspond to enzymes producing the precursors. This information can be checked in [KEGG](http://www.genome.jp/kegg/).

If there is no information about the structure, you can search for a template in several databases:

* [CSA](http://www.ebi.ac.uk/thornton-srv/databases/CSA/): Catalyic Site Atlas (link **C** in ""RetroPath"") in order to get all possible templates.
* Links to 3D structures in [UniProt](http://www.uniprot.org) (link **U** in ""RetroPath"").
* [PDBsum](http://www.ebi.ac.uk/pdbsum/) provides comprehensive information about protein structures and their interactions with ligands (you can go from CSA).

If no information is available but still you want to analyze this case, then you will need to build a homology model:

* [SWISSMODEL](http://swissmodel.expasy.org/): a fully automated protein structure homology-modeling server.
* [CPHmodels 3.0 Server](http://www.cbs.dtu.dk/services/CPHmodels/): a protein homology modeling server.
* [BioInfoBank Meta Server](http://meta.bioinfo.pl/submit_wizard.pl): provides access to various fold recognition, function prediction and local structure prediction methods.
* [ESyPred](http://www.fundp.ac.be/sciences/biologie/urbm/bioinfo/esypred/): an automated homology modeling program.
* [Geno3D](http://geno3d-pbil.ibcp.fr/cgi-bin/geno3d_automat.pl?page=/GENO3D/geno3d_home.html): automated modeler of protein 3D structure
* [HHPred](http://toolkit.tuebingen.mpg.de/hhpred): predictor based on HMMs.

> #### Guided example: tyrosine aminotransferase
> We have selected the enzyme tyrosine aminotransferase in human (EC 2.6.1.5, PDB id [3dyd](http://www.pdb.org/pdb/explore/explore.do?structureId=3dyd)):
> <img src="http://www.pdb.org/pdb/images/3dyd_bio_r_500.jpg" width="210px" title="3dyd"></img>
... and its interaction with substrate tyrosine (Pubchem id [6057](http://pubchem.ncbi.nlm.nih.gov/summary/summary.cgi?cid=6057&loc=ec_rcs 6057)):
> <img src="http://pubchem.ncbi.nlm.nih.gov/image/img3d.cgi?cid=6057" width="210px" title="6057"></img>


## 4. Predicting enzyme active site and locating hot-spots

The prediction of those residues that play a functional role in a protein can be done at different levels, depending on the amount of information available. We will do first predictions based solely on the enzyme sequence or structure, without considering the ligand and the exact location of the catalytic site. 

The main criteria here is based on the fact that location of protein interfaces is generally related to some protein features which can be directly measured in the protein structure, such as:
* Sequence conservation
* Amino acid enrichment 
* Secondary structure
* Solvent accessibility
* Side chain flexibility
* Side-chain conformational entropy

### From single chain structures

>>**Study 2**. Find protein functional regions based on descriptors scores.
>>
>> 1. Select at least 4 scores from the single-chain algorithms listed below in order to get a residue-wise score of your protein template.
>> 2. Map the highest ranked residues into the 3D structure and visualize the resulting hot regions of the protein.
>> 3. How many interfaces does the protein contain?
>>

First of all, we are going to score residues in the protein based on its sequence and/or structural profile.

* [Consurf](http://consurf.tau.ac.il/) Sequence conservation
* [SCORECONS](http://www.ebi.ac.uk/thornton-srv/databases/cgi-bin/valdar/scorecons_server.pl): Score conservation based on multiple sequence alignment.
* [SHARP2](http://www.bioinformatics.sussex.ac.uk/SHARP2/sharp2.html): Hydrophobic patches.
* [PredictProtein](http://www.predictprotein.org/): protein prediction suite.
* [WHAT IF](http://swift.cmbi.ru.nl/whatif/): Molecular modelling package that is specialized on working with proteins and the molecules in their environment like water, ligands, nucleic acids, etc. 
* [cons-PPISP](http://pipe.scs.fsu.edu/ppisp.html): a method using PSI-Blast sequence profile and solvent accessibility as input to a neural network.
* [Promate](http://bioportal.weizmann.ac.il/promate): a naive Bayesian method based on properties, such as secondary structure, atom distribution, amino-acid pairing and sequence conservation.
* [PINUP](http://sparks.informatics.iupui.edu/PINUP/): a method based on an empirical scoring function consisting of a side-chain energy term, a term proportional to solvent accessible area, and a term accounting for sequence conservation.
* [PPI-Pred](http://bioinformatics.leeds.ac.uk/ppi-pred): a support vector machine method taking six properties (including surface shape and electrostatic potential) as input.
* [SPPIDER](http://sppider.cchmc.org/): a neural-network method that includes predicted solvent accessibility as input.
* [Meta-PPISP](http://pipe.scs.fsu.edu/meta-ppisp.html): a meta web server that is built on raw scores from cons-PPISP, Promate and PINUP through linear regression.
* [CASTp](http://cast.engr.uic.edu/): Pockets & cavities
* [SurfNet](http://www.biochem.ucl.ac.uk/~roman/surfnet/surfnet.html): Surfaces and void regions.
* [EvolutionaryTrace](http://mammoth.bcm.tmc.edu/ETserver.html): Evolutionary Trace Server.
* [ZEBRA](http://biokinet.belozersky.msu.ru/zebra): Identifies amino acids responsible for functional diversity based on structural information and physicochemical conservation.
* [CAVER](http://www.caver.cz/): Software tool for protein analysis and visualization to identify tunnels and channels in protein structures.


### Docking

>>**Study 3**. Build a model of the enzyme-substrate transient state.
>> 1. Use a docking tool in order to model the interaction of the substrate with the enzyme.
>> 2. Locate those residues that are interacting (less than 8 angstrom) with the ligand and visualize the resulting active region of the enzyme.
>> 3. Map the protein active region into the 3D structure and visualize it in the structure of the complex.
>> 4. Is there an overlap with the previous identified hot regions?
>>

Web servers:
 
* **Recomended** [SwissDock](http://www.swissdock.ch/): a web service to predict the molecular interactions that may occur between a target protein and a small molecule
* [ 1-Click Docking](https://mcule.com/apps/1-click-docking/):  Docking predicts the binding orientation and affinity of a ligand to a target.
* [ParDOCK](http://www.scfbio-iitd.res.in/dock/pardock.jsp): Automated server for protein ligand docking.
* [PatchDock](http://bioinfo3d.cs.tau.ac.il/PatchDock/): An automatic server for molecular docking.

Software (needs installation):

* [HADDOCK](http://www.nmr.chem.uu.nl/haddock/): Protein-ligand docking
* [Rosetta](http://www.rosettacommons.org/)

* Check out many more tools at [Click2Drug](http://www.click2drug.org/directory_StructureBasedScreening.html)

> #### Guided example: docking tyrosine aminotransferase
> We submit the PDB and the ligand to [SwissDock](http://www.swissdock.ch/). Since [we know](http://en.wikipedia.org/wiki/Tyrosine_aminotransferase we know) that the active site is around ""Lys280"" and the prosthetic group PDP, there is the possibility of performing the docking around this pocket:
> [<img src="http://previews.figshare.com/2355756/preview_2355756.jpg" width="320px" title="ensemble"></img>](http://dx.doi.org/10.6084/m9.figshare.1572205)
> From the ensemble of docked ligands, we select as our transition state model the one with the lowest predicted energy:

> [<img src="http://previews.figshare.com/2355757/preview_2355757.jpg" width="320px" title="ligand"></img>](http://dx.doi.org/10.6084/m9.figshare.1572206)


### Computing the residues in the interface

These are some tools that can provide you the interacting interface of the protein with the small ligand:

* **Recommended**: [Q-SiteFinder](http://www.modelling.leeds.ac.uk/qsitefinder/): Ligand binding site.

> #### Guided example: substrate interface
> We visualize the residues at the interface between the substrate and the enzyme in the model of the transition state:
> [<img src="http://previews.figshare.com/2355755/preview_2355755.jpg" width="210px" title="ligand"></img>](http://dx.doi.org/10.6084/m9.figshare.1572204)


### Prediction of hot-spots in the complex structure:

>> **Study 4**. Identify enzyme hot-spots.
>> 1. Use at least one of the online tools to predict protein hot-spots.
>> 2. Map the protein hot-spots, compare their location with the protein interface, and visualize the 3D structure of the complex.
>>

Hot-spots are those residues in the protein that are contributing most to the free binding energy. Experimentally, they are usually determined through an alanine scanning.

* [KFC Server](http://kfc.mitchell-lab.org): Protein Interface Hot Spot Prediction.
* [Robetta Alanine Scanning](http://robetta.bakerlab.org/): In silico alanine scanning.
* [FoldX](http://foldx.crg.es/): Energy-based hot spot prediction tool.
* [PredictProtein](http://www.predictprotein.org/): Protein prediction suite (ISIS).
* [DrugScorePPI](http://mbilab.uni-frankfurt.de/dsppi/): A knowledge-based scoring function for computational alanine-scanning in protein-protein interfaces. 
* [DrugScore](http://pc1664.pharmazie.uni-marburg.de/drugscore/): Score protein-ligand complexes of your interest and to visualize the per-atom score contributions

## 5. Experimental verification from databases and literature

>>**Study 5**. Search for experimentally verified residues.===
>> 1. Use the experimental databases and literature search in [Pubmed](http://www.ncbi.nlm.nih.gov/pubmed) in order to find at least one cluster of hot-spots that are functionally related to the interaction.
>>

At this point, it is important to check how close are our predicted hot-spots and active regions to experimentally verified protein residues. There are several databases with experimental data such as point mutations, including alanine scanning data. Furthermore, it is worth to check experimental protein interaction databases to know more about all the interacting partners of the protein, in order to see how can be they related to the protein active regions.

### Database of experimental affinities and interface residues

* [ASEdb](http://nic.ucsf.edu/asedb/): Alanine scanning database.
* [CSA](http://www.ebi.ac.uk/thornton-srv/databases/CSA/): Catalyic Site Atlas.
* [ProTherm](http://gibk26.bse.kyutech.ac.jp/jouhou/Protherm/protherm.html): Thermodynamic parameters and wild type of mutant protein.
* [WikiBID](http://tsailab.tamu.edu/wikiBID/index.php/Main_Page): Binding Interface Wiki.
* [BRENDA](http://www.brenda-enzymes.org/): Enzyme information database.
* [Ligand Protein DataBase (LPDB)](http://lpdb.chem.lsa.umich.edu/): Collection of curated ligand-protein complexes, with 3D structures and experimental binding free energies. Maintained by the university of Michigan.
* [AffinDB](http://pc1664.pharmazie.uni-marburg.de/affinity/): Freely accessible database of affinities for protein-ligand complexes from the PDB.
* [Protein Ligand Database (PLD)](http://chemistry.st-andrews.ac.uk/staff/jbom/group/material.html): Collection of protein ligand complexes extracted fom the PDB along with biomolecular data, including binding energies, Tanimoto ligand similarity scores and protein sequence similarities of protein-ligand complexes.
* [BindingDB](http://www.bindingdb.org/bind/index.jsp): Public, web-accessible database of measured binding affinities, focusing chiefly on the interactions of protein considered to be drug-targets with small, drug-like molecules.
* [SCORPIO](http://scorpio.biophysics.ismb.lon.ac.uk/scorpio.html): Free online repository of protein-ligand complexes which have been structurally resolved and thermodynamically characterised.
* [BAPPL complexes set](http://www.scfbio-iitd.res.in/software/drugdesign/proteinliganddataset.htm): 161 protein-ligand complexes with experimental and estimated binding free energies calculated with the BAPPL server.
* [DNA Drug complex dataset](http://www.scfbio-iitd.res.in/software/drugdesign/dnadrugdataset.jsp): Dataset of DNA-drug complexes consisting of 16 minimized crystal structures and 34 model-built structures, along with experimental affinities, used to validate ""PreDDICTA"".
* [Binding Database](http://www.bindingdb.org/bind/index.jsp): Public, web-accessible database of measured binding affinities, focusing chiefly on the interactions of protein considered to be drug-targets with small, drug-like molecules. Maintained by the Center for Advanced Research in Biotechnology, University of Maryland Biotechnology Institute.


### Database of protein interactions

* [DIP](http://dip.doe-mbi.ucla.edu/dip/Main.cgi): Database of Interacting Proteins.
* [IntAct](http://www.ebi.ac.uk/intact/main.xhtml): Database system and analysis tools for protein interaction data. 
* [BOND](http://bond.unleashedinformatics.com/Action?): The Biomolecular Interaction Network Database (BIND).
* [MINT](http://mint.bio.uniroma2.it/mint/Welcome.do):	The Molecular Interactions Database.
* [BioGRID](http://thebiogrid.org/):  Biological General Repository for Interaction Datasets.

	
## 6. Enzyme Design

>>**Study 6**. Combinatorial library of a ranked ensemble of variants.
>> 1. Use the [[http://rosettadesign.med.unc.edu/ RosettaDesign]] webserver in order to perform a //saturation mutagenesis// focused library for the hot-spots positions.
>> 2. List the best predicted substitutions for the hot-spots.
>> 3. Sketch an in-silico protocol for building a combinatorial library including simultaneous mutations of the hot-spots.
>>

Finally, once the desired positions to mutate have been selected, we perform in silico mutations in order to rank the variants and build a combinatorial library.

### Webserver interface

* [RosettaDesign](http://rosettadesign.med.unc.edu/): Rosetta design can be used to identify sequences compatible with a given protein backbone. Some of Rosetta design's successes include the design of a novel protein fold, redesign of an existing protein for greater stability, increased binding affinity between two proteins, and the design of novel enzymes.
* [Rosie](http://rosie.rosettacommons.org/): Rosetta Online Server that Includes Everyone.

### Other webservers of interest

* [Rosetta VIP](http://rosie.rosettacommons.org/vip): Automated selection of stabilizing mutations in designed and natural proteins.
* [PoPMuSiC](http://dezyme.com/): tool for the computer-aided design of mutant proteins with controlled stability properties.
* [Protein WISDOM](http://atlas.princeton.edu/proteinwisdom/): Workbench for in silico de novo design of biomolecules.
* [EvoDesign](http://zhanglab.ccmb.med.umich.edu/EvoDesign/): ""EvoDesign"" is an evolutionary profile based approach to de novo protein design.

### Stand-alone only

* [IPRO](http://maranas.che.psu.edu/IPRO.htm): Iterative Protein Redesign and Optimization. IPRO redesigns proteins to increase or give specificity to native or novel substrates and cofactors. This is done by repeatedly randomly perturbing the backbones of the proteins around specified design positions, identifying the lowest energy combination of rotamers, and determining if the new design has a lower binding energy than previous ones. The iterative nature of this process allows IPRO to make additive mutations to the protein sequence that collectively improve the specificity towards the desired substrates and/or cofactors. 
* [EGAD](http://egad.ucsd.edu/EGAD_manual/index.html): A Genetic Algorithm for protein Design. A free, open-source software package for protein design and prediction of mutation effects on protein folding stabilities and binding affinities. EGAD can also consider multiple structures simultaneously for designing specific binding proteins or locking proteins into specific conformational states. In addition to natural protein residues, EGAD can also consider free-moving ligands with or without rotatable bonds. EGAD can be used with single or multiple processors.
* [SHARPEN](http://koko.che.caltech.edu/): Systematic Hierarchical Algorithms for Rotamers and Proteins on an Extended Network. SHARPEN offers a variety of combinatorial optimization methods (e.g. Monte Carlo, Simulated Annealing) and can score proteins using the Rosetta all-atom force field or molecular mechanics force fields (OPLSaa).
* [Abalone](http://www.biomolecular-modeling.com/Abalone/index.html): Software for protein modelling and visualisation.
* [Janus](https://sites.google.com/site/mdtoneylab/research/janus-prorgam-download): prediction and ranking of mutations required for functional interconversion of enzymes.