SynBioMTS An automated model test system for synthetic biology models of gene expression and regulation. synbiomts uses a database of over 16000 unique characterized genetic systems to run Python-wrapped sequence-function models, quantify model accuracy, accept or reject proposed mechanistic hypotheses, and identify sources of model error. This package is easily modifiable to expand the genetic system database, calculate additional statistical test metrics, and test new and improved gene expression models implemented in nearly any programming language. Getting Started Dependencies Python packages used are listed below. You can install the first three packages together as part of the SciPy Stack. pandas - Database management scipy - Statistics calculations numpy - General purpose numerical computing scikit-learn - Machine learning Optional WebLogo - We've wrapped the WebLogo python API so that you can easily generate weblogos for your sequence datasets. ViennaRNA - A C code library for the prediction and comparison of RNA secondary structures. ViennaRNA is wrapped with /models/PyVRNA.py for use in modeling and machine learning analysis. Installing Install with the following: git clone https://github.com/reisalex/SynBioMTS cd SynBioMTS sudo python setup.py install The model test system can then be imported in Python: import synbiomts Usage If you would like to use the provided genetic system database, the best way is to navigate to /synbiomts, and run the database initialization module (initdb.py): cd synbiomts/examples/RBS python initdb.py To use the model test system: Wrap the model with a Python function. Create a models Container object and pass the wrapped functions with the add method. Specify the functional form between the model predictor and the system function with the setform method. Create a ModelTest object and pass the models Container. Run model calculations and statistics with run. import synbiomts # Wrap the model with a function import RBS_Calculator_v2 def RBSCalcv2(sequence,temperature): rRNA = 'ACCTCCTTA' model = RBS_Calculator_v2.RBS_Calculator(mRNA=sequence,rRNA=rRNA) model.temp = temperature model.run() RBS = model.output() # simplified for the example # Results should be returned as a dictionary # The keys will become labels in the resulting pandas dataframe results = { 'TIR': RBS.tir, 'dG_total': RBS.dG_total, 'dG_mRNA_rRNA': RBS.dG_mRNA_rRNA', 'dG_mRNA': RBS.dG_mRNA } return results if __name__ == "__main__": # create models Container object models = synbiomts.interface.Container() # add the model(s) models.add(RBSCalcv2) # specify the form of each model # RBS_Calculator is a thermodynamic model where: Protein ~ K*exp(-0.45*dG_total) models.setform(['RBSCalcv2'],x='dG_total',y='PROT.MEAN',yScale='ln',a1=-0.45) # create test system object testsystem = synbiomts.analyze.ModelTest(models,'geneticsystems.db',add_data=True,verbose=True) # run model predictions and statistics calculations testsystem.run() # if you want to shelve the model calculations # testsystem.run(calcsFilename='savedcalcs.db') When you add models to the Containers object, you can specify arguments of the wrapped function. This comes in handy when you want to vary a parameter and test which is most accurate: for s in range(0,16): name = "RBSCalcv2-s={}.format(s) models.add(RBSCalcv2,optimal_spacing=s) You can specify filters to run predictions on a subset of genetic systems with shared properties: filters = { 'ORGANISM': ['Escherichia coli'], 'DATASET' : ['Beck_PLoS_2016', 'Salis_NBT_2009', 'Tian_NAR_2015'] } testsystem = synbiomts.analyze.ModelTest(models,'geneticsystems.db',filters) The model test system uses multiprocessing, with the number of available CPUs by default, to run model predictions. You can specify the number of processes to force single process or specify a desired number: testsystem = synbiomts.analyze.ModelTest(models,'geneticsystems.db',nprocesses=1) The run method calculates both model predictions and calculates statistics. If you only want to run model predictions, you can use predict: testsystem.predict() See /examples for more detailed examples. Statistics By default, the model test system will run statistics assuming the model predictor and the system function share a linear relationship. Specifically, analyze.statistics() calls a custom function linear_complete from the stats module to compute the following: Fitted slope and y-intercept with outliers removed (via MAD method) Relative model error (Apparent Value/Predicted Value) Pearson & Spearman correlation coefficients One-sided model error cummulative distribution function Kullback-Leibler divergence If you want to run futher statistics, you can import the stats module: from synbiomts import stats You can always add additional stats functions as needed. Exporting Export to Excel is as simple as: testsystem.run() testsystem.to_excel('filename') By default pandas exports with the labels (columns) alphabetized. The model test system overrides the default export if you specify the labels. See /examples/labels for the ones I use: test.run() with open("labels/labels1.txt","r") as f: predictLabels = [x.strip('\n') for x in f.readlines()] with open("labels/labels_stats.txt","r") as f: statsLabels = [x.strip('\n') for x in f.readlines()] test.to_excel('filename',predictLabels,statsLabels) Acknowledgements Thanks to Howard M Salis (Penn State), Iman Farasat (Merck), Amin Espah Borujeni (MIT), Tian Tian (JBEI), Daniel Goodman (Harvard), Sri Kosuri (UCLA), Robert Egbert (Berkeley), Mark Mimee (MIT), and Heather Beck (Vienna) for providing high quality characterization data. A special thanks to Daniel Goodman for discussion on Flow-seq and for providing additional information on the 2013 Flow-seq datasets. If you use synbiomts, please cite: Alexander C. Reis, and Howard M. Salis. An automated model test system for systematic development and improvement of gene expression models, In Preparation (2017).