. 2015 Apr 20;43(Web Server issue):W589–W598. doi: 10.1093/nar/gkv350

The BioMart community portal: an innovative alternative to large, centralized data repositories

Damian Smedley ¹, Syed Haider ², Steffen Durinck ³, Luca Pandini ⁴, Paolo Provero ^4,⁵, James Allen ⁶, Olivier Arnaiz ⁷, Mohammad Hamza Awedh ⁸, Richard Baldock ⁹, Giulia Barbiera ⁴, Philippe Bardou ¹⁰, Tim Beck ¹¹, Andrew Blake ¹², Merideth Bonierbale ¹³, Anthony J Brookes ¹¹, Gabriele Bucci ⁴, Iwan Buetti ⁴, Sarah Burge ⁶, Cédric Cabau ¹⁰, Joseph W Carlson ¹⁴, Claude Chelala ¹⁵, Charalambos Chrysostomou ¹¹, Davide Cittaro ⁴, Olivier Collin ¹⁶, Raul Cordova ¹³, Rosalind J Cutts ¹⁵, Erik Dassi ¹⁷, Alex Di Genova ¹⁸, Anis Djari ¹⁹, Anthony Esposito ²⁰, Heather Estrella ²⁰, Eduardo Eyras ^21,²², Julio Fernandez-Banet ²⁰, Simon Forbes ¹, Robert C Free ¹¹, Takatomo Fujisawa ²³, Emanuela Gadaleta ¹⁵, Jose M Garcia-Manteiga ⁴, David Goodstein ¹⁴, Kristian Gray ²⁴, José Afonso Guerra-Assunção ¹⁵, Bernard Haggarty ⁹, Dong-Jin Han ^25,²⁶, Byung Woo Han ^27,²⁸, Todd Harris ²⁹, Jayson Harshbarger ³⁰, Robert K Hastings ¹¹, Richard D Hayes ¹⁴, Claire Hoede ¹⁹, Shen Hu ³¹, Zhi-Liang Hu ³², Lucie Hutchins ³³, Zhengyan Kan ²⁰, Hideya Kawaji ^30,³⁴, Aminah Keliet ³⁵, Arnaud Kerhornou ⁶, Sunghoon Kim ^25,²⁶, Rhoda Kinsella ⁶, Christophe Klopp ¹⁹, Lei Kong ³⁶, Daniel Lawson ³⁷, Dejan Lazarevic ⁴, Ji-Hyun Lee ^25,^27,²⁸, Thomas Letellier ³⁵, Chuan-Yun Li ³⁸, Pietro Lio ³⁹, Chu-Jun Liu ³⁸, Jie Luo ⁶, Alejandro Maass ^18,⁴⁰, Jerome Mariette ¹⁹, Thomas Maurel ⁶, Stefania Merella ⁴, Azza Mostafa Mohamed ⁴¹, Francois Moreews ¹⁰, Ibounyamine Nabihoudine ¹⁹, Nelson Ndegwa ⁴², Céline Noirot ¹⁹, Cristian Perez-Llamas ²², Michael Primig ⁴³, Alessandro Quattrone ¹⁷, Hadi Quesneville ³⁵, Davide Rambaldi ⁴, James Reecy ³², Michela Riba ⁴, Steven Rosanoff ⁶, Amna Ali Saddiq ⁴⁴, Elisa Salas ¹³, Olivier Sallou ¹⁶, Rebecca Shepherd ¹, Reinhard Simon ¹³, Linda Sperling ⁷, William Spooner ^45,⁴⁶, Daniel M Staines ⁶, Delphine Steinbach ³⁵, Kevin Stone ³³, Elia Stupka ⁴, Jon W Teague ¹, Abu Z Dayem Ullah ¹⁵, Jun Wang ³⁶, Doreen Ware ⁴⁵, Marie Wong-Erasmus ⁴⁷, Ken Youens-Clark ⁴⁵, Amonida Zadissa ⁶, Shi-Jian Zhang ³⁸, Arek Kasprzyk ^4,^48,^*

¹Wellcome Trust Sanger Institute, Welcome Trust Genome Campus, Hinxton, CB10 1SD, UK

²The Weatherall Institute Of Molecular Medicine, University of Oxford, Oxford, OX3 9DS, UK

³Genentech, Inc. 1 DNA Way South San Francisco, CA 94080, USA

⁴Center for Translational Genomics and Bioinformatics San Raffaele Scientific Institute, Via Olgettina 58, 20132 Milan, Italy

⁵Dept of Molecular Biotechnology and Health Sciences University of Turin, Italy

⁶European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Trust Genome Campus, Hinxton, Cambridge, CB10 1SD, UK

⁷Institute for Integrative Biology of the Cell (I2BC), CEA, CNRS, Université Paris Sud, 1 avenue de la terrasse, 91198 Gif sur Yvette, France

⁸Department of Electrical and Computer Engineering, Faculty of Engineering, King Abdulaziz University, Jeddah, Saudi Arabia

⁹MRC Human Genetics Unit, Institute of Genetics and Molecular Medicine, Western General Hospital, Edinburgh, EH4 2XU, UK

¹⁰Sigenae, INRA, Castanet-Tolosan, France

¹¹Department of Genetics, University of Leicester, University Road, Leicester, LE1 7RH, UK

¹²MRC Harwell, Harwell Science and Innovation Campus, Oxfordshire, OX11 0RD, UK

¹³International Potato Center (CIP), Lima, 1558, Peru

¹⁴Department of Energy, Joint Genome Institute, Walnut Creek, USA

¹⁵Centre for Molecular Oncology, Barts Cancer Institute, Queen Mary University of London, Charterhouse Square, London EC1M 6BQ, UK

¹⁶IRISA-INRIA, Campus de Beaulieu 35042 Rennes, France

¹⁷Laboratory of Translational Genomics, Centre for Integrative Biology, University of Trento, Trento, Italy

¹⁸Center for Mathematical Modeling and Center for Genome Regulation, University of Chile, Beauchef 851, 7th floor, Chile

¹⁹Plate-forme bio-informatique Genotoul, Mathématiques et Informatique Appliquées de Toulouse, INRA, Castanet-Tolosan, France

²⁰Oncology Computational Biology, Pfizer, La Jolla, USA

²¹Catalan Institute for Research and Advanced Studies (ICREA), Passeig Lluis Companys 23, E-08010 Barcelona, Spain

²²Universitat Pompeu Fabra, Dr Aiguader 88 E-08003 Barcelona, Spain

²³Kasuza DNA Research Institute, Chiba, 292–0818, Japan

²⁴HUGO Gene Nomenclature Committee (HGNC), European Bioinformatics Institute (EMBL-EBI) Wellcome Trust Genome Campus, Hinxton, CB10 1SD, UK

²⁵Medicinal Bioconvergence Research Center, College of Pharmacy, Seoul National University, Seoul 151–742, Republic of Korea

²⁶Department of Molecular Medicine and Biopharmaceutical Sciences, Seoul National University, Seoul 151–742, Republic of Korea

²⁷Research Institute of Pharmaceutical Sciences, College of Pharmacy, Seoul National University, Seoul 151–742, Republic of Korea

²⁸Information Center for Bio-pharmacological Network, Seoul National University, Suwon 443–270, Republic of Korea

²⁹Ontario Institute for Cancer Research, Toronto, M5G 0A3, Canada

³⁰RIKEN Center for Life Science Technologies (CLST), Division of Genomic Technologies (DGT), Kanagawa, 230–0045, Japan

³¹School of Dentistry and Dental Research Institute, University of California Los Angeles (UCLA), Los Angeles, CA 90095–1668, USA

³²Iowa State Univeristy, USA

³³Mouse Genomic Informatics Group, The Jackson Laboratory, Bar Harbor, ME 04609, USA

³⁴RIKEN Preventive Medicine and Diagnosis Innovation Program, Saitama 351–0198, Japan

³⁵INRA URGI Centre de Versailles, bâtiment 18 Route de Saint Cyr 78026 Versailles, France

³⁶Center for Bioinformatics, State Key Laboratory of Protein and Plant Gene Research, College of Life Sciences, Peking University, Beijing, 100871, P.R. China

³⁷VectorBase, European Bioinformatics Institute, Wellcome Trust Genome Campus, Hinxton, CB10 1SD, UK

³⁸Institute of Molecular Medicine, Peking University, Beijing, China

³⁹Computer Laboratory, University of Cambridge, Cambridge, CB3 0FD, UK

⁴⁰Department of Mathematical Engineering, University of Chile, Av. Beauchef 851, 5th floor, Santiago, Chile

⁴¹Departament of Biochemistry, Faculty of Science for Girls, King Abdulaziz University, Jeddah, Saudi Arabia

⁴²Department of Medical Epidemiology and Biostatistics, Karolinska Institutet, PO Box 281, 17177 Stockholm, Sweden

⁴³Inserm U1085 IRSET, University of Rennes 1, 35042 Rennes, France

⁴⁴Department of Biological Sciences, Faculty of Science for Girls, King Abdulaziz University, Jeddah, Saudi Arabia

⁴⁵Cold Spring Harbor Laboratory, Cold Spring Harbor, NY 11724, USA

⁴⁶Eagle Genomics Ltd., Babraham Research Campus, Cambridge, CB22 3AT, UK

⁴⁷Human Longevity, Inc. 10835 Road to the Cure 140 San Diego, CA 92121, USA

⁴⁸Department of Biological Sciences, Faculty of Science, King Abdulaziz University, Jeddah, Saudi Arabia

To whom correspondence should be addressed. Tel: +39 02 26439139; Fax: +39 02 2643 4153; Email: Arek.Kasprzyk@gmail.com

PMCID: PMC4489294 PMID: 25897122

Abstract

The BioMart Community Portal (www.biomart.org) is a community-driven effort to provide a unified interface to biomedical databases that are distributed worldwide. The portal provides access to numerous database projects supported by 30 scientific organizations. It includes over 800 different biological datasets spanning genomics, proteomics, model organisms, cancer data, ontology information and more. All resources available through the portal are independently administered and funded by their host organizations. The BioMart data federation technology provides a unified interface to all the available data. The latest version of the portal comes with many new databases that have been created by our ever-growing community. It also comes with better support and extensibility for data analysis and visualization tools. A new addition to our toolbox, the enrichment analysis tool is now accessible through graphical and web service interface. The BioMart community portal averages over one million requests per day. Building on this level of service and the wealth of information that has become available, the BioMart Community Portal has introduced a new, more scalable and cheaper alternative to the large data stores maintained by specialized organizations.

INTRODUCTION

The methods of data generation and processing that are utilized in biomedical sciences have radically changed in recent years. With the advancement of new high-throughput technologies, data have grown in terms of quantity as well as complexity. However, the significance of the information that is hidden in the newly generated experimental data can only be deciphered by linking it to other types of biological data that have been accumulated previously. As a result there are already numerous bioinformatics resources and new ones are constantly being created. Typically, each resource comes with its own query interface. This poses a problem for the scientists who want to utilize such resources in their research. Even the simplest task such as compiling results from a few existing resources is challenging due to the lack of a complete, up to date catalogue of already existing resources and the necessity of constantly learning how to navigate new query interfaces. A different challenge is faced by collaborating groups of scientists who independently generate or maintain their own data. Such collaborations are seriously hampered by the lack of a simple data management solution that would make it possible to connect their disparate, geographically distributed data sources and present them in a uniform way to other scientists. The BioMart project has been set up to address these challenges.

SOFTWARE

BioMart is an open source data management system, which is based on a data federation model (1). Under this model, each data source is managed, updated and released independently by their host organization while the BioMart software provides a unified view of these sources that are distributed worldwide. The data sources are presented to the user through a unified set of graphical and programmatic interfaces so that they appear to be a single integrated database. To navigate this database and compile a query the user does not have to learn the underlying structure of each data source but instead use a set of simple abstractions: datasets, filters and attributes. Once a user's input is provided, the software distributes parts of the query to individual data sources, collects the data and presents the user with the unified result set.

The BioMart software is data agnostic and its applications are not limited to biological data. It is cross-platform and supports many popular relational database managements systems, including MySQL, Oracle, PostgreSQL. It also supports many third party packages such as Taverna (2), Galaxy (3), Cytoscape (4) and biomaRt (5), which part of the Bioconductor (6) library.

The BioMart project currently maintains two independent code bases: one written in Java and one written in Perl. For more information about the architecture and capabilities of each of the packages please refer to previous publications (1,7). The latest version of the Java based BioMart software has been significantly enhanced with new additions to the existing collection of graphical user interfaces (GUIs). It has also been re-engineered to provide better support and extensibility for data analysis and visualization tools. The first of the BioMart tools based on this new framework has already been implemented and is accessible from the BioMart Community Portal.

The BioMart project adheres to the open source philosophy that promotes collaboration and code reuse. Two good examples of how this philosophy benefits the scientific community are provided by two independent research groups. The INRA group based in Toulouse, France has recently released a software package called RNAbrowse (RNA-Seq De Novo Assembly Results Browser) (8). The Pfizer group based in La Jolla, USA has just announced the release of OASIS: A Web-based Platform for Exploratory Analysis of Cancer Genome and Transcriptome data (www.oasis-genomics.org). Both of these software packages are based on the BioMart software.

DATA

The BioMart community consists of a wide spectrum of different research groups that use the BioMart technology to provide access to their databases. It currently comprises 30 scientific organizations supporting 38 database projects that contain over 800 different biological datasets spanning genomics, proteomics, model organisms, cancer data, ontology information and more. The BioMart community is constantly growing and since the last publication (9), 11 new database projects have become available. As new BioMart databases become available locally they also become gradually integrated into the BioMart Community Portal. The main function of the portal is to provide a convenient single point of access to all available data that is distributed worldwide (Figure 1). All BioMart databases that are included in the portal are independently administered and funded. Table 1 provides a detailed list of all BioMart community resources as of March 2015.

Figure 1. — BioMart community databases and their host countries.

Table 1. BioMart community databases and their host organizations.

Database	Description	Host	Reference
Animal Genome databases^a,b	Agriculturally important livestock genomes	Iowa State University, US	NA
Atlas of UTR Regulatory Activity (AURA)^a	Meta-database centred on mapping post-transcriptional (PTR) interactions of trans-factors with human and mouse untranslated regions (UTRs) of mRNAs	University of Trento, Italy	(36)
BCCTB Bioinformatics Portal^a	Portal for mining omics data on breast cancer from published literature and experimental datasets	Breast Cancer Campaign/Barts Cancer Institute UK	(37)
Cildb	Database for eukaryotic cilia and centriolar structures, integrating orthology relationships for 44 species with high-throughput studies and OMIM	Centre National de la Recherche Scientifique (CNRS), France	(38)
COSMIC	Somatic mutation information relating to human cancers	Wellcome Trust Sanger Institute (WTSI), UK	(39)
DAPPER^a	Mass spec identified protein interaction networks in Drosophila cell cycle regulation	Department of Genetics, University of Cambridge, Cambridge, UK	NA
EMAGE	In situ gene expression data in the mouse embryo	Medical Research Council, Human Genetics Unit (MRC HGU), UK	(40)
Ensembl	Genome databases for vertebrates and other eukaryotic species	Wellcome Trust Sanger Institute (WTSI), UK	(41)
Ensembl Genomes	Ensembl Fungi, Metazoa, Plants and Protists	European Bioinformatics Institute (EBI), UK	(41)
Euraexpress	Transcriptome atlas database for mouse embryo	Medical Research Council, Human Genetics Unit (MRC HGU), UK	(42)
EuroPhenome	Mouse phenotyping data	Harwell Science and Innovation Campus (MRC Harwell), UK	(15)
FANTOM5^a	The FANTOM5 project mapped a promoter level expression atlas in human and mouse. The FANTOM5 BioMart instance provides the set of promoters along with annotation.	RIKEN Center for Life Science Technologies (CLST), Japan	(16)
GermOnLine	Cross-species microarray expression database focusing on germline development, meiosis, and gametogenesis as well as the mitotic cell cycle	Institut national de la santé et de la recherche médicale (Inserm), France	(17)
GnpIS^a	Genetic and Genomic Information System (GnpIS)	Institut Nationale de Recherche Agronomique (INRA), Unité de Recherche en Génomique-Info (URGI), France	(18)
Gramene	Agriculturally important grass genomes	Cold Spring Harbor Laboratory (CSHL), US	(43)
GWAS Central^a	GWAS Central provides a comprehensive curated collection of summary level findings from genetic association studies	University of Leicester, UK	(19)
HapMap	Multi-country effort to identify and catalog genetic similarities and differences in human beings	National Center for Biotechnology Information (NCBI), US	(20)
HGNC	Repository of human gene nomenclature and associated resources	European Bioinformatics Institute (EBI), UK	(21)
i-Pharm^a	PharmDB-K is an integrated bio-pharmacological network databases for TKM (Traditional Korean Medicine)	Information Center for Bio-pharmacological Network (i-Pharm), South Korea	(22)
InterPro	Integrated database of predictive protein ‘signatures’ used for the classification and automatic annotation of proteins and genomes	European Bioinformatics Institute (EBI), UK	(44)
KazusaMart	Cyanobase, rhizobia, and plant genome databases	Kazusa DNA Research Institute (Kazusa), Japan	NA
MGI	Mouse genome features, locations, alleles, and orthologs	Jackson Laboratory, US	(23)
Pancreatic Expression Database	Results from published literature	Barts Cancer Institute UK	(24)
ParameciumDB	Paramecium genome database	Centre National de la Recherche Scientifique (CNRS), France	(25)
Phytozome	Comparative genomics of green plants	Joint Genome Institute (JGI)/Center for Integrative Genomics (CIG), US	(26)
Potato Database	Potato and sweetpotato phenotypic and genomic information	International Potato Center (CIP), Peru	NA
PRIDE	Repository for protein and peptide identifications	European Bioinformatics Institute (EBI), UK	(45)
Regulatory Genomics Group^a	Predictive Models of Gene Regulation from High-Throughput Epigenomics Data	Universitat Pompeu Fabra (UPF), Spain	(27)
Rfam^a	The Rfam database is a collection of RNA families, each represented by multiple sequence alignments, consensus secondary structures and covariance models (CMs).	Wellcome Trust Sanger Institute (WTSI), UK	(28)
RhesusBase^a	A knowledgebase for the monkey research community	Peking University, China	(29)
Rice-Map	Rice (japonica and indica) genome annotation database	Peking University, China	(30)
SalmonDB	Genomic information for Atlantic salmon, rainbow trout, and related species	Center for Mathematical Modeling and Center for Genome Regulation (CMM), Chile	(31)
sigReannot	Aquaculture and farm animal species microarray probes re-annotation	INRA - French National Institute of Agricultural Research, France	(46)
UniProt	Protein sequence and functional information	European Bioinformatics Institute (EBI), UK	(32)
VectorBase	Genome information for invertebrate vectors of human pathogens	University of Notre Dame, US	(33)
VEGA	Manual annotation of vertebrate genome sequences	Wellcome Trust Sanger Institute (WTSI), UK	(34)
WormBase	C. elegans and related nematode genomic information	Cold Spring Harbor Laboratory (CSHL), US	(35)

Open in a new tab

^aDenotes new databases that have become available since last publication (9).

^bDenotes new databases that are not yet integrated into the portal.

PORTAL

The current version of the BioMart Community Portal operates two different instances of the web server: one implemented in Perl and the other in Java. Both servers support complex database searches and although they use different types of GUIs, they share the same navigation and query compilation logic based on selection of datasets, filters and attributes (9,10). The Java version of the portal also includes a section for specialized tools, which consists of the following: Sequence retrieval, ID Converter and Enrichment Analysis. Sequence retrieval allows easy querying of sequences while the ID Converter tool allows users to enter or upload a list of identifiers in any format (currently supported by Ensembl), and retrieve the same list converted to any other supported format. The enrichment tool supports enrichment analysis of genes in all species included in the current Ensembl release. For each of those species a broad range of gene identifiers is available. Furthermore, the tool supports cross species analysis using Ensembl homology data. For instance, it is possible to perform a one step enrichment analysis against a human disease dataset using experimental data from any of the species for which human homology data is available. Finally, the enrichment tool facilitates analysis of BED files containing genomic features such as Copy Number Variations or Differentially Methylated Regions. The output is provided in tabular and network graphic format (Figure 2).

Figure 2. — The network graphic output of the BioMart enrichment tool. The Gene Ontology (GO) enrichment analysis was performed using BED file containing human data. This tool is also accessible through web services (Java version only). The programmatic access complies with a standard BioMart interface: dataset, filter and attribute.

WEB SERVICE

The BioMart Community Portal handles queries from several interfaces such as:

PERL API
Java API
Web interfaces
URL based access
RESTful web service
SPARQL

For more detailed description of all the interfaces please refer to earlier publications (1,7). In the section below we provide a description and compare the REST-based web service, which is implemented in Perl and its counterpart, which is implemented in Java. It is worth noting that the web service maintains the same query interface both in Perl and Java implementations. For example, the web service query (Figure 3A) can be run against java-based server as follows:

curl –data-urlencode query@query.xml http://central.biomart.org/martservice/results
or its Perl-based counter-part as below
curl –data-urlencode query@query.xml http://www.biomart.org/biomart/martservice

By default, query sets the attribute processor to ‘TSV’ requesting tab-delimited results (Figure 3B). Alternatively, by setting processor to ‘JSON’, would return JSON formatted results (Figure 3C), which are readily consumable by third-party web-based clients saving overhead of parsing and format translations. Please note that JSON format is only available in the java version.

A simple way to compile a web service query for later programmatic use is to use one of the web GUIs and generate the query XML using REST/SOAP button. After following the steps outlined by the GUI and clicking the ‘results’ button, the user needs to click the REST/SOAP button, save the query and run it as described above. Alternatively a user can take advantage of the programmatic access to all the metadata defining marts, datasets, filters and attributes. The access to the metadata served by the Java and Perl BioMart servers is provided using the following webservice requests:

Java (central.biomart.org)

registry information:

http://central.biomart.org/martservice/portal
available marts:

http://central.biomart.org/martservice/marts
datasets available for a config:

http://central.biomart.org/martservice/datasets?config=snp_config
attributes available for a dataset:

http://central.biomart.org/martservice/attributes?datasets=btaurus_snp&config=snp_config
filters available for a dataset:

http://central.biomart.org/martservice/filters?datasets=btaurus_snp&config=snp_config

Perl (www.biomart.org)

registry information:

http://www.biomart.org/biomart/martservice?type=registry
datasets available for a mart:

http://www.biomart.org/biomart/martservice?type=datasets&mart=ensembl
attributes available for a dataset:

http://www.biomart.org/biomart/martservice?type=attributes&dataset=oanatinus_gene_ensembl
filters available for a dataset:

http://www.biomart.org/biomart/martservice?type=filters&dataset=oanatinus_gene_ensembl
configuration for a dataset:

http://www.biomart.org/biomart/martservice?type=configuration&dataset=oanatinus_gene_ensembl

Please note that the granularity between mart and dataset has been improved in the Java version through the introduction of multiple dataset configs. This facilitates the end-users to browse various views of the same dataset, which are presented through the portal either using a different GUI or subsets of data.

QUERY EXAMPLES

Given the coverage of the current BioMart datatsets, many relevant biological questions can be answered. For example, a researcher who has detected potentially pathogenic variants in FGFR2 (ENSG00000066468) from exome sequencing patients may be interested if the same variants have been previously described and if they were associated with the same or similar diseases. To answer this, integrated data from Ensembl can be queried as shown in Table 2 to display all known variants annotated within FGFR2 that are predicted as pathogenic by SIFT (11) and Polyphen (12). The genomic position outputs can be compared to the researcher's variants and the phenotype data used to assess candidacy for their cases. For example, the first batch of results shows a C->G variant at position 121520160 on chromosome 10 that is associated with Apert syndrome (OMIM:176943).

Table 2. Query to display phenotypic consequence for known, pathogenic variants in FGFR2.

Database and dataset	Filters	Attributes
Ensembl 78 Short Variations	Ensembl Gene ID(s):	Chromosome name
(WTSI, UK)	ENSG00000066468	Chromosome position start (bp)
Homo sapiens Short Variation (SNPs and indels) (GRCh38)	SIFT Prediction: deleterious	Chromosome position end (bp)
	PolyPhen Prediction: probably damaging	Strand
		Variant Alleles
		Ensembl Gene ID
		Consequence to transcript
		Associated variation names
		Study External Reference
		Source name
		Associated gene with phenotype
		Phenotype description

Open in a new tab

Another common use case that BioMart is used for is to analyse a list of genes to establish whether they are associated with particular protein functions, pathways or diseases more often than would be expected by chance (enrichment analysis). For example, a researcher may have discovered that AURKA, AURKB, AURKC, PLK1, CDK1 and CDK4 are differentially expressed in their experiment and used BioMart's enrichment tool with its default settings to analyse these genes. The results show that these genes are enriched for involvement in the cell cycle, kinase activity and mitotic nuclear division amongst others. Many other real usage examples are documented in our previous paper (10) and the BioMart special issue in Database: the journal of biological databases and biocuration (www.oxfordjournals.org/our_journals/databa/biomart_virtual_issue.html).

CONCLUSIONS

Since its conception as a data-mining interface for the Human Genome Project (13) BioMart has rapidly grown to become an international collaboration involving a large number of different groups and organizations both in academia and in industry (14). It has been successfully applied to many different types of data including genomics, proteomics, model organisms, cancer data, etc., proving that its generic data model is widely applicable (15–53). BioMart has also provided a first successful solution for the unprecedented data management needs of the International Cancer Genome Consortium proving that the federated model scales well with the amounts of data generated by Next Generation Sequencing (48).

There are a number of important factors that contributed to the BioMart's success and its adoption by many different types of projects around the world as their data management platform. BioMart's ability to quickly deploy a website hosting any type of data, user-friendly GUI, several programmatic interfaces and support for third party tools has proved to be an attractive solution for data managers who were in need of a rapid and reliable solution for their user community. BioMart has also proven to be a platform of choice for many smaller organizations that lack the necessary resources to embark on the development of their own data management solution. As a result, more and more database projects have become accessible through the BioMart interface. The arrival of these new resources coupled with the data federation technology provided by the BioMart software has galvanized the creation of the BioMart Community Portal. The federated model has proven to be very cost-effective since all development and maintenance of individual databases is left to the individual data providers. It also has proven to be very scalable as the internet and database traffic is handled by the local BioMart servers. As a result the BioMart Community Portal service has grown impressively not only in terms of available data but also the level of service. The BioMart community portal now averages over million requests per our services per day. Building on this level of service and the wealth of information that has become accessible through the BioMart interface, the BioMart Community Portal has effectively introduced a new, more scalable and much more cost-effective alternative to the large data stores maintained by specialized organizations.

Acknowledgments

We are grateful to the following organizations for providing support for the BioMart project: European Molecular Biology Laboratory, European Bioinformatics Institute, Hinxton, UK; Ontario Institute for Cancer Research, Toronto, Canada; San Raffaele Scientific Institute, Milan, Italy and King Abdulaziz University, Jeddah, Saudi Arabia.

FUNDING

The BioMart Community Portal is a collaborative, community effort and as such it is the product of the efforts of dozens of different groups and organizations. The individual data sources that the portal comprises are funded separately and independently. In particular: Wellcome Trust [077012/Z/05/Z to COSMIC mart]; Spanish Government [BIO2011–23920 and CSD2009–00080 to BioMart database of the Regulatory Genomics group at Pompeu Fabra University]; Sandra Ibarra Foundation for Cancer [FSI2013]; Breast Cancer Campaign Tissue Bank [09TBBAR to BCCTB bioinformatics portal]; Office of Science of the U.S. Department of Energy [DE-AC02–05CH11231 to Phytozome]; Global Frontier Project (to i-Pharm research) funded by the Ministry of Science, ICT and Future Planning through the National Research Foundation of Korea (NRF-2013M3A6A4043695); Agence National de la Recherche [ANR-10-BLAN-1122, ANR-12-BSV6–0017–03, ANR-14-CE10–0005–03 to ParameciumDB and cilDB]; Centre National de la Recherche Scientifique; Center for Genome Regulation [SalmonDB; Fondap-1509007 to A.M. and A.D.G.]; Center for Mathematical Modelling [Basal-PFB 03 to A.M. and A.D.G.]; Wellcome Trust (WT095908 and WT098051 to R.K., T.M. and A.Z.); European Molecular Biology Laboratory; Japanese Ministry of Education, Culture, Sports, Science and Technology [FANTOM5 BioMart; for RIKEN OSC and RIKEN PMI to Yoshihide Hayashizaki, and for RIKEN CLST]. Deanship of Scientific Research (DSR) King Abdulaziz University (96–130–35-HiCi to M.H.A., A.M.M., A.A.S. and A.K.). Funding for open access charge: King Abdulaziz University.

Conflict of interest statement. None declared.

REFERENCES

1.Zhang J., Haider S., Baran J., Cros A., Guberman J.M., Hsu J., Liang Y., Yao L., Kasprzyk A. BioMart: a data federation framework for large collaborative projects. Database. 2011:bar038. doi: 10.1093/database/bar038. [DOI] [PMC free article] [PubMed] [Google Scholar]
2.Hull D., Wolstencroft K., Stevens R., Goble C., Pocock M.R., Li P., Oinn T. Taverna: a tool for building and running workflows of services. Nucleic Acids Res. 2006;34:W729–W732. doi: 10.1093/nar/gkl320. [DOI] [PMC free article] [PubMed] [Google Scholar]
3.Giardine B., Riemer C., Hardison R.C., Burhans R., Elnitski L., Shah P., Zhang Y., Blankenberg D., Albert I., Taylor J., et al. Galaxy: a platform for interactive large-scale genome analysis. Genome Res. 2005;15:1451–1455. doi: 10.1101/gr.4086505. [DOI] [PMC free article] [PubMed] [Google Scholar]
4.Cline M.S., Smoot M., Cerami E., Kuchinsky A., Landys N., Workman C., Christmas R., Avila-Campilo I., Creech M., Gross B., et al. Integration of biological networks and gene expression data using Cytoscape. Nat. Protoc. 2007;2:2366–2382. doi: 10.1038/nprot.2007.324. [DOI] [PMC free article] [PubMed] [Google Scholar]
5.Durinck S., Moreau Y., Kasprzyk A., Davis S., De Moor B., Brazma A., Huber W. BioMart and Bioconductor: a powerful link between biological databases and microarray data analysis. Bioinformatics. 2005;21:3439–3440. doi: 10.1093/bioinformatics/bti525. [DOI] [PubMed] [Google Scholar]
6.Reimers M., Carey V.J. Bioconductor: an open source framework for bioinformatics and computational biology. Methods Enzymol. 2006;411:119–134. doi: 10.1016/S0076-6879(06)11008-3. [DOI] [PubMed] [Google Scholar]
7.Haider S., Ballester B., Smedley D., Zhang J., Rice P., Kasprzyk A. BioMart Central Portal–unified access to biological data. Nucleic Acids Res. 2009;37:W23–W27. doi: 10.1093/nar/gkp265. [DOI] [PMC free article] [PubMed] [Google Scholar]
8.Mariette J., Noirot C., Nabihoudine I., Bardou P., Hoede C., Djari A., Cabau C., Klopp C. RNAbrowse: RNA-Seq de novo assembly results browser. PLoS One. 2014;9:e96821. doi: 10.1371/journal.pone.0096821. [DOI] [PMC free article] [PubMed] [Google Scholar]
9.Guberman J.M., Ai J., Arnaiz O., Baran J., Blake A., Baldock R., Chelala C., Croft D., Cros A., Cutts R.J., et al. BioMart Central Portal: an open database network for the biological community. Database. 2011:bar041. doi: 10.1093/database/bar041. [DOI] [PMC free article] [PubMed] [Google Scholar]
10.Smedley D., Haider S., Ballester B., Holland R., London D., Thorisson G., Kasprzyk A. BioMart–biological queries made easy. BMC Genomics. 2009;10:22. doi: 10.1186/1471-2164-10-22. [DOI] [PMC free article] [PubMed] [Google Scholar]
11.C Ng Pauline, Henikoff Steven. SIFT: Predicting amino acid changes that affect protein function. Nucleic Acids Res. 2003;31:3812–3814. doi: 10.1093/nar/gkg509. [DOI] [PMC free article] [PubMed] [Google Scholar]
12.A Adzhubei Ivan, Schmidt Steffen, Peshkin Leonid, E Ramensky Vasily, Gerasimova Anna, Bork Peer, S Kondrashov Alexey, R Sunyaev Shamil. A method and server for predicting damaging missense mutations. Nature. 2010;7:248–249. doi: 10.1038/nmeth0410-248. [DOI] [PMC free article] [PubMed] [Google Scholar]
13.Kasprzyk A., Keefe D., Smedley D., London D., Spooner W., Melsopp C., Hammond M., Rocca-Serra P., Cox T., Birney E. EnsMart: a generic system for fast and flexible access to biological data. Genome Res. 2004;14:160–169. doi: 10.1101/gr.1645104. [DOI] [PMC free article] [PubMed] [Google Scholar]
14.Kasprzyk A. BioMart: driving a paradigm change in biological data management. Database. 2011:bar049. doi: 10.1093/database/bar049. [DOI] [PMC free article] [PubMed] [Google Scholar]
15.Mallon A.M., Iyer V., Melvin D., Morgan H., Parkinson H., Brown S.D., Flicek P., Skarnes W.C. Accessing data from the International Mouse Phenotyping Consortium: state of the art and future plans. Mamm. Genome. 2012;23:641–652. doi: 10.1007/s00335-012-9428-9. [DOI] [PMC free article] [PubMed] [Google Scholar]
16.Lizio M., Harshbarger J., Shimoji H., Severin J., Kasukawa T., Sahin S., Abugessaisa I., Fukuda S., Hori F., Ishikawa-Kato S., et al. Gateways to the FANTOM5 promoter level mammalian expression atlas. Genome Biol. 2015;16:22. doi: 10.1186/s13059-014-0560-6. [DOI] [PMC free article] [PubMed] [Google Scholar]
17.Lardenois A., Gattiker A., Collin O., Chalmel F., Primig M. GermOnline 4.0 is a genomics gateway for germline development, meiosis and the mitotic cell cycle. Database. 2010:baq030. doi: 10.1093/database/baq030. [DOI] [PMC free article] [PubMed] [Google Scholar]
18.Steinbach D., Alaux M., Amselem J., Choisne N., Durand S., Flores R., Keliet A.O., Kimmel E., Lapalu N., Luyten I., et al. GnpIS: an information system to integrate genetic and genomic data from plants and fungi. Database. 2013:bat058. doi: 10.1093/database/bat058. [DOI] [PMC free article] [PubMed] [Google Scholar]
19.Beck T., Hastings R.K., Gollapudi S., Free R.C., Brookes A.J. GWAS Central: a comprehensive resource for the comparison and interrogation of genome-wide association studies. Eur. J. Hum. Genet. 2014;22:949–952. doi: 10.1038/ejhg.2013.274. [DOI] [PMC free article] [PubMed] [Google Scholar]
20.International HapMap Consortium. The International HapMap Project. Nature. 2003;426:789–796. doi: 10.1038/nature02168. [DOI] [PubMed] [Google Scholar]
21.Povey S., Lovering R., Bruford E., Wright M., Lush M., Wain H. The HUGO Gene Nomenclature Committee (HGNC) Hum. Genet. 2001;109:678–680. doi: 10.1007/s00439-001-0615-0. [DOI] [PubMed] [Google Scholar]
22.Lee H.S., Bae T., Lee J.H., Kim D.G., Oh Y.S., Jang Y., Kim J.T., Lee J.J., Innocenti A., Supuran C.T., et al. Rational drug repositioning guided by an integrated pharmacological network of protein, disease and drug. BMC Syst. Biol. 2012;6:80. doi: 10.1186/1752-0509-6-80. [DOI] [PMC free article] [PubMed] [Google Scholar]
23.Shaw D.R. Searching the Mouse Genome Informatics (MGI) resources for information on mouse biology from genotype to phenotype. Curr. Protoc. Bioinformatics. 2009;2009 doi: 10.1002/0471250953.bi0107s25. doi:10.1002/0471250953.bi0107s25. [DOI] [PubMed] [Google Scholar]
24.Dayem Ullah A.Z., Cutts R.J., Ghetia M., Gadaleta E., Hahn S.A., Crnogorac-Jurcevic T., Lemoine N.R., Chelala C. The pancreatic expression database: recent extensions and updates. Nucleic Acids Res. 2014;42:D944–D949. doi: 10.1093/nar/gkt959. [DOI] [PMC free article] [PubMed] [Google Scholar]
25.Arnaiz O., Sperling L. ParameciumDB in 2011: new tools and new data for functional and comparative genomics of the model ciliate Paramecium tetraurelia. Nucleic Acids Res. 2011;39:D632–D636. doi: 10.1093/nar/gkq918. [DOI] [PMC free article] [PubMed] [Google Scholar]
26.Goodstein D.M., Shu S., Howson R., Neupane R., Hayes R.D., Fazo J., Mitros T., Dirks W., Hellsten U., Putnam N., et al. Phytozome: a comparative platform for green plant genomics. Nucleic Acids Res. 2012;40:D1178–D1186. doi: 10.1093/nar/gkr944. [DOI] [PMC free article] [PubMed] [Google Scholar]
27.Althammer S., Pages A., Eyras E. Predictive models of gene regulation from high-throughput epigenomics data. Comp. Funct. Genomics. 2012;2012:284786. doi: 10.1155/2012/284786. [DOI] [PMC free article] [PubMed] [Google Scholar]
28.Burge S.W., Daub J., Eberhardt R., Tate J., Barquist L., Nawrocki E.P., Eddy S.R., Gardner P.P., Bateman A. Rfam 11.0: 10 years of RNA families. Nucleic Acids Res. 2013;41:D226–D232. doi: 10.1093/nar/gks1005. [DOI] [PMC free article] [PubMed] [Google Scholar]
29.Zhang S.J., Liu C.J., Shi M., Kong L., Chen J.Y., Zhou W.Z., Zhu X., Yu P., Wang J., Yang X., et al. RhesusBase: a knowledgebase for the monkey research community. Nucleic Acids Res. 2013;41:D892–D905. doi: 10.1093/nar/gks835. [DOI] [PMC free article] [PubMed] [Google Scholar]
30.Wang J., Kong L., Zhao S., Zhang H., Tang L., Li Z., Gu X., Luo J., Gao G. Rice-Map: a new-generation rice genome browser. BMC Genomics. 2011;12:165. doi: 10.1186/1471-2164-12-165. [DOI] [PMC free article] [PubMed] [Google Scholar]
31.Di Genova A., Aravena A., Zapata L., Gonzalez M., Maass A., Iturra P. SalmonDB: a bioinformatics resource for Salmo salar and Oncorhynchus mykiss. Database. 2011:bar050. doi: 10.1093/database/bar050. [DOI] [PMC free article] [PubMed] [Google Scholar]
32.UniProt Consortium. Activities at the Universal Protein Resource (UniProt) Nucleic Acids Res. 2014;42:D191–D198. doi: 10.1093/nar/gkt1140. [DOI] [PMC free article] [PubMed] [Google Scholar]
33.Megy K., Emrich S.J., Lawson D., Campbell D., Dialynas E., Hughes D.S., Koscielny G., Louis C., Maccallum R.M., Redmond S.N., et al. VectorBase: improvements to a bioinformatics resource for invertebrate vector genomics. Nucleic Acids Res. 2012;40:D729–D734. doi: 10.1093/nar/gkr1089. [DOI] [PMC free article] [PubMed] [Google Scholar]
34.Harrow J.L., Steward C.A., Frankish A., Gilbert J.G., Gonzalez J.M., Loveland J.E., Mudge J., Sheppard D., Thomas M., Trevanion S., et al. The Vertebrate Genome Annotation browser 10 years on. Nucleic Acids Res. 2014;42:D771–D779. doi: 10.1093/nar/gkt1241. [DOI] [PMC free article] [PubMed] [Google Scholar]
35.Harris T.W., Baran J., Bieri T., Cabunoc A., Chan J., Chen W.J., Davis P., Done J., Grove C., Howe K., et al. WormBase 2014: new views of curated biology. Nucleic Acids Res. 2014;42:D789–D793. doi: 10.1093/nar/gkt1063. [DOI] [PMC free article] [PubMed] [Google Scholar]
36.Dassi E., Re A., Leo S., Tebaldi T., Pasini L., Peroni D., Quattrone A. AURA 2 Empowering discovery of post-transcriptional networks. Translation. 2014;2:e27738. doi: 10.4161/trla.27738. [DOI] [PMC free article] [PubMed] [Google Scholar]
37.Cutts R.J., Guerra-Assuncao J.A., Gadaleta E., Dayem Ullah A.Z., Chelala C. BCCTBbp: the Breast Cancer Campaign Tissue Bank bioinformatics portal. Nucleic Acids Res. 2015;43:D831–D836. doi: 10.1093/nar/gku984. [DOI] [PMC free article] [PubMed] [Google Scholar]
38.Arnaiz O., Cohen J., Tassin A.M., Koll F. Remodeling Cildb, a popular database for cilia and links for ciliopathies. Cilia. 2014;3:9. doi: 10.1186/2046-2530-3-9. [DOI] [PMC free article] [PubMed] [Google Scholar]
39.Shepherd R., Forbes S.A., Beare D., Bamford S., Cole C.G., Ward S., Bindal N., Gunasekaran P., Jia M., Kok C.Y., et al. Data mining using the Catalogue of Somatic Mutations in Cancer BioMart. Database. 2011;2011:bar018. doi: 10.1093/database/bar018. [DOI] [PMC free article] [PubMed] [Google Scholar]
40.Stevenson P., Richardson L., Venkataraman S., Yang Y., Baldock R. The BioMart interface to the eMouseAtlas gene expression database EMAGE. Database. 2011;2011:bar029. doi: 10.1093/database/bar029. [DOI] [PMC free article] [PubMed] [Google Scholar]
41.Kinsella R.J., Kahari A., Haider S., Zamora J., Proctor G., Spudich G., Almeida-King J., Staines D., Derwent P., Kerhornou A., et al. Ensembl BioMarts: a hub for data retrieval across taxonomic space. Database. 2011;2011:bar030. doi: 10.1093/database/bar030. [DOI] [PMC free article] [PubMed] [Google Scholar]
42.Diez-Roux G., Banfi S., Sultan M., Geffers L., Anand S., Rozado D., Magen A., Canidio E., Pagani M., Peluso I., et al. A high-resolution anatomical atlas of the transcriptome in the mouse embryo. PLoS Biol. 2011;9:e1000582. doi: 10.1371/journal.pbio.1000582. [DOI] [PMC free article] [PubMed] [Google Scholar]
43.Spooner W., Youens-Clark K., Staines D., Ware D. GrameneMart: the BioMart data portal for the Gramene project. Database. 2012;2012:bar056. doi: 10.1093/database/bar056. [DOI] [PMC free article] [PubMed] [Google Scholar]
44.Jones P., Binns D., McMenamin C., McAnulla C., Hunter S. The InterPro BioMart: federated query and web service access to the InterPro Resource. Database. 2011;2011:bar033. doi: 10.1093/database/bar033. [DOI] [PMC free article] [PubMed] [Google Scholar]
45.Ndegwa N., Cote R.G., Ovelleiro D., D'Eustachio P., Hermjakob H., Vizcaino J.A., Croft D. Critical amino acid residues in proteins: a BioMart integration of Reactome protein annotations with PRIDE mass spectrometry data and COSMIC somatic mutations. Database. 2011;2011:bar047. doi: 10.1093/database/bar047. [DOI] [PMC free article] [PubMed] [Google Scholar]
46.Moreews F., Rauffet G., Dehais P., Klopp C. SigReannot-mart: a query environment for expression microarray probe re-annotations. Database. 2011;2011:bar025. doi: 10.1093/database/bar025. [DOI] [PMC free article] [PubMed] [Google Scholar]
47.Cutts R.J., Gadaleta E., Lemoine N.R., Chelala C. Using BioMart as a framework to manage and query pancreatic cancer data. Database. 2011;2011:bar024. doi: 10.1093/database/bar024. [DOI] [PMC free article] [PubMed] [Google Scholar]
48.Zhang J., Baran J., Cros A., Guberman J.M., Haider S., Hsu J., Liang Y., Rivkin E., Wang J., Whitty B., et al. International Cancer Genome Consortium Data Portal–a one-stop shop for cancer genomics data. Database. 2011;2011:bar026. doi: 10.1093/database/bar026. [DOI] [PMC free article] [PubMed] [Google Scholar]
49.Oakley D.J., Iyer V., Skarnes W.C., Smedley D. BioMart as an integration solution for the International Knockout Mouse Consortium. Database. 2011;2011:bar028. doi: 10.1093/database/bar028. [DOI] [PMC free article] [PubMed] [Google Scholar]
50.Croft D., O'Kelly G., Wu G., Haw R., Gillespie M., Matthews L., Caudy M., Garapati P., Gopinath G., Jassal B., et al. Reactome: a database of reactions, pathways and biological processes. Nucleic Acids Res. 2011;39:D691–D697. doi: 10.1093/nar/gkq1018. [DOI] [PMC free article] [PubMed] [Google Scholar]
51.Perez-Llamas C., Gundem G., Lopez-Bigas N. Integrative cancer genomics (IntOGen) in Biomart. Database. 2011;2011:bar039. doi: 10.1093/database/bar039. [DOI] [PMC free article] [PubMed] [Google Scholar]
52.Koscielny G., Yaikhom G., Iyer V., Meehan T.F., Morgan H., Atienza-Herrero J., Blake A., Chen C.K., Easty R., Di Fenza A., et al. The International Mouse Phenotyping Consortium Web Portal, a unified point of access for knockout mice and related phenotyping data. Nucleic Acids Res. 2014;42:D802–D809. doi: 10.1093/nar/gkt977. [DOI] [PMC free article] [PubMed] [Google Scholar]
53.Wilkinson P., Sengerova J., Matteoni R., Chen C.K., Soulat G., Ureta-Vidal A., Fessele S., Hagn M., Massimi M., Pickford K., et al. EMMA–mouse mutant resources for the international scientific community. Nucleic Acids Res. 2010;38:D570–D576. doi: 10.1093/nar/gkp799. [DOI] [PMC free article] [PubMed] [Google Scholar]

PERMALINK

The BioMart community portal: an innovative alternative to large, centralized data repositories

Damian Smedley

Syed Haider

Steffen Durinck

Luca Pandini

Paolo Provero

James Allen

Olivier Arnaiz

Mohammad Hamza Awedh

Richard Baldock

Giulia Barbiera

Philippe Bardou

Tim Beck

Andrew Blake

Merideth Bonierbale

Anthony J Brookes

Gabriele Bucci

Iwan Buetti

Sarah Burge

Cédric Cabau

Joseph W Carlson

Claude Chelala

Charalambos Chrysostomou

Davide Cittaro

Olivier Collin

Raul Cordova

Rosalind J Cutts

Erik Dassi

Alex Di Genova

Anis Djari

Anthony Esposito

Heather Estrella

Eduardo Eyras

Julio Fernandez-Banet

Simon Forbes

Robert C Free

Takatomo Fujisawa

Emanuela Gadaleta

Jose M Garcia-Manteiga

David Goodstein

Kristian Gray

José Afonso Guerra-Assunção

Bernard Haggarty

Dong-Jin Han

Byung Woo Han

Todd Harris

Jayson Harshbarger

Robert K Hastings

Richard D Hayes

Claire Hoede

Shen Hu

Zhi-Liang Hu

Lucie Hutchins

Zhengyan Kan

Hideya Kawaji

Aminah Keliet

Arnaud Kerhornou

Sunghoon Kim

Rhoda Kinsella

Christophe Klopp

Lei Kong

Daniel Lawson

Dejan Lazarevic

Ji-Hyun Lee

Thomas Letellier

Chuan-Yun Li

Pietro Lio

Chu-Jun Liu

Jie Luo

Alejandro Maass

Jerome Mariette

Thomas Maurel

Stefania Merella

Azza Mostafa Mohamed

Francois Moreews

Ibounyamine Nabihoudine

Nelson Ndegwa

Céline Noirot

Cristian Perez-Llamas