The value of society’s investment in science is strongly dependent on the ability of future scientists to build on previous results. As the pace of scientific productivity has accelerated through genomics and other Big Data technologies, scientists are increasingly dependent on computational tools to find information relevant to their own research. My group is working on projects to make published data more reusable through biocuration, which is the association of biological data with annotations that can be used in computer-based data mining and analyses.
The work involves using controlled vocabularies, such as ontologies, to provide both consistent terminology and a structured data format for the capture of biological information. An ontology consists of a controlled vocabulary of defined terms with unique identifiers and precise relationships to each other.
We are using the widely used Gene Ontology to make functional annotations to the gene products of bacteria and their viruses (bacteriophage). In addition, with collaborators at the Institute of Genomic Sciences at the University of Maryland Medical School, we have developed a new ontology for capturing phenotype information about all microorganisms: the Ontology for Microbial Phenotypes (OMP).
Phenotypes are the observable characteristics of an organism that result from the expression of a particular genotype in a particular environment. For example, eye color, number of seeds per pod, and coat color are phenotypic traits that can be observed in flies, legumes, and cats, respectively. Phenotype-genotype associations in well-studied model organisms can be a powerful tool for predicting biological function in less well-studied organisms. However, to use known phenotypic information in one organism to predict possible phenotypes in other organisms requires that phenotype information is stored in a consistent, computable format for ease of data integration and mining.
Until recently, phenotypic information has largely been captured as free text descriptions in primary research papers. The ambiguities in natural language confound attempts to retrieve information across sources. For example, “serotype” and “serovar” both refer to the same phenotype, but a simple text-based computer query with either word alone would miss the other. Or, a single term may be ambiguous: “sporulation” can be used to describe the general process of spore formation, but a search just for “sporulation” would return results for sporulation to survive adverse conditions (such as endospore formation in Gram-positive bacteria) or sporulation for the purpose of reproduction (such as found in the Actinomyces).
Currently, there are phenotype and anatomy ontologies in common use for many eukaryotic organisms, including fungi. However, none of the existing ontologies is appropriate to comprehensively capture phenotypes for Bacteria or Archaea or their phages or for comparisons across microbial species. This was the impetus for developing OMP. The OMP ontology and annotations of bacterial and fungal phenotypes can be accessed via a wiki-based ontology browser: microbialphenotypes.org. OMP releases can be downloaded from Github.