ONJava.com -- The Independent Source for Enterprise Java
oreilly.comSafari Books Online.Conferences.

advertisement

AddThis Social Bookmark Button

Java APIs for Bioinformatics

by Stephen Montgomery
03/10/2004

Samuel Johnson said, "Knowledge is of two kinds. We know a subject ourselves, or we know where we can find information on it." This is likely the most common experience among computational biologists. Most of us, having previously relied on either a solely computational or biological sciences background, are now confronted with the problem of learning how to speak fluently in both. This is not an uncommon task for most programmers, as part of the trade is adapting to a new problem space. But in a fast-paced industry, where developers can frequently drive requirements, it can be a challenge.

Computational biologists need to be aware of available resources and consistently practice code sharing and reuse. To aid in this, several collaborative initiatives are attempting to provide easy-to-use, open source interfaces for managing life sciences data. However, design and maintenance of these APIs is presenting its own challenges. In this article, we will explore some of the Java-based bioinformatics APIs that are being developed to reduce the amount of time developers spend (re)implementing common tasks and complex data structures. We will also address some of the challenges that are being faced by API-level developers in the life sciences.

The Tools of the Trade

Bioinformatics-based APIs are usually presented to aspiring bioinformaticians as resources for acquiring or manipulating data. For instance, the exposure may begin when the student is asked to programmatically find the number of genes on chromosome 1. Students will likely then connect to one of the many worldwide data stores using a SQL-derived API and request all the genes for this chromosome. Alternatively, the exposure to bioinformatics-based APIs may begin by asking the student to transfer data between two common file formats. Regardless, the expectation is that APIs are useful for acquiring data and performing simple formatting tasks. And while this is not an inaccurate expectation, the tools of the trade include APIs that not only aid in performing common tasks but successfully embody rich biological knowledge. In this section, we will explore several bioinformatics-based APIs that aid in acquiring, processing, and defining biological data; each in its own unique way. (This list is not exhaustive, and I encourage developers to mention their projects in the talkback section.)

BioJava

BioJava is "an open source project dedicated to providing Java tools for processing biological data." BioJava's goal is to create an API that automates common bioinformatics tasks while providing a foundation for bioinformatics-based software projects. It contains basic tools to transfer protein or DNA-sequence data between common file formats, attach annotation to sequence data, and perform simple statistics. It also contains advanced tools that, among other things, allow developers to generate random sequences based on distributions, construct dynamic programming algorithms, and view simple biological data in a GUI. BioJava comes complete with the aptly named BioJava in Anger resource, which aids frustrated developers in quickly achieving results through several short coding vignettes. BioJava is a good solution for when a developer wants to move sequence and annotation data through a high-throughput analysis process.

Example 1. Using BioJava to change from EMBL to GFF format

This code sample acts as an entry point to working with BioJava. Here we are changing an EMBL file describing the Huntington Disease protein into a GFF-formatted file.

Example 1. Resources

//REQUIRED IMPORTS
import java.io.*;
import java.util.*;
import org.biojava.bio.*;
import org.biojava.bio.seq.*;
import org.biojava.bio.seq.io.*;
import org.biojava.bio.program.gff.*;

//EXAMPLE 1
try {
   BufferedReader br = new BufferedReader(new FileReader("hd.embl"));
   SequenceIterator seqs = SeqIOTools.readEmbl(br);
   SequencesAsGFF sgff = new SequencesAsGFF();
   GFFWriter gffwriter = new GFFWriter(new PrintWriter(new FileWriter(
      "hd.gff")));
   while (seqs.hasNext()) {
      Sequence seq = seqs.nextSequence();
      sgff.processSequence(seq, gffwriter);
   }
}
catch (BioException ex) {
   System.err.println("Not fasta or wrong alphabet");
   ex.printStackTrace();
}
catch (NoSuchElementException ex) {
   System.err.println("No fasta sequences in file");
   ex.printStackTrace();
}
catch (FileNotFoundException ex) {
   System.err.println("File not found");
   ex.printStackTrace();
}
catch (IOException ie) {
   System.err.println("Output error");
   ie.printStackTrace();
}

Figure 1
Figure 1. The HD GFF file imported into Sockeye. The length of the repeat in the first exon is related to the severity of Huntington disease in an afflicted individual.

caBio

caBio (Cancer Bioinformatics Infrastructure Objects) is one component of the National Cancer Institute's Centre for Bioinformatics (NCICB) caCORE research management system. The caBio API contains implementations of various biomedical objects to facilitate consistent data representation and data integration projects. Furthermore, caBio supports SOAP and HTTP-XML interfaces to either the NCICB's hosted data or a user's own caBio server. caBio's clear implementation of these interfaces allows for the easy querying and traversing of biological data. The 262-page technical document (version 2), loaded with several examples of caCORE's diverse functionality, allows users to quickly get up to speed with the caBio API. caBio is a great solution for bioinformatics developers who want the benefits of defined biomedical objects, coupled with search criteria objects capable of cross-platform data exchange and retrieval.

Example 2. Using caBio to analyze the HD gene

This example is a derivative of one of the many examples that can be found on the NCICB's caBio site. It shows how you can easily use caBio to obtain gene information and traverse associated relationships.

//REQUIRED IMPORTS
import gov.nih.nci.caBIO.bean.*;

//EXAMPLE 2
Gene myGene = new Gene();
GeneSearchCriteria criteria = new GeneSearchCriteria();
criteria.setSymbol("HD");

SearchResult result = myGene.search(criteria);
if (result != null) {
   Gene[] genes = (Gene[]) result.getResultSet();
   for (int i = 0; i < genes.length; i++) {
      System.out.println("\nInformation regarding Gene Name " 
                         + genes[i].getName());
      System.out.println("\nInformation regarding Gene Title "
                         + genes[i].getTitle());
      System.out.println("\nInformation regarding Gene Organism " 
                         + genes[i].getOrganismAbbreviation());
      System.out.println("\tOMIM Id: " + genes[i].getOMIMId());
      System.out.println("\tUnigene Cluster Id: " 
                         + genes[i].getClusterId());
      System.out.println("\tLocusLink Id: " 
                         + genes[i].getLocusLinkId());

      Protein[] proteins = genes[i].getProteins();
      if (proteins.length > 0) {
         System.out.println("\nAssociated Protein :");
         for (int k = 0; k < proteins.length; k++) {
            ProteinHomolog[] pHomologs = 
                    proteins[k].getProteinHomologs();
            if (pHomologs.length > 0) {
               System.out.println("\nAssociated Protein Homologs:");
               for (int l = 0; l < pHomologs.length; l++) {
                  Taxon homologTaxon = pHomologs[l].getTaxon();
                  System.out.println("\tProtein Homolog Taxon:" + 
                       homologTaxon.getScientificName());
                  System.out.println("\tProtein Homolog Alignment 
                       Length:" + pHomologs[l].getAlignmentLength());
                  System.out.println("\tProtein Homolog Percentage 
                       Similarity:" + 
                       pHomologs[l].getSimilarityPercentage() + "%");
                }
             }
          }
       }
   }
}

ENSJ

ENSJ is the Java implementation of the EnsEMBL driver and data adaptors. ENSJ allows a developer to access sequence or annotation information stored in the EnsEMBL database (see "Java for Bioinformatics" for more information about EnsEMBL and ENSJ). Recently, a new prototype API, called MartJ, has been developed that allows developers to access EnsEMBL's Mart database (EnsEMBL Mart is a database focussed on fast and flexible multi-organism data-mining). While meaningful data adaptors significantly aid in aggregating data and traversing relationships, most SQL-based APIs still require users to open and manage connections to remote MySQL servers (or host a local mirror). The advantage of this implementation is that a developer is able to customize and easily verify the accuracy and completeness of data requests; however, the disadvantage is that migrating schemas cause version incompatibility. It is likely that the evolution of genome databases will utilize web-service-based approaches for common data requests. I will talk a bit more about an alternate solution in the section on LSIDs.

Example 3. Counting the number of genes on the first million bases of chromosome 1 using ENSJ

This code example counts the genes on both the forward and reverse strands (DNA is double-stranded). Also see ENSJ Examples or "Java for Bioinformatics" for further examples.

//REQUIRED IMPORTS
import java.util.ArrayList;
import java.util.Properties;
import org.ensembl.datamodel.AssemblyLocation;
import org.ensembl.datamodel.Location;
import org.ensembl.driver.AdaptorException;
import org.ensembl.driver.ConfigurationException;
import org.ensembl.driver.Driver;
import org.ensembl.driver.DriverManager;
import org.ensembl.driver.GeneAdaptor;

//EXAMPLE 3
try {
   Properties configProps = new Properties();
   configProps.setProperty("ensembl_driver",
       "org.ensembl.driver.plugin.standard.MySQLDriver");
   configProps.setProperty("host", "ensembldb.ensembl.org");
   configProps.setProperty("user", "anonymous");
   configProps.setProperty("database", "homo_sapiens_core_18_34");

   Driver driver = DriverManager.load(configProps);

   try {
      GeneAdaptor ga = driver.getGeneAdaptor();
      Location location_for = new AssemblyLocation("1", 1, 1000000, 1);
      ArrayList genes_for = (ArrayList) ga.fetch(location_for);
      System.out.println(genes_for.size());
      Location location_rev = new AssemblyLocation("1", 1, 1000000, -1);
      ArrayList genes_rev = (ArrayList) ga.fetch(location_rev);
      System.out.println(genes_rev.size());
   }

   catch (AdaptorException ae) {
      ae.printStackTrace();
   }
}
catch (ConfigurationException ce) {
      ce.printStackTrace();
}

PAL

The Phylogenetic Analysis Library (PAL) is a Java API dedicated to the subset of bioinformatics analysis that pertains to the evolutionary development of genomes (DNA and protein sequence). Version 1.5 of PAL comprises approximately 250 classes in 18 packages with functionality related to the IO of sequence alignments, distance matrices, and phylogenetic trees. Some of the more advanced features include phylogenetic tree manipulation, amino acid substitution modelling, and several tree-construction methods. PAL is potentially a unique example of a bioinformatics-based Java API that aims to provide a deep level of functionality for its specific niche. Future work has been planned to bridge this API with the BioJava package.

Example 4. Using PAL to turn an alignment into a phylogenetic tree

Prior to this code example, I have aligned (using CLUSTALW) the protein sequence of a particular gene (CFTR) in human, mouse, and rat. In the code sample, we will load the aligned sequences and attempt to calculate the evolutionary distance between the sequences. Once calculated, we display the phylogenetic tree as is and with the midpoint rooted. The end result is a graphical display of the evolutionary distances between these sequences.

Example 4. Resources

//REQUIRED IMPORTS
import java.io.*;
import pal.datatype.*;
import pal.alignment.*;
import pal.distance.*;
import pal.gui.*;
import pal.substmodel.*;
import pal.tree.*;
import pal.treesearch.*;
import java.awt.*;
import javax.swing.*;

//EXAMPLE 4
try {
   FileReader fr = new FileReader("cftr.aln");
   DataType dt = DataTypeTool.getNucleotides();
   Alignment cftr_aln = 
      AlignmentReaders.readPhylipClustalAlignment(fr, dt);

   SubstitutionModel sm = SubstitutionTool.createJC69Model();
   DistanceMatrix dm = 
      DistanceTool.constructEvolutionaryDistances(cftr_aln, sm);

   Tree t = TreeTool.createNeighbourJoiningTree(dm);
   Tree rooted = TreeTool.getMidPointRooted(t);

   TreeComponent tc = new TreeComponent(t, "CFTR Tree");
   tc.setSize(new Dimension(400,400));
   tc.setMode(TreeComponent.CIRCULAR_COLOR);

   TreeComponent tc2 = new TreeComponent(rooted, 
                       "CFTR Tree (Mid-point Rooted)");
   tc2.setSize(new Dimension(400,400));
   tc2.setMode(TreeComponent.CIRCULAR_COLOR);

   JFrame jf = new JFrame("CFTR Tree");
   jf.setDefaultCloseOperation(JFrame.EXIT_ON_CLOSE);
   jf.setSize(new Dimension(810,410));

   jf.getContentPane().setLayout(new BorderLayout());

   JSplitPane jsp = new JSplitPane(JSplitPane.VERTICAL_SPLIT, tc, tc2);

   jf.getContentPane().add(jsp);
   jf.setVisible(true);
}
catch (FileNotFoundException fnfe) {
   System.err.println("cftr.aln file not found");
}
catch (IOException ioe) {
   System.err.println("IO Exception");
   ioe.printStackTrace();
}
catch (AlignmentParseException ape) {
   System.err.println("Parsing Exception");
   ape.printStackTrace();
}

Figure 2
Figure 2. The Phylogenetic trees generated in Example 3

Other APIs

KDOM: The Knowledge Discovery Object Model (KDOM) is a bioinformatics-based API designed to represent and manage biological knowledge during application development. The goal of KDOM is to create a framework for managing the acquisition and implementation of biological objects. Developers can further define relationships between biological objects to develop a knowledge ontology that is persistent through the system. KDOM also facilitates context-dependent display of biological objects. For instance, a gene can be displayed differently in the context of a chromosome or an exon.

LSID Resolution Protocol Project: The Life Science Identifier (LSID) project is an I3C URN specification that is being implemented by IBM (for more information on what URNs are, see Daniel Zambonini's article "Diagramming the XML family"). The concept boils down to creating a worldwide unique ID for life sciences data that includes the information required to resolve this ID. Specifically, from the LSID, a client can use DNS to resolve an Authority that will in turn allow the client to retrieve a web services description that describes the methods available to that particular LSID. IBM has been further demonstrating the capabilities of this technology to allow users to navigate biological data held in different data stores (BioFerret) or to launch specific bioinformatics applications from simply clicking on LSIDs within a client's email application or web browser (LSID LaunchPad). The LSID project is being developed for Java and Perl as freely downloaded open source. Basic LSID handling capabilities are also appearing in BioJava under org.biojava.utils.lsid.

MAGE-stk: The development of APIs that couple biological and computational knowledge to formally describe complex biological data types significantly reduces the number of conflicting formats and the time required to access and meaningfully analyze biological data. MAGE-stk is an example of a bioinformatics-based API that provides a Java representation of the information required to describe a particular experiment (in this case, a microarray experiment). The heterogeneous interchange of experimental information for microarray studies led to the development of standards (MIAME) that defined the minimal amount of information that was required to acceptably describe individual experiments. MAGE-stk facilitates the loading of MAGE-ML, an XML-based format based on the MIAME standards. This is an excellent example of a bioinformatics resource that doesn't perform many more complicated tasks than accurately describe microarray studies for end-user systems. While that may not seem impressive, the Java component of MAGE-stk comprises more than 300 classes, making it one of the heavyweights when compared to the other APIs mentioned here.

The Challenges of Bioinformatics-Based API Development

Progress is being measured by an organization's capability to synthesize large quantities of data to identify a meaningful picture of biological function. The effort to compartmentalize bioinformatics code into discrete APIs is a fundamental component of standardizing knowledge and creating reliable bioinformatics systems. User adoption and adherence to code-reuse strategies will significantly aid in the construction and delivery of complex analysis systems to researchers with varying computational proficiencies. Community efforts must be undertaken to increase the visibility and usability of bioinformatics APIs among developers. To do this, we must identify the challenges that prevent the adoption of APIs that have undergone significant design or peer review.

The majority of the existing Java APIs are the products of dedicated individuals or relatively small groups of developers. In many cases, there is little documentation or guarantee that the software will work out of the box. The measure of success has been related to the proportion of the bioinformatics community that utilizes the API. Unfortunately, most novice programmers are overwhelmed when they take their first steps into an object-oriented environment, and abandon or side-step the effort. Furthermore, senior programmers, who are more likely to use these APIs, frequently don't; they are either not aware of them or the short-term tradeoff between time and the challenge of doing it themselves is not that appealing. Both reactions are understandable, as it is frequently unclear where to go to find these APIs or how to start using one once identified (even if it is bug-free). I offer some suggestions to aid in the evolution of bioinformatics APIs in general.

  1. Answer the question "How do users find solved problems?"

  2. Implement better solutions to address how we respond and deliver information (i.e., very few mailing lists provide timeout responses such as, "I think no one knows.").

  3. Address how the community becomes aware of new APIs or new versions of existing APIs.

The Future

The development of open source Java APIs allow bioinformaticians to rapidly aggregate, synthesize, and dynamically display biological data. The tenets of code reuse and intellectual manageability are driving the development of open source bioinformatics APIs. These APIs have been and continue to be developed to reduce the amount of time developers spend (re)implementing common tasks and complex data structures. However, it is unclear how many potential contributors are not aware of these APIs or have nowhere to publicly ask for assistance. Collectively requiring standards for software delivery or assured maintenance mechanisms for continued community development may assist future API development in the life sciences field. I encourage discussion on this below.

Stephen Montgomery is currently a genetics graduate student at Canada's Michael Smith Genome Sciences Centre in Vancouver, BC.


Return to ONJava.com.