J. Craig Venter Institute - 2006 JBoss Innovation Award Winner - Clustering

Logo - No Image

March 3, 2008

J. Craig Venter Institute submiited by Pete Davies, Indresh Singh, Tom Dolafi, Chris Lemieux, Sean Murphy, Adam Resnick, Angelo Trivelli, Bryan Yu, and Saul Kravitz

Customer: J. Craig Venter Institute

Industry: Life Sciences
Geography: North America
Country: United States


Solution:

Selected for use of JBoss messaging and clustering to provide the stability and scalability necessary to process in excess of 40 million traces in batch across a 2 node cluster that supports over 100 DNA sequencers (scaling to 8 nodes to process large collections of traces) while also saving the not-for-profit genomic research center over $500,000 per year in licensing and maintenance costs.

More
Background:

J. Craig Venter Institute is a not-for-profit research institute dedicated to the advancement of the science of genomics; the understanding of its implications for society; and the communication of those results to the scientific community, the public, and policymakers. Founded by J. Craig Venter, Ph.D., the Institute is home to approximately 200 staff and scientists with expertise in human and evolutionary biology, genetics, genomic and environmental policy research.

Business Challenge:

The J. Craig Venter Institute’s Joint Technology Center (JTC) opened in June 2003. The facility was designed to be one of the world’s leading DNA sequencing organizations, providing DNA sequencing and resequencing services for the Venter Institute and collaborators worldwide. The JTC executes 150-200 projects a year, 45-50 concurrently, and has a capacity of 80 million sequence reads (lanes) per year. The facility was designed to scale to 320 million lanes per year. Producing high quality data from a 24×7 high throughput DNA sequencing facility servicing many concurrent projects requires a comprehensive and robust Laboratory Information Management System (LIMS).

The challenge was to build a LIMS supporting the JTC’s evolving lab processes that would provide the performance, scalability, and robustness the JTC’s high capacity operation, with materials tracking, workflow, and process control. In addition, the LIMS was required to support integration with hundreds of laboratory instruments, computational pipelines, and analysis tools; real time quality control reports and data delivery to collaborators and public data repositories. We evaluated COTS LIMS tools, but all fell well short of the JTC’s LIMS requirements. The primary shortcomings were performance at the transaction and data volumes contemplated, and the ability to integrate with analysis pipelines and instrumentation.

Members of our team had experience using JBoss since 2000. Once we made our strategic decision to heavily invest in J2EE, JBoss was our first choice. The main considerations were its feature set, quality, cost, and clear development roadmap.

Solution:

To track all the lab processes and the data, we designed and implemented a suite of LIMS applications, providing bar-coded tracking of users, reagents, and material transfers throughout the facility. The LIMS is built on an Oracle 10g database, with a rich stored procedure layer. Business logic is implemented in EJBs, and most user interaction is via JSPs. The first version of the LIMS was deployed in April 2004.

Extensive integration of laboratory robotics has been facilitated by the use of J2EE technology provided by JBoss. Our ~110 DNA sequencers each host a JBoss instance, supporting user interactions via JSPs, data transfer and remote monitoring via HTTP, and interaction with the LIMS via JMS and EJBs. We have integrated fluid handling robots used to setup experiments in a similar manner. Symbol wireless PDAs have been integrated to track steps where lab staff mobility was critical.
DNA sequence data flows from the sequencers to a pipeline where it is reduced and analyzed, and loaded into the LIMS database. The pipeline is built using both JMS and EJB technology and integrated with both our in-house Oracle database, as well as a Sybase database from one of our key customers. The robustness of this pipeline in the face of DB connection stability enabled by the J2EE features as embodied in

JBoss has produced dramatic increases in customer satisfaction as well as greatly reduced production IT overhead.

Suites of LIMS applications are deployed on three, 2 node JBoss clusters. For on demand, batch processing, we have an 8 node JBoss cluster, which can easily process 40 million traces within days.

We are leveraging JBossMQ with HAJMS to process traces with a JBoss cluster using a fork and join approach. We used this approach to parallelize trace processing for high performance. When the TraceProcessing MDB receives a trace processing request, it forks the job into many smaller jobs by sending messages into another JMS queue. These messages are received by a second MDB and the smaller jobs are then executed on all clustered servers in parallel. The TraceProcessing MDB waits for completion messages from all of the MDBs which are processing the smaller jobs. After receiving the completion messages from all of the jobs, the TraceProcessing MDB aggregates all results and sends a message into another queue which is processed by a Loading MDB to persist the data. This is a very simple, stable and highly scalable approach to process millions of traces. In case we need to add more computing power to our JTrace server farm, we simply add new JBoss nodes to existing JBoss cluster.

Building on JBoss has dramatically reduced the development time for our LIMS, and increased its stability. The ability to deploy Java application servers to all of our DNA sequencers has greatly improved the cost effectiveness of our LIMS integration with the sequencers. JBoss JMS and Clustering provided the vertical scalability to our application; we setup an 8 node JBoss cluster to process large bundles (40 million) of traces in batch.

To track all the lab processes and the data, we designed and implemented a suite of LIMS applications, providing bar-coded tracking of users, reagents, and material transfers throughout the facility. The LIMS is built on an Oracle 10g database, with a rich stored procedure layer. Business logic is implemented in EJBs, and most user interaction is via JSPs. The first version of the LIMS was deployed in April 2004.

Extensive integration of laboratory robotics has been facilitated by the use of J2EE technology provided by JBoss. Our ~110 DNA sequencers each host a JBoss instance, supporting user interactions via JSPs, data transfer and remote monitoring via HTTP, and interaction with the LIMS via JMS and EJBs. We have integrated fluid handling robots used to setup experiments in a similar manner. Symbol wireless PDAs have been integrated to track steps where lab staff mobility was critical.
DNA sequence data flows from the sequencers to a pipeline where it is reduced and analyzed, and loaded into the LIMS database. The pipeline is built using both JMS and EJB technology and integrated with both our in-house Oracle database, as well as a Sybase database from one of our key customers. The robustness of this pipeline in the face of DB connection stability enabled by the J2EE features as embodied in

JBoss has produced dramatic increases in customer satisfaction as well as greatly reduced production IT overhead.

Suites of LIMS applications are deployed on three, 2 node JBoss clusters. For on demand, batch processing, we have an 8 node JBoss cluster, which can easily process 40 million traces within days.

We are leveraging JBossMQ with HAJMS to process traces with a JBoss cluster using a fork and join approach. We used this approach to parallelize trace processing for high performance. When the TraceProcessing MDB receives a trace processing request, it forks the job into many smaller jobs by sending messages into another JMS queue. These messages are received by a second MDB and the smaller jobs are then executed on all clustered servers in parallel. The TraceProcessing MDB waits for completion messages from all of the MDBs which are processing the smaller jobs. After receiving the completion messages from all of the jobs, the TraceProcessing MDB aggregates all results and sends a message into another queue which is processed by a Loading MDB to persist the data. This is a very simple, stable and highly scalable approach to process millions of traces. In case we need to add more computing power to our JTrace server farm, we simply add new JBoss nodes to existing JBoss cluster.


DNA Sequencers: 100 ABI 3730xl and 6 ABI 3100 DNA Sequencers
Integration with ABI 9700 Thermalcycler
Beckman Biomek® FX Laboratory Automation Workstation
Hudson Controls Laboratory Automation Stackers and Print and Apply Stations
Zebra Barcode Printers and Scanners
Application servers run under Linux on HP BL20p blade servers
DB Server runs on a two node Linux cluster (HP DL580s).
Symbol PPT-8846 Wireless Pocket-PC based PDAs with scanner
3 Oracle 10gRAC cluster with more then 1Terrabyte of data.
JBoss 3.2.6 Java Application Servers
Panscopic Scope Server Reporting Tool ((www.panscopic.com)
Bugzero Issue Tracking System (www.websina.com)
DiskXTender from Legato
Celera Assembler
Custom Bioinformatics Software developed at the Venter Institute

Yes, we used JBoss advance training provided JBoss and we found it very good. Training added a lot of value to our employees.

Benefits:

The J. Craig Venter Institute’s Joint Technology Center (JTC) opened in June 2003. The facility was designed to be one of the world’s leading DNA sequencing organizations, providing DNA sequencing and resequencing services for the Venter Institute and collaborators worldwide. The JTC executes 150-200 projects a year, 45-50 concurrently, and has a capacity of 80 million sequence reads (lanes) per year. The facility was designed to scale to 320 million lanes per year. Producing high quality data from a 24×7 high throughput DNA sequencing facility servicing many concurrent projects requires a comprehensive and robust Laboratory Information Management System (LIMS).

The challenge was to build a LIMS supporting the JTC’s evolving lab processes that would provide the performance, scalability, and robustness the JTC’s high capacity operation, with materials tracking, workflow, and process control. In addition, the LIMS was required to support integration with hundreds of laboratory instruments, computational pipelines, and analysis tools; real time quality control reports and data delivery to collaborators and public data repositories. We evaluated COTS LIMS tools, but all fell well short of the JTC’s LIMS requirements. The primary shortcomings were performance at the transaction and data volumes contemplated, and the ability to integrate with analysis pipelines and instrumentation.

Members of our team had experience using JBoss since 2000. Once we made our strategic decision to heavily invest in J2EE, JBoss was our first choice. The main considerations were its feature set, quality, cost, and clear development roadmap.

Contact Sales

Less