Pistoia Alliance - Sequence Services Project

Pistoia Alliance: Many of the activities carried out by researchers in bio-science organizations are similar. For example the activities to determine a gene sequence, identify a signal transduction pathway, search a chemical repository and keep abreast of the scientific literature are all remarkably common. In 2008 in Pistoia, Italy - a group of senior pharmaceutical industry R&D IT directors met to discuss this phenomenon and to speculate that much replication of effort by their respective R&D IT organizations might be minimized if they shared thinking about, and defined best practices for, these pre-competitive research activities.

Such a common effort would be worthwhile. Currently much energy is expended by each company configuring individually different workflows and each technology or information supplier configuring individually different interfaces - but all just to achieve the same ends. In the future researchers could redirect their efforts and expenditure from mundane tasks to innovative tasks and suppliers could re-focus their development budgets into developing the function of their products rather than just their form.

As such the Pistoia Alliance [1] was formed to streamline precompetitive domains in the life science R&D workflow by specifying common standards, vocabularies, and processes and piloting short-term projects that explore innovative and transformational information-based solutions. It does this through a virtual organization that engages leading life-science companies, technology vendors, and academic and commercial information suppliers. By assembling and aggregating common use cases, identifying specific, high-value areas of opportunity, and exploiting contemporary technologies and service delivery models, the organization serves as a hub for envisioning information-based solutions that will drive innovation and productivity in the precompetitive domains of life science R&D.

The Sequence Services Challenge

There is no competitive advantage for each bioscience company to maintain the latest version of the many informatics databases and software tools within their company firewall.

To maintain even a core set of sequence databases, as the Red Queen told Alice: "it takes all the running you can do, to keep in the same place". Perpetually re-evaluating all the new and proliferating ones is a task as frustrating as that given to King Sysyphus.

The objective of this project is to define standards for the provision of secure access to pre-competitive databases and software tools and invite external suppliers to provide those services to multiple consumers who would effectively share the cost of the maintenance. Such an approach liberates each bioscience company to concentrate their investments where they feel they have a competitive advantage.

Informatics is about analyzing data that has often taken months of scientists' efforts and much money to generate. The delivery of informatics services is about balancing differing needs within its user community. There are power users - attracted by, and demanding of, new features - but the greater number is comprised of occasional users who want stability. For power users these applications are an integral part of their daily work. Occasional users, by contrast, require reassurance that the applications will remain available the next time they need to analyze data. Both want a "reliable brand", but define these concepts differently. Users in both camps require secure access in order to analyze that hard-won, competitive-value data.

The reluctance to perform analyses outside the company firewall is deeply ingrained - encouraged by a long-established culture that protects intellectual assets compounded by the public perception that the internet is insecure. While it remains technically possible to intercept a stream of sequence-data queries, map these onto the genome and hence determine what a given company is researching, many in the pharmaceutical industry would be unwilling to accept the risk. When stripped of its biological context, this scenario may seem ludicrous but a simple risk analysis process would place it squarely in the "Low risk, High impact" quadrant. Risk of this level demands "someone" to take a corporate decision to permit this activity.

One must conclude that separating sequence information from its context is not necessarily sufficient to change the heavily-ingrained, pharmaceutical industry culture of using traditional methods to protect its intellectual assets. A significant challenge for an externally managed sequence service is to provide a flexibility that is comparable with that offered by in-house systems. Inevitably, systems already extant within the company firewall will have some niche and highly appreciated applications. Migrating to an external service might entail losing the flexibility of rapidly fulfilling user requests.

The challenge for this Pistoia Alliance Sequence Services project is to define a service that can provide security and the flexibility to integrate new applications rapidly, without an over-burdensome, expensive change-management process.

Phase 1 of the Project

Security is of primary concern for cloud-based services [2] that would contain sensitive data (such as people's genomes). Furthermore, such a service must be quick and easy to use. Clearly, the functionality needs to be provided at a lower cost of ownership than that which would be delivered by developing and expanding in-house corporate IT systems and infrastructure. The first phase of Sequence Services therefore focused on these non-functional - but absolutely vital - business requirements. Two simple, yet valuable, functional tasks were identified for Phase 1 (shown in Fig 1)

  • Hosting valuable open source software viz: Ensembl and Plasmapper
  • Ability to add private data, and control access to it e.g. company internal gene alias lists

The requirements for the project - including the need to deliver proof-of-concept installations - were published and distributed to qualified vendors or consortia. The decision on how these services would be delivered securely was left to each supplier. Such an approach leveraged the extensive experience of the supplier community to deliver high-quality solutions. Once the proof-of-concept installations have been deployed, the Pistoia Alliance project team will test each of them for performance and for security inter alia by commissioning an ethical hacker attack against each service.

The choice of open-source functional software in Phase I simplified any possible licensing complications and allowed the project team more easily to commission multiple phase I pilot installations. The European Bioinformatics Institute (EBI), who are members of the organization and jointly with the Wellcome Trust Sanger Institute provide Ensembl code under an open-source license, make available all of the underlying Ensembl data free for all uses and have experience supporting a mirror of Ensembl on the Amazon Cloud.

Conclusion

The value of the multiple ‘proof-of-concept' approach is that the same sets of functional and non-functional requirements are delivered on different platforms (e.g. Amazon Cloud, Microsoft Cloud, etc.) and with different authentication and security models (e.g. OpenAM, Azure, etc.). This will allow comparison of performance, scalability, ease of use and total cost of service. The Sequence Services project team will then be in an ideal position to make informed decisions on the platform and architecture for the more ambitions phases of the project that are being planned.

Notes
[1] The Pistoia Alliance is a not-for-profit organization formally incorporated in the State of Delaware, USA
[2] For a definition of cloud computing see http://csrc.nist.gov/groups/SNS/cloud-computing/cloud-def-v15.doc

 

Authors

Contact

The West Wing Sandhill Hse.
Middle Claydon
Buckinghamshire MK18 2LD
GB

Register now!

The latest information directly via newsletter.

To prevent automated spam submissions leave this field empty.