Laboratory Information Technology
Mass Spectrometry on the Move
Mass Spectrometry (MS) is finding new users every day. As experience and insights are growing, new types of applications rapidly emerge. A problem is the multitude of data and file formats, hence this article is focused on formats and analytical topics.
Mass Spectrometry software is very complex. Many software suites are proprietary and limited to specific types of applications. There are some standards covering certain aspects in the sequence from equipment configuration and data acquisition through transfer, storage, analysis and result representation (on-screen or printed document).
Mass spectrometry (MS) includes analysis techniques to identify and quantify biological and chemical compounds which create a molecular fingerprint. It includes preparing and normalizing data, calibration of spectra and searching for statistically meaningful peaks in the samples.
There are three basic parts in a mass spectrometer:
- ion source: transforms the molecules in a sample into ionized fragments
- mass analyzer: electric and magnetic fields sort the ions by their masses
- detector: measures and calculates the abundances of each ion fragment type
- with a great variety of devices and analytical techniques each with its own set of requirements, data formats and problems.
MS equipment is of great functional and physical variety. There are many different proprietary and ‘to be' standardized data formats. Analytical chemistry, and especially mass-spectrometry is critically dependent on optimal instrumentation and a correct interpretation of the raw data. Hence, raw data should be compared with known spectra from libraries of reference spectra and the data must be processed correctly. Robust algorithms have to be thoroughly tested to eliminate potential flaws. Each type of mass spectrometer is optimized for specific methods of ionization and detection, based on incompatible data formats and lack of infrastructure to handle mass spectrometry data. Because of this, researchers have to use the software tools bundled with each mass spectrometer, this complicates database construction and makes comparisons between results from different mass spectrometers difficult.
The proprietary nature of the software tools limits the users who want to modify these tools for their individual needs.
Data/File Format Standards
Development in instrumentation and computing equipment is progressing very fast. This creates the problem that the hardware and software environment used to retrieve data for analysis will be different from that which was used to create it. To preserve the content of the original data, a data storage format must be:
- based on open data format standards
- readable for 20 to 30 years because of regulatory compliance and patent protection
- usable through changes in computer hardware/software, operating systems and storage media
- able to precisely represent the original data to meet regulatory requirements
- able to recognize data from a variety of instruments, extensible for new types and backwards compatible
Many existing data standards are specific to certain instrumentation. Some organizations have opted not to use the existing data storage standards at all, but instead store graphical representations of a final report. Hence, the content of the archived file is restricted to the data included in the original report. The graphics are of no use, if a regulatory inspector needs to view associated information that was not incorporated in the report.
The public XML standard has features that seem to make it ideal as the basis of a file format for long-term storage and access for instrument data. One of the features of XML is that it has been developed completely in the public domain by the World Wide Web Consortium. XML is not actually a file format, but is a standardized, application-independent way of representing data using plain ASCII text. In contrast to unstructured ASCII documents, XML has an inherent system of defining hierarchical tags and attributes that describe the relationships between pieces of data. This tagging makes it possible to extend and combine XML data structures without reformatting the information. It addresses all of the critical requirements listed below:
- ASCII storage mechanism, which is human readable, now and in the future
- ASCII files are easy to migrate to any operating system, hardware/software platform and storage media
- based on a public domain standard controlled by a completely independent body
- describes and shares complex data structures via the use of public-domain XSDs
- can encapsulate binary information to maintain numerical accuracy using standard ASCII characters
- designed to be extensible while backwards compatible
One drawback of XML data representation is that the data set size can be larger than a proprietary binary format containing the same information by a factor of 2 to 3. Today's low cost of storage media and lossless compression technologies make this a minor issue.
PSI-MSS Working Group
The PSI-MSS working group defines data formats and controlled vocabulary terms facilitating data exchange and archiving in the field of proteomics mass spectrometry. Examples are the:
- mzData standard
- mzML format
- TraML format
as described below in alphabetical order.
The "Automatic Mass Spectral Deconvolution and Identification Software" is a method for automating GC/MS analysis and for finding components in complex mixtures that would otherwise be missed. The productivity is significantly increased since the first pass analysis of the data is in minutes (without direct intervention) rather than hours (with full involvement of the analyst). AMDIS can also provide statistically valid confidence measures for analysis.
Cactus is an open source problem solving environment, using modular structure for parallel computation across different architectures and collaborative code development, between different groups. Cactus originated in the Numerical Relativity community. It is now primarily developed and maintained at the Max Planck Institute for Gravitational Physics (Albert Einstein Institute) in Potsdam, Germany.
The name Cactus comes from the design of a central core or cactus flesh, which connects to application modules or cactus thorns. Some thorns can be used for scientific or engineering applications, while other thorns from a standard computational toolkit provide a range of capabilities, like parallel I/O, data distribution or checkpointing. Proprietary data types are replaced with Cactus variables that maintain their properties independent of hardware architecture or software environment. The user is able to change the program parameters on-the-fly by activating or deactivating individual code modules without the need to recompile the program. Applications, developed on standard workstations or laptops, can be seamlessly run on clusters or supercomputers like an on IBM BlueGene/P with 131,073 cores at the Argonne Leadership Computing Facility (ALCF) in July 2008.
A Mascot data file is an ASCII file containing peak list information and search parameters. It can use peak list formats from a wide range of instrument data systems which it will automatically recognize. The Mascot Generic Format (MGF) lists each MS/MS dataset as pairs of mass and intensity values.
The Mass Spectrum I/O Project (MSIOP) allows for storage and analysis of mass spectrometer data from multiple manufacturers across various platforms using the Cactus framework. It is one attempt to provide open source software to be able to convert data files from various mass spectrometer manufacturers into mzXML. It also uses the Cactus framework and C programming to enable cross platform portability and infrastructural support. This allows it to be deployed without modifications in a wide variety of hardware and software environments and heterogeneous grids.
The mzData standard for the capture of mass spectrometry output data was developed by the PSI. It unites a large number of current formats into a single format but is not a substitute for the raw file formats of the instrument vendors. mzData is stable at version 1.05 and to be merged into mzML.
The mzML format merges the mzData and the mzXML formats. The PSI with participation by ISB, developed mzML by merging the best ideas from each of those two formats. mzML 1.1.0 was released on June 1, 2009.
The mzXML was developed at the Seattle Proteome Center at the Institute for Systems Biology. It is an XML based file format for the storage of proteomics mass spectrometric data, designed to be flexible enough to be used by all mass spectrometry researchers and extendable for future requirements.
The TraML format is for the exchange and transmission of transition lists of selected reaction monitoring (SRM) experiments. This specification is still in an early draft form (mid 2009).
There is obviously a need for a suite of a small number of coherent standards to make results accurate, vendor-independent and directly comparable. A single standard is not possible due to the many parts involved, and the extreme variety of applications. The standards and drafts from PSI/HUPO, OpenMS and NIST are steps in this direction.