Reading in Base64-encoded binary data from mzML 1.0 (including functional Java code)

Bryan Smith, Thursday June 4 2009.

Please send any corrections or questions to bryanesmith at gmail dot com.

Contents


Very brief introduction

mzML is the standard mass spectrometer output file developed by HUPO PSI. The following is a portion of a sample mzML file containing m/z data, peaklist1.mzml (available for download from Tranche):

<binaryDataArrayList count="2"> <binaryDataArray arrayLength="127" encodedLength="688"> <cvParam cvRef="MS" accession="MS:1000514" name="m/z array" value=""/> <cvParam cvRef="MS" accession="MS:1000521" name="32-bit float" value=""/> <cvParam cvRef="MS" accession="MS:1000574" name="zlib compression" value=""/> <binary>eNoNxltIkwEYBuAOREZFhrCudGFbbraTU+ZmueH3fptpm5gmmpkIU+fcSDtJByoMo/ETKjYEgzTaFrm20RoSRlSWiO2iiKCILhy2iUhWhAWRrXqunrJvvxpI4kmS4kqQNNc3UdPFEXIPGanbqyGhJErXDvZQf00fDdY9oqHqz+StdtMdq0DB/esp1CKnSP0zut/YR9M1kzTTME+z9RKK23vpXZWeEkfMNO/sp6QjTAuSo7TQ8YoWHS9oKTdKS52HaLVjDmt1aaxTnMOGggAytkeQoTyLzeqH2FY0gyzZMLKL7chRP4G4YAp7dF1QmeJQF6Wg0d5ASVkj9hafQkWxAZUqA6xKEdr0FrSXTsJlXMHpcheu3iqBRzqNgZ3PMXB7BYOSpxget+LmbiVGfcfgz4vBf/c87vlNCHkLEQq68FjhwZTsI2bzHZgN2/Ay1IW4zIP4mA3JaDaWYz580TnxI3YSP/XD+B31YVX7Fn/0I0gX2ZCe2MhbEk2cKe7gzJZLnDXWzKJRE4vmxLwjcYClARfLUl9ZnlfKinYpa3x9bFSsYWPbJza1LrE5qWLL/1sCr9mSWuTyXX/Z5n/Atd9zuHm8kJuTb9ieX8HO5Qh3Siv5uCPFZyT7uHeim3vfz/PlE14WtjpZyI2wIA+zYKhlwaxl4fAHFlrTLLjrWLjQ8w8C8tNw</binary> </binaryDataArray>> <binaryDataArray arrayLength="127" encodedLength="692"> <cvParam cvRef="MS" accession="MS:1000515" name="intensity array" value=""/> <cvParam cvRef="MS" accession="MS:1000521" name="32-bit float" value=""/> <cvParam cvRef="MS" accession="MS:1000574" name="zlib compression" value=""/> <binary>eNoB/AED/kLX3s5CgSL1QxClWUOi/BxCp81HQ22XLEMs4SVCdeQoRFDsGkNw279EudCWQ5lriUKri1RCkPU/QphSEURBx/lEWwtyQveDFEK6fypCeicSQ8lqVkKeKnNCrQ5tQ+nNcEMJ7hhET9WnQniCtkQjinpEmOmtQo8d00RcbD1CxUWnQq80h0LlJktCsYQwQtI5y0P1JzZCcSDPQz6bWULF7ItCg6osQ27kHkOvh5BCmR+QQ5C8HkTBitNC5eBtQpDO9kKP7gNC4vyAQpvgQkLepG1DvsKaQwiOMUJyDBVC+9OyQyUHAkNYhTVDYV+gQqJjo0MweaRCsWW/QqWb30P34q5C5CxMQyStxkKDyJZCuKusQ5QW3kNvJAdC3VbzQ0HsLkKoiutCpLx2RBtL10K51zRCrIJxQ6lfykL+/zZC/PyuQxko4UQgqWlC2lcBQpVTzEOQtQxCrf5KQzGqE0LANl5CumaKQrqg3kQr9DRDOmDGQyR44UK1aL1C6/R5Qq3A40KA/LRC9LqhQs/HmkNJo1NDO0CdQpE9ZUUeRXpEi3jdQ4fUuUKX3EJDBJ1bQoLqAEJSlOBCvbngQwEDz0LAvUhCmKiiQuBa8EJSezhCl69pQoeXJULUuqZCgh8EQkiWDEKj1SRCml1sQwUqfkNEGhJCdjEQQvLBYkKHaQifb+ek</binary> </binaryDataArray> </binaryDataArrayList>

This is not an introduction to mzML—the purpose of this document is how to read in the Base64-encoded binary data, like the two <binary> elements in the example above. (Someone asked me how this is done at ASMS conference 2009 after viewing our Babel Fish poster presentation.) However, I want to note a few things quickly:

  1. This <binaryDataArrayList> states it has two binary arrays
  2. Note the following important information from the <cvParam> elements for both <binaryDataArray>
  3. The first <binaryDataArray> is an array of m/z floats, and the second <binaryDataArray> is an array of intensities. Both arrays are the same length.

Given this information, you can create the spectra.

The rest of this paper will focus on compressed spectra data with 32-bit floats. Note, however, that mzML can hold other binary data, including 16-bit floats, 64-bit floats and integers. (For more information, see the mzML specification documentation.)


Steps

This section has four steps to take a Base64-encoded string representing an array of 32-bit floats that is zlib compressed and return the float array. After I list the steps with the associated code, I present the entire functional code that you can include as a utility method.

You should be able to use an equivalent Base64 utility for decoding for your programming language.

Step 1: Decode bytes from Base64

To include binary data in an XML file, it must be encoded as a string (hence, plain-text). We must retrieve bytes from Base64 encoding. The only supporting Jar we need, xercesImple.jar, has the Base64 utility we'll use.

/** * STEP 1: Decode bytes from base-64. These are compressed. */ byte[] binArray = Base64.decode(input);

Step 2: Convert bytes to little-endian

/** * STEP 2: Make litte-endian */ { ByteBuffer bbuf = ByteBuffer.allocate(binArray.length); bbuf.put(binArray); binArray = bbuf.order(ByteOrder.LITTLE_ENDIAN).array(); }

Step 3: If compressed, decompress bytes (from zlib)

/** * STEP 3: Decompress from zlib. Note the data might not be compressed. Check associated cvParam elements. */ byte[] decompressedData = null; { Inflater decompressor = new Inflater(); decompressor.setInput(binArray); // Create an expandable byte array to hold the decompressed data ByteArrayOutputStream bos = null; try { bos = new ByteArrayOutputStream(binArray.length); // Decompress the data byte[] buf = new byte[1024]; while (!decompressor.finished()) { int count = decompressor.inflate(buf); bos.write(buf, 0, count); } } finally { try { bos.close(); } catch (Exception nope) { /* This exception doesn't matter */ } } decompressedData = bos.toByteArray(); }

Step 4: Convert byte array to float array

/** * STEP 4: Read floats from IEEE 754 floating-point "single format" representations */ final int totalFloats = decompressedData.length / 4; float[] floatValues = new float[totalFloats]; // Iterate until parse each float int floatIndex = 0; for (int nextFloatPosition = 0; nextFloatPosition < decompressedData.length; nextFloatPosition += 4) { // Read in the bytes char c1 = (char) decompressedData[nextFloatPosition + 0]; char c2 = (char) decompressedData[nextFloatPosition + 1]; char c3 = (char) decompressedData[nextFloatPosition + 2]; char c4 = (char) decompressedData[nextFloatPosition + 3]; // Bitwise AND to make sure only first 2 bytes are included int b1 = (int) (c1 & 0xFF); int b2 = (int) (c2 & 0xFF); int b3 = (int) (c3 & 0xFF); int b4 = (int) (c4 & 0xFF); // Build the four-byte floating-point "single format" representation int intBits = (b4 << 0) | (b3 << 8) | (b2 << 16) | (b1 << 24); floatValues[floatIndex] = Float.intBitsToFloat(intBits); // Increment counter used to populate array floatIndex++; }

Complete Java utility method for converting Base64-encoded string to float array

private static float[] getFloatArrayFromCompressedBase64String(String input) throws Exception { /** * STEP 1: Decode bytes from base-64. These are compressed. */ byte[] binArray = Base64.decode(input); /** * STEP 2: Make litte-endian */ { ByteBuffer bbuf = ByteBuffer.allocate(binArray.length); bbuf.put(binArray); binArray = bbuf.order(ByteOrder.LITTLE_ENDIAN).array(); } /** * STEP 3: Decompress from zlib. Note the data might not be compressed. Check associated cvParam elements. */ byte[] decompressedData = null; { Inflater decompressor = new Inflater(); decompressor.setInput(binArray); // Create an expandable byte array to hold the decompressed data ByteArrayOutputStream bos = null; try { bos = new ByteArrayOutputStream(binArray.length); // Decompress the data byte[] buf = new byte[1024]; while (!decompressor.finished()) { int count = decompressor.inflate(buf); bos.write(buf, 0, count); } } finally { try { bos.close(); } catch (Exception nope) { /* This exception doesn't matter */ } } decompressedData = bos.toByteArray(); } /** * STEP 4: Read floats from IEEE 754 floating-point "single format" representations */ final int totalFloats = decompressedData.length / 4; float[] floatValues = new float[totalFloats]; // Iterate until parse each float int floatIndex = 0; for (int nextFloatPosition = 0; nextFloatPosition < decompressedData.length; nextFloatPosition += 4) { // Read in the bytes char c1 = (char) decompressedData[nextFloatPosition + 0]; char c2 = (char) decompressedData[nextFloatPosition + 1]; char c3 = (char) decompressedData[nextFloatPosition + 2]; char c4 = (char) decompressedData[nextFloatPosition + 3]; // Bitwise AND to make sure only first 2 bytes are included int b1 = (int) (c1 & 0xFF); int b2 = (int) (c2 & 0xFF); int b3 = (int) (c3 & 0xFF); int b4 = (int) (c4 & 0xFF); // Build the four-byte floating-point "single format" representation int intBits = (b4 << 0) | (b3 << 8) | (b2 << 16) | (b1 << 24); floatValues[floatIndex] = Float.intBitsToFloat(intBits); // Increment counter used to populate array floatIndex++; } return floatValues; }

Links