Open standard · HUPO-PSI

The modern
mass spectrometry
data format

Compact, fast, and cloud-native — a Parquet-in-ZIP successor to mzML. Losslessly smaller, and randomly addressable straight over HTTP.

Governed by HUPO-PSIReference implementations across seven languages
spectrumMS1 · profile12_80.mzpeak
m/z 478.9213 · int 1.42e8
12050090014001804 m/z
0.1–0.6×
the size of the equivalent mzML — losslessly
1 spectrum
streamed from a multi-gigabyte file, no download
7 langs
reference SDKs — Rust, Python, R, C#, Java, TypeScript, C++
100%
CV-governed by the PSI-MS controlled vocabulary
Why mzPeak

Built for today's data volumes

Everything mzML can describe, stored in a layout built for terabyte archives, cloud workflows, and AI pipelines.

Compact

Columnar Parquet compression makes a file typically a fraction of the equivalent mzML — without losing a single data point.

0.18–0.57× · lossless

Cloud-native

Designed for S3-compatible object storage and data lakes. Random access fetches only the bytes you need — so you pay for less egress and move less data.

S3 · data lake

Fast

Open a 3 GB file and read any spectrum near-instantly — under 1s locally, under 2s from remote S3. Ion images and XICs extract in seconds, in the browser.

3 GB in < 2s

AI-ready

Built on Apache Parquet, the native format of modern AI/ML stacks. Read directly by pandas, Polars, Spark & Arrow — no custom parser — and stream columns straight from the data lake into training.

pandas · Polars · Spark

Open & interoperable

ZIP-of-Parquet is language-independent, and the semantics are anchored in the PSI-MS controlled vocabulary — with a versioned conformance profile and validator.

PSI-MS · validated

Extensible

First-class extensions for MS imaging (per-pixel spatial data) and sample metadata (SDRF / ISA) — the format grows by extension, never by incompatible forks.

MSI · SDRF / ISA

Secure

Parquet's modular encryption can protect individual columns or files with AES-GCM, leaving the rest of the archive readable. How sensitive index fields and post-quantum-safe schemes are handled is an open design question.

AES-GCM · per-column

Backwards-compatible

Metadata aligns with the HUPO-PSI mzML standard, so lossless mzML ↔ mzPeak conversion is tested on hundreds of datasets — with ProteoWizard support in preparation.

mzML ↔ mzPeak
The idea in one line

A ZIP archive of Parquet tables, plus a small JSON index

Columnar, compressed, randomly addressable, and self-describing. Everything a reader needs to find one spectrum — without parsing the whole run — lives in the manifest.

New kinds of data attach through documented entity-type and data-kind mechanisms, so optical images, sample metadata, and provenance all ride along in the same container.

run.mzpeakZIP container
*_data.parquet
The signal — m/z + intensity (and ion-mobility) as sorted, compressed columns.
metadata
Instrument, software, samples, run description, and CV declarations.
mzpeak_index.json
The manifest — what members exist and how to find them.
Other members
Optional embedded artifacts — optical images, SDRF / ISA sample metadata, provenance.
Performance · measured

A fraction of the size — losslessly

Real datasets across seven instrument classes. Each bar is the mzPeak file size relative to the equivalent mzML — on average about 0.37×, and as small as 0.18×.

Thermo LTQ FT Ultra FT-ICR30.2 → 5.5 MB
0.18×
SCIEX ZenoTOF 760089.8 → 50.9 MB
0.57×
Thermo LTQ XL ion trap173.5 → 55.6 MB
0.32×
Thermo Orbitrap Velos429.2 → 101.5 MB
0.24×
Thermo Fusion Lumos588.6 → 156.5 MB
0.27×
Bruker timsTOF Pro1386.5 → 677.2 MB
0.49×
Thermo Orbitrap Astral6118.4 → 3359.4 MB
0.55×
// mzPeak ÷ mzML file size · lower is better · public benchmark corpus, peak type preserved on conversion. Source: mzPeak white paper, J. Proteome Res. 2025.
Across the full corpus · three data families

The same pattern, across real datasets

The bars above are individual files; these plots summarise the full public mzML2mzPeak corpus across three kinds of data. Each tracks file size along the conversion chain — vendor raw at 100% — so the trend is easy to read: mzML often grows past the raw file, while mzPeak consistently shrinks it.

Compression overview for general MS data: raw 100%, mzML ~181%, mzPeak ~50%.
General MS data

LC-/GC-MS across six instrument vendors. mzML typically inflates past the vendor raw file; mzPeak lands at about half of it.

Compression overview for imaging MS: raw 100%, mzPeak ~35%.
Imaging MS (MSI)

imzML imaging runs with per-pixel coordinates and embedded optical images — the whole image at roughly a third of the source.

Compression overview for study-design datasets: raw 100%, mzML ~194%, mzPeak ~45%.
Study-design embedding

Studies that carry their SDRF / ISA sample annotation alongside the data — kept in the archive, still near 45%.

Open & community-governed

Build on the format

mzPeak is developed as an open community effort under HUPO-PSI. The specification is language-independent — start from the spec, validate against the conformance profile, and ship.