Mass spectrometry, also called mass spec, is an analytical technique that is used to measure the mass-to-charge ratio of ions. The results are presented as a mass spectrum, a plot of intensity as a function of the mass-to-charge ratio.
from Wikipedia
Keep updating the awesome machine-learning papers and codes related to small molecules mass spectrometry. Please notice that awesome lists are curations of the best, not everything. Contributes are always welcome!
- Databases
- Papers
- Survey/Review papers
- Discussions in databases
- Discussions in pre-train models
- Small molecular representation learning
- Mass spectrometry-related properties prediction
- Mass spectra representation learning and matching
- Chemical formula prediction from mass spectra
- Mass spectra peak annotation/assignment
- Machine learning in small molecules chromatography
- Related awesome lists
Database | No. of Compounds | Note |
---|---|---|
OC20 & OC22 | 1.3 million molecular relaxations (from 260 million DFT calculations) | The Open Catalyst Project focuses on using AI to find new renewable energy storage catalysts |
QM9 | 134,000 | Stable small organic molecules composed of CHONF with computed geometric, energetic, electronic, and thermodynamic properties |
GEOM | 37 million conformations for 450,000+ molecules | Generated using advanced sampling and semi-empirical density functional theory (DFT) |
MD17 & MD22 | 7 biomolecular systems (42-370 atoms each) | Molecular dynamics trajectories sampled at 400-500 K with 1 fs resolution, energy and forces calculated using PBE+MBD theory |
PCQM4Mv2 | Not specified (derived from PubChemQC) | Focuses on predicting DFT-calculated HOMO-LUMO energy gaps of molecules using 2D graphs |
MoleculeNet | 700,000+ | Benchmark for testing machine learning methods on molecular properties, integrated into the DeepChem package |
Database | No. of Compounds | Note |
---|---|---|
MassSpecGym | 19K molecules (231K MS/MS spectra) | Benchmark for discovery and identification of molecules with high-quality MS/MS spectra |
NIST23 | 399,267 small molecules (2,374,064 MS/MS spectra) | Collection of MS/MS spectra and search software |
MoNA | 2,061,612 mass spectral records | Contains experimental and in-silico libraries, as well as user contributions |
GNPS | Not specified | Web-based mass spectrometry ecosystem for community-wide organization and sharing of MS/MS data |
HMDB 5.0 | 220,945 metabolite entries | Human Metabolome Database containing metabolites present in the human body and their experimental MS/MS spectra |
Database | Compounds | Note |
---|---|---|
METLIN-SMRT | 80,038 small molecules | Experimentally acquired reverse-phase chromatography retention time dataset |
RepoRT | 8,809 unique compounds (88,325 retention time entries) | Contains 373 datasets measured on 49 different chromatographic columns using various eluents, flow rates, and temperatures |
HSM3 | 75 compounds (43,329 total retention measurements) | Refined hydrophobic subtraction model for predicting reversed-phase liquid chromatography selectivity, based on 13 RP stationary phases with measurements at multiple mobile phase compositions |
Database | Compounds | Note |
---|---|---|
AllCCS | 1.6 million small molecules (5,000+ experimental CCS records, ~12 million calculated CCS values) | Collection of experimental and calculated collision cross section values |
AllCCS2 | 4,326 compounds (10,384 records, 7,713 unified CCS values added) | Expanded version of AllCCS with newly available experimental CCS data and standardized values with confidence scores |
METLIN-CCS | 27,000+ molecular standards across 79 chemical classes | Database of collision cross section values derived from ion mobility spectrometry data |
CCSBase | Not specified | Integrated platform with comprehensive database of CCS measurements from various sources and a machine learning prediction model Website |
- [Annu. Rev. Anal. Chem. 2025] Hong, Yuhui, et al. Machine Learning in Small-Molecule Mass Spectrometry
- [J. Am. Soc. Mass Spectrom. 2024] Nguyen, Julia, et al. Advancing the Prediction of MS/MS Spectra Using Machine Learning
- [IJCAI 2023] Xia, Jun, et al. A Systematic Survey of Chemical Pre-trained Models
- [Metabolomics 2022] Bittremieux, Wout, Mingxun Wang, and Pieter C. Dorrestein. The critical role that spectral libraries play in capturing the metabolomics community knowledge
- [TrAC 2021] Debus, Bruno, et al. Deep learning in analytical chemistry
- [J. Cheminform. 2013] Scheubert, Kerstin, et al. Computational mass spectrometry for small molecules
- [Nat. Commun 2025] Kretschmer, Fleming, et al. Coverage bias in small molecule machine learning
- [Anal. Chem. 2024] Hoang, Corey, et al. Tandem Mass Spectrometry across Platforms
- [bioRxiv 2024] Kretschmer, Fleming, et al. Small molecule machine learning: All models are wrong, some may not even be useful
- [JCIM 2023] Zhang, Ziqiao, et al. Can Pre-trained Models Really Learn Better Molecular Representations for AI-aided Drug Discovery?
- [NeurIPS 2022] Sun, Ruoxi, et al. Does GNN Pretraining Help Molecular Representation?
According to the information embedded in the model, the molecular representation learning models are categorized as point-based (or quantum-based) methods, graph-based methods, and sequence-based methods. Because the number of graph-based methods is huge, they are further divided into self-supervised learning and supervised learning manners. It is worth noting that the difference between point-based (or quantum-based) methods and graph-based methods is if bonds (i.e. edges) are included in the encoding.
Point-based (or quantum-based) methods
- [ICLR 2023] Zhou, Gengmo, et al. Uni-mol: A universal 3d molecular representation learning framework [code]
- [PMLR 2021] Schütt, Kristof, et al. Equivariant message passing for the prediction of tensorial properties and molecular spectra [code]
- [NeurIPS 2017] Schütt, Kristof, et al. Schnet: A continuous-filter convolutional neural network for modeling quantum interactions [code]
Self-Supervised Learning:
- [Brief. Bioinformatics 2024] Zhen, Wang, et al. BatmanNet: bi-branch masked graph transformer autoencoder for molecular representation [code]
- [Bioinformatics 2023] [3DGCL] Moon, Kisung, et al. 3D graph contrastive learning for molecular property prediction [code]
- [ICLR 2023] [Mole-BERT] Xia, Jun, et al. Mole-bert: Rethinking pre-training graph neural networks for molecules [code]
- [ICLR 2023 (spotlight)] [GNS TAT] Zaidi, Sheheryar, et al. Pre-training via denoising for molecular property prediction [code]
- [ICLR 2023] [GeoSSL-DDM] Liu, Shengchao, et al. Molecular geometry pretraining with se (3)-invariant denoising distance matching [code]
- [ICLR 2022] [GraphMVP] Liu, Shengchao, et al. Pre-training molecular graph representation with 3d geometry [code]
- [NeurIPS 2021] [MGSSL] Zhang, Zaixi, et al. Motif-based graph self-supervised learning for molecular property prediction [code]
- [NeurIPS 2020] [GROVER] Rong, Yu, et al. Self-supervised graph transformer on large-scale molecular data [code]
- [ICLR 2020] [InfoGraph] Sun, Fan-Yun, et al. Infograph: Unsupervised and semi-supervised graph-level representation learning via mutual information maximization [code]
Supervised Learning
- [AAAI 2023] [Molformer] Wu, Fang, et al. Molformer: Motif-based transformer on 3d heterogeneous molecular graphs [code]
- [NeurIPS 2022] [ComENet] Wang, Limei, et al. ComENet: Towards Complete and Efficient Message Passing for 3D Molecular Graphs [code (implemented in DIG library)]
- [ICLR 2022] [GNS+Noisy Nodes] Godwin, Jonathan, et al. Simple GNN regularisation for 3D molecular property prediction & beyond [codes]
- [ICLR 2022] [MolR] Wang, Hongwei, et al. Chemical-reaction-aware molecule representation learning [code]
- [ICLR 2022] [SphereNet] Liu, Yi, et al. Spherical message passing for 3d graph networks [code (implemented in DIG library)]
- [Nat. Mach. Intell. 2022] [GEM] Fang, Xiaomin, et al. Geometry-enhanced molecular representation learning for property prediction [code]
- [NeurIPS 2021] [GemNet] Gasteiger, Johannes, et al. Gemnet: Universal directional graph neural networks for molecules [code]
- [NeurIPS 2020] [DimeNet++] Klicpera, Johannes, et al. Fast and uncertainty-aware directional message passing for non-equilibrium molecules [code]
- [ICLR 2020] [DimeNet] Gasteiger, Johannes, et al. Directional message passing for molecular graphs [code]
- [Chem. Mater 2019] [MEGNet] Chen, Chi, et al. Graph networks as a universal machine learning framework for molecules and crystals [preprint] [code]
- [PMLR 2017] Gilmer, Justin, et al. Neural message passing for quantum chemistry [code]
- [NeurIPS 2015] [Neural FPs] Duvenaud, David K., et al. Convolutional networks on graphs for learning molecular fingerprints [code]
Other Related Works
- [NeurIPS 2020] You, Yuning, et al. Graph contrastive learning with augmentations [code]
- [ICLR 2020] Hu, Weihua, et al. Strategies for pre-training graph neural networks [code]
- [Patterns 2022] [SELFIES] Krenn, Mario, et al. SELFIES and the future of molecular string representations [code]
- [Nat. Mach. Intell. 2022] [MolFormer] Ross, Jerret, et al. Large-scale chemical language representations capture molecular structure and properties [code]
- [Chem. Sci. 2022] [R-SMILES] Zhong, Zipeng, et al. Root-aligned SMILES: a tight representation for chemical reaction prediction [code]
- [BCB 2019] [SMILES-BERT] Wang, Sheng, et al. SMILES-BERT: large scale unsupervised pre-training for molecular property prediction [code]
Tandem mass spectra prediction predicton
- [Anal. Chem. 2024] [PPGB_MS2] Zheng, Fujian, et al. Predicting Tandem Mass Spectra of Small Molecules Using Graph Embedding of Precursor-Product Ion Pair Graph [code]
- [Anal. Chem. 2023] Wang, Fei, et al. Deep Learning-Enabled MS/MS Spectrum Prediction Facilitates Automated Identification Of Novel Psychoactive Substances [code]
- [Nat. Mach. Intell. 2023] Goldman, Samuel, et al. Annotating metabolite mass spectra with domain-inspired chemical formula transformers [code]
- [Nat. Mach. Intell. 2024] Young, Adamo, et al. Tandem mass spectrum prediction for small molecules using graph transformers [code]
- [NeurIPS 2023] Goldman, Samuel, et al. Prefix-tree decoding for predicting mass spectra from molecules [code]
- [Bioinformatics 2023] Hong, Yuhui, et al. 3DMolMS: prediction of tandem mass spectra from 3D molecular conformations [code]
- [Anal. Chem. 2021] Wang, Fei, et al. CFM-ID 4.0: more accurate ESI-MS/MS spectral prediction and compound identification [code]
- [ACS Cent. Sci. 2019] Wei, Jennifer N., et al. Rapid prediction of electron–ionization mass spectrometry using neural networks [code]
- [Bioinformatics 2024] [RT-Transformer] Xue, Jun, et al. RT-Transformer: Retention time prediction for metabolite annotation to assist in metabolite identification [code]
- [J. Chromatogr. A 2023] [DeepGCN-RT] Kang, Qiyue, et al. Deep graph convolutional network for small-molecule retention time prediction [code]
- [Anal. Chem. 2021] [GNN-RT] Yang, Qiong, et al. Prediction of liquid chromatographic retention time with graph neural networks to assist in small molecule identification [code]
- [Anal. Chem. 2020] [Retip] Bonini, Paolo, et al. Retip: retention time prediction for compound annotation in untargeted metabolomics [code]
- [Nat. Commun 2019] Domingo-Almenara, Xavier, et al. The METLIN small molecule dataset for machine learning-based retention time prediction [code]
Collision cross section prediction
- [Anal. Chem. 2024] de Cripan, et al. Predicting the Predicted: A Comparison of Machine Learning-Based Collision Cross-Section Prediction Models for Small Molecules
- [Anal. Chem. 2022] [AllCCS2] Zhang, Haosong, et al. AllCCS2: Curation of Ion Mobility Collision Cross-Section Atlas for Small Molecules Using Comprehensive Molecular Representations [code]
- [Anal. Chem. 2022] [CCSP 2.0] Rainey, Markace A., et al. CCS Predictor 2.0: An open-source jupyter notebook tool for filtering out false positives in metabolomics [code]
- [Nat. Commun 2020] [AllCCS] Zhou, Zhiwei, et al. Ion mobility collision cross-section atlas for known and unknown metabolite annotation in untargeted metabolomics [code]
- [Anal. Chem. 2019] [DeepCCS] Plante, Pier-Luc, et al. Predicting ion mobility collision cross-sections using a deep neural network: DeepCCS [code]
- [Anal. Chem. 2023] [CLERMS] Guo, Hao, et al. Contrastive learning-based embedder for the representation of tandem mass spectra [code]
- [Nat. Commun 2023] [FastEI] Yang, Qiong, et al. Ultra-fast and accurate electron ionization mass spectrum matching for compound identification with million-scale in-silico library [code]
- [PLoS Comput. Biol. 2021] Huber, Florian, et al. Spec2Vec: Improved mass spectral similarity scoring through learning of structural relationships [code]
- [J. Cheminform. 2021] Huber, Florian, et al. MS2DeepScore: a novel deep learning similarity measure to compare tandem mass spectra [code]
- [Anal. Chem. 2019] [DeepMASS] Ji, Hongchao, et al. Deep MS/MS-aided structural-similarity scoring for unknown metabolite identification [code]
- [JCIM 2023] Goldman, Samuel, et al. MIST-CF: Chemical formula inference from tandem mass spectra [code]
- [Nat. Methods 2023] [BUDDY] Xing, Shipei, et al. BUDDY: molecular formula discovery via bottom-up MS/MS interrogation [code1] [code2]
- [Nat. Methods 2019] [SIRIUS 4] Dührkop, Kai, et al. SIRIUS 4: a rapid tool for turning tandem mass spectra into metabolite structure information [code]
- [J. Cheminform. 2016] Ruttkies, Christoph, et al. MetFrag relaunched: incorporating strategies beyond in silico fragmentation [website]
- [Anal. Chem. 2014] Ma, Yan, et al. MS2Analyzer: A software for small molecule substructure annotations from accurate tandem mass spectra [website]
- [Nucleic Acids Res. 2014] Allen, Felicity, et al. CFM-ID: a web server for annotation, spectrum prediction and metabolite identification from tandem mass spectra [website] CFM-ID is designed for three tasks: spectrum prediction, peak assignment, and compound identification.
Mass spectrometry is often coupled with chromatographic techniques, such as GC-MS (gas chromatography-mass spectrometry) or LC-MS (liquid chromatography-mass spectrometry). In these combined techniques, the chromatographic method separates the compounds, and then the mass spectrometer analyzes each separated compound for identification and quantification.
- [Anal. Chem. 2024] [3DMolCSP] Hong, Yuhui, et al. Enhanced Structure-Based Prediction of Chiral Stationary Phases for Chromatographic Enantioseparation from 3D Molecular Conformations [code]
- [Nat. Commun 2023] [QGeoGNN] Xu, Hao, et al. Retention time prediction for chromatographic enantioseparation by quantile geometry-enhanced graph neural network [code]
- [J. Sep. Sci. 2018] Piras, Patrick, et al. Modeling and predicting chiral stationary phase enantioselectivity: An efficient random forest classifier using an optimally balanced training dataset and an aggregation strategy
- [J. Chromatogr. A 2016] Sheridan, Robert, et al. Toward structure-based predictive tools for the selection of chiral stationary phases for the chromatographic separation of enantiomers
- Awsome Mass Spectra Libraries: This repository contains the latest libraries for mass spectral data and related properties.
- Awesome Small Molecule Machine Learning: This repository focuses on machine learning topics related to small molecules.
- Awesome Cheminformatics: This repository concentrates on computer-based methods in chemistry.
- Awesome Python Chemistry: This repository is dedicated to Python-based frameworks, libraries, software, and resources in the field of Chemistry.
- Awesome DeepBio & deeplearning-biology: These repositories focus on deep learning methods in biology.
- awesome-pretrain-on-molecules