Skip to content

Awesome papers and codes list of small molecule mass spectrometry-related machine learning methods

License

Notifications You must be signed in to change notification settings

JosieHong/awesome-smallmol-massspec-ml

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

66 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Awesome Machine Learning in Small Molecules Mass Spectrometry

Awesome

Mass spectrometry, also called mass spec, is an analytical technique that is used to measure the mass-to-charge ratio of ions. The results are presented as a mass spectrum, a plot of intensity as a function of the mass-to-charge ratio.

from Wikipedia

Keep updating the awesome machine-learning papers and codes related to small molecules mass spectrometry. Please notice that awesome lists are curations of the best, not everything. Contributes are always welcome!

Contents

Databases

Molecular properties:

Database No. of Compounds Note
OC20 & OC22 1.3 million molecular relaxations (from 260 million DFT calculations) The Open Catalyst Project focuses on using AI to find new renewable energy storage catalysts
QM9 134,000 Stable small organic molecules composed of CHONF with computed geometric, energetic, electronic, and thermodynamic properties
GEOM 37 million conformations for 450,000+ molecules Generated using advanced sampling and semi-empirical density functional theory (DFT)
MD17 & MD22 7 biomolecular systems (42-370 atoms each) Molecular dynamics trajectories sampled at 400-500 K with 1 fs resolution, energy and forces calculated using PBE+MBD theory
PCQM4Mv2 Not specified (derived from PubChemQC) Focuses on predicting DFT-calculated HOMO-LUMO energy gaps of molecules using 2D graphs
MoleculeNet 700,000+ Benchmark for testing machine learning methods on molecular properties, integrated into the DeepChem package

MS/MS:

Database No. of Compounds Note
MassSpecGym 19K molecules (231K MS/MS spectra) Benchmark for discovery and identification of molecules with high-quality MS/MS spectra
NIST23 399,267 small molecules (2,374,064 MS/MS spectra) Collection of MS/MS spectra and search software
MoNA 2,061,612 mass spectral records Contains experimental and in-silico libraries, as well as user contributions
GNPS Not specified Web-based mass spectrometry ecosystem for community-wide organization and sharing of MS/MS data
HMDB 5.0 220,945 metabolite entries Human Metabolome Database containing metabolites present in the human body and their experimental MS/MS spectra

Retention time:

Database Compounds Note
METLIN-SMRT 80,038 small molecules Experimentally acquired reverse-phase chromatography retention time dataset
RepoRT 8,809 unique compounds (88,325 retention time entries) Contains 373 datasets measured on 49 different chromatographic columns using various eluents, flow rates, and temperatures
HSM3 75 compounds (43,329 total retention measurements) Refined hydrophobic subtraction model for predicting reversed-phase liquid chromatography selectivity, based on 13 RP stationary phases with measurements at multiple mobile phase compositions

Collision cross section:

Database Compounds Note
AllCCS 1.6 million small molecules (5,000+ experimental CCS records, ~12 million calculated CCS values) Collection of experimental and calculated collision cross section values
AllCCS2 4,326 compounds (10,384 records, 7,713 unified CCS values added) Expanded version of AllCCS with newly available experimental CCS data and standardized values with confidence scores
METLIN-CCS 27,000+ molecular standards across 79 chemical classes Database of collision cross section values derived from ion mobility spectrometry data
CCSBase Not specified Integrated platform with comprehensive database of CCS measurements from various sources and a machine learning prediction model Website

Papers

Survey/Review papers

Discussions in database

Discussions in pre-train models

Small molecular representation learning

According to the information embedded in the model, the molecular representation learning models are categorized as point-based (or quantum-based) methods, graph-based methods, and sequence-based methods. Because the number of graph-based methods is huge, they are further divided into self-supervised learning and supervised learning manners. It is worth noting that the difference between point-based (or quantum-based) methods and graph-based methods is if bonds (i.e. edges) are included in the encoding.

Point-based (or quantum-based) methods

Graph-based methods

Self-Supervised Learning:

Supervised Learning

Other Related Works

Sequence-based methods

Mass spectrometry-related properties prediction

Tandem mass spectra prediction predicton

Retention time prediction

Collision cross section prediction

Mass spectra representation learning and matching

Chemical formula prediction from mass spectra

Mass spectra peak annotation/assignment

Machine learning in small molecules chromatography

Mass spectrometry is often coupled with chromatographic techniques, such as GC-MS (gas chromatography-mass spectrometry) or LC-MS (liquid chromatography-mass spectrometry). In these combined techniques, the chromatographic method separates the compounds, and then the mass spectrometer analyzes each separated compound for identification and quantification.

Related awesome lists

About

Awesome papers and codes list of small molecule mass spectrometry-related machine learning methods

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 3

  •  
  •  
  •