7.4. Metabolomics

Provides functionality to interact with metabolomics datasets and search MINE databases for metabolomics hits.

class minedatabase.metabolomics.MetabolomicsDataset(name: str, adducts: Optional[List[str]] = None, known_peaks: Optional[List[Peak]] = None, unknown_peaks: Optional[List[Peak]] = None, native_set: Set[str] = {}, ppm: bool = False, tolerance: float = 0.001, halogens: bool = False, verbose: bool = False)

A class containing all the information for a metabolomics data set.

annotate_peaks(db: MINE, core_db: MINE) → None

This function iterates through the unknown peaks in the dataset and searches the database for compounds that match a peak m/z given the adducts permitted. Statistics on the annotated data set are printed.

Parameters

dbMINE: MINE database.
core_dbMINE: Core database containing spectra info.

check_product_of_native(cpd_ids: List[str], db: MINE) → List[str]: Filters list of compound IDs to just those associated with compounds produced from a native hit in the model (i.e. in native set).

enumerate_possible_masses(tolerance: float) → None

Generate all possible masses from unknown peaks and list of adducts. Saves these mass ranges to self.possible_ranges.

Parameters

tolerancefloat: Mass tolerance in Daltons.

find_db_hits(peak: Peak, db: MINE, core_db: MINE, adducts: List[Tuple[str, float, float]]) → None

This function searches the database for matches of a peak given adducts and updates the peak object with that information.

Parameters

peakPeak: Peak object to query against MINE compound database.
dbMINE: MINE database to query.
adductsList[Tuple[str, float, float]]: List of adducts. Each adduct contains three values in a tuple: (adduct name, mass multiplier, ion mass).

get_rt(peak_id: str) → Optional[float]

Return retention time for peak with given ID. If not found, returns None.

Parameters

peak_idstr: ID of peak as listed in dataset.

Returns

rtfloat, optional: Retention time of peak with given ID, None if not found.

class minedatabase.metabolomics.Peak(name: str, r_time: float, mz: float, charge: str, inchi_key: str = None, ms2: List[float, float] = None)

Peak object which contains peak metadata as well as mass, retention time, spectra, and any MINE database hits.

Parameters

namestr: Name or ID of the peak.
r_timefloat: Retention time of the peak.
mzfloat: Mass-to-charge ratio (m/z) of the peak.
chargestr: Charge of the peak, “+” or “-“.
inchi_keystr, optional: InChI key of the peak, if already identified, by default None.
ms2List[float], optional: MS2 spectra m/z values for this peak, by default None.

Attributes

isomersList[Dict]: List of compound documents in JSON (dict) format.
formulasSet[str]: All the unique compound formulas from compounds found for this peak.
total_hitsint: Number of compound hits for this peak.
native_hitbool: Whether this peak matches a compound provided in the native set.

score_isomers(metric: ~typing.Callable[[list, list], float] = <function dot_product>, energy_level: int = 20, tolerance: float = 0.005) → None

Scores and sorts isomers based on mass spectra data.

Calculates the cosign similarity score between the provided ms2 peak list and pre-calculated CFM-spectra and sorts the isomer list according to this metric.

Parameters

metricfunction, optional: The scoring metric to use for the spectra. Function must accept 2 lists of (mz, intensity) tuples and return a score, by default dot_product.
energy_levelint, optional: The Fragmentation energy level to use. May be 10, 20 or 40., by default 20.
tolerancefloat, optional: The precision to use for matching m/z in mDa, by default 0.005.

Raises

ValueError: Empty ms2 peak.

class minedatabase.metabolomics.Struct(**entries): convert key-value pairs into object-attribute pairs.

minedatabase.metabolomics.dot_product(x: List[tuple], y: List[tuple], epsilon: float = 0.01) → float

Calculate the dot product of two spectra, allowing for some variability in mass-to-charge ratios

Parameters

xList[tuple]: First spectra m/z values.
yList[tuple]: Second spectra m/z values.
epsilonfloat, optional: Mass tolerance in Daltons, by default 0.01.

Returns

dot_prodfloat: Dot product of x and y.

minedatabase.metabolomics.get_KEGG_comps(db: MINE, core_db: MINE, kegg_db: Database, model_ids: List[str]) → set

Get MINE IDs from KEGG MINE database for compounds in model(s).

Parameters

dbMINE: MINE Mongo database.
kegg_dbpymongo.database.Database: Mongo database with annotated organism metabolomes from KEGG.
model_idsList[str]: List of organism identifiers from KEGG.

Returns

set: MINE IDs of compounds that are linked to a KEGG ID in at least one of the organisms in model_ids.

minedatabase.metabolomics.jaccard(x: List[tuple], y: List[tuple], epsilon: float = 0.01) → float

Calculate the Jaccard Index of two spectra, allowing for some variability in mass-to-charge ratios

Parameters

xList[tuple]: First spectra m/z values.
yList[tuple]: Second spectra m/z values.
epsilonfloat, optional: Mass tolerance in Daltons, by default 0.01.

Returns

jaccard_indexfloat: Jaccard Index of x and y.

minedatabase.metabolomics.ms2_search(db: MINE, core_db: MINE, keggdb: Database, text: str, text_type: str, ms_params) → List

Search for compounds matching MS2 spectra.

Parameters

dbMINE

Contains compound documents to search.

core_dbMINE

Contains extra info (including spectra) for compounds in db.

keggdbpymongo.database.Database

Contains models with associated compound documents.

textstr

Text as in metabolomics datafile for specific peak.

text_typestr

Type of metabolomics datafile (mgf, mzXML, and msp are supported). If text, assumes m/z values are separated by newlines (and set text_type to “form”).

ms_paramsdict

“tolerance”: float specifying tolerance for m/z, in mDa by default.: Can specify in ppm if “ppm” key’s value is set to True.

“charge”: bool (1 for positive, 0 for negative). “energy_level”: int specifying fragmentation energy level to use. May

be 10, 20, or 40.

“scoring_function”: str describing which scoring function to use. Can: be either “jaccard” or “dot product”.

“adducts”: list of adducts to use. If not specified, uses all adducts. “models”: List of model _ids. If supplied, score compounds higher if

present in model.

“ppm”: bool specifying whether “tolerance” is in mDa or ppm. Default: value for ppm is False (so tolerance is in mDa by default).
“kovats”: length 2 tuple specifying min and max kovats retention index: to filter compounds (e.g. (500, 1000)).
“logp”: length 2 tuple specifying min and max logp to filter compounds: (e.g. (-1, 2)).
“halogens”: bool specifying whether to filter out compounds containing: F, Cl, or Br. Filtered out if set to True. False by default.

Returns

ms_adduct_outputlist: Compound JSON documents matching ms2 search query.

minedatabase.metabolomics.ms_adduct_search(db: MINE, core_db: MINE, keggdb: Database, text: str, text_type: str, ms_params) → List

Search for compound-adducts matching precursor mass.

Parameters

dbMINE

Contains compound documents to search.

core_dbMINE

Contains extra info (including spectra) for compounds in db.

keggdbpymongo.database.Database

Contains models with associated compound documents.

textstr

Text as in metabolomics datafile for specific peak.

text_typestr

Type of metabolomics datafile (mgf, mzXML, and msp are supported). If text, assumes m/z values are separated by newlines (and set text_type to “form”).

ms_paramsdict

“tolerance”: float specifying tolerance for m/z, in mDa by default.: Can specify in ppm if “ppm” key’s value is set to True.

“adducts”: list of adducts to use. If not specified, uses all adducts. “models”: List of model _ids. If supplied, score compounds higher if

present in model. [“eco”] by default (E. coli).

“ppm”: bool specifying whether “tolerance” is in mDa or ppm. Default: value for ppm is False (so tolerance is in mDa by default).
“kovats”: length 2 tuple specifying min and max kovats retention index: to filter compounds (e.g. (500, 1000)).
“logp”: length 2 tuple specifying min and max logp to filter compounds: (e.g. (-1, 2)).
“halogens”: bool specifying whether to filter out compounds containing: F, Cl, or Br. Filtered out if set to True. False by default.

Returns

ms_adduct_outputlist: Compound JSON documents matching ms adduct query.

minedatabase.metabolomics.read_adduct_names(filepath: str) → List[str]

Read adduct names from text file at specified path into a list.

Parameters

filepathstr: Path to adduct text file.

Returns

adductslist: Names of adducts in text file.

Notes

Not used in this codebase but used by MINE-Server to validate adduct input.

minedatabase.metabolomics.read_mgf(input_string: str, charge: bool, ms2_delim='\t') → List[Peak]

Parse mgf metabolomics data file.

Parameters

input_stringstr: Metabolomics input data file.
chargebool: True if positive, False if negative.
ms2_delimstr: Delimiter for whitespace between intensity and m/z value. Usually tab but can also be a space in some MGF files. Tab by default.

Returns

peaksList[Peak]: A list of Peak objects.

minedatabase.metabolomics.read_msp(input_string: str, charge: bool) → List[Peak]

Parse msp metabolomics data file.

Parameters

input_stringstr: Metabolomics input data file.
chargebool: True if positive, False if negative.

Returns

peaksList[Peak]: A list of Peak objects.

minedatabase.metabolomics.read_mzxml(input_string: str, charge: bool) → List[Peak]

Parse mzXML metabolomics data file.

Parameters

input_stringstr: Metabolomics input data file.
chargebool: True if positive, False if negative.

Returns

List[Peak]: A list of Peak objects.

minedatabase.metabolomics.score_compounds(compounds: list, model_id: str = None, core_db: pymongo.database = None, mine_db: pymongo.database = None, kegg_db: pymongo.database = None, parent_frac: float = 0.75, reaction_frac: float = 0.25, get_native: bool = False) → List[dict]

This function validates compounds against a metabolic model, returning only the compounds which pass.

Parameters

dbMongo DB: Should contain a “models” collection with compound and reaction IDs listed.
core_dbMongo DB: Core MINE database.
compoundslist: Each element is a dict describing that compound. Should have an ‘_id’ field.
model_idstr: KEGG organism code (e.g. ‘hsa’).
parent_fracfloat, optional: Weighting for compounds derived from compounds in the provided model. 0.75 by default.
reaction_fracfloat, optional: Weighting for compounds derived from known compounds not in the model. 0.25 by default.

Returns

compoundsList[dict]: Modified version of input compounds list, where each compound now has a ‘Likelihood_score’ key and value between 0 and 1.

minedatabase.metabolomics.spectra_download(db: MINE, mongo_id: Optional[str] = None) → str

Download one or more spectra for compounds matching a given query.

Parameters

dbMINE: Contains compound documents to search.
mongo_querystr, optional (default: None): A valid Mongo query as a literal string. If None, all compound spectra are returned.
parent_filterstr, optional (default: None): If set to a metabolic model’s Mongo _id, only get spectra for compounds in or derived from that metabolic model.
putativebool, optional (default: True): If False, only find known compounds (i.e. in Generation 0). Otherwise, finds both known and predicted compounds.

Returns

spectral_librarystr: Text of all matching spectra, including headers and peak lists.