7.4. Metabolomics

Provides functionality to interact with metabolomics datasets and search MINE databases for metabolomics hits.

class minedatabase.metabolomics.MetabolomicsDataset(name: str, adducts: Optional[List[str]] = None, known_peaks: Optional[List[Peak]] = None, unknown_peaks: Optional[List[Peak]] = None, native_set: Set[str] = {}, ppm: bool = False, tolerance: float = 0.001, halogens: bool = False, verbose: bool = False)

A class containing all the information for a metabolomics data set.

annotate_peaks(db: MINE, core_db: MINE) None

This function iterates through the unknown peaks in the dataset and searches the database for compounds that match a peak m/z given the adducts permitted. Statistics on the annotated data set are printed.

Parameters
dbMINE

MINE database.

core_dbMINE

Core database containing spectra info.

check_product_of_native(cpd_ids: List[str], db: MINE) List[str]

Filters list of compound IDs to just those associated with compounds produced from a native hit in the model (i.e. in native set).

enumerate_possible_masses(tolerance: float) None

Generate all possible masses from unknown peaks and list of adducts. Saves these mass ranges to self.possible_ranges.

Parameters
tolerancefloat

Mass tolerance in Daltons.

find_db_hits(peak: Peak, db: MINE, core_db: MINE, adducts: List[Tuple[str, float, float]]) None

This function searches the database for matches of a peak given adducts and updates the peak object with that information.

Parameters
peakPeak

Peak object to query against MINE compound database.

dbMINE

MINE database to query.

adductsList[Tuple[str, float, float]]

List of adducts. Each adduct contains three values in a tuple: (adduct name, mass multiplier, ion mass).

get_rt(peak_id: str) Optional[float]

Return retention time for peak with given ID. If not found, returns None.

Parameters
peak_idstr

ID of peak as listed in dataset.

Returns
rtfloat, optional

Retention time of peak with given ID, None if not found.

class minedatabase.metabolomics.Peak(name: str, r_time: float, mz: float, charge: str, inchi_key: str = None, ms2: List[float, float] = None)

Peak object which contains peak metadata as well as mass, retention time, spectra, and any MINE database hits.

Parameters
namestr

Name or ID of the peak.

r_timefloat

Retention time of the peak.

mzfloat

Mass-to-charge ratio (m/z) of the peak.

chargestr

Charge of the peak, “+” or “-“.

inchi_keystr, optional

InChI key of the peak, if already identified, by default None.

ms2List[float], optional

MS2 spectra m/z values for this peak, by default None.

Attributes
isomersList[Dict]

List of compound documents in JSON (dict) format.

formulasSet[str]

All the unique compound formulas from compounds found for this peak.

total_hitsint

Number of compound hits for this peak.

native_hitbool

Whether this peak matches a compound provided in the native set.

score_isomers(metric: ~typing.Callable[[list, list], float] = <function dot_product>, energy_level: int = 20, tolerance: float = 0.005) None

Scores and sorts isomers based on mass spectra data.

Calculates the cosign similarity score between the provided ms2 peak list and pre-calculated CFM-spectra and sorts the isomer list according to this metric.

Parameters
metricfunction, optional

The scoring metric to use for the spectra. Function must accept 2 lists of (mz, intensity) tuples and return a score, by default dot_product.

energy_levelint, optional

The Fragmentation energy level to use. May be 10, 20 or 40., by default 20.

tolerancefloat, optional

The precision to use for matching m/z in mDa, by default 0.005.

Raises
ValueError

Empty ms2 peak.

class minedatabase.metabolomics.Struct(**entries)

convert key-value pairs into object-attribute pairs.

minedatabase.metabolomics.dot_product(x: List[tuple], y: List[tuple], epsilon: float = 0.01) float

Calculate the dot product of two spectra, allowing for some variability in mass-to-charge ratios

Parameters
xList[tuple]

First spectra m/z values.

yList[tuple]

Second spectra m/z values.

epsilonfloat, optional

Mass tolerance in Daltons, by default 0.01.

Returns
dot_prodfloat

Dot product of x and y.

minedatabase.metabolomics.get_KEGG_comps(db: MINE, core_db: MINE, kegg_db: Database, model_ids: List[str]) set

Get MINE IDs from KEGG MINE database for compounds in model(s).

Parameters
dbMINE

MINE Mongo database.

kegg_dbpymongo.database.Database

Mongo database with annotated organism metabolomes from KEGG.

model_idsList[str]

List of organism identifiers from KEGG.

Returns
set

MINE IDs of compounds that are linked to a KEGG ID in at least one of the organisms in model_ids.

minedatabase.metabolomics.jaccard(x: List[tuple], y: List[tuple], epsilon: float = 0.01) float

Calculate the Jaccard Index of two spectra, allowing for some variability in mass-to-charge ratios

Parameters
xList[tuple]

First spectra m/z values.

yList[tuple]

Second spectra m/z values.

epsilonfloat, optional

Mass tolerance in Daltons, by default 0.01.

Returns
jaccard_indexfloat

Jaccard Index of x and y.

Search for compounds matching MS2 spectra.

Parameters
dbMINE

Contains compound documents to search.

core_dbMINE

Contains extra info (including spectra) for compounds in db.

keggdbpymongo.database.Database

Contains models with associated compound documents.

textstr

Text as in metabolomics datafile for specific peak.

text_typestr

Type of metabolomics datafile (mgf, mzXML, and msp are supported). If text, assumes m/z values are separated by newlines (and set text_type to “form”).

ms_paramsdict
“tolerance”: float specifying tolerance for m/z, in mDa by default.

Can specify in ppm if “ppm” key’s value is set to True.

“charge”: bool (1 for positive, 0 for negative). “energy_level”: int specifying fragmentation energy level to use. May

be 10, 20, or 40.

“scoring_function”: str describing which scoring function to use. Can

be either “jaccard” or “dot product”.

“adducts”: list of adducts to use. If not specified, uses all adducts. “models”: List of model _ids. If supplied, score compounds higher if

present in model.

“ppm”: bool specifying whether “tolerance” is in mDa or ppm. Default

value for ppm is False (so tolerance is in mDa by default).

“kovats”: length 2 tuple specifying min and max kovats retention index

to filter compounds (e.g. (500, 1000)).

“logp”: length 2 tuple specifying min and max logp to filter compounds

(e.g. (-1, 2)).

“halogens”: bool specifying whether to filter out compounds containing

F, Cl, or Br. Filtered out if set to True. False by default.

Returns
ms_adduct_outputlist

Compound JSON documents matching ms2 search query.

Search for compound-adducts matching precursor mass.

Parameters
dbMINE

Contains compound documents to search.

core_dbMINE

Contains extra info (including spectra) for compounds in db.

keggdbpymongo.database.Database

Contains models with associated compound documents.

textstr

Text as in metabolomics datafile for specific peak.

text_typestr

Type of metabolomics datafile (mgf, mzXML, and msp are supported). If text, assumes m/z values are separated by newlines (and set text_type to “form”).

ms_paramsdict
“tolerance”: float specifying tolerance for m/z, in mDa by default.

Can specify in ppm if “ppm” key’s value is set to True.

“adducts”: list of adducts to use. If not specified, uses all adducts. “models”: List of model _ids. If supplied, score compounds higher if

present in model. [“eco”] by default (E. coli).

“ppm”: bool specifying whether “tolerance” is in mDa or ppm. Default

value for ppm is False (so tolerance is in mDa by default).

“kovats”: length 2 tuple specifying min and max kovats retention index

to filter compounds (e.g. (500, 1000)).

“logp”: length 2 tuple specifying min and max logp to filter compounds

(e.g. (-1, 2)).

“halogens”: bool specifying whether to filter out compounds containing

F, Cl, or Br. Filtered out if set to True. False by default.

Returns
ms_adduct_outputlist

Compound JSON documents matching ms adduct query.

minedatabase.metabolomics.read_adduct_names(filepath: str) List[str]

Read adduct names from text file at specified path into a list.

Parameters
filepathstr

Path to adduct text file.

Returns
adductslist

Names of adducts in text file.

Notes

Not used in this codebase but used by MINE-Server to validate adduct input.

minedatabase.metabolomics.read_mgf(input_string: str, charge: bool, ms2_delim='\t') List[Peak]

Parse mgf metabolomics data file.

Parameters
input_stringstr

Metabolomics input data file.

chargebool

True if positive, False if negative.

ms2_delimstr

Delimiter for whitespace between intensity and m/z value. Usually tab but can also be a space in some MGF files. Tab by default.

Returns
peaksList[Peak]

A list of Peak objects.

minedatabase.metabolomics.read_msp(input_string: str, charge: bool) List[Peak]

Parse msp metabolomics data file.

Parameters
input_stringstr

Metabolomics input data file.

chargebool

True if positive, False if negative.

Returns
peaksList[Peak]

A list of Peak objects.

minedatabase.metabolomics.read_mzxml(input_string: str, charge: bool) List[Peak]

Parse mzXML metabolomics data file.

Parameters
input_stringstr

Metabolomics input data file.

chargebool

True if positive, False if negative.

Returns
List[Peak]

A list of Peak objects.

minedatabase.metabolomics.score_compounds(compounds: list, model_id: str = None, core_db: pymongo.database = None, mine_db: pymongo.database = None, kegg_db: pymongo.database = None, parent_frac: float = 0.75, reaction_frac: float = 0.25, get_native: bool = False) List[dict]

This function validates compounds against a metabolic model, returning only the compounds which pass.

Parameters
dbMongo DB

Should contain a “models” collection with compound and reaction IDs listed.

core_dbMongo DB

Core MINE database.

compoundslist

Each element is a dict describing that compound. Should have an ‘_id’ field.

model_idstr

KEGG organism code (e.g. ‘hsa’).

parent_fracfloat, optional

Weighting for compounds derived from compounds in the provided model. 0.75 by default.

reaction_fracfloat, optional

Weighting for compounds derived from known compounds not in the model. 0.25 by default.

Returns
compoundsList[dict]

Modified version of input compounds list, where each compound now has a ‘Likelihood_score’ key and value between 0 and 1.

minedatabase.metabolomics.spectra_download(db: MINE, mongo_id: Optional[str] = None) str

Download one or more spectra for compounds matching a given query.

Parameters
dbMINE

Contains compound documents to search.

mongo_querystr, optional (default: None)

A valid Mongo query as a literal string. If None, all compound spectra are returned.

parent_filterstr, optional (default: None)

If set to a metabolic model’s Mongo _id, only get spectra for compounds in or derived from that metabolic model.

putativebool, optional (default: True)

If False, only find known compounds (i.e. in Generation 0). Otherwise, finds both known and predicted compounds.

Returns
spectral_librarystr

Text of all matching spectra, including headers and peak lists.