7.4. Metabolomics
Provides functionality to interact with metabolomics datasets and search MINE databases for metabolomics hits.
- class minedatabase.metabolomics.MetabolomicsDataset(name: str, adducts: Optional[List[str]] = None, known_peaks: Optional[List[Peak]] = None, unknown_peaks: Optional[List[Peak]] = None, native_set: Set[str] = {}, ppm: bool = False, tolerance: float = 0.001, halogens: bool = False, verbose: bool = False)
A class containing all the information for a metabolomics data set.
- annotate_peaks(db: MINE, core_db: MINE) None
This function iterates through the unknown peaks in the dataset and searches the database for compounds that match a peak m/z given the adducts permitted. Statistics on the annotated data set are printed.
- Parameters
- dbMINE
MINE database.
- core_dbMINE
Core database containing spectra info.
- check_product_of_native(cpd_ids: List[str], db: MINE) List[str]
Filters list of compound IDs to just those associated with compounds produced from a native hit in the model (i.e. in native set).
- enumerate_possible_masses(tolerance: float) None
Generate all possible masses from unknown peaks and list of adducts. Saves these mass ranges to self.possible_ranges.
- Parameters
- tolerancefloat
Mass tolerance in Daltons.
- find_db_hits(peak: Peak, db: MINE, core_db: MINE, adducts: List[Tuple[str, float, float]]) None
This function searches the database for matches of a peak given adducts and updates the peak object with that information.
- Parameters
- peakPeak
Peak object to query against MINE compound database.
- dbMINE
MINE database to query.
- adductsList[Tuple[str, float, float]]
List of adducts. Each adduct contains three values in a tuple: (adduct name, mass multiplier, ion mass).
- get_rt(peak_id: str) Optional[float]
Return retention time for peak with given ID. If not found, returns None.
- Parameters
- peak_idstr
ID of peak as listed in dataset.
- Returns
- rtfloat, optional
Retention time of peak with given ID, None if not found.
- class minedatabase.metabolomics.Peak(name: str, r_time: float, mz: float, charge: str, inchi_key: str = None, ms2: List[float, float] = None)
Peak object which contains peak metadata as well as mass, retention time, spectra, and any MINE database hits.
- Parameters
- namestr
Name or ID of the peak.
- r_timefloat
Retention time of the peak.
- mzfloat
Mass-to-charge ratio (m/z) of the peak.
- chargestr
Charge of the peak, “+” or “-“.
- inchi_keystr, optional
InChI key of the peak, if already identified, by default None.
- ms2List[float], optional
MS2 spectra m/z values for this peak, by default None.
- Attributes
- isomersList[Dict]
List of compound documents in JSON (dict) format.
- formulasSet[str]
All the unique compound formulas from compounds found for this peak.
- total_hitsint
Number of compound hits for this peak.
- native_hitbool
Whether this peak matches a compound provided in the native set.
- score_isomers(metric: ~typing.Callable[[list, list], float] = <function dot_product>, energy_level: int = 20, tolerance: float = 0.005) None
Scores and sorts isomers based on mass spectra data.
Calculates the cosign similarity score between the provided ms2 peak list and pre-calculated CFM-spectra and sorts the isomer list according to this metric.
- Parameters
- metricfunction, optional
The scoring metric to use for the spectra. Function must accept 2 lists of (mz, intensity) tuples and return a score, by default dot_product.
- energy_levelint, optional
The Fragmentation energy level to use. May be 10, 20 or 40., by default 20.
- tolerancefloat, optional
The precision to use for matching m/z in mDa, by default 0.005.
- Raises
- ValueError
Empty ms2 peak.
- class minedatabase.metabolomics.Struct(**entries)
convert key-value pairs into object-attribute pairs.
- minedatabase.metabolomics.dot_product(x: List[tuple], y: List[tuple], epsilon: float = 0.01) float
Calculate the dot product of two spectra, allowing for some variability in mass-to-charge ratios
- Parameters
- xList[tuple]
First spectra m/z values.
- yList[tuple]
Second spectra m/z values.
- epsilonfloat, optional
Mass tolerance in Daltons, by default 0.01.
- Returns
- dot_prodfloat
Dot product of x and y.
- minedatabase.metabolomics.get_KEGG_comps(db: MINE, core_db: MINE, kegg_db: Database, model_ids: List[str]) set
Get MINE IDs from KEGG MINE database for compounds in model(s).
- Parameters
- dbMINE
MINE Mongo database.
- kegg_dbpymongo.database.Database
Mongo database with annotated organism metabolomes from KEGG.
- model_idsList[str]
List of organism identifiers from KEGG.
- Returns
- set
MINE IDs of compounds that are linked to a KEGG ID in at least one of the organisms in model_ids.
- minedatabase.metabolomics.jaccard(x: List[tuple], y: List[tuple], epsilon: float = 0.01) float
Calculate the Jaccard Index of two spectra, allowing for some variability in mass-to-charge ratios
- Parameters
- xList[tuple]
First spectra m/z values.
- yList[tuple]
Second spectra m/z values.
- epsilonfloat, optional
Mass tolerance in Daltons, by default 0.01.
- Returns
- jaccard_indexfloat
Jaccard Index of x and y.
- minedatabase.metabolomics.ms2_search(db: MINE, core_db: MINE, keggdb: Database, text: str, text_type: str, ms_params) List
Search for compounds matching MS2 spectra.
- Parameters
- dbMINE
Contains compound documents to search.
- core_dbMINE
Contains extra info (including spectra) for compounds in db.
- keggdbpymongo.database.Database
Contains models with associated compound documents.
- textstr
Text as in metabolomics datafile for specific peak.
- text_typestr
Type of metabolomics datafile (mgf, mzXML, and msp are supported). If text, assumes m/z values are separated by newlines (and set text_type to “form”).
- ms_paramsdict
- “tolerance”: float specifying tolerance for m/z, in mDa by default.
Can specify in ppm if “ppm” key’s value is set to True.
“charge”: bool (1 for positive, 0 for negative). “energy_level”: int specifying fragmentation energy level to use. May
be 10, 20, or 40.
- “scoring_function”: str describing which scoring function to use. Can
be either “jaccard” or “dot product”.
“adducts”: list of adducts to use. If not specified, uses all adducts. “models”: List of model _ids. If supplied, score compounds higher if
present in model.
- “ppm”: bool specifying whether “tolerance” is in mDa or ppm. Default
value for ppm is False (so tolerance is in mDa by default).
- “kovats”: length 2 tuple specifying min and max kovats retention index
to filter compounds (e.g. (500, 1000)).
- “logp”: length 2 tuple specifying min and max logp to filter compounds
(e.g. (-1, 2)).
- “halogens”: bool specifying whether to filter out compounds containing
F, Cl, or Br. Filtered out if set to True. False by default.
- Returns
- ms_adduct_outputlist
Compound JSON documents matching ms2 search query.
- minedatabase.metabolomics.ms_adduct_search(db: MINE, core_db: MINE, keggdb: Database, text: str, text_type: str, ms_params) List
Search for compound-adducts matching precursor mass.
- Parameters
- dbMINE
Contains compound documents to search.
- core_dbMINE
Contains extra info (including spectra) for compounds in db.
- keggdbpymongo.database.Database
Contains models with associated compound documents.
- textstr
Text as in metabolomics datafile for specific peak.
- text_typestr
Type of metabolomics datafile (mgf, mzXML, and msp are supported). If text, assumes m/z values are separated by newlines (and set text_type to “form”).
- ms_paramsdict
- “tolerance”: float specifying tolerance for m/z, in mDa by default.
Can specify in ppm if “ppm” key’s value is set to True.
“adducts”: list of adducts to use. If not specified, uses all adducts. “models”: List of model _ids. If supplied, score compounds higher if
present in model. [“eco”] by default (E. coli).
- “ppm”: bool specifying whether “tolerance” is in mDa or ppm. Default
value for ppm is False (so tolerance is in mDa by default).
- “kovats”: length 2 tuple specifying min and max kovats retention index
to filter compounds (e.g. (500, 1000)).
- “logp”: length 2 tuple specifying min and max logp to filter compounds
(e.g. (-1, 2)).
- “halogens”: bool specifying whether to filter out compounds containing
F, Cl, or Br. Filtered out if set to True. False by default.
- Returns
- ms_adduct_outputlist
Compound JSON documents matching ms adduct query.
- minedatabase.metabolomics.read_adduct_names(filepath: str) List[str]
Read adduct names from text file at specified path into a list.
- Parameters
- filepathstr
Path to adduct text file.
- Returns
- adductslist
Names of adducts in text file.
Notes
Not used in this codebase but used by MINE-Server to validate adduct input.
- minedatabase.metabolomics.read_mgf(input_string: str, charge: bool, ms2_delim='\t') List[Peak]
Parse mgf metabolomics data file.
- Parameters
- input_stringstr
Metabolomics input data file.
- chargebool
True if positive, False if negative.
- ms2_delimstr
Delimiter for whitespace between intensity and m/z value. Usually tab but can also be a space in some MGF files. Tab by default.
- Returns
- peaksList[Peak]
A list of Peak objects.
- minedatabase.metabolomics.read_msp(input_string: str, charge: bool) List[Peak]
Parse msp metabolomics data file.
- Parameters
- input_stringstr
Metabolomics input data file.
- chargebool
True if positive, False if negative.
- Returns
- peaksList[Peak]
A list of Peak objects.
- minedatabase.metabolomics.read_mzxml(input_string: str, charge: bool) List[Peak]
Parse mzXML metabolomics data file.
- Parameters
- input_stringstr
Metabolomics input data file.
- chargebool
True if positive, False if negative.
- Returns
- List[Peak]
A list of Peak objects.
- minedatabase.metabolomics.score_compounds(compounds: list, model_id: str = None, core_db: pymongo.database = None, mine_db: pymongo.database = None, kegg_db: pymongo.database = None, parent_frac: float = 0.75, reaction_frac: float = 0.25, get_native: bool = False) List[dict]
This function validates compounds against a metabolic model, returning only the compounds which pass.
- Parameters
- dbMongo DB
Should contain a “models” collection with compound and reaction IDs listed.
- core_dbMongo DB
Core MINE database.
- compoundslist
Each element is a dict describing that compound. Should have an ‘_id’ field.
- model_idstr
KEGG organism code (e.g. ‘hsa’).
- parent_fracfloat, optional
Weighting for compounds derived from compounds in the provided model. 0.75 by default.
- reaction_fracfloat, optional
Weighting for compounds derived from known compounds not in the model. 0.25 by default.
- Returns
- compoundsList[dict]
Modified version of input compounds list, where each compound now has a ‘Likelihood_score’ key and value between 0 and 1.
- minedatabase.metabolomics.spectra_download(db: MINE, mongo_id: Optional[str] = None) str
Download one or more spectra for compounds matching a given query.
- Parameters
- dbMINE
Contains compound documents to search.
- mongo_querystr, optional (default: None)
A valid Mongo query as a literal string. If None, all compound spectra are returned.
- parent_filterstr, optional (default: None)
If set to a metabolic model’s Mongo _id, only get spectra for compounds in or derived from that metabolic model.
- putativebool, optional (default: True)
If False, only find known compounds (i.e. in Generation 0). Otherwise, finds both known and predicted compounds.
- Returns
- spectral_librarystr
Text of all matching spectra, including headers and peak lists.