00 Compound Profile
Jupyter notebook from the ENIGMA Carbon Census 1 project.
NB00 — Compound Profile (ENIGMA Carbon Census 1)¶
Load the source spreadsheet, isolate the 83 selected compounds, and profile their chemical-class composition, source (groundwater vs necromass), and physicochemical properties (MW, LogP). Outputs a clean compounds table for downstream identity resolution (NB01) and figures for the report.
In [1]:
import pandas as pd
import numpy as np
import matplotlib
matplotlib.use('Agg')
import matplotlib.pyplot as plt
from pathlib import Path
DATA = Path('../data'); FIG = Path('../figures')
DATA.mkdir(exist_ok=True); FIG.mkdir(exist_ok=True)
pd.set_option('display.max_columns', 30)
pd.set_option('display.width', 160)
In [2]:
raw = pd.read_excel('../user_data/Carbon_Census_Compound_Selections.xlsx')
print('raw shape:', raw.shape)
print('columns:', list(raw.columns))
raw shape: (92, 21) columns: ['#', 'Include?', 'Unique ID', 'Source', 'Name', 'NPC Pathway', 'NPC Superclass', 'NPC Class', 'Orig. Class', 'In House?', 'Price/mg', 'Price ($)', 'amount per container', 'Primary Supplier', 'Link', 'MW (g/mol)', 'LogP', 'H2O Sol. (mM)', 'H2O Sol. (source)', 'DMSO Sol. note', 'DMSO flag']
Select the 83 compounds¶
Include? ∈ {Yes, Yes expensive}. The 9 NaN rows are excluded candidates.
In [3]:
SEL = {'Yes', 'Yes, expensive'}
df = raw[raw['Include?'].isin(SEL)].copy().reset_index(drop=True)
assert len(df) == 83, f'expected 83 selected, got {len(df)}'
df['expensive'] = df['Include?'].eq('Yes, expensive')
df['source_short'] = df['Source'].map({
'Carbon Census SSO Groundwater': 'groundwater',
'Necromass': 'necromass'})
print('selected:', len(df))
print(df['source_short'].value_counts())
print('expensive:', int(df['expensive'].sum()))
selected: 83 source_short groundwater 59 necromass 24 Name: count, dtype: int64 expensive: 11
Chemical-class composition (NP Classifier ontology)¶
In [4]:
for col in ['NPC Pathway', 'NPC Superclass', 'NPC Class']:
print(f'\n=== {col} ===')
print(df[col].value_counts(dropna=False).to_string())
=== NPC Pathway === NPC Pathway Alkaloids 26 Shikimates and Phenylpropanoids 25 Terpenoids 17 Fatty acids 10 Polyketides 2 Amino acids and Peptides 2 Alkaloids, Shikimates and Phenylpropanoids 1 === NPC Superclass === NPC Superclass Fatty Acids and Conjugates 9 Phenolic acids (C6-C1) 8 Monoterpenoids 7 Tryptophan alkaloids 6 Phenylpropanoids (C6-C3) 6 NaN 5 Tryptophan alkaloids, Anthranilic acid alkaloids 4 Coumarins 4 Sesquiterpenoids 4 Flavonoids 3 Tyrosine alkaloids 3 Anthranilic acid alkaloids 3 Chromanes 2 Diterpenoids 2 Apocarotenoids 2 Tryptophan alkaloids|Anthranilic acid alkaloids 2 Sesquiterpenoids|Monoterpenoids 2 Nicotinic acid alkaloids 2 Amino acids and derivatives 1 Organoheterocyclic compounds 1 Histidine alkaloids 1 Lysine alkaloids 1 Pseudoalkaloids 1 Pseudoalkaloids, Ornithine alkaloids 1 Tryptophan alkaloids, Phenylpropanoids (C6-C3) 1 Small peptides 1 Glycerophospholipids 1 === NPC Class === NPC Class Quinoline alkaloids 7 NaN 7 Simple phenolic acids 6 Dicarboxylic acids 5 Cinnamic acids and derivatives 5 Simple indole alkaloids 4 Simple coumarins 3 Menthane monoterpenoids, Monocyclic monoterpenoids 3 Chromones 2 Chalcones 2 Shikimic acids and derivatives, Simple phenolic acids 2 Labdane diterpenoids 2 Phenylethylamines 2 Hydroxy fatty acids 2 Pinane monoterpenoids 2 Branched fatty acids|Unsaturated fatty acids 2 Acyclic monoterpenoids|Farnesane sesquiterpenoids 2 Pyridine alkaloids 2 Carboline alkaloids 1 Isocoumarins 1 Betaines 1 Benzimidazoles 1 Acridone alkaloids 1 Imidazole alkaloids 1 Piperidine alkaloids 1 Purine alkaloids 1 Acetate-derived alkaloids, Polyamines 1 Cinnamic acids and derivatives, Simple indole alkaloids 1 Aminoacids 1 Flavanones 1 Apocarotenoids (β-) 1 Apocarotenoids(ε-) 1 Menthane monoterpenoids 1 Eremophilane sesquiterpenoids 1 Phenazine alkaloids 1 Camphane monoterpenoids 1 Bisabolane sesquiterpenoids 1 Longifolane sesquiterpenoids 1 Caryophyllane sesquiterpenoids 1 Isoquinoline alkaloids|Tetrahydroisoquinoline alkaloids 1 Anthranillic acid derivatives 1
In [5]:
# class x source crosstab (NPC Pathway)
ct = pd.crosstab(df['NPC Pathway'].fillna('(none)'), df['source_short'])
ct['total'] = ct.sum(axis=1)
ct = ct.sort_values('total', ascending=False)
print(ct.to_string())
source_short groundwater necromass total NPC Pathway Alkaloids 17 9 26 Shikimates and Phenylpropanoids 20 5 25 Terpenoids 9 8 17 Fatty acids 8 2 10 Amino acids and Peptides 2 0 2 Polyketides 2 0 2 Alkaloids, Shikimates and Phenylpropanoids 1 0 1
Physicochemical properties¶
In [6]:
for col in ['MW (g/mol)', 'LogP', 'H2O Sol. (mM)']:
s = pd.to_numeric(df[col], errors='coerce')
print(f'{col:16s} n={s.notna().sum():3d} min={s.min():.2f} '
f'med={s.median():.2f} max={s.max():.2f}')
MW (g/mol) n= 83 min=102.18 med=175.18 max=308.50 LogP n= 82 min=-2.30 med=2.00 max=6.40 H2O Sol. (mM) n= 23 min=0.20 med=5.60 max=75812.80
Figures¶
In [7]:
# Fig 1: NPC Pathway composition, stacked by source
ctp = pd.crosstab(df['NPC Pathway'].fillna('(none)'), df['source_short'])
ctp = ctp.loc[ctp.sum(axis=1).sort_values(ascending=False).index]
fig, ax = plt.subplots(figsize=(8, 4.5))
ctp.plot(kind='barh', stacked=True, ax=ax,
color={'groundwater': '#3b7dd8', 'necromass': '#d98c3b'})
ax.set_xlabel('compounds'); ax.set_ylabel('NPC Pathway')
ax.set_title('Carbon Census: chemical-class composition by source (n=83)')
ax.invert_yaxis(); ax.legend(title='source')
fig.tight_layout(); fig.savefig(FIG / '00_class_composition.png', dpi=150)
print('saved 00_class_composition.png'); plt.close(fig)
saved 00_class_composition.png
In [8]:
# Fig 2: MW vs LogP, colored by source, sized by class
mw = pd.to_numeric(df['MW (g/mol)'], errors='coerce')
logp = pd.to_numeric(df['LogP'], errors='coerce')
fig, ax = plt.subplots(figsize=(7, 5))
for src, c in [('groundwater', '#3b7dd8'), ('necromass', '#d98c3b')]:
m = df['source_short'].eq(src)
ax.scatter(mw[m], logp[m], c=c, label=src, alpha=0.75, edgecolor='k', linewidth=0.3)
ax.set_xlabel('MW (g/mol)'); ax.set_ylabel('LogP')
ax.set_title('Physicochemical space of the 83 compounds')
ax.axhline(0, color='grey', lw=0.5, ls='--'); ax.legend()
fig.tight_layout(); fig.savefig(FIG / '00_mw_logp.png', dpi=150)
print('saved 00_mw_logp.png'); plt.close(fig)
saved 00_mw_logp.png
Save clean compounds table for NB01¶
In [9]:
keep = ['Unique ID', 'Name', 'source_short', 'expensive',
'NPC Pathway', 'NPC Superclass', 'NPC Class', 'Orig. Class',
'MW (g/mol)', 'LogP', 'H2O Sol. (mM)']
clean = df[keep].rename(columns={
'Unique ID': 'compound_id', 'Name': 'name',
'NPC Pathway': 'npc_pathway', 'NPC Superclass': 'npc_superclass',
'NPC Class': 'npc_class', 'Orig. Class': 'orig_class',
'MW (g/mol)': 'mw', 'H2O Sol. (mM)': 'h2o_sol_mM'})
clean.to_csv(DATA / 'compounds_selected.tsv', sep='\t', index=False)
print('wrote data/compounds_selected.tsv', clean.shape)
clean.head(10)
wrote data/compounds_selected.tsv (83, 11)
Out[9]:
| compound_id | name | source_short | expensive | npc_pathway | npc_superclass | npc_class | orig_class | mw | LogP | h2o_sol_mM | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | Cc1_29 | harman | groundwater | True | Alkaloids | Tryptophan alkaloids | Carboline alkaloids | Alkaloid | 182.22 | 3.6 | NaN |
| 1 | Cc1_102 | 1_prop_2_en_1_yl__1H_indole_3_carboxylic_acid | groundwater | True | Alkaloids | Tryptophan alkaloids | Simple indole alkaloids | Indole | 201.22 | NaN | NaN |
| 2 | Cc1_22 | 2-hydroxy-4,7,8-trimethylquinoline | groundwater | True | Alkaloids | Tryptophan alkaloids, Anthranilic acid alkaloids | Quinoline alkaloids | Heterocyclic | 187.24 | 2.0 | NaN |
| 3 | Cc1_46 | 3,6-dimethylchromone | groundwater | True | Polyketides | Chromanes | Chromones | Chromone | 174.20 | 2.3 | NaN |
| 4 | Cc1_39 | 7-hydroxy-4,8-dimethylchromen-2-one | groundwater | True | Shikimates and Phenylpropanoids | Coumarins | Simple coumarins | Coumarin | 190.19 | 1.8 | NaN |
| 5 | Cc1_71 | Fraxetin | groundwater | True | Shikimates and Phenylpropanoids | Coumarins | Simple coumarins | Coumarin | 208.17 | 1.2 | NaN |
| 6 | Cc1_34 | 2',5'-dihydroxychalcone | groundwater | True | Shikimates and Phenylpropanoids | Flavonoids | Chalcones | Flavonoid | 240.25 | 3.5 | NaN |
| 7 | Cc1_60 | Phthalic acid mono-2-ethylhexyl ester | groundwater | True | Shikimates and Phenylpropanoids | Phenolic acids (C6-C1) | Shikimic acids and derivatives, Simple phenoli... | Unknown | 278.34 | 4.0 | 3.6 |
| 8 | Cc1_106 | 2-Isopropyl-5-methylcyclohexanamine | groundwater | True | Terpenoids | Monoterpenoids | Menthane monoterpenoids, Monocyclic monoterpen... | Alicyclic amine | 155.28 | 2.8 | NaN |
| 9 | CECREIRZLPLYDM-UHFFFAOYSA-N | Manool | necromass | True | Terpenoids | Diterpenoids | Labdane diterpenoids | NaN | 290.50 | 5.7 | NaN |