01 Metal Experiment Inventory
Jupyter notebook from the Gene-Resolution Metal Cross-Resistance Across Diverse Bacteria project.
NB01: Metal Experiment Inventory & Fitness Extraction¶
Project: metal_cross_resistance
Purpose: Classify all Fitness Browser experiments by metal type, build organism × metal coverage matrix, extract per-organism gene × metal fitness matrices.
Environment: Requires BERDL JupyterHub (Spark SQL)
Outputs:
data/metal_experiments.csv— All metal experiments with metal classificationdata/organism_metal_coverage.csv— Organism × metal experiment count matrixdata/gene_metal_fitness/— Per-organism gene × metal mean fitness matrices
import os
import pandas as pd
import numpy as np
from pathlib import Path
# Spark session
try:
spark = get_spark_session()
except NameError:
from berdl_notebook_utils.setup_spark_session import get_spark_session
spark = get_spark_session()
DATA_DIR = Path('../data')
DATA_DIR.mkdir(exist_ok=True)
(DATA_DIR / 'gene_metal_fitness').mkdir(exist_ok=True)
print('Spark session ready.')
Spark session ready.
1. Classify All FB Experiments by Metal Type¶
Metal experiments use inconsistent naming (e.g., "Cobalt chloride hexahydrate", "Sodium molybdate", "copper (II) chloride dihydrate"). We classify by keyword matching on condition_1.
# Get all experiments
all_exps = spark.sql("""
SELECT orgId, expName, expGroup, condition_1
FROM kescience_fitnessbrowser.experiment
""").toPandas()
print(f'Total FB experiments: {len(all_exps):,}')
print(f'Organisms: {all_exps.orgId.nunique()}')
Total FB experiments: 7,552 Organisms: 48
def classify_metal(condition):
"""Classify a condition string as a metal type. Returns None if not a metal experiment."""
c = condition.lower() if isinstance(condition, str) else ''
# Order matters: check specific compounds before generic keywords
metal_patterns = [
('Co', ['cobalt', 'cocl']),
('Ni', ['nickel', 'nicl']),
('Cu', ['copper', 'cucl', 'cuso4', 'copper (ii) sulfate', 'copper (ii) chloride']),
('Zn', ['zinc', 'znso', 'zncl']),
('Mn', ['manganese', 'mncl']),
('Fe', ['iron', 'fecl', 'ferr']),
('Al', ['aluminum', 'alcl']),
('Cr', ['chromate', 'chromium', 'crcl']),
('U', ['uranyl', 'uranium', 'uo2']),
('Mo', ['molybdate', 'molybdenum', 'na2moo']),
('W', ['tungstate', 'tungsten', 'na2wo']),
('Se', ['selenate', 'selenite', 'selenium', 'na2seo']),
('Hg', ['mercury', 'hgcl']),
('Cd', ['cadmium', 'cdcl']),
('Ga', ['gallium']),
('Ag', ['silver']),
]
for symbol, keywords in metal_patterns:
for kw in keywords:
if kw in c:
return symbol
return None
all_exps['metal'] = all_exps['condition_1'].apply(classify_metal)
metal_exps = all_exps[all_exps['metal'].notna()].copy()
print(f'Metal experiments: {len(metal_exps)}')
print(f'Organisms with metal data: {metal_exps.orgId.nunique()}')
print(f'Metals detected: {sorted(metal_exps.metal.unique())}')
print()
print('Experiments per metal:')
print(metal_exps.groupby('metal').size().sort_values(ascending=False).to_string())
Metal experiments: 452 Organisms with metal data: 37 Metals detected: ['Al', 'Cd', 'Co', 'Cr', 'Cu', 'Fe', 'Hg', 'Mn', 'Mo', 'Ni', 'Se', 'U', 'W', 'Zn'] Experiments per metal: metal Co 95 Ni 86 Fe 70 Cu 58 Al 50 Zn 49 U 9 Mo 8 Mn 6 Cr 5 W 5 Hg 5 Cd 3 Se 3
# Save metal experiment classification
metal_exps.to_csv(DATA_DIR / 'metal_experiments.csv', index=False)
print(f'Saved {len(metal_exps)} metal experiments to data/metal_experiments.csv')
Saved 452 metal experiments to data/metal_experiments.csv
2. Organism × Metal Coverage Matrix¶
# Build coverage matrix
coverage = metal_exps.groupby(['orgId', 'metal']).size().reset_index(name='n_exps')
coverage_matrix = coverage.pivot_table(index='orgId', columns='metal', values='n_exps', fill_value=0)
# Count metals per organism
metals_per_org = (coverage_matrix > 0).sum(axis=1).sort_values(ascending=False)
print('=== Organism × Metal Coverage ===')
print(f'Organisms with >=3 metals: {(metals_per_org >= 3).sum()}')
print(f'Organisms with >=5 metals: {(metals_per_org >= 5).sum()}')
print()
print('Top organisms by metal coverage:')
for org in metals_per_org.head(15).index:
metals = [col for col in coverage_matrix.columns if coverage_matrix.loc[org, col] > 0]
total = coverage_matrix.loc[org].sum()
print(f' {org}: {len(metals)} metals, {int(total)} experiments — {", ".join(metals)}')
# Save
coverage_matrix.to_csv(DATA_DIR / 'organism_metal_coverage.csv')
print(f'\nSaved coverage matrix to data/organism_metal_coverage.csv')
=== Organism × Metal Coverage === Organisms with >=3 metals: 30 Organisms with >=5 metals: 12 Top organisms by metal coverage: DvH: 12 metals, 68 experiments — Al, Co, Cr, Cu, Hg, Mn, Mo, Ni, Se, U, W, Zn psRCH2: 8 metals, 41 experiments — Al, Cd, Co, Cr, Cu, Ni, U, Zn Korea: 5 metals, 6 experiments — Al, Co, Cu, Ni, Zn Dino: 5 metals, 15 experiments — Al, Co, Cu, Ni, Zn Cola: 5 metals, 16 experiments — Al, Co, Cu, Ni, Zn Pedo557: 5 metals, 14 experiments — Al, Co, Cu, Ni, Zn acidovorax_3H11: 5 metals, 14 experiments — Al, Co, Cu, Ni, Zn SB2B: 5 metals, 8 experiments — Al, Co, Cu, Ni, Zn MR1: 5 metals, 10 experiments — Al, Co, Cu, Ni, Zn Marino: 5 metals, 21 experiments — Al, Co, Cu, Ni, Zn Phaeo: 5 metals, 16 experiments — Al, Co, Cu, Ni, Zn PV4: 5 metals, 11 experiments — Al, Co, Cu, Ni, Zn Caulo: 4 metals, 12 experiments — Al, Co, Cu, Ni pseudo6_N2E2: 4 metals, 6 experiments — Al, Co, Cu, Ni pseudo5_N2C3_1: 4 metals, 6 experiments — Al, Co, Cu, Ni Saved coverage matrix to data/organism_metal_coverage.csv
3. Extract Per-Organism Gene × Metal Fitness Matrices¶
For each organism with ≥3 metals, query genefitness for all metal experiments and compute mean fitness per gene × metal combination.
Critical pitfalls addressed:
- All FB columns are strings → CAST fit to FLOAT
- locusId types must be consistent → convert to str
# Select organisms with >=3 metals
target_orgs = metals_per_org[metals_per_org >= 3].index.tolist()
print(f'Extracting fitness data for {len(target_orgs)} organisms with >=3 metals')
# Build expName → metal lookup
exp_metal_map = metal_exps.set_index('expName')['metal'].to_dict()
extraction_summary = []
for i, org in enumerate(target_orgs):
outfile = DATA_DIR / 'gene_metal_fitness' / f'{org}_metal_fitness.csv'
# Skip if already extracted
if outfile.exists():
cached = pd.read_csv(outfile, index_col=0)
extraction_summary.append({
'orgId': org, 'n_genes': len(cached),
'n_metals': len(cached.columns), 'status': 'cached'
})
print(f' [{i+1}/{len(target_orgs)}] {org}: cached ({len(cached)} genes × {len(cached.columns)} metals)')
continue
# Get metal experiment names for this organism
org_metal_exps = metal_exps[metal_exps['orgId'] == org]
exp_names = org_metal_exps['expName'].tolist()
if len(exp_names) == 0:
print(f' [{i+1}/{len(target_orgs)}] {org}: SKIPPED (no metal experiments)')
continue
# Query genefitness — CAST fit to FLOAT
exp_list = "', '".join(exp_names)
query = f"""
SELECT locusId, expName, CAST(fit AS FLOAT) as fit
FROM kescience_fitnessbrowser.genefitness
WHERE orgId = '{org}'
AND expName IN ('{exp_list}')
"""
try:
gf = spark.sql(query).toPandas()
except Exception as e:
print(f' [{i+1}/{len(target_orgs)}] {org}: ERROR — {e}')
continue
if len(gf) == 0:
print(f' [{i+1}/{len(target_orgs)}] {org}: SKIPPED (no fitness data returned)')
continue
# Ensure locusId is string (critical pitfall from plan review)
gf['locusId'] = gf['locusId'].astype(str)
gf['fit'] = gf['fit'].astype(float)
# Map expName to metal
gf['metal'] = gf['expName'].map(exp_metal_map)
gf = gf[gf['metal'].notna()]
# Average fitness per gene × metal (across replicate experiments)
gene_metal = gf.groupby(['locusId', 'metal'])['fit'].mean().reset_index()
# Pivot to gene × metal matrix
matrix = gene_metal.pivot_table(index='locusId', columns='metal', values='fit')
# Save
matrix.to_csv(outfile)
extraction_summary.append({
'orgId': org, 'n_genes': len(matrix),
'n_metals': len(matrix.columns), 'status': 'extracted',
'n_fitness_records': len(gf)
})
print(f' [{i+1}/{len(target_orgs)}] {org}: {len(matrix)} genes × {len(matrix.columns)} metals '
f'({len(gf):,} fitness records)')
Extracting fitness data for 30 organisms with >=3 metals
[1/30] DvH: 2741 genes × 13 metals (186,388 fitness records)
[2/30] psRCH2: 3349 genes × 8 metals (137,309 fitness records)
[3/30] Korea: 3393 genes × 5 metals (20,358 fitness records)
[4/30] Dino: 3187 genes × 6 metals (47,805 fitness records)
[5/30] Cola: 3954 genes × 6 metals (63,264 fitness records)
[6/30] Pedo557: 4423 genes × 5 metals (61,922 fitness records)
[7/30] acidovorax_3H11: 3935 genes × 5 metals (55,090 fitness records)
[8/30] SB2B: 3121 genes × 5 metals (24,968 fitness records)
[9/30] MR1: 3782 genes × 5 metals (37,820 fitness records)
[10/30] Marino: 3650 genes × 6 metals (76,650 fitness records)
[11/30] Phaeo: 3099 genes × 5 metals (49,584 fitness records)
[12/30] PV4: 3009 genes × 5 metals (33,099 fitness records)
[13/30] Caulo: 3312 genes × 4 metals (39,744 fitness records)
[14/30] pseudo6_N2E2: 5133 genes × 4 metals (30,798 fitness records)
[15/30] pseudo5_N2C3_1: 5193 genes × 4 metals (31,158 fitness records)
[16/30] Btheta: 4055 genes × 7 metals (105,430 fitness records)
[17/30] Smeli: 5133 genes × 4 metals (30,798 fitness records)
[18/30] azobra: 4833 genes × 2 metals (24,165 fitness records)
[19/30] PS: 2561 genes × 4 metals (20,488 fitness records)
[20/30] Keio: 3789 genes × 4 metals (15,156 fitness records)
[21/30] Cup4G11: 6384 genes × 3 metals (57,456 fitness records)
[22/30] pseudo13_GW456_L13: 4350 genes × 4 metals (21,750 fitness records)
[23/30] ANA3: 3668 genes × 3 metals (11,004 fitness records)
[24/30] BFirm: 5428 genes × 2 metals (21,712 fitness records)
[25/30] Ponti: 3685 genes × 4 metals (40,535 fitness records)
[26/30] pseudo3_N2E3: 5028 genes × 3 metals (40,224 fitness records)
[27/30] WCS417: 4419 genes × 3 metals (13,257 fitness records)
[28/30] Koxy: 4608 genes × 3 metals (36,864 fitness records)
[29/30] Kang: 2003 genes × 4 metals (10,015 fitness records)
[30/30] pseudo1_N1B4: 4336 genes × 4 metals (17,344 fitness records)
# Summary
summary_df = pd.DataFrame(extraction_summary)
summary_df.to_csv(DATA_DIR / 'extraction_summary.csv', index=False)
print('\n=== Extraction Summary ===')
print(f'Organisms extracted: {len(summary_df)}')
print(f'Total genes across all organisms: {summary_df.n_genes.sum():,}')
print(f'Mean genes per organism: {summary_df.n_genes.mean():.0f}')
print(f'Mean metals per organism: {summary_df.n_metals.mean():.1f}')
print()
print(summary_df.sort_values('n_metals', ascending=False).to_string(index=False))
=== Extraction Summary ===
Organisms extracted: 30
Total genes across all organisms: 119,561
Mean genes per organism: 3985
Mean metals per organism: 4.7
orgId n_genes n_metals status n_fitness_records
DvH 2741 13 extracted 186388
psRCH2 3349 8 extracted 137309
Btheta 4055 7 extracted 105430
Dino 3187 6 extracted 47805
Marino 3650 6 extracted 76650
Cola 3954 6 extracted 63264
Pedo557 4423 5 extracted 61922
acidovorax_3H11 3935 5 extracted 55090
SB2B 3121 5 extracted 24968
MR1 3782 5 extracted 37820
Phaeo 3099 5 extracted 49584
PV4 3009 5 extracted 33099
Korea 3393 5 extracted 20358
Caulo 3312 4 extracted 39744
pseudo6_N2E2 5133 4 extracted 30798
pseudo5_N2C3_1 5193 4 extracted 31158
Smeli 5133 4 extracted 30798
PS 2561 4 extracted 20488
pseudo13_GW456_L13 4350 4 extracted 21750
Keio 3789 4 extracted 15156
pseudo1_N1B4 4336 4 extracted 17344
Ponti 3685 4 extracted 40535
Kang 2003 4 extracted 10015
Cup4G11 6384 3 extracted 57456
Koxy 4608 3 extracted 36864
ANA3 3668 3 extracted 11004
WCS417 4419 3 extracted 13257
pseudo3_N2E3 5028 3 extracted 40224
azobra 4833 2 extracted 24165
BFirm 5428 2 extracted 21712
4. Also Extract Non-Metal Experiment Counts¶
We need non-metal experiment counts per organism to classify genes into the three-tier architecture (NB04): general stress vs metal-shared vs metal-specific. Save the total experiment counts now.
# Count non-metal experiments per organism
non_metal_exps = all_exps[all_exps['metal'].isna()]
non_metal_counts = non_metal_exps.groupby('orgId').size().reset_index(name='n_nonmetal_exps')
metal_counts = metal_exps.groupby('orgId').size().reset_index(name='n_metal_exps')
total_counts = all_exps.groupby('orgId').size().reset_index(name='n_total_exps')
org_exp_counts = total_counts.merge(metal_counts, on='orgId', how='left').merge(
non_metal_counts, on='orgId', how='left'
).fillna(0)
# Filter to target organisms
org_exp_counts = org_exp_counts[org_exp_counts['orgId'].isin(target_orgs)]
org_exp_counts.to_csv(DATA_DIR / 'organism_experiment_counts.csv', index=False)
print('Experiment counts for target organisms:')
print(org_exp_counts.sort_values('n_total_exps', ascending=False).to_string(index=False))
Experiment counts for target organisms:
orgId n_total_exps n_metal_exps n_nonmetal_exps
DvH 757 68.0 689
Btheta 519 26.0 493
psRCH2 350 41.0 309
Phaeo 274 16.0 258
Marino 255 21.0 234
pseudo3_N2E3 211 8.0 203
Koxy 208 8.0 200
Cola 202 16.0 186
WCS417 201 3.0 198
Caulo 198 12.0 186
SB2B 190 8.0 182
pseudo6_N2E2 188 6.0 182
Dino 186 15.0 171
pseudo5_N2C3_1 184 6.0 178
Pedo557 177 14.0 163
MR1 176 10.0 166
Keio 168 4.0 164
Korea 162 6.0 156
PV4 160 11.0 149
pseudo1_N1B4 147 4.0 143
acidovorax_3H11 140 14.0 126
BFirm 113 4.0 109
Kang 108 5.0 103
ANA3 107 3.0 104
Cup4G11 106 9.0 97
pseudo13_GW456_L13 106 5.0 101
Ponti 104 11.0 93
azobra 91 5.0 86
Smeli 90 6.0 84
PS 79 8.0 71
5. Quick Sanity Check: DvH Cross-Metal Correlations¶
As a sanity check, compute the metal-metal correlation matrix for DvH (our best-covered organism) and verify it aligns with the counter_ion_effects findings.
# Load DvH matrix
dvh_file = DATA_DIR / 'gene_metal_fitness' / 'DvH_metal_fitness.csv'
if dvh_file.exists():
dvh = pd.read_csv(dvh_file, index_col=0)
print(f'DvH: {dvh.shape[0]} genes × {dvh.shape[1]} metals')
print(f'Metals: {list(dvh.columns)}')
print()
# Compute pairwise Pearson correlations between metals
# Drop genes with NaN in any metal (complete cases)
dvh_complete = dvh.dropna()
print(f'Complete cases (genes with data for all {dvh.shape[1]} metals): {len(dvh_complete)}')
corr = dvh.corr(method='pearson')
print('\nDvH metal-metal correlation matrix (Pearson r):')
print(corr.round(3).to_string())
# Highlight strongest pairs
pairs = []
for i, m1 in enumerate(corr.columns):
for j, m2 in enumerate(corr.columns):
if i < j:
pairs.append((m1, m2, corr.loc[m1, m2]))
pairs.sort(key=lambda x: abs(x[2]), reverse=True)
print('\nTop 10 most correlated metal pairs (DvH):')
for m1, m2, r in pairs[:10]:
print(f' {m1}-{m2}: r = {r:.3f}')
else:
print('DvH fitness file not found — run extraction first')
DvH: 2741 genes × 13 metals
Metals: ['Al', 'Co', 'Cr', 'Cu', 'Fe', 'Hg', 'Mn', 'Mo', 'Ni', 'Se', 'U', 'W', 'Zn']
Complete cases (genes with data for all 13 metals): 2741
DvH metal-metal correlation matrix (Pearson r):
Al Co Cr Cu Fe Hg Mn Mo Ni Se U W Zn
Al 1.000 0.587 0.665 0.475 0.720 0.512 0.626 0.168 0.554 0.627 0.394 0.543 0.711
Co 0.587 1.000 0.501 0.659 0.572 0.311 0.784 0.191 0.323 0.507 0.238 0.477 0.502
Cr 0.665 0.501 1.000 0.448 0.745 0.474 0.559 0.215 0.419 0.623 0.307 0.416 0.427
Cu 0.475 0.659 0.448 1.000 0.473 0.285 0.608 0.570 0.341 0.444 0.194 0.387 0.355
Fe 0.720 0.572 0.745 0.473 1.000 0.500 0.598 0.207 0.444 0.694 0.330 0.431 0.475
Hg 0.512 0.311 0.474 0.285 0.500 1.000 0.333 0.097 0.327 0.469 0.697 0.269 0.327
Mn 0.626 0.784 0.559 0.608 0.598 0.333 1.000 0.244 0.346 0.552 0.259 0.484 0.515
Mo 0.168 0.191 0.215 0.570 0.207 0.097 0.244 1.000 0.123 0.194 0.058 0.222 0.105
Ni 0.554 0.323 0.419 0.341 0.444 0.327 0.346 0.123 1.000 0.424 0.337 0.342 0.358
Se 0.627 0.507 0.623 0.444 0.694 0.469 0.552 0.194 0.424 1.000 0.366 0.523 0.470
U 0.394 0.238 0.307 0.194 0.330 0.697 0.259 0.058 0.337 0.366 1.000 0.287 0.248
W 0.543 0.477 0.416 0.387 0.431 0.269 0.484 0.222 0.342 0.523 0.287 1.000 0.594
Zn 0.711 0.502 0.427 0.355 0.475 0.327 0.515 0.105 0.358 0.470 0.248 0.594 1.000
Top 10 most correlated metal pairs (DvH):
Co-Mn: r = 0.784
Cr-Fe: r = 0.745
Al-Fe: r = 0.720
Al-Zn: r = 0.711
Hg-U: r = 0.697
Fe-Se: r = 0.694
Al-Cr: r = 0.665
Co-Cu: r = 0.659
Al-Se: r = 0.627
Al-Mn: r = 0.626
# Visualize DvH cross-resistance matrix
import matplotlib
matplotlib.use('Agg')
import matplotlib.pyplot as plt
import seaborn as sns
if dvh_file.exists():
fig, ax = plt.subplots(figsize=(10, 8))
sns.heatmap(corr, annot=True, fmt='.2f', cmap='RdBu_r', center=0,
vmin=-1, vmax=1, square=True, ax=ax,
linewidths=0.5, linecolor='white')
ax.set_title('DvH Metal Cross-Resistance Matrix\n(Gene-level Pearson correlation of fitness profiles)',
fontsize=12)
plt.tight_layout()
fig.savefig('../figures/dvh_cross_resistance_heatmap.png', dpi=150, bbox_inches='tight')
plt.show()
print('Saved to figures/dvh_cross_resistance_heatmap.png')
Saved to figures/dvh_cross_resistance_heatmap.png
Summary¶
This notebook classified FB experiments by metal type, built the organism × metal coverage matrix, and extracted per-organism gene × metal fitness matrices for all organisms with ≥3 metals.
Next: NB02 will compute per-organism cross-resistance matrices (metal × metal Pearson correlations) for all organisms.