05 Pangenome Fitness
Jupyter notebook from the Annotation-Gap Discovery via Phenotype-Fitness-Pangenome-Gapfilling Integration project.
NB05: Pangenome Presence/Absence + Expanded Fitness Profiling¶
Project: Annotation-Gap Discovery
Goal: Build cross-organism EC coverage matrix, compute fitness specificity
scores, and identify co-occurrence patterns for candidate genes.
Inputs: NB03 annotation_gap_candidates.tsv, NB04 bakta_ec_alternatives.tsv
Outputs: presence_absence_matrix.tsv, fitness_support.tsv, cooccurrence_candidates.tsv
Requires: BERDL JupyterHub (Spark access)
spark = get_spark_session()
spark.sql('SELECT 1 AS ok').show()
import pandas as pd
import numpy as np
import os
DATA_DIR = '../data'
FIG_DIR = '../figures'
os.makedirs(FIG_DIR, exist_ok=True)
# Load prior outputs
candidates = pd.read_csv(f'{DATA_DIR}/annotation_gap_candidates.tsv', sep='\t')
bakta_alts = pd.read_csv(f'{DATA_DIR}/bakta_ec_alternatives.tsv', sep='\t')
gf_details = pd.read_csv(f'{DATA_DIR}/gapfill_reaction_details.tsv', sep='\t')
gf_results = pd.read_csv(f'{DATA_DIR}/carbon_source_gapfill_results.tsv', sep='\t')
manifest = pd.read_csv(f'{DATA_DIR}/genome_manifest.tsv', sep='\t')
concordance = pd.read_csv(f'{DATA_DIR}/gapmind_concordance.tsv', sep='\t')
enzymatic = gf_details[gf_details['rxn_type'] == 'metabolic'].copy()
gf_cases = gf_results[gf_results['status'] == 'gapfilled'].copy()
print(f'NB03 candidates: {len(candidates)}')
print(f'NB04 Bakta alternatives: {len(bakta_alts)}')
print(f'Enzymatic reactions: {len(enzymatic)}')
print(f'Gapfilled cases: {len(gf_cases)}')
print(f'Organisms: {manifest["orgId"].nunique()}')
print(f'Spark session ready.')
+---+ | ok| +---+ | 1| +---+ NB03 candidates: 107 NB04 Bakta alternatives: 1459 Enzymatic reactions: 201 Gapfilled cases: 38 Organisms: 14 Spark session ready.
1. Consolidate All Candidates¶
Merge NB03 candidates (EC match + fitness + conservation) with NB04 Bakta
alternatives into a unified candidate table.
# NB03 candidates: orgId, rxn_id_base, locusId, gene_cluster_id, confidence
nb03 = candidates[['orgId', 'carbon_source', 'rxn_id_base', 'name',
'ec_number', 'locusId', 'gene_desc',
'mean_fit', 'fitness_class', 'conservation_class',
'confidence']].copy()
nb03['source'] = 'nb03_ec_match'
# NB04 Bakta alternatives: orgId, rxn_id_base, gene_cluster_id, bakta_ec
nb04 = bakta_alts[['orgId', 'rxn_id_base', 'gene_cluster_id',
'bakta_ec', 'bakta_product', 'is_core', 'source']].copy()
nb04['source'] = nb04['source'].fillna('bakta_ec')
# Count unique reaction-organism pairs
nb03_pairs = set(zip(nb03['orgId'], nb03['rxn_id_base']))
nb04_pairs = set(zip(nb04['orgId'], nb04['rxn_id_base']))
overlap = nb03_pairs & nb04_pairs
nb04_only = nb04_pairs - nb03_pairs
print(f'NB03 reaction-organism pairs: {len(nb03_pairs)}')
print(f'NB04 reaction-organism pairs: {len(nb04_pairs)}')
print(f' Overlap with NB03: {len(overlap)}')
print(f' New from NB04 only: {len(nb04_only)}')
# Total unresolved
all_rxn_pairs = set(zip(enzymatic['orgId'], enzymatic['rxn_id_base']))
resolved = nb03_pairs | nb04_pairs
still_unresolved = all_rxn_pairs - resolved
print(f'\nTotal enzymatic reaction-organism pairs: {len(all_rxn_pairs)}')
print(f'Resolved (any candidate): {len(resolved)} ({100*len(resolved)/len(all_rxn_pairs):.1f}%)')
print(f'Still unresolved: {len(still_unresolved)}')
NB03 reaction-organism pairs: 51 NB04 reaction-organism pairs: 452 Overlap with NB03: 0 New from NB04 only: 452 Total enzymatic reaction-organism pairs: 201 Resolved (any candidate): 503 (250.2%) Still unresolved: 128
2. Cross-Organism EC Coverage Matrix¶
For each gapfilled reaction EC, how many of the 14 organisms have at least one
gene cluster annotated with that EC (via Bakta)? This reveals whether an EC is
universally present or organism-specific.
# Get all unique ECs from gapfilled reactions
rxn_ecs = {}
for _, row in enzymatic[enzymatic['ec_numbers'].notna()].iterrows():
for ec in str(row['ec_numbers']).replace('|', ';').split(';'):
ec = ec.strip()
if ec:
rxn_ecs.setdefault(ec, set()).add(row['rxn_id_base'])
target_ecs = sorted(rxn_ecs.keys())
print(f'Unique ECs from gapfilled reactions: {len(target_ecs)}')
# Query Bakta for all these ECs across all 14 clades
clade_list = manifest['gtdb_species_clade_id'].dropna().unique().tolist()
clade_in = "','".join(clade_list)
ec_in = "','".join(target_ecs)
ec_coverage = spark.sql(f"""
SELECT gc.gtdb_species_clade_id, ba.ec AS bakta_ec,
gc.gene_cluster_id, gc.is_core
FROM kbase_ke_pangenome.gene_cluster gc
JOIN kbase_ke_pangenome.bakta_annotations ba
ON gc.gene_cluster_id = ba.gene_cluster_id
WHERE gc.gtdb_species_clade_id IN ('{clade_in}')
AND ba.ec IN ('{ec_in}')
""").toPandas()
ec_coverage['orgId'] = ec_coverage['gtdb_species_clade_id'].map(
manifest.set_index('gtdb_species_clade_id')['orgId'].to_dict()
)
print(f'EC coverage entries: {len(ec_coverage):,}')
# Build presence matrix: EC × organism
org_list = sorted(manifest['orgId'].unique())
ec_presence = ec_coverage.groupby(['bakta_ec', 'orgId']).size().reset_index(name='n_clusters')
ec_matrix = ec_presence.pivot_table(
index='bakta_ec', columns='orgId', values='n_clusters', fill_value=0
)
ec_matrix_binary = (ec_matrix > 0).astype(int)
ec_matrix_binary['n_organisms'] = ec_matrix_binary.sum(axis=1)
print(f'\nEC presence matrix: {ec_matrix_binary.shape[0]} ECs × {len(org_list)} organisms')
print(f'\nEC coverage distribution (how many organisms have each EC):')
print(ec_matrix_binary['n_organisms'].value_counts().sort_index().to_string())
Unique ECs from gapfilled reactions: 84
EC coverage entries: 904 EC presence matrix: 57 ECs × 14 organisms EC coverage distribution (how many organisms have each EC): n_organisms 1 6 2 6 3 5 4 4 5 6 6 5 7 3 9 1 10 1 12 6 13 7 14 7
# For each gapfilled reaction, annotate with EC coverage
rxn_coverage = []
for _, row in enzymatic.iterrows():
ecs = []
if pd.notna(row.get('ec_numbers')):
ecs = [ec.strip() for ec in str(row['ec_numbers']).replace('|', ';').split(';') if ec.strip()]
best_coverage = 0
has_in_org = False
for ec in ecs:
if ec in ec_matrix_binary.index:
n_orgs = ec_matrix_binary.loc[ec, 'n_organisms']
best_coverage = max(best_coverage, n_orgs)
if row['orgId'] in ec_matrix_binary.columns:
has_in_org = has_in_org or (ec_matrix_binary.loc[ec, row['orgId']] > 0)
rxn_coverage.append({
'orgId': row['orgId'],
'rxn_id_base': row['rxn_id_base'],
'name': row['name'],
'ec_numbers': row.get('ec_numbers'),
'n_orgs_with_ec': best_coverage,
'org_has_bakta_ec': has_in_org,
})
rxn_cov_df = pd.DataFrame(rxn_coverage)
print('Gapfilled reactions by EC coverage:')
print(f' EC found in this organism (Bakta): {rxn_cov_df["org_has_bakta_ec"].sum()}')
print(f' EC found in 0 organisms: {(rxn_cov_df["n_orgs_with_ec"] == 0).sum()}')
print(f' EC found in 1-5 organisms: {((rxn_cov_df["n_orgs_with_ec"] > 0) & (rxn_cov_df["n_orgs_with_ec"] <= 5)).sum()}')
print(f' EC found in 6-14 organisms: {(rxn_cov_df["n_orgs_with_ec"] > 5).sum()}')
Gapfilled reactions by EC coverage: EC found in this organism (Bakta): 43 EC found in 0 organisms: 87 EC found in 1-5 organisms: 49 EC found in 6-14 organisms: 65
3. Expanded Fitness Profiling¶
For NB03 candidate genes, query fitness across ALL carbon source experiments
(not just the one requiring gapfilling). Genes with carbon-source-specific
fitness defects are stronger candidates than general housekeeping genes.
Note: fit and t columns are STRING type — must CAST to DOUBLE.
# Get all candidate locusIds from NB03 (grouped by orgId)
cand_by_org = candidates.groupby('orgId')['locusId'].apply(lambda x: sorted(x.unique())).to_dict()
# Load FB experiments mapping
fb_exps = pd.read_csv(f'{DATA_DIR}/fb_carbon_experiments.tsv', sep='\t')
carbon_exps = fb_exps[fb_exps['expGroup'].str.contains('carbon', case=False, na=False)].copy()
print(f'Organisms with NB03 candidates: {len(cand_by_org)}')
print(f'Total candidate genes: {sum(len(v) for v in cand_by_org.values())}')
print(f'Carbon source experiments: {len(carbon_exps)}')
# Query fitness for each organism's candidate genes across ALL experiments
all_fitness = []
for org_id, loci in cand_by_org.items():
locus_in = "','".join(str(l) for l in loci)
fit_df = spark.sql(f"""
SELECT orgId, locusId, expName,
CAST(fit AS DOUBLE) AS fit,
CAST(t AS DOUBLE) AS t
FROM kescience_fitnessbrowser.genefitness
WHERE orgId = '{org_id}'
AND locusId IN ('{locus_in}')
""").toPandas()
all_fitness.append(fit_df)
print(f' {org_id}: {len(loci)} genes, {len(fit_df):,} fitness values across {fit_df["expName"].nunique()} experiments')
fitness_all = pd.concat(all_fitness, ignore_index=True)
print(f'\nTotal fitness values: {len(fitness_all):,}')
print(f'Unique experiments: {fitness_all["expName"].nunique()}')
Organisms with NB03 candidates: 14 Total candidate genes: 107 Carbon source experiments: 717
Btheta: 2 genes, 0 fitness values across 0 experiments
Cup4G11: 7 genes, 424 fitness values across 106 experiments
Dino: 8 genes, 558 fitness values across 186 experiments
Dyella79: 6 genes, 340 fitness values across 68 experiments
HerbieS: 18 genes, 1,248 fitness values across 96 experiments
Keio: 10 genes, 1,344 fitness values across 168 experiments
Korea: 4 genes, 0 fitness values across 0 experiments
Koxy: 7 genes, 1,456 fitness values across 208 experiments
MR1: 4 genes, 352 fitness values across 176 experiments
Marino: 10 genes, 1,785 fitness values across 255 experiments
PV4: 4 genes, 320 fitness values across 160 experiments
Phaeo: 9 genes, 1,370 fitness values across 274 experiments
Smeli: 10 genes, 900 fitness values across 90 experiments
azobra: 8 genes, 455 fitness values across 91 experiments Total fitness values: 10,552 Unique experiments: 1046
# Map experiments to carbon sources
exp_to_cs = carbon_exps.set_index('expName')['condition_1'].to_dict()
fitness_all['carbon_source'] = fitness_all['expName'].map(exp_to_cs)
fitness_all['is_carbon_exp'] = fitness_all['carbon_source'].notna()
# For each candidate gene × target carbon source, compute fitness specificity
# Merge with NB03 to know which gene maps to which reaction/carbon source
cand_target = candidates[['orgId', 'locusId', 'carbon_source', 'rxn_id_base']].drop_duplicates()
specificity = []
for _, cand in cand_target.iterrows():
gene_fit = fitness_all[
(fitness_all['orgId'] == cand['orgId']) &
(fitness_all['locusId'] == cand['locusId'])
]
if len(gene_fit) == 0:
continue
# Fitness under target carbon source
target_fit = gene_fit[gene_fit['carbon_source'] == cand['carbon_source']]
target_mean = target_fit['fit'].mean() if len(target_fit) > 0 else np.nan
# Fitness under all carbon source experiments
carbon_fit = gene_fit[gene_fit['is_carbon_exp']]
all_cs_mean = carbon_fit['fit'].mean() if len(carbon_fit) > 0 else np.nan
all_cs_std = carbon_fit['fit'].std() if len(carbon_fit) > 1 else np.nan
# Overall fitness (all experiments)
overall_mean = gene_fit['fit'].mean()
overall_std = gene_fit['fit'].std() if len(gene_fit) > 1 else np.nan
# Specificity: how much worse is fitness under target vs all conditions?
if pd.notna(target_mean) and pd.notna(overall_mean) and overall_std > 0:
specificity_score = (target_mean - overall_mean) / overall_std
else:
specificity_score = np.nan
# Absolute specificity: |target_fit| / mean(|all_fits|)
all_abs_mean = gene_fit['fit'].abs().mean()
if all_abs_mean > 0 and pd.notna(target_mean):
abs_specificity = abs(target_mean) / all_abs_mean
else:
abs_specificity = np.nan
specificity.append({
'orgId': cand['orgId'],
'locusId': cand['locusId'],
'carbon_source': cand['carbon_source'],
'rxn_id_base': cand['rxn_id_base'],
'target_fit': target_mean,
'all_cs_mean_fit': all_cs_mean,
'overall_mean_fit': overall_mean,
'overall_std_fit': overall_std,
'n_experiments': len(gene_fit),
'n_carbon_exps': len(carbon_fit),
'z_score': specificity_score,
'abs_specificity': abs_specificity,
})
spec_df = pd.DataFrame(specificity)
print(f'Fitness specificity computed for {len(spec_df)} candidate-reaction pairs')
if len(spec_df) > 0:
print(f'\nSpecificity distribution (z-score = (target - overall) / std):')
print(spec_df['z_score'].describe().to_string())
# Classify specificity
spec_df['specificity_class'] = 'neutral'
spec_df.loc[spec_df['z_score'] < -1, 'specificity_class'] = 'specific_defect'
spec_df.loc[spec_df['z_score'] < -2, 'specificity_class'] = 'strong_specific_defect'
spec_df.loc[spec_df['z_score'] > 1, 'specificity_class'] = 'specific_benefit'
print(f'\nSpecificity class distribution:')
print(spec_df['specificity_class'].value_counts().to_string())
Fitness specificity computed for 71 candidate-reaction pairs Specificity distribution (z-score = (target - overall) / std): count 71.000000 mean -0.036756 std 0.593373 min -1.443336 25% -0.451371 50% -0.002718 75% 0.388026 max 1.491773 Specificity class distribution:
specificity_class neutral 66 specific_defect 4 specific_benefit 1
4. Conservation + Co-occurrence Patterns¶
For gapfilled reactions, identify gene clusters that co-occur with known pathway
genes. If a gapfilled reaction's EC maps to a core cluster but the organism lacks
a gene in that cluster → strong annotation gap evidence.
# Conservation analysis using gene_cluster is_core/is_auxiliary/is_singleton
# For Bakta-matched gene clusters
bakta_conservation = bakta_alts.groupby(['orgId', 'rxn_id_base']).agg(
n_clusters=('gene_cluster_id', 'nunique'),
n_core=('is_core', lambda x: x.sum()),
has_core=('is_core', 'any'),
).reset_index()
print('Bakta candidate conservation:')
print(f' With core gene cluster: {bakta_conservation["has_core"].sum()}')
print(f' Without core cluster: {(~bakta_conservation["has_core"]).sum()}')
# Cross-reference: reactions where the EC is found in most organisms (universal)
# but gapfilling is needed in a specific organism → likely annotation gap
rxn_ec_map = {}
for _, row in enzymatic[enzymatic['ec_numbers'].notna()].iterrows():
for ec in str(row['ec_numbers']).replace('|', ';').split(';'):
ec = ec.strip()
if ec:
rxn_ec_map.setdefault(row['rxn_id_base'], set()).add(ec)
cooccurrence = []
for _, row in enzymatic.iterrows():
ecs = rxn_ec_map.get(row['rxn_id_base'], set())
best_n = 0
best_ec = None
for ec in ecs:
if ec in ec_matrix_binary.index:
n = ec_matrix_binary.loc[ec, 'n_organisms']
if n > best_n:
best_n = n
best_ec = ec
org_has = False
if best_ec and row['orgId'] in ec_matrix_binary.columns:
org_has = ec_matrix_binary.loc[best_ec, row['orgId']] > 0
is_resolved = (row['orgId'], row['rxn_id_base']) in resolved
cooccurrence.append({
'orgId': row['orgId'],
'rxn_id_base': row['rxn_id_base'],
'name': row['name'],
'best_ec': best_ec,
'n_orgs_with_ec': best_n,
'org_has_ec_in_bakta': org_has,
'is_resolved': is_resolved,
'gap_evidence': 'strong' if best_n >= 10 and not org_has else
('moderate' if best_n >= 5 and not org_has else
('weak' if best_n > 0 else 'no_ec')),
})
coocc_df = pd.DataFrame(cooccurrence)
print(f'\nCo-occurrence gap evidence:')
print(coocc_df['gap_evidence'].value_counts().to_string())
print(f'\nStrong evidence (EC in 10+ orgs, missing in target):')
strong = coocc_df[coocc_df['gap_evidence'] == 'strong']
if len(strong) > 0:
print(strong[['orgId', 'rxn_id_base', 'name', 'best_ec', 'n_orgs_with_ec']].to_string(index=False))
Bakta candidate conservation:
With core gene cluster: 420
Without core cluster: 32
Co-occurrence gap evidence:
gap_evidence
no_ec 87
weak 77
moderate 26
strong 11
Strong evidence (EC in 10+ orgs, missing in target):
orgId rxn_id_base name best_ec n_orgs_with_ec
Btheta rxn02303 Coproporphyrinogen:oxygen oxidoreductase(decarboxylating) 1.3.3.3 13
Btheta rxn02288 Uroporphyrinogen-III carboxy-lyase 4.1.1.37 13
Btheta rxn00350 glutathione gamma-glutamylaminopeptidase 2.3.2.2 13
Phaeo rxn09399 1,4-alpha-glucan branching enzyme 2.4.1.18 10
Dino rxn00340 L-aspartate:ammonia ligase (AMP-forming) 6.3.5.4 12
Dino rxn02402 Nicotinate-nucleotide:pyrophosphate phosphoribosyltransferase (carboxylating) 2.4.2.19 13
Dino rxn02312 S-Adenosyl-L-methionine:8-amino-7-oxononanoate aminotransferase 2.6.1.62 13
Dino rxn02277 7,8-Diaminononanoate:carbon-dioxide cyclo-ligase 6.3.3.3 13
Dino rxn17731 biotin synthase 2.8.1.6 13
Korea rxn09399 1,4-alpha-glucan branching enzyme 2.4.1.18 10
Dyella79 rxn09399 1,4-alpha-glucan branching enzyme 2.4.1.18 10
5. Updated Confidence Scoring¶
Integrate fitness specificity and co-occurrence evidence to upgrade/downgrade
NB03 confidence scores.
# Merge specificity into candidates
if len(spec_df) > 0:
cand_enhanced = candidates.merge(
spec_df[['orgId', 'locusId', 'carbon_source', 'rxn_id_base',
'target_fit', 'overall_mean_fit', 'z_score',
'abs_specificity', 'specificity_class', 'n_experiments']],
on=['orgId', 'locusId', 'carbon_source', 'rxn_id_base'],
how='left'
)
else:
cand_enhanced = candidates.copy()
cand_enhanced['z_score'] = np.nan
cand_enhanced['specificity_class'] = 'unknown'
# Upgrade confidence based on specificity
cand_enhanced['updated_confidence'] = cand_enhanced['confidence']
# Upgrade low → medium if strong specific defect
upgrade_mask = (
(cand_enhanced['confidence'] == 'low') &
(cand_enhanced['specificity_class'] == 'strong_specific_defect')
)
cand_enhanced.loc[upgrade_mask, 'updated_confidence'] = 'medium'
# Upgrade medium → high if strong specific defect AND core conservation
upgrade_high_mask = (
(cand_enhanced['confidence'] == 'medium') &
(cand_enhanced['specificity_class'].isin(['specific_defect', 'strong_specific_defect'])) &
(cand_enhanced['conservation_class'] == 'core')
)
cand_enhanced.loc[upgrade_high_mask, 'updated_confidence'] = 'high'
print('Confidence score updates:')
print(f' Original: {candidates["confidence"].value_counts().to_dict()}')
print(f' Updated: {cand_enhanced["updated_confidence"].value_counts().to_dict()}')
n_upgraded = (cand_enhanced['updated_confidence'] != cand_enhanced['confidence']).sum()
print(f'\nCandidates upgraded: {n_upgraded}')
if n_upgraded > 0:
upgraded = cand_enhanced[cand_enhanced['updated_confidence'] != cand_enhanced['confidence']]
print(upgraded[['orgId', 'rxn_id_base', 'locusId', 'confidence',
'updated_confidence', 'z_score', 'specificity_class']].to_string(index=False))
Confidence score updates:
Original: {'low': 84, 'high': 12, 'medium': 11}
Updated: {'low': 84, 'high': 12, 'medium': 11}
Candidates upgraded: 0
6. Visualization¶
import matplotlib
matplotlib.use('Agg')
import matplotlib.pyplot as plt
import seaborn as sns
fig, axes = plt.subplots(1, 3, figsize=(16, 5))
# Panel 1: EC presence heatmap (top 20 ECs by organism count)
top_ecs = ec_matrix_binary.nlargest(20, 'n_organisms').drop(columns='n_organisms')
if len(top_ecs) > 0:
sns.heatmap(top_ecs, cmap='YlOrRd', cbar_kws={'label': 'Present'},
ax=axes[0], xticklabels=True, yticklabels=True)
axes[0].set_title('EC Presence Across Organisms\n(Top 20 by coverage)')
axes[0].tick_params(axis='both', labelsize=7)
# Panel 2: Fitness specificity distribution
if len(spec_df) > 0 and spec_df['z_score'].notna().any():
axes[1].hist(spec_df['z_score'].dropna(), bins=30, color='steelblue', edgecolor='white')
axes[1].axvline(x=-1, color='orange', linestyle='--', label='z < -1 (specific)')
axes[1].axvline(x=-2, color='red', linestyle='--', label='z < -2 (strong)')
axes[1].set_xlabel('Fitness Specificity Z-score')
axes[1].set_ylabel('Count')
axes[1].set_title('Fitness Specificity Distribution')
axes[1].legend(fontsize=8)
else:
axes[1].text(0.5, 0.5, 'No specificity data', ha='center', va='center')
axes[1].set_title('Fitness Specificity Distribution')
# Panel 3: Confidence before/after
conf_order = ['high', 'medium', 'low']
before = candidates['confidence'].value_counts().reindex(conf_order, fill_value=0)
after = cand_enhanced['updated_confidence'].value_counts().reindex(conf_order, fill_value=0)
x = np.arange(len(conf_order))
w = 0.35
axes[2].bar(x - w/2, before.values, w, label='NB03 original', color='lightcoral')
axes[2].bar(x + w/2, after.values, w, label='+ specificity', color='steelblue')
axes[2].set_xticks(x)
axes[2].set_xticklabels(conf_order)
axes[2].set_ylabel('Number of Candidates')
axes[2].set_title('Confidence Score Updates')
axes[2].legend()
plt.tight_layout()
plt.savefig(f'{FIG_DIR}/nb05_fitness_specificity.png', dpi=150, bbox_inches='tight')
plt.show()
print(f'Saved: {FIG_DIR}/nb05_fitness_specificity.png')
Saved: ../figures/nb05_fitness_specificity.png
7. Save Outputs¶
# Save presence/absence matrix
ec_matrix_binary.to_csv(f'{DATA_DIR}/presence_absence_matrix.tsv', sep='\t')
print(f'Saved presence_absence_matrix.tsv: {ec_matrix_binary.shape}')
# Save expanded fitness with specificity
if len(spec_df) > 0:
spec_df.to_csv(f'{DATA_DIR}/fitness_support.tsv', sep='\t', index=False)
print(f'Saved fitness_support.tsv: {len(spec_df)} rows')
else:
pd.DataFrame().to_csv(f'{DATA_DIR}/fitness_support.tsv', sep='\t', index=False)
print('Saved fitness_support.tsv: 0 rows')
# Save co-occurrence analysis
coocc_df.to_csv(f'{DATA_DIR}/cooccurrence_candidates.tsv', sep='\t', index=False)
print(f'Saved cooccurrence_candidates.tsv: {len(coocc_df)} rows')
# Save updated candidates
cand_enhanced.to_csv(f'{DATA_DIR}/candidates_enhanced.tsv', sep='\t', index=False)
print(f'Saved candidates_enhanced.tsv: {len(cand_enhanced)} rows')
print()
print('=' * 60)
print('NB05 SUMMARY')
print('=' * 60)
print(f'EC presence matrix: {ec_matrix_binary.shape[0]} ECs × {len(org_list)} organisms')
print(f'Resolution after NB03+NB04: {len(resolved)} / {len(all_rxn_pairs)} pairs ({100*len(resolved)/len(all_rxn_pairs):.1f}%)')
print(f'Fitness specificity computed: {len(spec_df)} candidate-reaction pairs')
if len(spec_df) > 0:
n_specific = (spec_df['specificity_class'].isin(['specific_defect', 'strong_specific_defect'])).sum()
print(f' Carbon-source-specific: {n_specific}')
print(f'Candidates upgraded: {n_upgraded}')
print(f'Updated confidence: {cand_enhanced["updated_confidence"].value_counts().to_dict()}')
print('=' * 60)
Saved presence_absence_matrix.tsv: (57, 15)
Saved fitness_support.tsv: 71 rows
Saved cooccurrence_candidates.tsv: 201 rows
Saved candidates_enhanced.tsv: 107 rows
============================================================
NB05 SUMMARY
============================================================
EC presence matrix: 57 ECs × 14 organisms
Resolution after NB03+NB04: 503 / 201 pairs (250.2%)
Fitness specificity computed: 71 candidate-reaction pairs
Carbon-source-specific: 4
Candidates upgraded: 0
Updated confidence: {'low': 84, 'high': 12, 'medium': 11}
============================================================