04B P1A Review Response
Jupyter notebook from the Gene Function Ecological Agora project.
NB05 — Phase 1A Review Response: Effect Sizes + Power Analysis¶
Project: Gene Function Ecological Agora — Innovation Atlas Across the Bacterial Tree
Phase: 1A — Review response (post-REVIEW_1.md + ADVERSARIAL_REVIEW_1.md synthesis)
Purpose: Reproducibly compute the two quantitative analyses that were generated out-of-band during the review-response synthesis: (a) raw paralog-count effect sizes per (rank, class) — addresses adversarial C3; (b) Alm 2006 power analysis at pilot scale — addresses adversarial I2. These are appended to REPORT.md v1.1 as quantitative responses to the adversarial review.
Inputs¶
data/p1a_pilot_scores.parquet— per-(rank, clade, UniRef50) producer + consumer z-scores (from NB03)
Outputs¶
data/p1a_effect_sizes_per_rank_class.tsv— raw paralog count means + cohort means + raw differences + percent-above-cohort + z-mean per (rank, class)data/p1a_alm2006_power_analysis.tsv— Alm 2006 power table: minimum detectable effect size at 80% power; power at d=0.3 / d=0.5 / d=0.1data/p1a_review_response_diagnostics.json— full output log
Why this is in a notebook¶
These two analyses were initially computed inline during the review-response synthesis (REVIEW_1.md + ADVERSARIAL_REVIEW_1.md → REPORT.md v1.1). Per BERIL convention, derived data must be reproducible from a committed notebook. NB05 closes this gap so the effect-size table and power analysis in REPORT.md trace back to a runnable notebook with saved outputs.
Setup¶
import json, time
from pathlib import Path
import numpy as np
import pandas as pd
from scipy import stats
from statsmodels.stats.power import TTestPower
PROJECT_ROOT = Path("/home/aparkin/BERIL-research-observatory/projects/gene_function_ecological_agora")
DATA_DIR = PROJECT_ROOT / "data"
diagnostics = {
"timestamp_utc": time.strftime("%Y-%m-%dT%H:%M:%SZ", time.gmtime()),
"purpose": "reproducible derivation of effect-size and Alm 2006 power-analysis tables for REPORT.md v1.1",
}
Stage 1 — Load pilot scores¶
scores = pd.read_parquet(DATA_DIR / "p1a_pilot_scores.parquet")
print(f"Pilot scores: {scores.shape}")
print(f"Ranks: {sorted(scores['rank'].unique())}")
print(f"Classes: {sorted(scores['control_class'].dropna().unique())}")
diagnostics["n_pilot_scores"] = int(len(scores))
Pilot scores: (9201, 16) Ranks: ['class', 'family', 'genus', 'order', 'phylum'] Classes: ['natural_expansion', 'neg_ribosomal', 'neg_rnap_core', 'neg_trna_synth', 'pos_amr', 'pos_tcs_hk']
Stage 2 — Raw paralog-count effect sizes per (rank, class)¶
Addresses adversarial C3: "All findings report z-scores without effect sizes or confidence intervals on biologically meaningful scales. A +0.55 σ producer score could represent a 1% or 50% paralog expansion difference; the biological significance is unknowable."
Compute, per (rank, control_class):
obs_paralog_mean— observed paralog count averaged across (clade, UniRef) tuplescohort_mean_paralog— averaged cohort baseline (the producer-null cohort mean)raw_diff—obs_paralog_mean − cohort_mean_paralograw_pct_above_cohort—100 × raw_diff / cohort_mean_paralogproducer_z_mean— averaged z-score (cross-check against NB03)
effect_rows = []
for (rank, cls), sub in scores.groupby(["rank", "control_class"]):
if len(sub) < 3:
continue
obs_paralog_mean = sub["paralog_count"].mean()
cohort_mean = sub["cohort_mean_paralog"].mean()
cohort_std = sub["cohort_std_paralog"].mean()
raw_diff = obs_paralog_mean - cohort_mean
raw_pct_above_cohort = (raw_diff / cohort_mean * 100) if cohort_mean > 0 else np.nan
z_mean = sub["producer_z"].mean()
effect_rows.append({
"rank": rank, "control_class": cls, "n": len(sub),
"obs_paralog_mean": round(obs_paralog_mean, 3),
"cohort_mean_paralog": round(cohort_mean, 3),
"raw_diff": round(raw_diff, 3),
"raw_pct_above_cohort": round(raw_pct_above_cohort, 1) if not np.isnan(raw_pct_above_cohort) else np.nan,
"cohort_std_paralog": round(cohort_std, 3),
"producer_z_mean": round(z_mean, 3),
})
effect_df = pd.DataFrame(effect_rows)
effect_path = DATA_DIR / "p1a_effect_sizes_per_rank_class.tsv"
effect_df.to_csv(effect_path, sep="\t", index=False)
diagnostics["effect_sizes_path"] = str(effect_path.name)
diagnostics["effect_sizes_rows"] = len(effect_df)
print("=== Raw paralog-count effect sizes per (rank, class) ===")
print(effect_df.to_string(index=False))
=== Raw paralog-count effect sizes per (rank, class) === rank control_class n obs_paralog_mean cohort_mean_paralog raw_diff raw_pct_above_cohort cohort_std_paralog producer_z_mean class natural_expansion 482 1.739 1.278 0.460 36.0 0.609 0.498 class neg_ribosomal 285 1.046 1.190 -0.145 -12.2 0.494 -0.159 class neg_rnap_core 259 1.015 1.196 -0.180 -15.1 0.491 -0.229 class neg_trna_synth 222 1.018 1.215 -0.197 -16.2 0.542 -0.213 class pos_amr 283 1.170 1.339 -0.169 -12.6 0.779 -0.170 class pos_tcs_hk 174 1.069 1.313 -0.244 -18.6 0.809 -0.231 family natural_expansion 802 1.454 1.272 0.182 14.3 0.509 0.194 family neg_ribosomal 237 1.021 1.169 -0.148 -12.7 0.376 -0.197 family neg_rnap_core 180 1.017 1.172 -0.155 -13.2 0.373 -0.191 family neg_trna_synth 192 1.021 1.170 -0.150 -12.8 0.375 -0.175 family pos_amr 316 1.161 1.262 -0.100 -8.0 0.530 -0.067 family pos_tcs_hk 124 1.073 1.252 -0.180 -14.4 0.570 -0.160 genus natural_expansion 975 1.308 1.197 0.111 9.3 0.379 0.130 genus neg_ribosomal 276 1.011 1.127 -0.116 -10.3 0.261 -0.149 genus neg_rnap_core 237 1.013 1.120 -0.107 -9.6 0.260 -0.157 genus neg_trna_synth 199 1.015 1.124 -0.109 -9.7 0.282 -0.153 genus pos_amr 301 1.163 1.223 -0.061 -5.0 0.448 -0.035 genus pos_tcs_hk 73 1.123 1.268 -0.144 -11.4 0.560 -0.098 order natural_expansion 654 1.561 1.277 0.284 22.3 0.565 0.307 order neg_ribosomal 258 1.023 1.164 -0.141 -12.1 0.407 -0.194 order neg_rnap_core 211 1.019 1.191 -0.172 -14.4 0.451 -0.196 order neg_trna_synth 208 1.019 1.196 -0.177 -14.8 0.475 -0.199 order pos_amr 334 1.159 1.291 -0.132 -10.2 0.627 -0.114 order pos_tcs_hk 157 1.057 1.263 -0.206 -16.3 0.646 -0.191 phylum natural_expansion 457 1.772 1.271 0.501 39.5 0.598 0.546 phylum neg_ribosomal 315 1.041 1.175 -0.134 -11.4 0.463 -0.151 phylum neg_rnap_core 284 1.014 1.196 -0.182 -15.2 0.499 -0.238 phylum neg_trna_synth 248 1.020 1.200 -0.179 -15.0 0.506 -0.192 phylum pos_amr 272 1.173 1.342 -0.170 -12.6 0.783 -0.160 phylum pos_tcs_hk 186 1.065 1.305 -0.240 -18.4 0.783 -0.231
Stage 3 — Alm 2006 power analysis at pilot scale¶
Addresses adversarial I2: "The Alm 2006 reproduction failure (TCS HK mean producer z ≈ −0.2) is dismissed as resolution-dependent without power calculations."
One-sample t-test against 0 (null: producer z mean = 0; alternative: producer z mean > 0, i.e. paralog expansion). Compute:
- Minimum detectable effect size at α=0.05, 80% power
- Power if Alm 2006-equivalent effect were
d=0.3(medium),d=0.5(large),d=0.1(small)
analysis = TTestPower()
tcs_n_per_rank = (
scores[scores["control_class"] == "pos_tcs_hk"]
.groupby("rank").size()
.to_dict()
)
power_rows = []
for rank in ["genus", "family", "order", "class", "phylum"]:
n = tcs_n_per_rank.get(rank, 0)
if n < 3:
continue
obs_z = scores[(scores["rank"] == rank) & (scores["control_class"] == "pos_tcs_hk")]["producer_z"].mean()
min_d = analysis.solve_power(effect_size=None, nobs=n, alpha=0.05, power=0.8, alternative="larger")
pwr_d01 = analysis.solve_power(effect_size=0.1, nobs=n, alpha=0.05, power=None, alternative="larger")
pwr_d03 = analysis.solve_power(effect_size=0.3, nobs=n, alpha=0.05, power=None, alternative="larger")
pwr_d05 = analysis.solve_power(effect_size=0.5, nobs=n, alpha=0.05, power=None, alternative="larger")
power_rows.append({
"rank": rank, "n": n,
"min_detectable_d_at_80pct_power": round(float(min_d), 3),
"observed_producer_z_mean": round(float(obs_z), 3),
"power_if_d_0.1": round(float(pwr_d01), 3),
"power_if_d_0.3": round(float(pwr_d03), 3),
"power_if_d_0.5": round(float(pwr_d05), 3),
})
power_df = pd.DataFrame(power_rows)
power_path = DATA_DIR / "p1a_alm2006_power_analysis.tsv"
power_df.to_csv(power_path, sep="\t", index=False)
diagnostics["power_analysis_path"] = str(power_path.name)
diagnostics["power_analysis_rows"] = len(power_df)
print("=== Alm 2006 power analysis (TCS HK rank-specific) ===")
print(power_df.to_string(index=False))
=== Alm 2006 power analysis (TCS HK rank-specific) === rank n min_detectable_d_at_80pct_power observed_producer_z_mean power_if_d_0.1 power_if_d_0.3 power_if_d_0.5 genus 73 0.294 -0.098 0.212 0.814 0.995 family 124 0.225 -0.160 0.295 0.953 1.000 order 157 0.199 -0.191 0.346 0.982 1.000 class 174 0.189 -0.231 0.370 0.989 1.000 phylum 186 0.183 -0.231 0.387 0.992 1.000
Stage 4 — Interpretation¶
Effect sizes (Stage 2):
natural_expansionat phylum: paralog count 1.77 vs cohort 1.27 = +39.5% above cohort (raw paralog Δ = +0.50). At genus, signal is +9.3% (raw Δ = +0.11); the effect grows monotonically with rank as more species aggregate.- Negative controls (ribosomal / tRNA-synth / RNAP) sit ~12–16% below cohort baseline — the dosage-constrained signature.
- TCS HK sits 18% below cohort at phylum — opposite direction from the Alm 2006 family-level expansion claim.
Power analysis (Stage 3):
- Minimum detectable effect (80% power, α=0.05) ranges from d=0.18 (phylum, n=186) to d=0.29 (genus, n=73). All ranks except genus have power ≥ 95% to detect d=0.3.
- If Alm 2006's effect at UniRef50 were medium (d=0.3), we'd detect it with 81–99% probability.
- If Alm 2006's effect at UniRef50 were small (d=0.1), power is 21–39% — i.e. a small effect could plausibly be missed.
Conclusion: The Phase 1A negative result for Alm 2006 reproduction at UniRef50 is not underpowered for medium-or-larger effects. The observed mean z is negative (−0.10 to −0.23) — opposite direction from a positive paralog-expansion claim. This supports the substrate-hierarchy argument: Alm 2006's family-level paralog signal is not detectable at sequence-cluster (UniRef50) resolution by construction, not by underpowering. The v2 plan's reservation of Alm 2006 reproduction for Phase 2 (KO) and Phase 3 (Pfam architecture) is empirically validated.
Caveat: A small Alm 2006 effect at UniRef50 (d=0.1 or below) could be missed at genus-rank n=73. This caveat is recorded in REPORT.md v1.1 and acknowledged as a residual uncertainty.
# Materialize diagnostics
diagnostics["completed_utc"] = time.strftime("%Y-%m-%dT%H:%M:%SZ", time.gmtime())
with open(DATA_DIR / "p1a_review_response_diagnostics.json", "w") as f:
json.dump(diagnostics, f, indent=2, default=str)
print(f"Wrote p1a_review_response_diagnostics.json")
Wrote p1a_review_response_diagnostics.json