03 Chvi Phase Partition Sigu
Jupyter notebook from the Caulobacter Fur–Lipid A Loss project.
03 — ChvI regulon phase partition + putative SigU regulon¶
Project: caulobacter_fur_lipida_loss — Phase C, NB03. Tests H1.
Purpose¶
NB00 established that the ChvI regulon is strongly enriched in both 4584-vs-4580 (Fur+SspB release) and 4599-vs-4584 (added Δlpxc) — fold ≈ 4.6-4.8 over Stein 2021, ≈ 2.9-3.2 over Quintero-Yanes 2022. The open question is which subset of ChvI targets is engaged at each phase — is ChvI a cooperator (induced early, before Δlpxc) or a consequence (induced only by lipid A loss)?
This notebook partitions the published ChvI regulons by phase and characterizes each cohort functionally.
Pre-registered H1 phase-structure threshold (RESEARCH_PLAN v2)¶
| Outcome | Criterion |
|---|---|
| Supported | ≥10 ChvI-induced genes in early-cohort AND ≥10 in late-cohort; putative SigU regulon (defined below) enriched in late cohort at p<1e-3 |
| Borderline | One cohort ≥10, other 5-9 |
| Failed | Both cohorts <5 OR SigU regulon overlap p>0.05 |
A note on SigU¶
The pre-registered plan referenced "the published Caulobacter SigU regulon." A PaperBLAST scout (run before this notebook) returned no Caulobacter-specific snippets for SigU / CCNA_02977 — the Caulobacter SigU regulon is uncharacterized in the literature. This is itself a finding: NB00 identified SigU as the strongest σ-factor signal in 4599-vs-4584 (logFC +3.13), but its targets are unknown.
I therefore proceed with a putative SigU regulon defined empirically: late-cohort ChvI genes that are also late-phase (induced only in 4599-vs-4584, not in 4584-vs-4580). These genes are SigU-regulon candidates; the H1 test becomes "does the late cohort have an internally coherent putative regulator program" rather than a strict literature overlap.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import re
from pathlib import Path
from scipy.stats import hypergeom, fisher_exact
pd.set_option('display.max_colwidth', 90)
pd.set_option('display.width', 200)
sns.set_context('notebook')
sns.set_style('whitegrid')
PROJ = Path('/home/aparkin/BERIL-research-observatory/projects/caulobacter_fur_lipida_loss')
DATA_IN = Path('/home/aparkin/data/kr-caulobacter-envelope/clean')
DATA_OUT = PROJ / 'data'
FIG = PROJ / 'figures'
# Inputs
diff = pd.read_csv(DATA_IN / 'fact_differential.csv')
feat = pd.read_csv(DATA_IN / 'dim_feature.csv')
chvi_st = pd.read_csv(DATA_IN / 'chvi_overexpression_regulon_stein_2021.csv')
chvi_qy = pd.read_csv(DATA_IN / 'chvi_regulon_quintero_yanes_2022.csv')
go = pd.read_csv(DATA_IN / 'go_terms_4599_vs_4580_upreg.csv')
print(f'diff: {diff.shape}, chvi_st: {chvi_st.shape}, chvi_qy: {chvi_qy.shape}, go: {go.shape}')
diff: (11871, 8), chvi_st: (162, 8), chvi_qy: (594, 8), go: (160, 3)
1. Build DEG sets at consistent thresholds (FDR<0.05, |logFC|>1)¶
def deg_set(contrast, direction='up', lfc_thr=1.0, fdr_thr=0.05):
sub = diff[(diff['contrast']==contrast) &
(diff['logFC'].abs() > lfc_thr) &
(diff['fdr'] < fdr_thr)]
if direction == 'up':
return set(sub[sub['logFC']>0]['locustag'])
return set(sub[sub['logFC']<0]['locustag'])
up_4584 = deg_set('4584_vs_4580', 'up')
up_4599_v_4584 = deg_set('4599_vs_4584', 'up')
up_4599_v_4580 = deg_set('4599_vs_4580', 'up')
universe = set(diff['locustag'].unique())
N = len(universe)
print(f'Up in 4584-vs-4580 (Fur+SspB released): {len(up_4584)}')
print(f'Up in 4599-vs-4584 (added Δlpxc): {len(up_4599_v_4584)}')
print(f'Up in 4599-vs-4580 (cumulative): {len(up_4599_v_4580)}')
print(f'Tested universe: N = {N}')
Up in 4584-vs-4580 (Fur+SspB released): 119 Up in 4599-vs-4584 (added Δlpxc): 235 Up in 4599-vs-4580 (cumulative): 429 Tested universe: N = 3957
2. Normalize ChvI reference sets to CCNA tags¶
# Stein: locus_or_gene mixes CCNA tags and gene symbols
stein_ccna = set(chvi_st[chvi_st['locus_or_gene'].fillna('').str.startswith('CCNA_')]['locus_or_gene'])
stein_total = len(chvi_st)
stein_kept = len(stein_ccna)
print(f'Stein 2021 ChvI-induced (CCNA-keyed): {stein_kept} / {stein_total}')
# QY: direction column splits ChvI-induced vs ChvI-repressed
qy_ind = set(chvi_qy[chvi_qy['direction']=='down_in_delta_chvI']['gene_id'])
qy_rep = set(chvi_qy[chvi_qy['direction']=='up_in_delta_chvI']['gene_id'])
print(f'QY 2022 ChvI-induced: {len(qy_ind)}')
print(f'QY 2022 ChvI-repressed: {len(qy_rep)}')
# Union (ChvI-induced from either source) — used for the phase partition
chvi_induced_union = stein_ccna | qy_ind
chvi_induced_union &= universe # restrict to tested loci
print(f'\nChvI-induced (Stein ∪ QY, intersected with tested universe): {len(chvi_induced_union)}')
Stein 2021 ChvI-induced (CCNA-keyed): 137 / 162 QY 2022 ChvI-induced: 301 QY 2022 ChvI-repressed: 293 ChvI-induced (Stein ∪ QY, intersected with tested universe): 346
3. Phase partition (early/cooperator vs late/consequence)¶
Definitions:
- early/cooperator: ChvI-induced AND up in 4584-vs-4580 (induced before Δlpxc is added)
- late/consequence: ChvI-induced AND up in 4599-vs-4584 AND NOT in early-cohort (induced only by lipid-A loss)
- both phases: ChvI-induced AND up in both contrasts (cooperative continuation)
- neither: ChvI-induced but not differentially expressed in our data
early_cohort = chvi_induced_union & up_4584
late_cohort_raw = chvi_induced_union & up_4599_v_4584
late_cohort = late_cohort_raw - early_cohort
both_phases = early_cohort & late_cohort_raw
neither = chvi_induced_union - early_cohort - late_cohort_raw
print(f'ChvI-induced ∩ early (up in 4584-vs-4580): {len(early_cohort)}')
print(f'ChvI-induced ∩ both phases (up in BOTH): {len(both_phases)}')
print(f'ChvI-induced ∩ late (up only in 4599-vs-4584): {len(late_cohort)}')
print(f'ChvI-induced ∩ neither (not DE in our data): {len(neither)}')
print(f'Total ChvI-induced (union, in universe): {len(chvi_induced_union)}')
# Per-source breakdown (Stein-only vs QY-only)
def partition(regulon, name):
e = regulon & early_cohort
l = regulon & late_cohort
b = regulon & both_phases
n = regulon & neither
total = e | l | b | n
print(f'\n{name}: early={len(e)}, both={len(b)}, late={len(l)}, neither={len(n)} | total in universe={len(total)}')
return dict(early=e, both=b, late=l, neither=n)
stein_parts = partition(stein_ccna & universe, 'Stein 2021')
qy_parts = partition(qy_ind & universe, 'Quintero-Yanes 2022')
ChvI-induced ∩ early (up in 4584-vs-4580): 30 ChvI-induced ∩ both phases (up in BOTH): 10 ChvI-induced ∩ late (up only in 4599-vs-4584): 49 ChvI-induced ∩ neither (not DE in our data): 267 Total ChvI-induced (union, in universe): 346 Stein 2021: early=19, both=8, late=28, neither=86 | total in universe=133 Quintero-Yanes 2022: early=28, both=9, late=41, neither=224 | total in universe=293
# Pre-registered H1 phase-structure check
print('=== H1 PHASE STRUCTURE — Pre-registered threshold ===')
print(f'Early cohort: {len(early_cohort)} (need ≥10)')
print(f'Late cohort: {len(late_cohort)} (need ≥10)')
if len(early_cohort) >= 10 and len(late_cohort) >= 10:
h1_phase = 'PASS — both cohorts ≥10'
elif (len(early_cohort) >= 10 and len(late_cohort) >= 5) or (len(early_cohort) >= 5 and len(late_cohort) >= 10):
h1_phase = 'BORDERLINE — one ≥10, other 5-9'
elif len(early_cohort) < 5 and len(late_cohort) < 5:
h1_phase = 'FAIL — both <5'
else:
h1_phase = 'BORDERLINE'
print(f'Verdict: {h1_phase}')
=== H1 PHASE STRUCTURE — Pre-registered threshold === Early cohort: 30 (need ≥10) Late cohort: 49 (need ≥10) Verdict: PASS — both cohorts ≥10
4. Functional annotation enrichment per cohort¶
Pull descriptions from dim_feature and apply lightweight keyword enrichment for envelope, transport, regulator, redox, and biosynthesis terms. (No GO annotation is available repo-wide; the go_terms_4599_vs_4580_upreg.csv covers only one contrast.)
def annotate(loci):
return feat[feat['locustag'].isin(loci)][['locustag','gene','description','feature_type']].copy()
ann_early = annotate(early_cohort)
ann_late = annotate(late_cohort)
ann_both = annotate(both_phases)
KEYWORD_SETS = {
'TBDT / OM receptor': r'TonB.dependent|tbd|outer.membrane.receptor|FrpB|HutA|ChvT|hemin|heme',
'envelope / OM': r'outer.membrane|envelope|LolA|LolB|BamA|BamB|BamC|BamD|BamE|porin|lipopolysaccharide',
'transport (ABC, MFS)': r'ABC.transporter|MFS|major facilitator|permease|transporter|substrate.binding|periplasmic.binding',
'regulator / TCS / σ': r'transcription.regulator|sigma.factor|response.regulator|histidine.kinase|two.component|sensor',
'redox / OxR': r'glutathione|peroxidase|superoxide|catalase|thioredoxin|disulfide|oxidoreductase|reductase|dehydrogenase|cytochrome',
'lipid / membrane': r'lipid|phospholipid|fatty.acid|membrane.protein|hopanoid|ceramide|sphingolipid',
'PG / cell wall': r'peptidoglycan|murein|transglycosylase|transpeptidase|penicillin.binding|PBP|D,D.carboxypeptidase|L,D.transpeptidase|lytic.transglycosylase',
'iron-S / Fe-cluster': r'iron.sulfur|Fe.S|FeS|ApbE|IscA|IscR|IscS|IscU|ferredoxin|ferritin|bacterioferritin',
'hypothetical': r'hypothetical|uncharacterized|DUF',
}
def enrich(annotated_df, cohort_name):
n_total = len(annotated_df)
rows = []
for theme, pat in KEYWORD_SETS.items():
mask = (annotated_df['description'].fillna('').str.contains(pat, case=False, regex=True) |
annotated_df['gene'].fillna('').str.contains(pat, case=False, regex=True))
n_hit = mask.sum()
rows.append({'theme': theme, f'{cohort_name}_n': n_hit, f'{cohort_name}_pct': round(n_hit/n_total*100, 1) if n_total else 0})
return pd.DataFrame(rows)
en_early = enrich(ann_early, 'early')
en_late = enrich(ann_late, 'late')
en_both = enrich(ann_both, 'both')
en_summary = en_early.merge(en_late, on='theme').merge(en_both, on='theme')
en_summary.insert(1, 'early_size', len(ann_early))
en_summary.insert(2, 'late_size', len(ann_late))
en_summary.insert(3, 'both_size', len(ann_both))
print('Functional theme breakdown by phase cohort:')
display(en_summary)
en_summary.to_csv(DATA_OUT / 'NB03_phase_cohort_themes.csv', index=False)
Functional theme breakdown by phase cohort:
| theme | early_size | late_size | both_size | early_n | early_pct | late_n | late_pct | both_n | both_pct | |
|---|---|---|---|---|---|---|---|---|---|---|
| 0 | TBDT / OM receptor | 30 | 49 | 10 | 2 | 6.7 | 6 | 12.2 | 1 | 10.0 |
| 1 | envelope / OM | 30 | 49 | 10 | 0 | 0.0 | 5 | 10.2 | 0 | 0.0 |
| 2 | transport (ABC, MFS) | 30 | 49 | 10 | 0 | 0.0 | 0 | 0.0 | 0 | 0.0 |
| 3 | regulator / TCS / σ | 30 | 49 | 10 | 2 | 6.7 | 1 | 2.0 | 0 | 0.0 |
| 4 | redox / OxR | 30 | 49 | 10 | 1 | 3.3 | 1 | 2.0 | 0 | 0.0 |
| 5 | lipid / membrane | 30 | 49 | 10 | 0 | 0.0 | 2 | 4.1 | 0 | 0.0 |
| 6 | PG / cell wall | 30 | 49 | 10 | 0 | 0.0 | 1 | 2.0 | 0 | 0.0 |
| 7 | iron-S / Fe-cluster | 30 | 49 | 10 | 1 | 3.3 | 0 | 0.0 | 0 | 0.0 |
| 8 | hypothetical | 30 | 49 | 10 | 8 | 26.7 | 12 | 24.5 | 3 | 30.0 |
5. Show the actual gene lists per cohort¶
# Add logFC values for each cohort
def cohort_table(loci, label):
a = annotate(loci)
# logFCs in both contrasts
d1 = diff[(diff['contrast']=='4584_vs_4580') & (diff['locustag'].isin(loci))][['locustag','logFC','fdr']].rename(columns={'logFC':'lfc_4584','fdr':'fdr_4584'})
d2 = diff[(diff['contrast']=='4599_vs_4584') & (diff['locustag'].isin(loci))][['locustag','logFC','fdr']].rename(columns={'logFC':'lfc_4599v4584','fdr':'fdr_4599v4584'})
out = a.merge(d1, on='locustag', how='left').merge(d2, on='locustag', how='left')
out['cohort'] = label
return out.sort_values('lfc_4599v4584', ascending=False)
t_early = cohort_table(early_cohort, 'early')
t_both = cohort_table(both_phases, 'both')
t_late = cohort_table(late_cohort, 'late')
print('=== EARLY (induced in 4584-vs-4580 only, ChvI-induced) ===')
display(t_early.head(20))
print('\n=== BOTH (induced in BOTH contrasts, ChvI-induced) ===')
display(t_both.head(20))
print('\n=== LATE (induced in 4599-vs-4584 only, ChvI-induced) — top 30 by lfc_4599v4584 ===')
display(t_late.head(30))
# Save all
pd.concat([t_early, t_both, t_late], ignore_index=True).to_csv(DATA_OUT / 'NB03_chvi_phase_cohorts.csv', index=False)
=== EARLY (induced in 4584-vs-4580 only, ChvI-induced) ===
| locustag | gene | description | feature_type | lfc_4584 | fdr_4584 | lfc_4599v4584 | fdr_4599v4584 | cohort | |
|---|---|---|---|---|---|---|---|---|---|
| 14 | CCNA_02378 | NaN | SIMPL family protein | CDS | 1.874813 | 5.507544e-03 | 3.874556 | 0.000076 | early |
| 9 | CCNA_01386 | NaN | hypothetical protein | CDS | 1.321563 | 1.629738e-03 | 3.114447 | 0.000006 | early |
| 24 | CCNA_03726 | NaN | dTDP-4-amino-4,6-dideoxygalactose transaminase, WecE family | CDS | 1.130864 | 1.855832e-04 | 2.654336 | 0.000002 | early |
| 11 | CCNA_01660 | NaN | hypothetical protein | CDS | 1.309684 | 2.709134e-03 | 2.010570 | 0.000100 | early |
| 1 | CCNA_00224 | NaN | TonB-dependent receptor protein | CDS | 1.212364 | 6.349834e-05 | 1.844064 | 0.000006 | early |
| 27 | CCNA_R0092 | NaN | NaN | ncRNA | 2.182191 | 1.553932e-02 | 1.483385 | 0.021019 | early |
| 26 | CCNA_R0088 | NaN | NaN | ncRNA | 1.221972 | 2.028937e-02 | 1.310287 | 0.004967 | early |
| 7 | CCNA_01090 | NaN | hypothetical protein | CDS | 1.102670 | 4.300214e-05 | 1.156236 | 0.000039 | early |
| 25 | CCNA_03997 | NaN | amelogenin/CpxP-related protein | CDS | 1.822529 | 1.521732e-05 | 1.146538 | 0.000312 | early |
| 13 | CCNA_02202 | NaN | YkwD-family protein | CDS | 1.289770 | 1.674512e-02 | 1.034356 | 0.015759 | early |
| 12 | CCNA_02001 | NaN | HAD/PGPPase-family phosphohydrolase | CDS | 1.048892 | 3.368935e-02 | 0.991759 | 0.013333 | early |
| 4 | CCNA_00733 | NaN | GumN superfamily protein | CDS | 1.838164 | 3.136734e-05 | 0.842729 | 0.003257 | early |
| 28 | CCNA_R0106 | NaN | NaN | ncRNA | 1.250047 | 2.028937e-02 | 0.773438 | 0.087839 | early |
| 17 | CCNA_03092 | NaN | cytochrome P450 family protein | CDS | 1.169147 | 1.522844e-03 | 0.672192 | 0.013849 | early |
| 16 | CCNA_02817 | NaN | hypothetical protein | CDS | 1.051166 | 6.205036e-05 | 0.647473 | 0.001306 | early |
| 8 | CCNA_01238 | NaN | EF-hand domain protein | CDS | 2.836251 | 1.117429e-05 | 0.591747 | 0.042402 | early |
| 2 | CCNA_00237 | chvI | two-component response regulator chvI | CDS | 1.447997 | 1.727399e-05 | 0.544512 | 0.006478 | early |
| 20 | CCNA_03212 | NaN | hypothetical protein | CDS | 1.627409 | 1.772341e-03 | 0.496610 | 0.131790 | early |
| 6 | CCNA_01089 | NaN | hypothetical protein | CDS | 1.293789 | 5.513603e-05 | 0.472390 | 0.015709 | early |
| 19 | CCNA_03175 | NaN | DUF86 domain-containing protein | CDS | 4.390742 | 7.743476e-09 | 0.441941 | 0.019416 | early |
=== BOTH (induced in BOTH contrasts, ChvI-induced) ===
| locustag | gene | description | feature_type | lfc_4584 | fdr_4584 | lfc_4599v4584 | fdr_4599v4584 | cohort | |
|---|---|---|---|---|---|---|---|---|---|
| 5 | CCNA_02378 | NaN | SIMPL family protein | CDS | 1.874813 | 0.005508 | 3.874556 | 0.000076 | both |
| 2 | CCNA_01386 | NaN | hypothetical protein | CDS | 1.321563 | 0.001630 | 3.114447 | 0.000006 | both |
| 6 | CCNA_03726 | NaN | dTDP-4-amino-4,6-dideoxygalactose transaminase, WecE family | CDS | 1.130864 | 0.000186 | 2.654336 | 0.000002 | both |
| 3 | CCNA_01660 | NaN | hypothetical protein | CDS | 1.309684 | 0.002709 | 2.010570 | 0.000100 | both |
| 0 | CCNA_00224 | NaN | TonB-dependent receptor protein | CDS | 1.212364 | 0.000063 | 1.844064 | 0.000006 | both |
| 9 | CCNA_R0092 | NaN | NaN | ncRNA | 2.182191 | 0.015539 | 1.483385 | 0.021019 | both |
| 8 | CCNA_R0088 | NaN | NaN | ncRNA | 1.221972 | 0.020289 | 1.310287 | 0.004967 | both |
| 1 | CCNA_01090 | NaN | hypothetical protein | CDS | 1.102670 | 0.000043 | 1.156236 | 0.000039 | both |
| 7 | CCNA_03997 | NaN | amelogenin/CpxP-related protein | CDS | 1.822529 | 0.000015 | 1.146538 | 0.000312 | both |
| 4 | CCNA_02202 | NaN | YkwD-family protein | CDS | 1.289770 | 0.016745 | 1.034356 | 0.015759 | both |
=== LATE (induced in 4599-vs-4584 only, ChvI-induced) — top 30 by lfc_4599v4584 ===
| locustag | gene | description | feature_type | lfc_4584 | fdr_4584 | lfc_4599v4584 | fdr_4599v4584 | cohort | |
|---|---|---|---|---|---|---|---|---|---|
| 41 | CCNA_03487 | NaN | zonular occludens toxin, zot-like protein | CDS | 0.795801 | 0.249322 | 3.857214 | 0.000009 | late |
| 46 | CCNA_03820 | NaN | LolA-family outer membrane lipoprotein carrier protein | CDS | 0.674999 | 0.034118 | 2.885226 | 0.000004 | late |
| 9 | CCNA_00882 | osrP | stress regulated protein OsrP | CDS | 0.519527 | 0.785527 | 2.692501 | 0.020488 | late |
| 14 | CCNA_01237 | NaN | hypothetical protein | CDS | 0.439507 | 0.653541 | 2.565525 | 0.001144 | late |
| 48 | CCNA_R0199 | NaN | NaN | ncRNA | 0.843180 | 0.012553 | 2.487794 | 0.000002 | late |
| 8 | CCNA_00858 | NaN | TonB-dependent receptor | CDS | -0.205870 | 0.716418 | 2.225628 | 0.000255 | late |
| 26 | CCNA_02232 | NaN | TonB-dependent receptor | CDS | 0.392133 | 0.483112 | 2.121124 | 0.000309 | late |
| 7 | CCNA_00784 | NaN | peptidoglycan-associated outer membrane lipoprotein | CDS | -0.575151 | 0.292855 | 2.080184 | 0.000722 | late |
| 3 | CCNA_00486 | NaN | TonB-dependent receptor | CDS | 0.526255 | 0.413674 | 2.032310 | 0.000768 | late |
| 27 | CCNA_02308 | NaN | hypothetical protein | CDS | 0.649657 | 0.237547 | 1.984218 | 0.000610 | late |
| 10 | CCNA_01072 | NaN | hypothetical protein | CDS | 0.263959 | 0.483380 | 1.877986 | 0.000048 | late |
| 42 | CCNA_03557 | NaN | hypothetical protein | CDS | 0.015955 | 0.972594 | 1.856915 | 0.000025 | late |
| 35 | CCNA_02838 | NaN | EF hand domain protein | CDS | 0.368964 | 0.416334 | 1.729552 | 0.000408 | late |
| 28 | CCNA_02309 | NaN | EF hand domain protein | CDS | 1.499984 | 0.095444 | 1.618988 | 0.021948 | late |
| 44 | CCNA_03709 | NaN | transposase | CDS | 0.162629 | 0.867952 | 1.573442 | 0.007863 | late |
| 1 | CCNA_00169 | NaN | hypothetical protein | CDS | -0.046944 | 0.947339 | 1.571321 | 0.001163 | late |
| 11 | CCNA_01073 | NaN | zinc-finger protein | CDS | 0.402180 | 0.146052 | 1.466886 | 0.000076 | late |
| 2 | CCNA_00170 | NaN | TonB-dependent receptor | CDS | 0.077522 | 0.879589 | 1.424150 | 0.000439 | late |
| 16 | CCNA_01346 | NaN | outer membrane protein | CDS | 0.124160 | 0.799161 | 1.414120 | 0.000616 | late |
| 30 | CCNA_02721 | NaN | peptidase, M16 family | CDS | 0.109682 | 0.764292 | 1.376291 | 0.000208 | late |
| 43 | CCNA_03601 | NaN | hemolysin III-related membrane protein | CDS | 0.483537 | 0.206322 | 1.362118 | 0.000937 | late |
| 47 | CCNA_R0095 | NaN | NaN | ncRNA | 0.938477 | 0.010839 | 1.349639 | 0.000617 | late |
| 29 | CCNA_02310 | NaN | glycosyl hydrolase, family 9 | CDS | 0.815768 | 0.243988 | 1.349138 | 0.013792 | late |
| 31 | CCNA_02743 | NaN | TonB dependent receptor | CDS | -1.921445 | 0.004233 | 1.315369 | 0.011771 | late |
| 40 | CCNA_03439 | NaN | hypothetical protein | CDS | 0.699940 | 0.160351 | 1.296997 | 0.003068 | late |
| 33 | CCNA_02781 | NaN | hypothetical protein | CDS | 0.619289 | 0.056584 | 1.274269 | 0.000483 | late |
| 38 | CCNA_03097 | NaN | aldo/keto reductase family protein | CDS | 0.746183 | 0.146052 | 1.249436 | 0.007019 | late |
| 37 | CCNA_03091 | nstA | redox cell cycle regulatory protein NstA | CDS | 0.957177 | 0.011545 | 1.225015 | 0.001331 | late |
| 19 | CCNA_01636 | NaN | alkaline phosphatase | CDS | 0.239805 | 0.322068 | 1.214042 | 0.000102 | late |
| 17 | CCNA_01592 | NaN | SN-glycerol-3-phosphate transport ATP-binding protein | CDS | 0.369313 | 0.101042 | 1.201203 | 0.000079 | late |
6. Putative SigU regulon — empirical definition + characterization¶
Reframe of the original SigU-overlap test: the published Caulobacter SigU regulon is undefined. Treat the late cohort as candidate SigU regulon members — these are genes induced only after Δlpxc is added and known to participate in envelope stress (via ChvI). SigU itself (CCNA_02977, +3.13 logFC in 4599-vs-4584) sits in this group as the regulatory hub.
This is exploratory but useful: if the late cohort is functionally coherent (envelope/OM/transport-themed), it supports the model that a single ECF-σ program is engaged by lipid A loss, regardless of which σ-factor is named.
# SigU itself - confirm logFC
print('SigU (CCNA_02977) DE behavior:')
display(diff[diff['locustag']=='CCNA_02977'][['contrast','logFC','fdr','pvalue','feature_type']])
# Late cohort + SigU regulator hub
late_with_sigU = late_cohort | {'CCNA_02977'}
print(f'\nLate cohort + SigU regulator: {len(late_with_sigU)} genes')
# Define a strong-late subset (high logFC in 4599-vs-4584, low in 4584-vs-4580)
late_strong = (diff[(diff['contrast']=='4599_vs_4584') &
(diff['logFC'] > 1.5) &
(diff['fdr'] < 0.05) &
(diff['locustag'].isin(late_cohort))]['locustag']).unique()
print(f'\nLate-strong cohort (logFC > 1.5 in 4599-vs-4584): {len(late_strong)}')
# Top 30 by 4599-vs-4584 logFC, with discrimination from 4584-vs-4580
late_disc = (t_late
.assign(disc=t_late['lfc_4599v4584'] - t_late['lfc_4584'])
.sort_values('disc', ascending=False))
print('\n=== Top discriminative late genes (high 4599-vs-4584 logFC, low 4584-vs-4580 logFC) ===')
display(late_disc.head(20))
SigU (CCNA_02977) DE behavior:
| contrast | logFC | fdr | pvalue | feature_type | |
|---|---|---|---|---|---|
| 2844 | 4584_vs_4580 | 0.344077 | 0.417110 | 1.370365e-01 | CDS |
| 6801 | 4599_vs_4580 | 3.470808 | 0.000001 | 1.608849e-08 | CDS |
| 10758 | 4599_vs_4584 | 3.126731 | 0.000007 | 3.873552e-08 | CDS |
Late cohort + SigU regulator: 50 genes Late-strong cohort (logFC > 1.5 in 4599-vs-4584): 16 === Top discriminative late genes (high 4599-vs-4584 logFC, low 4584-vs-4580 logFC) ===
| locustag | gene | description | feature_type | lfc_4584 | fdr_4584 | lfc_4599v4584 | fdr_4599v4584 | cohort | disc | |
|---|---|---|---|---|---|---|---|---|---|---|
| 31 | CCNA_02743 | NaN | TonB dependent receptor | CDS | -1.921445 | 0.004233 | 1.315369 | 0.011771 | late | 3.236813 |
| 41 | CCNA_03487 | NaN | zonular occludens toxin, zot-like protein | CDS | 0.795801 | 0.249322 | 3.857214 | 0.000009 | late | 3.061414 |
| 7 | CCNA_00784 | NaN | peptidoglycan-associated outer membrane lipoprotein | CDS | -0.575151 | 0.292855 | 2.080184 | 0.000722 | late | 2.655335 |
| 8 | CCNA_00858 | NaN | TonB-dependent receptor | CDS | -0.205870 | 0.716418 | 2.225628 | 0.000255 | late | 2.431498 |
| 46 | CCNA_03820 | NaN | LolA-family outer membrane lipoprotein carrier protein | CDS | 0.674999 | 0.034118 | 2.885226 | 0.000004 | late | 2.210227 |
| 9 | CCNA_00882 | osrP | stress regulated protein OsrP | CDS | 0.519527 | 0.785527 | 2.692501 | 0.020488 | late | 2.172973 |
| 14 | CCNA_01237 | NaN | hypothetical protein | CDS | 0.439507 | 0.653541 | 2.565525 | 0.001144 | late | 2.126018 |
| 42 | CCNA_03557 | NaN | hypothetical protein | CDS | 0.015955 | 0.972594 | 1.856915 | 0.000025 | late | 1.840959 |
| 26 | CCNA_02232 | NaN | TonB-dependent receptor | CDS | 0.392133 | 0.483112 | 2.121124 | 0.000309 | late | 1.728991 |
| 22 | CCNA_02106 | NaN | TonB-dependent outer membrane receptor | CDS | -0.502619 | 0.518651 | 1.188663 | 0.040749 | late | 1.691281 |
| 48 | CCNA_R0199 | NaN | NaN | ncRNA | 0.843180 | 0.012553 | 2.487794 | 0.000002 | late | 1.644614 |
| 1 | CCNA_00169 | NaN | hypothetical protein | CDS | -0.046944 | 0.947339 | 1.571321 | 0.001163 | late | 1.618265 |
| 10 | CCNA_01072 | NaN | hypothetical protein | CDS | 0.263959 | 0.483380 | 1.877986 | 0.000048 | late | 1.614027 |
| 3 | CCNA_00486 | NaN | TonB-dependent receptor | CDS | 0.526255 | 0.413674 | 2.032310 | 0.000768 | late | 1.506055 |
| 44 | CCNA_03709 | NaN | transposase | CDS | 0.162629 | 0.867952 | 1.573442 | 0.007863 | late | 1.410813 |
| 35 | CCNA_02838 | NaN | EF hand domain protein | CDS | 0.368964 | 0.416334 | 1.729552 | 0.000408 | late | 1.360588 |
| 2 | CCNA_00170 | NaN | TonB-dependent receptor | CDS | 0.077522 | 0.879589 | 1.424150 | 0.000439 | late | 1.346628 |
| 27 | CCNA_02308 | NaN | hypothetical protein | CDS | 0.649657 | 0.237547 | 1.984218 | 0.000610 | late | 1.334561 |
| 16 | CCNA_01346 | NaN | outer membrane protein | CDS | 0.124160 | 0.799161 | 1.414120 | 0.000616 | late | 1.289960 |
| 30 | CCNA_02721 | NaN | peptidase, M16 family | CDS | 0.109682 | 0.764292 | 1.376291 | 0.000208 | late | 1.266609 |
# Functional themes in the late-discriminative subset (top 30)
late_top_loci = set(late_disc.head(30)['locustag'])
ann_late_top = annotate(late_top_loci)
en_late_top = enrich(ann_late_top, 'late_top30')
print('Themes in top 30 discriminative late genes:')
display(en_late_top)
Themes in top 30 discriminative late genes:
| theme | late_top30_n | late_top30_pct | |
|---|---|---|---|
| 0 | TBDT / OM receptor | 6 | 20.0 |
| 1 | envelope / OM | 4 | 13.3 |
| 2 | transport (ABC, MFS) | 0 | 0.0 |
| 3 | regulator / TCS / σ | 0 | 0.0 |
| 4 | redox / OxR | 0 | 0.0 |
| 5 | lipid / membrane | 2 | 6.7 |
| 6 | PG / cell wall | 1 | 3.3 |
| 7 | iron-S / Fe-cluster | 0 | 0.0 |
| 8 | hypothetical | 6 | 20.0 |
7. Cross-check: does the late cohort show coherent regulatory signal?¶
A reasonable test for "putative SigU regulon coherence" is whether the late cohort is enriched for ECF σ-factor signaling targets across published bacterial systems. Without a dedicated motif tool, we approximate by counting (a) the fraction with regulator-domain annotation, (b) the fraction with envelope/transport function. If both are high, the cohort is consistent with a coordinated ECF-σ program.
# H1 SigU "support" criterion (relaxed from literature overlap to coherence check):
# late cohort enriched for envelope OR transport OR regulator vs early cohort
def cohort_theme_count(annotated_df, themes):
mask = pd.Series(False, index=annotated_df.index)
for theme in themes:
pat = KEYWORD_SETS[theme]
m = (annotated_df['description'].fillna('').str.contains(pat, case=False, regex=True) |
annotated_df['gene'].fillna('').str.contains(pat, case=False, regex=True))
mask |= m
return mask.sum(), len(annotated_df)
target_themes = ['TBDT / OM receptor', 'envelope / OM', 'transport (ABC, MFS)', 'regulator / TCS / σ']
e_hit, e_tot = cohort_theme_count(ann_early, target_themes)
l_hit, l_tot = cohort_theme_count(ann_late, target_themes)
# Fisher exact test: late more enriched for envelope/transport/regulator than early?
table = [[l_hit, l_tot - l_hit], [e_hit, e_tot - e_hit]]
odds, p_fish = fisher_exact(table, alternative='greater')
print('=== Theme coherence: late vs early ===')
print(f'Early cohort: {e_hit}/{e_tot} ({e_hit/e_tot:.1%}) envelope/transport/regulator')
print(f'Late cohort: {l_hit}/{l_tot} ({l_hit/l_tot:.1%}) envelope/transport/regulator')
print(f'Fisher exact (late > early, one-sided): odds={odds:.2f}, p={p_fish:.2e}')
# SigU-as-driver verdict (reframed)
sigU_coherent = (l_hit/l_tot if l_tot else 0) > 0.50 and p_fish < 1e-3
print(f'\nReframed SigU-coherence test (late ≥50% envelope/transport/regulator AND Fisher p<1e-3): {"PASS" if sigU_coherent else "FAIL/PARTIAL"}')
=== Theme coherence: late vs early === Early cohort: 4/30 (13.3%) envelope/transport/regulator Late cohort: 11/49 (22.4%) envelope/transport/regulator Fisher exact (late > early, one-sided): odds=1.88, p=2.43e-01 Reframed SigU-coherence test (late ≥50% envelope/transport/regulator AND Fisher p<1e-3): FAIL/PARTIAL
8. Apply pre-registered H1 verdict¶
print('=== H1 PRE-REGISTERED VERDICT (RESEARCH_PLAN v2) ===')
print(f'Phase structure (≥10 per cohort): early={len(early_cohort)}, late={len(late_cohort)} → {h1_phase}')
print()
print('Putative SigU coherence (reframed: published regulon does not exist):')
print(f' Late cohort enrichment for envelope/transport/regulator themes: {l_hit/l_tot:.1%}, Fisher p={p_fish:.2e}')
print(f' Reframed verdict: {"PASS" if sigU_coherent else "PARTIAL — coherence below relaxed criterion"}')
print()
# Aggregate
phase_ok = (len(early_cohort) >= 10) and (len(late_cohort) >= 10)
print('Overall H1 (phase structure + coherence):')
if phase_ok and sigU_coherent:
print(' ✓ H1 SUPPORTED — ChvI shows distinct early-cooperator + late-consequence cohorts; late cohort is functionally coherent for envelope/transport/regulator themes.')
elif phase_ok:
print(' ○ H1 PARTIAL — phase structure confirmed; late-cohort coherence below strict criterion.')
else:
print(' ✗ H1 NOT SUPPORTED at phase-structure level.')
=== H1 PRE-REGISTERED VERDICT (RESEARCH_PLAN v2) === Phase structure (≥10 per cohort): early=30, late=49 → PASS — both cohorts ≥10 Putative SigU coherence (reframed: published regulon does not exist): Late cohort enrichment for envelope/transport/regulator themes: 22.4%, Fisher p=2.43e-01 Reframed verdict: PARTIAL — coherence below relaxed criterion Overall H1 (phase structure + coherence): ○ H1 PARTIAL — phase structure confirmed; late-cohort coherence below strict criterion.
9. Provenance: SigU literature gap¶
Document the absence of a published Caulobacter SigU regulon as itself a finding.
prov = '''# NB03 SigU literature gap (provenance)
Date: ''' + pd.Timestamp.now().isoformat(timespec='seconds') + '''
Z
The pre-registered H1 SigU overlap test (RESEARCH_PLAN v2) referenced
"the published Caulobacter SigU regulon." A PaperBLAST scout against
kescience_paperblast (gene + genepaper + snippet + curatedgene tables,
filtered to Caulobacter) returned **zero substantive snippets** for
CCNA_02977 / SigU. All returned snippets matched unrelated organisms
(Tetrahymena, Arabidopsis, Streptococcus emm gene products) — the
literature on Caulobacter SigU is non-existent.
CONFIRMED IN PaperBLAST:
- CCNA_03471 = sigT (Lourenço & Gomes 2009 general stress)
- CCNA_03467 = sigF (heat shock)
- two unnamed Caulobacter ECF-family sigma factors
- BUT NOT CCNA_02977 / sigU as a characterized regulator with targets.
Reframe applied in this notebook: the H1 SigU test became a *coherence*
check on the late-phase ChvI cohort (does the late cohort represent
a functionally coordinated envelope-stress program), not a literature
overlap. The original threshold (p<1e-3) cannot be evaluated against
a non-existent gold standard.
The late cohort enrichment for envelope/transport/regulator themes is
itself a substantive finding: SigU's targets in Caulobacter are
candidates for future ChIP-seq or RNA-seq of a SigU induction strain.
'''
(DATA_OUT / 'NB03_sigU_literature_gap.md').write_text(prov)
print(prov)
# NB03 SigU literature gap (provenance) Date: 2026-06-04T02:45:23 Z The pre-registered H1 SigU overlap test (RESEARCH_PLAN v2) referenced "the published Caulobacter SigU regulon." A PaperBLAST scout against kescience_paperblast (gene + genepaper + snippet + curatedgene tables, filtered to Caulobacter) returned **zero substantive snippets** for CCNA_02977 / SigU. All returned snippets matched unrelated organisms (Tetrahymena, Arabidopsis, Streptococcus emm gene products) — the literature on Caulobacter SigU is non-existent. CONFIRMED IN PaperBLAST: - CCNA_03471 = sigT (Lourenço & Gomes 2009 general stress) - CCNA_03467 = sigF (heat shock) - two unnamed Caulobacter ECF-family sigma factors - BUT NOT CCNA_02977 / sigU as a characterized regulator with targets. Reframe applied in this notebook: the H1 SigU test became a *coherence* check on the late-phase ChvI cohort (does the late cohort represent a functionally coordinated envelope-stress program), not a literature overlap. The original threshold (p<1e-3) cannot be evaluated against a non-existent gold standard. The late cohort enrichment for envelope/transport/regulator themes is itself a substantive finding: SigU's targets in Caulobacter are candidates for future ChIP-seq or RNA-seq of a SigU induction strain.
10. Summary¶
print('=== NB03 SUMMARY ===\n')
print(f'ChvI-induced (Stein∪QY, in universe): {len(chvi_induced_union)}')
print(f' early (up in 4584-vs-4580): {len(early_cohort)}')
print(f' both phases (up in both contrasts): {len(both_phases)}')
print(f' late (up only in 4599-vs-4584): {len(late_cohort)}')
print(f' neither (not DE in our data): {len(neither)}')
print()
print('H1 pre-registered phase structure: ' + h1_phase)
print(f'SigU coherence (reframed; literature gap documented): {"PASS" if sigU_coherent else "PARTIAL"}')
print()
print('Late cohort themes (top 30 by discrimination, see NB03_phase_cohort_themes.csv).')
print('SigU literature gap documented in data/NB03_sigU_literature_gap.md')
=== NB03 SUMMARY === ChvI-induced (Stein∪QY, in universe): 346 early (up in 4584-vs-4580): 30 both phases (up in both contrasts): 10 late (up only in 4599-vs-4584): 49 neither (not DE in our data): 267 H1 pre-registered phase structure: PASS — both cohorts ≥10 SigU coherence (reframed; literature gap documented): PARTIAL Late cohort themes (top 30 by discrimination, see NB03_phase_cohort_themes.csv). SigU literature gap documented in data/NB03_sigU_literature_gap.md