Research Question

Do Type IV secretion systems (T4SS) in environmental bacteria physically co-localise with
carbohydrate-active enzyme (CAZy) genes in a manner consistent with horizontal gene transfer,
and does phylogenetic analysis of syntenic GT2 glycosyltransferases reveal cross-phylum
transfer events that cannot be explained by vertical inheritance?

Key Findings

  • 21.8% of 30,497 high-quality environmental MAGs carry T4SS/conjugative machinery
    (6,652 MAGs; multi-marker definition: VirB4/6/8/9/10/11, VirD4, TraI/D, TrwB, TraG).
  • 92 CAZy families show elevated co-occurrence with T4SS loci at ≤10 kb (threshold
    validation pending); GT2 glycosyltransferases are the top hit (767 genomes, avg 5,041 bp).
  • Biome enrichment: marine sediment (OR=5.5, q<10⁻⁹⁸), barley rhizosphere (OR=10.4),
    maize rhizosphere (OR=4.1).
  • 77 HGT events detected in GT2 gene tree; 32 high-confidence cross-phylum events
    (normalised). Strongest: Node_4915 spans 8 phyla at max divergence 4.843.
  • CAZy genes are not on plasmids (ICEfinder: 12 IMEs in top 100 accumulators);
    T4SS+ genomes have 10× higher MGE density (p<0.001), consistent with
    chromosomal/integrative transfer.
  • Discovery/validation split (70/30): enrichment patterns replicate across sets,
    confirming robustness.

Interpretation

T4SS machinery in environmental MAGs is significantly co-localised with GT2
glycosyltransferase cassettes, and phylogenetic incongruence in the GT2 gene tree
provides positive evidence for 32 cross-phylum HGT events. The ICEfinder null result
(no plasmid-borne CAZy genes) combined with the 10× MGE density elevation in T4SS+
genomes is consistent with chromosomal or integrative transfer mechanisms (IMEs/ICEs)
rather than plasmid mobilisation.

These results support the hypothesis that T4SS machinery mediates environmental
dissemination of carbohydrate-active enzyme diversity across phylogenetically distant
bacteria. All associations are observational; mechanistic confirmation requires
experimental validation.

Status

Preliminary — core analyses complete (NB01–NB04). NB05 threshold validation and
NB06 manuscript figures pending.

NB05 Analysis Results

HGT event characterisation (Detected_HGT_Events.csv, n=77):
- Distinct_Phyla distribution: 65 events span 2 phyla, 12 span ≥3 phyla (15.6%).
- Node_4915 confirmed: 35 genes, 82.9% syntenic, 8 phyla (WOR-3, Desulfobacterota,
Patescibacteria, Bacteroidota, Firmicutes_A, Methanobacteriota, Bdellovibrionota,
Acidobacteriota), Max_Divergence = 4.843.
- Divergence vs. synteny: Spearman ρ = −0.615 (p<0.001) — more phylogenetically
distant events have lower syntenic percentage, consistent with sequence divergence
after transfer. This is the expected biological pattern.
- Most-involved phyla across all events: Firmicutes_A (27), Pseudomonadota (22),
Bacillota_A (19), Actinomycetota (10).
- Figure: figures/fig_nb05_hgt_scatter.png

GT2 neighbourhood analysis (376 genomes):
- Neighbourhood parsed as list-format: T4SS co-occurs in 503 neighbourhood entries,
GT2 in 495 — confirming syntenic co-localisation at the contig level.
- GH23 (murein lytic transglycosylase) is the second most common CAZy family in GT2
neighbourhoods (106 occurrences), suggesting cell wall remodelling genes cluster with
GT2-T4SS syntenic loci.
- Figure: figures/fig_nb05_neighbourhood_functions.png

GT2 × metal resistance (new finding):
GT2-neighbourhood MAGs (n=376) have mean 0.045 metal resistance types vs 0.004 for
non-GT2 MAGs (n=260,276); Mann-Whitney p=8.6e-27. Genomes with GT2 in T4SS-proximal
neighbourhoods carry 11× more metal resistance genes. This is an independent link
between CAZy-T4SS synteny and the metal resistance niche breadth hypothesis: genomes
that serve as hubs for GT2 horizontal transfer are also enriched for metal resistance
functions.

Pending Validation

  1. Synteny threshold permutation test (requires Spark unfiltered data)
  2. Node_4915 BLAST validation (requires BLAST against NCBI nr)
  3. Housekeeping gene null baseline
  4. Biome enrichment factorisation: θ = OR(T4SS-CAZy) / [OR(T4SS) × OR(CAZy)]

Data Collections

Atlas Reuse

Derived products, review objects, and tensions connected to this project in the BERIL Atlas.

Review

AI Review BERIL Automated Review (Claude, claude-sonnet-4-6) 2026-05-02 Needs Re-review

Summary

This is a scientifically ambitious and clearly motivated project investigating whether Type IV secretion systems physically co-localise with CAZy genes in environmental MAGs, and whether phylogenetic incongruence in GT2 gene trees provides evidence for cross-phylum HGT. The research question is specific and testable, the preliminary findings are quantitative and internally consistent, and the claim framing — explicitly associative and clearly distinguished from the PUL/TBDT literature — is exemplary. However, the project is not yet self-contained: its notebooks/ and data/ directories are both empty, with NB01–NB04 remaining in misc_exploratory/exploratory/ and all source data hard-coded to paths in that project. NB05 exists only as a standalone Python script (scripts/nb05_hgt_analysis.py) rather than the Jupyter notebook listed in the README, and none of the four adversarially-identified validation analyses (permutation test, housekeeping gene null, biome factorisation, Node_4915 BLAST) have been implemented. The project has strong scientific bones but requires significant work on self-containment, reproducibility scaffolding, and completion of the outstanding validation pipeline before it is ready for external review or manuscript submission.

Methodology

Research question: Clearly and precisely stated. The dual focus on physical co-localisation (≤10 kb synteny) and phylogenetic incongruence (GT2 gene tree) provides two independent lines of evidence, which is methodologically sound.

Approach: Appropriate. The multi-marker T4SS definition (VirB4/6/8/9/10/11, VirD4, TraI/D, TrwB, TraG) is rigorous; the 70/30 discovery/validation split for enrichment patterns is good statistical practice and replication is mentioned in REPORT.md. The negative binomial GLM with cluster-robust standard errors by taxonomic class is a well-chosen model for overdispersed count data with non-independence.

Statistical methods: Fisher's exact + FDR for biome enrichment, Spearman for divergence–synteny correlation, Mann-Whitney for metal resistance comparison — all appropriate for the data types. The OR=10.4 for barley rhizosphere is striking and warrants confirmation via the pending biome factorisation (NB05.3), which correctly asks whether this exceeds independent T4SS and CAZy prevalence effects.

Data sources: Clearly tabulated in README (kescience_mgnify tables, kescience_fitnessbrowser, kescience_bacdive, NCBI RefSeq). The distinction between Spark-required and local-only queries is not documented.

Outstanding gap — four validation analyses not yet run: The adversarial review identified four specific methodological concerns, all still open at time of review:
1. Permutation test for the 10 kb synteny threshold — arbitrary threshold is unvalidated.
2. Housekeeping gene null baseline — without this, the 76/32 HGT detection rate cannot be shown to exceed background phylogenetic incongruence.
3. Biome enrichment factorisation (θ = OR_joint / [OR_T4SS × OR_CAZy]) — biome confounding not yet ruled out.
4. Node_4915 BLAST validation against NCBI nr — the 8-phyla, divergence-4.843 claim is the headline result and needs sequence-level verification.

These are not cosmetic; they are the primary claims the manuscript will need to defend.

Code Quality

scripts/nb05_hgt_analysis.py: Well-structured, readable Python with clear section separators. Statistical calls are correct (Spearman, Mann-Whitney with alternative='greater'). Figures are labelled and saved with sufficient DPI.

One style issue: from collections import Counter is imported at line 41, mid-script, rather than at the top with other imports. This is not a functional bug (Python 3 list comprehensions are scoped, so the p variable reuse in the comprehension on line 40 does not overwrite the Spearman p-value), but the mid-script import is poor practice and will raise linter warnings.

Hard-coded absolute paths: All three input files use absolute paths to misc_exploratory/exploratory/data/. Anyone other than the original author — or the author on a different machine — cannot run the script without editing these paths.

NB05 implements different analyses than planned: The README's NB05 section specifies four validation analyses (permutation test, housekeeping baseline, biome factorisation, Node_4915 BLAST). The script implements three different analyses (HGT event characterisation from CSV, GT2 neighbourhood function frequency, GT2 × metal resistance). The neighbourhood and metal resistance analyses are valuable and produce reproducible figures, but they do not address the adversarial validation agenda.

Source notebooks (NB01–NB04 in misc_exploratory): All three notebook files have saved outputs (T4SS_CAZy_Refined.ipynb: 21/22 cells with outputs; T4SS_CAzy_Tree_v3.ipynb: 4/5 cells; Alternative_HGT_Mechanisms_Plasticity_vs_Plasmids.ipynb: 6/14 cells). The Alternative_HGT notebook has notably fewer cells with saved outputs (43%), which means portions of NB04's analysis cannot be verified without re-running against Spark.

Pitfall awareness: No specific issues from docs/pitfalls.md are explicitly documented. The project uses kescience_mgnify (not kbase_ke_pangenome), so the pangenome-specific pitfalls (taxonomy join key, reserved order keyword) do not apply here. The general pitfall about string-typed numeric columns (kescience_fitnessbrowser coordinates and fitness values) would be relevant for any GLM cells in NB01 that use FitnessBrowser data — this is not verifiable without inspecting those notebooks in misc_exploratory.

Findings Assessment

Are conclusions supported? The preliminary conclusions reported in README and REPORT.md are quantitative and internally consistent. The OR values, q-values, GLM coefficients, and Spearman ρ are all cited with confidence intervals or p-values. The GT2 × metal resistance finding (11× enrichment, Mann-Whitney p=8.6e-27, n=376 vs 260,276) is striking and biologically interesting as a secondary result.

Count inconsistency: README states "76 HGT events detected in GT2 gene tree; 32 high-confidence cross-phylum events," but REPORT.md's NB05 section reports "n=77" for the Detected_HGT_Events.csv. This one-event discrepancy suggests either a version difference between the raw and normalised files, or an off-by-one in how one report counted. It should be reconciled and unified.

Limitations acknowledged: The claim framing section in README is unusually careful and commendable — explicitly associative, not causal; distinguishes mechanistically from PUL/TBDT literature; acknowledges experimental validation is required. This is the correct framing for an observational metagenomic study.

Incomplete analysis noted:
- NB05 validation analyses (all 4) pending.
- NB06 (synthesis/manuscript figures) not started — Fig. 1 (KEGG bubble chart) and Fig. 3 (GLM forest plot) are "TBD"; these are key figures for a manuscript.
- Only 2 of the 4 planned manuscript figures exist (Fig. 2, Fig. 4).

Figures: The two generated figures (fig_nb05_hgt_scatter.png, fig_nb05_neighbourhood_functions.png) exist in figures/ and are the output of the NB05 script. They cover the supplementary NB05 analyses. The primary manuscript figures (KEGG bubble chart, GLM forest plot) have not been generated.

Suggestions

  1. [Critical] Make the project self-contained: Copy or symlink NB01–NB04 notebooks from misc_exploratory/exploratory/ into notebooks/ with standardised names (01_t4ss_identification.ipynb, 02_synteny_biome_enrichment.ipynb, 03_gt2_gene_tree.ipynb, 04_mobilome_scan.ipynb). A project whose core analysis lives in a different project directory is not reproducible.

  2. [Critical] Implement the four NB05 validation analyses: The permutation test, housekeeping gene null, biome factorisation, and Node_4915 BLAST were identified by adversarial review as necessary before submission. These should be in a proper notebooks/05_validation_analyses.ipynb, not deferred further.

  3. [Critical] Stage intermediate data: Populate data/ with the key intermediate files needed for NB05 and NB06 (at minimum: Detected_HGT_Events.csv, GT2_neighborhood_signatures.csv, Normalized_Detected_HGT_Events.csv, tree_metadata.csv). Replace hard-coded misc_exploratory paths in scripts/nb05_hgt_analysis.py with relative paths to the project's own data/ directory.

  4. [High] Add a Reproduction section to README: Document which notebooks require JupyterHub Spark (NB01–NB04, NB05 permutation test), which can run locally from cached data (NB05 descriptives, NB06), and approximate expected runtimes. This is a standard BERDL project requirement.

  5. [High] Add requirements.txt: List all Python package dependencies (pandas, numpy, scipy, matplotlib, and any Spark/BERDL utilities used in NB01–NB04). This is needed for reproducibility.

  6. [Medium] Convert scripts/nb05_hgt_analysis.py to a Jupyter notebook: The script implements real analysis and generates published figures, but it has no cell-level outputs or documentation. Converting to .ipynb will allow narrative documentation alongside code and preserve outputs between runs.

  7. [Medium] Reconcile the 76 vs. 77 HGT event count: README says 76, REPORT.md NB05 section says 77. Identify whether this is a normalisation filter difference (raw vs. cross-phylum subset), a file version issue, or an off-by-one, and use a single consistent number throughout all documents.

  8. [Medium] Start NB06 synthesis figures: Fig. 1 (KEGG bubble chart) and Fig. 3 (GLM forest plot) are central manuscript figures marked TBD. These should be implemented before seeking external collaboration or submitting.

  9. [Low] Fix the mid-script import: Move from collections import Counter to the import block at the top of scripts/nb05_hgt_analysis.py (line 41 → top of file).

  10. [Low] Update the NB05 analysis plan in README: The four validation analyses in the README's NB05 section do not match what nb05_hgt_analysis.py actually implements. Either update the plan to reflect what was done, or clearly distinguish between "supplementary characterisation" (what the script does) and "validation" (the four adversarial items, still pending).

This review was generated by an AI system. It should be treated as advisory input, not a definitive assessment.

Visualizations

Fig Nb05 Hgt Scatter

Fig Nb05 Hgt Scatter

Fig Nb05 Neighbourhood Functions

Fig Nb05 Neighbourhood Functions

Notebooks