Implementation: P1.3 Messaging Federation/Clustering (NATS JetStream)

Scope

Production-grade multi-node messaging for SEA cells: NATS JetStream clustering for high availability, optional inter-cell federation via NATS leafnodes and stream mirroring, plus cluster-aware workers and observability. This delivers self-healing, optimized, and seamless integration with sea-cell and existing outbox/inbox patterns.

Non-negotiable constraints (project-aligned)

Current baseline

Target state (production-ready)


Implementation Plan

Workstream A: NATS JetStream cluster configuration

Goal: Introduce a stable, replicated NATS cluster with deterministic configuration for both dev and prod.

Deliverables

Plan

  1. Add a canonical NATS cluster config file under infra/nats/ with:
  2. Extend infra/docker/docker-compose.dev.yml with a 3-node NATS cluster profile:
  3. Ensure stream defaults reflect HA:

Validation


Workstream B: Sea-cell integration (Kubernetes)

Goal: First-class NATS cluster in sea-cell deployments, self-healing and storage-safe.

Deliverables

Plan

  1. Update deploy/pulumi/components/sea-cell.ts to add:
  2. Expose cluster endpoints to workloads:
  3. Integrate config and secrets:

Validation


Workstream C: Cluster-aware sea-mq-worker

Goal: Worker is resilient across NATS cluster nodes and scales horizontally without duplicate processing.

Deliverables

Plan

  1. Extend worker configuration to support:
  2. Use async_nats::ConnectOptions for:
  3. Ensure consumer creation is idempotent:
  4. Preserve backpressure and dedupe behavior:

Validation


Workstream D: Federation between sea-cells

Goal: Allow controlled cross-cell messaging without coupling clusters into a single failure domain.

Deliverables

Plan

  1. Use NATS leafnodes for cross-cell federation:
  2. Apply account isolation:
  3. For selective replication, use JetStream stream sources:

Validation


Workstream E: Observability, runbooks, and resilience testing

Goal: Production-grade monitoring and automated recovery verification.

Deliverables

Plan

  1. NATS monitoring:
  2. Worker metrics:
  3. Runbooks:
  4. Chaos and failover tests:

Validation


Decisions and Recommendations

  1. Cluster size
  2. Federation topology
  3. Security model
  4. Stream replication factor
  5. Worker scaling mode

Acceptance Criteria (100% completion)


Parity Checklist (Docker → K3s → Mesh)

1) Single Source of Truth (NATS config)

2) Local HA Profile (Docker)

3) K3s + Mesh HA (Kubernetes)

4) Uniform Client Config

5) Observability Parity

6) Failover Validation


Parity Validation (Commands + Expected Outcomes)

Docker HA profile

Start cluster

1
docker compose -f infra/docker/docker-compose.dev.yml --profile nats-ha up -d

Expected

Verify replication

1
curl -s http://localhost:8222/jsz | jq '.meta.cluster'

Expected

Failover check

1
docker stop nats-1

Expected

K3s/Mesh (Kubernetes)

Check pods and services

1
2
kubectl -n <namespace> get pods -l app=nats
kubectl -n <namespace> get svc nats nats-headless

Expected

Health and replication

1
2
kubectl -n <namespace> port-forward svc/nats 8222:8222
curl -s http://localhost:8222/jsz | jq '.meta.cluster'

Expected

Failover check

1
kubectl -n <namespace> delete pod nats-0

Expected

Federation (Leafnode)

Connectivity check

1
curl -s http://localhost:8222/connz | jq '.connections[] | select(.name | test("leafnode"))'

Expected

Partition + resync

1
# Simulate link loss by blocking leafnode port or stopping the remote NATS

Expected


Recommendation

Layout

Why this layout

Client endpoint strategy

Compose profile snippet (docker)

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
services:
  nats-1:
    image: nats:2.10-alpine
    container_name: sea-nats-1
    command: ["-c", "/etc/nats/nats.conf"]
    environment:
      NATS_SERVER_NAME: nats-1
    ports:
      - "4222:4222"
      - "8222:8222"
      - "6222:6222"
    volumes:
      - ./infra/nats/nats.conf:/etc/nats/nats.conf:ro
      - sea_nats_data_1:/data
    networks: [sea-net]
    profiles: ["nats-ha"]

  nats-2:
    image: nats:2.10-alpine
    container_name: sea-nats-2
    command: ["-c", "/etc/nats/nats.conf"]
    environment:
      NATS_SERVER_NAME: nats-2
    ports:
      - "4223:4222"
      - "8223:8222"
      - "6223:6222"
    volumes:
      - ./infra/nats/nats.conf:/etc/nats/nats.conf:ro
      - sea_nats_data_2:/data
    networks: [sea-net]
    profiles: ["nats-ha"]

  nats-3:
    image: nats:2.10-alpine
    container_name: sea-nats-3
    command: ["-c", "/etc/nats/nats.conf"]
    environment:
      NATS_SERVER_NAME: nats-3
    ports:
      - "4224:4222"
      - "8224:8222"
      - "6224:6222"
    volumes:
      - ./infra/nats/nats.conf:/etc/nats/nats.conf:ro
      - sea_nats_data_3:/data
    networks: [sea-net]
    profiles: ["nats-ha"]

networks:
  sea-net:
    name: sea-net

volumes:
  sea_nats_data_1:
  sea_nats_data_2:
  sea_nats_data_3:

NATS config (shared)

1
2
3
4
5
6
7
8
9
10
11
12
13
server_name: ${NATS_SERVER_NAME}
port: 4222
http: 8222
jetstream {
  store_dir: /data
}
cluster {
  name: sea-nats
  port: 6222
  routes: [
    ${NATS_CLUSTER_ROUTES}
  ]
}

Environment Variables for Routes:

K3s/Mesh sketch (StatefulSet + Services)

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
apiVersion: v1
kind: Service
metadata:
  name: nats
spec:
  selector:
    app: nats
  ports:
    - name: client
      port: 4222
      targetPort: 4222
    - name: monitor
      port: 8222
      targetPort: 8222
---
apiVersion: v1
kind: Service
metadata:
  name: nats-headless
spec:
  clusterIP: None
  selector:
    app: nats
  ports:
    - name: routes
      port: 6222
      targetPort: 6222
---
apiVersion: apps/v1
kind: StatefulSet
metadata:
  name: nats
spec:
  serviceName: nats-headless
  replicas: 3
  selector:
    matchLabels:
      app: nats
  template:
    metadata:
      labels:
        app: nats
    spec:
      affinity:
        podAntiAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
            - labelSelector:
                matchExpressions:
                  - key: app
                    operator: In
                    values:
                      - nats
              topologyKey: "kubernetes.io/hostname"
      containers:
        - name: nats
          image: nats:2.10-alpine
          command: ["nats-server", "-c", "/etc/nats/nats.conf"]
          ports:
            - containerPort: 4222
              name: client
            - containerPort: 8222
              name: monitor
            - containerPort: 6222
              name: routes
          volumeMounts:
            - name: config
              mountPath: /etc/nats
            - name: data
              mountPath: /data
          readinessProbe:
            httpGet:
              path: /healthz
              port: 8222
          livenessProbe:
            httpGet:
              path: /healthz
              port: 8222
      volumes:
        - name: config
          configMap:
            name: nats-config
  volumeClaimTemplates:
    - metadata:
        name: data
      spec:
        accessModes: ["ReadWriteOnce"]
        resources:
          requests:
            storage: 10Gi
---
apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
  name: nats-pdb
spec:
  minAvailable: 2
  selector:
    matchLabels:
      app: nats


Implementation Notes (Secrets + Environments)

Dev (env vars)

Production (SOPS)

Required keys