Skip to content

14.4 Infrastructure and deployment

Cloud

Primary: AWS Mumbai (ap-south-1).
DR: AWS Hyderabad (ap-south-2).
Multi-AZ across 2 – 3 AZs in primary.
Multi-account: prod, pre-prod, dev, security, audit — under AWS Organizations.

Alternative clouds: Azure (Pune / Mumbai), OCI (Mumbai / Hyderabad), GCP (Mumbai). Choose based on team familiarity and pricing.

Compute

Kubernetes (EKS) for stateless services.
EC2 for stateful as needed (Camunda, OpenSearch where managed not preferred).
AWS Fargate for less-utilised workers (cost optimisation).

Networking

VPC with public / private / restricted subnets.
Transit Gateway for multi-account / hybrid.
NAT Gateway for egress (single per AZ).
VPC Endpoints for AWS services (avoid internet hops).
Security Groups least-privilege.
Network ACLs for subnet-level controls.

Egress control

All outbound vendor traffic via an egress proxy (Squid / Envoy / NGINX) with allowlist.
Logs centralised for audit.

IaC

Terraform for all infrastructure.
Terragrunt for environment overlays.
State stored in S3 with DynamoDB locks; encrypted.
Module library internal.

CI / CD

GitHub Actions (or GitLab CI / Jenkins).
Per-module pipelines for the monolith.
Artifact: Docker images pushed to ECR.
Image scanning: Trivy / ECR scan.
Deployment: Helm + ArgoCD (GitOps).
Progressive rollout: canary / blue-green for critical services.

Observability

Logs: structured JSON; centralised via Fluent Bit → Loki / CloudWatch.
Metrics: Prometheus; alerts via Alertmanager → PagerDuty.
Traces: OpenTelemetry SDKs → Tempo / Datadog APM.
Synthetic monitoring: external probes hitting key endpoints.
Dashboards: Grafana.

Alternative: Datadog as one-stop (more expensive; faster to set up).

DR / BCP

RPO: <= 15 minutes for OLTP (CDC continuous).
RTO: <= 4 hours for full system recovery.
Quarterly DR drill of critical paths.
Documented runbooks per failure scenario.

Cost management

Budget alerts per environment.
Right-sizing reviews monthly.
Reserved instances / savings plans for predictable workloads.
Spot instances for non-critical batch.