Skip to content

14.4 Infrastructure and deployment

  • Primary: AWS Mumbai (ap-south-1).
  • DR: AWS Hyderabad (ap-south-2).
  • Multi-AZ across 2 – 3 AZs in primary.
  • Multi-account: prod, pre-prod, dev, security, audit — under AWS Organizations.

Alternative clouds: Azure (Pune / Mumbai), OCI (Mumbai / Hyderabad), GCP (Mumbai). Choose based on team familiarity and pricing.

  • Kubernetes (EKS) for stateless services.
  • EC2 for stateful as needed (Camunda, OpenSearch where managed not preferred).
  • AWS Fargate for less-utilised workers (cost optimisation).
  • VPC with public / private / restricted subnets.
  • Transit Gateway for multi-account / hybrid.
  • NAT Gateway for egress (single per AZ).
  • VPC Endpoints for AWS services (avoid internet hops).
  • Security Groups least-privilege.
  • Network ACLs for subnet-level controls.
  • All outbound vendor traffic via an egress proxy (Squid / Envoy / NGINX) with allowlist.
  • Logs centralised for audit.
  • Terraform for all infrastructure.
  • Terragrunt for environment overlays.
  • State stored in S3 with DynamoDB locks; encrypted.
  • Module library internal.
  • GitHub Actions (or GitLab CI / Jenkins).
  • Per-module pipelines for the monolith.
  • Artifact: Docker images pushed to ECR.
  • Image scanning: Trivy / ECR scan.
  • Deployment: Helm + ArgoCD (GitOps).
  • Progressive rollout: canary / blue-green for critical services.
  • Logs: structured JSON; centralised via Fluent Bit → Loki / CloudWatch.
  • Metrics: Prometheus; alerts via Alertmanager → PagerDuty.
  • Traces: OpenTelemetry SDKs → Tempo / Datadog APM.
  • Synthetic monitoring: external probes hitting key endpoints.
  • Dashboards: Grafana.

Alternative: Datadog as one-stop (more expensive; faster to set up).

  • RPO: <= 15 minutes for OLTP (CDC continuous).
  • RTO: <= 4 hours for full system recovery.
  • Quarterly DR drill of critical paths.
  • Documented runbooks per failure scenario.
  • Budget alerts per environment.
  • Right-sizing reviews monthly.
  • Reserved instances / savings plans for predictable workloads.
  • Spot instances for non-critical batch.