14.4 Infrastructure and deployment
- Primary: AWS Mumbai (
ap-south-1). - DR: AWS Hyderabad (
ap-south-2). - Multi-AZ across
2 – 3AZs in primary. - Multi-account: prod, pre-prod, dev, security, audit — under AWS Organizations.
Alternative clouds: Azure (Pune / Mumbai), OCI (Mumbai / Hyderabad), GCP (Mumbai). Choose based on team familiarity and pricing.
Compute
Section titled “Compute”- Kubernetes (EKS) for stateless services.
- EC2 for stateful as needed (Camunda, OpenSearch where managed not preferred).
- AWS Fargate for less-utilised workers (cost optimisation).
Networking
Section titled “Networking”- VPC with public / private / restricted subnets.
- Transit Gateway for multi-account / hybrid.
- NAT Gateway for egress (single per AZ).
- VPC Endpoints for AWS services (avoid internet hops).
- Security Groups least-privilege.
- Network ACLs for subnet-level controls.
Egress control
Section titled “Egress control”- All outbound vendor traffic via an egress proxy (Squid / Envoy / NGINX) with allowlist.
- Logs centralised for audit.
- Terraform for all infrastructure.
- Terragrunt for environment overlays.
- State stored in S3 with DynamoDB locks; encrypted.
- Module library internal.
CI / CD
Section titled “CI / CD”- GitHub Actions (or GitLab CI / Jenkins).
- Per-module pipelines for the monolith.
- Artifact: Docker images pushed to ECR.
- Image scanning: Trivy / ECR scan.
- Deployment: Helm + ArgoCD (GitOps).
- Progressive rollout: canary / blue-green for critical services.
Observability
Section titled “Observability”- Logs: structured JSON; centralised via Fluent Bit → Loki / CloudWatch.
- Metrics: Prometheus; alerts via Alertmanager → PagerDuty.
- Traces: OpenTelemetry SDKs → Tempo / Datadog APM.
- Synthetic monitoring: external probes hitting key endpoints.
- Dashboards: Grafana.
Alternative: Datadog as one-stop (more expensive; faster to set up).
DR / BCP
Section titled “DR / BCP”- RPO:
<= 15 minutesfor OLTP (CDC continuous). - RTO:
<= 4 hoursfor full system recovery. - Quarterly DR drill of critical paths.
- Documented runbooks per failure scenario.
Cost management
Section titled “Cost management”- Budget alerts per environment.
- Right-sizing reviews monthly.
- Reserved instances / savings plans for predictable workloads.
- Spot instances for non-critical batch.