5.6 Data platform and warehouse
Why a separate warehouse
Section titled “Why a separate warehouse”OLTP (PostgreSQL) is optimised for per-borrower, per-loan transactional reads and writes. Analytics (cohort vintage, ECL, dashboards, board MIS, fraud analytics, model training) need columnar storage with optimised aggregation.
Running analytics on OLTP kills both:
- Analytics is slow on row-store with operational indexes.
- OLTP performance degrades when long-running analytical queries lock or starve resources.
Build the data platform from MVP. Skipping it means the analytics module is built on hacks and never recovers — by year 2, you’ll be drowning in operational fires from “the dashboard query brought down the database”.
Architecture
Section titled “Architecture” Production OLTP (PostgreSQL Multi-AZ) │ │ CDC (Debezium / Aurora native logical replication) │ + scheduled extracts for slow-changing reference data ▼ Raw / Bronze layer in warehouse │ │ dbt models (SQL transforms) ▼ Cleansed / Silver layer │ │ dbt models ▼ Modelled / Gold layer (star schemas, aggregates) │ ├── BI tool (Metabase / Superset / Looker / Tableau) ├── Operational dashboards ├── Compliance reports (RBI returns, bureau) ├── Board MIS ├── Cohort analytics └── ML feature store (Feast)Warehouse choice
Section titled “Warehouse choice”| Option | Best for | Cost shape |
|---|---|---|
| Snowflake | Best balance of DX, performance, ops simplicity | Compute credits + storage |
| Amazon Redshift | AWS-native; tight integration; cluster-based | Cluster cost (predictable but less elastic) |
| Google BigQuery | GCP-native; serverless; pay-per-query | Query cost (can surprise; budget alerts essential) |
| ClickHouse (managed or self-hosted) | Very fast on large volumes; cost-effective at scale | Lower cost at scale; ops effort if self-hosted |
| PostgreSQL replica + extensions (Citus) | Smallest scale; familiar | Limited at scale; tipping point around 10M rows per table |
For an India NBFC starting at ₹30 Cr book → ₹500 Cr over 3 years:
- MVP (
< ₹50 Cr book): PostgreSQL read replica + dbt — sufficient; defer warehouse cost. - Phase 2 onwards: Snowflake (faster delivery, lower ops) or ClickHouse (cost-sensitive, ops capacity exists).
ETL / ELT
Section titled “ETL / ELT”CDC (Change Data Capture)
Section titled “CDC (Change Data Capture)”- Debezium + Kafka Connect for PostgreSQL → Kafka → warehouse.
- Aurora native logical replication can also feed CDC consumers directly.
- Captures every INSERT / UPDATE / DELETE in real time with
< 1 minutelag typically.
Event stream → warehouse
Section titled “Event stream → warehouse”- For events on Kafka (
loan_event, repayments, classifications), pipe directly to warehouse (Bronze layer). - Schema-evolution rules preserve compatibility.
Batch loads
Section titled “Batch loads”For occasional batch (vendor MIS files, bank statements, reference data):
- Scheduled jobs (Airflow / Dagster / cron) pull files / API.
- Write to S3 staging.
- Load via warehouse-native loaders.
Refresh cadence
Section titled “Refresh cadence”- Streaming (CDC): near-real-time for transactional tables.
- Hourly for batch-loaded reference data.
- Daily for downstream rollups and aggregates.
- Monthly for slow-changing metadata (NIC codes, IFSC list).
Use dbt (data build tool) for SQL-based transformations Bronze → Silver → Gold.
Why dbt
Section titled “Why dbt”- Versioned models in git — every transformation reviewable.
- Tested with assertions — declarative
assertstatements catch data-quality drift. - Lineage visible — automatic dependency graph.
- Standard practice in modern analytics teams.
- dbt-aware engineers are widely hireable in India by 2026.
dbt structure
Section titled “dbt structure”warehouse/├── dbt_project.yml├── models/│ ├── bronze/ # 1:1 with source tables; minimal transformation│ ├── silver/ # Cleaned, deduplicated, joined│ │ ├── borrower/│ │ ├── loan/│ │ └── settlement/│ └── gold/ # Business-ready: dimensional models│ ├── fact_loan_daily.sql│ ├── dim_borrower.sql│ ├── dim_partner.sql│ └── mart_cohort_vintage.sql├── tests/ # Cross-model integrity checks├── macros/ # Reusable SQL fragments└── snapshots/ # Slowly Changing Dimension trackingSample dbt model
Section titled “Sample dbt model”-- gold/fact_loan_daily.sql{{ config(materialized='incremental', unique_key=['loan_id','snapshot_date']) }}
WITH classifications AS ( SELECT * FROM {{ ref('silver_classification_snapshot') }}),provisions AS ( SELECT * FROM {{ ref('silver_provision') }}),loans AS ( SELECT * FROM {{ ref('silver_loan_account') }})SELECT l.loan_id, l.borrower_id, l.product_id, c.snapshot_date, c.class AS classification, c.dpd_days, l.current_outstanding, p.provision_amount, CASE WHEN l.partner_lender_id IS NOT NULL THEN 'co_lent' ELSE 'own_book' END AS book_typeFROM loans lJOIN classifications c ON l.loan_id = c.loan_idLEFT JOIN provisions p ON l.loan_id = p.loan_id AND c.snapshot_date = p.snapshot_date
{% if is_incremental() %}WHERE c.snapshot_date > (SELECT max(snapshot_date) FROM {{ this }}){% endif %}| Tool | Best for | TCO |
|---|---|---|
| Metabase | MVP; broad usability; lowest TCO | Open source / hosted |
| Superset | Scale; richer features; self-host | Open source |
| Looker / Tableau / PowerBI | Enterprise rigor; SSO; access control | Commercial |
Recommendation: Metabase at MVP, Superset or Looker at scale based on team preference.
Compliance reporting from the warehouse
Section titled “Compliance reporting from the warehouse”Regulatory reports (RBI returns, bureau monthly files) are best generated from the warehouse — cleansed, modelled data with reproducibility. The reporting module (3.N) reads from Gold tables.
Benefits:
- Reproducibility — same model generates same output for same date range.
- Audit trail — dbt model versions track exactly which logic produced the output.
- Speed — warehouse aggregation is fast; same query on OLTP would be slow and disruptive.
Feature store for ML
Section titled “Feature store for ML”When ML models enter production (Phase 4 onwards):
- Feast (
feast.dev) — open-source feature store. - Tecton — managed feature store.
- Custom on top of Snowflake / ClickHouse — common pattern for teams with strong data engineering.
The platform’s features (BSA-derived metrics, GST-derived metrics, behaviour features) should be defined once in the feature store and served both to training (offline) and to real-time decisioning (online).
Model registry and monitoring
Section titled “Model registry and monitoring”- MLflow for model registry + experiment tracking.
- Monitoring — drift, calibration, decision distribution:
- Built as dbt models on the
decision_runtable. - Or via specialised tooling (Arize, Fiddler, Evidently).
- Built as dbt models on the
Operational concerns
Section titled “Operational concerns”- CDC lag — typically
< 1 minute; acceptable for analytics. For real-time dashboards, augment with streaming aggregations (Materialize, Flink) at scale. - Data quality — assertions in dbt; anomaly detection on key metrics; daily quality report.
- PII handling in warehouse — minimised; tokens preferred; access strictly RBAC-controlled; masking for non-need-to-know users.
- Cost — Snowflake / BigQuery can surprise; budget alerts mandatory; per-team / per-query cost attribution.
What the warehouse is and isn’t
Section titled “What the warehouse is and isn’t”- It is the source of truth for analytics, reporting, ML, board MIS.
- It is not the operational source of truth for cash-affecting decisions — always reconcile back to OLTP for any cash-flow-affecting action.
- It is not the place to store raw vendor PDFs — use object store.
- It is not the place to compute current loan outstanding for repayment processing — use OLTP / event-source projection.
Related
Section titled “Related”- 3.L Portfolio monitoring — what the warehouse powers.
- 3.N Reporting and compliance — RBI returns generated from gold tables.
- 3.P Analytics and intelligence — dashboards and EWS.
- 14.2 Data and storage — concrete stack choices.
- 6.7 MVP vs advanced scorecard — ML scorecard roadmap.