5.6 Data platform and warehouse

Why a separate warehouse

OLTP (PostgreSQL) is optimised for per-borrower, per-loan transactional reads and writes. Analytics (cohort vintage, ECL, dashboards, board MIS, fraud analytics, model training) need columnar storage with optimised aggregation.

Running analytics on OLTP kills both:

Analytics is slow on row-store with operational indexes.
OLTP performance degrades when long-running analytical queries lock or starve resources.

Build the data platform from MVP. Skipping it means the analytics module is built on hacks and never recovers — by year 2, you’ll be drowning in operational fires from “the dashboard query brought down the database”.

Architecture

   Production OLTP (PostgreSQL Multi-AZ)
        │
        │  CDC (Debezium / Aurora native logical replication)
        │  + scheduled extracts for slow-changing reference data
        ▼
   Raw / Bronze layer in warehouse
        │
        │  dbt models (SQL transforms)
        ▼
   Cleansed / Silver layer
        │
        │  dbt models
        ▼
   Modelled / Gold layer (star schemas, aggregates)
        │
        ├── BI tool (Metabase / Superset / Looker / Tableau)
        ├── Operational dashboards
        ├── Compliance reports (RBI returns, bureau)
        ├── Board MIS
        ├── Cohort analytics
        └── ML feature store (Feast)

Warehouse choice

Option	Best for	Cost shape
Snowflake	Best balance of DX, performance, ops simplicity	Compute credits + storage
Amazon Redshift	AWS-native; tight integration; cluster-based	Cluster cost (predictable but less elastic)
Google BigQuery	GCP-native; serverless; pay-per-query	Query cost (can surprise; budget alerts essential)
ClickHouse (managed or self-hosted)	Very fast on large volumes; cost-effective at scale	Lower cost at scale; ops effort if self-hosted
PostgreSQL replica + extensions (Citus)	Smallest scale; familiar	Limited at scale; tipping point around `10M rows` per table

For an India NBFC starting at ₹30 Cr book → ₹500 Cr over 3 years:

MVP (< ₹50 Cr book): PostgreSQL read replica + dbt — sufficient; defer warehouse cost.
Phase 2 onwards: Snowflake (faster delivery, lower ops) or ClickHouse (cost-sensitive, ops capacity exists).

ETL / ELT

CDC (Change Data Capture)

Debezium + Kafka Connect for PostgreSQL → Kafka → warehouse.
Aurora native logical replication can also feed CDC consumers directly.
Captures every INSERT / UPDATE / DELETE in real time with < 1 minute lag typically.

Event stream → warehouse

For events on Kafka (loan_event, repayments, classifications), pipe directly to warehouse (Bronze layer).
Schema-evolution rules preserve compatibility.

Batch loads

For occasional batch (vendor MIS files, bank statements, reference data):

Scheduled jobs (Airflow / Dagster / cron) pull files / API.
Write to S3 staging.
Load via warehouse-native loaders.

Refresh cadence

Streaming (CDC): near-real-time for transactional tables.
Hourly for batch-loaded reference data.
Daily for downstream rollups and aggregates.
Monthly for slow-changing metadata (NIC codes, IFSC list).

dbt

Use dbt (data build tool) for SQL-based transformations Bronze → Silver → Gold.

Why dbt

Versioned models in git — every transformation reviewable.
Tested with assertions — declarative assert statements catch data-quality drift.
Lineage visible — automatic dependency graph.
Standard practice in modern analytics teams.
dbt-aware engineers are widely hireable in India by 2026.

dbt structure

warehouse/
├── dbt_project.yml
├── models/
│   ├── bronze/          # 1:1 with source tables; minimal transformation
│   ├── silver/          # Cleaned, deduplicated, joined
│   │   ├── borrower/
│   │   ├── loan/
│   │   └── settlement/
│   └── gold/            # Business-ready: dimensional models
│       ├── fact_loan_daily.sql
│       ├── dim_borrower.sql
│       ├── dim_partner.sql
│       └── mart_cohort_vintage.sql
├── tests/               # Cross-model integrity checks
├── macros/              # Reusable SQL fragments
└── snapshots/           # Slowly Changing Dimension tracking

Sample dbt model

-- gold/fact_loan_daily.sql
{{ config(materialized='incremental', unique_key=['loan_id','snapshot_date']) }}

WITH classifications AS (
    SELECT * FROM {{ ref('silver_classification_snapshot') }}
),
provisions AS (
    SELECT * FROM {{ ref('silver_provision') }}
),
loans AS (
    SELECT * FROM {{ ref('silver_loan_account') }}
)
SELECT
    l.loan_id,
    l.borrower_id,
    l.product_id,
    c.snapshot_date,
    c.class AS classification,
    c.dpd_days,
    l.current_outstanding,
    p.provision_amount,
    CASE WHEN l.partner_lender_id IS NOT NULL THEN 'co_lent' ELSE 'own_book' END AS book_type
FROM loans l
JOIN classifications c ON l.loan_id = c.loan_id
LEFT JOIN provisions p ON l.loan_id = p.loan_id AND c.snapshot_date = p.snapshot_date

{% if is_incremental() %}
WHERE c.snapshot_date > (SELECT max(snapshot_date) FROM {{ this }})
{% endif %}

BI

Tool	Best for	TCO
Metabase	MVP; broad usability; lowest TCO	Open source / hosted
Superset	Scale; richer features; self-host	Open source
Looker / Tableau / PowerBI	Enterprise rigor; SSO; access control	Commercial

Recommendation: Metabase at MVP, Superset or Looker at scale based on team preference.

Compliance reporting from the warehouse

Regulatory reports (RBI returns, bureau monthly files) are best generated from the warehouse — cleansed, modelled data with reproducibility. The reporting module (3.N) reads from Gold tables.

Benefits:

Reproducibility — same model generates same output for same date range.
Audit trail — dbt model versions track exactly which logic produced the output.
Speed — warehouse aggregation is fast; same query on OLTP would be slow and disruptive.

Feature store for ML

When ML models enter production (Phase 4 onwards):

Feast (feast.dev) — open-source feature store.
Tecton — managed feature store.
Custom on top of Snowflake / ClickHouse — common pattern for teams with strong data engineering.

The platform’s features (BSA-derived metrics, GST-derived metrics, behaviour features) should be defined once in the feature store and served both to training (offline) and to real-time decisioning (online).

Model registry and monitoring

MLflow for model registry + experiment tracking.
Monitoring — drift, calibration, decision distribution:
- Built as dbt models on the decision_run table.
- Or via specialised tooling (Arize, Fiddler, Evidently).

Operational concerns

CDC lag — typically < 1 minute; acceptable for analytics. For real-time dashboards, augment with streaming aggregations (Materialize, Flink) at scale.
Data quality — assertions in dbt; anomaly detection on key metrics; daily quality report.
PII handling in warehouse — minimised; tokens preferred; access strictly RBAC-controlled; masking for non-need-to-know users.
Cost — Snowflake / BigQuery can surprise; budget alerts mandatory; per-team / per-query cost attribution.

What the warehouse is and isn’t

It is the source of truth for analytics, reporting, ML, board MIS.
It is not the operational source of truth for cash-affecting decisions — always reconcile back to OLTP for any cash-flow-affecting action.
It is not the place to store raw vendor PDFs — use object store.
It is not the place to compute current loan outstanding for repayment processing — use OLTP / event-source projection.

3.L Portfolio monitoring — what the warehouse powers.
3.N Reporting and compliance — RBI returns generated from gold tables.
3.P Analytics and intelligence — dashboards and EWS.
14.2 Data and storage — concrete stack choices.
6.7 MVP vs advanced scorecard — ML scorecard roadmap.