Skip to content

5.6 Data platform and warehouse

OLTP (PostgreSQL) is optimised for per-borrower, per-loan transactional reads and writes. Analytics (cohort vintage, ECL, dashboards, board MIS, fraud analytics, model training) need columnar storage with optimised aggregation.

Running analytics on OLTP kills both:

  • Analytics is slow on row-store with operational indexes.
  • OLTP performance degrades when long-running analytical queries lock or starve resources.

Build the data platform from MVP. Skipping it means the analytics module is built on hacks and never recovers — by year 2, you’ll be drowning in operational fires from “the dashboard query brought down the database”.

Production OLTP (PostgreSQL Multi-AZ)
│ CDC (Debezium / Aurora native logical replication)
│ + scheduled extracts for slow-changing reference data
Raw / Bronze layer in warehouse
│ dbt models (SQL transforms)
Cleansed / Silver layer
│ dbt models
Modelled / Gold layer (star schemas, aggregates)
├── BI tool (Metabase / Superset / Looker / Tableau)
├── Operational dashboards
├── Compliance reports (RBI returns, bureau)
├── Board MIS
├── Cohort analytics
└── ML feature store (Feast)
OptionBest forCost shape
SnowflakeBest balance of DX, performance, ops simplicityCompute credits + storage
Amazon RedshiftAWS-native; tight integration; cluster-basedCluster cost (predictable but less elastic)
Google BigQueryGCP-native; serverless; pay-per-queryQuery cost (can surprise; budget alerts essential)
ClickHouse (managed or self-hosted)Very fast on large volumes; cost-effective at scaleLower cost at scale; ops effort if self-hosted
PostgreSQL replica + extensions (Citus)Smallest scale; familiarLimited at scale; tipping point around 10M rows per table

For an India NBFC starting at ₹30 Cr book → ₹500 Cr over 3 years:

  • MVP (< ₹50 Cr book): PostgreSQL read replica + dbt — sufficient; defer warehouse cost.
  • Phase 2 onwards: Snowflake (faster delivery, lower ops) or ClickHouse (cost-sensitive, ops capacity exists).
  • Debezium + Kafka Connect for PostgreSQL → Kafka → warehouse.
  • Aurora native logical replication can also feed CDC consumers directly.
  • Captures every INSERT / UPDATE / DELETE in real time with < 1 minute lag typically.
  • For events on Kafka (loan_event, repayments, classifications), pipe directly to warehouse (Bronze layer).
  • Schema-evolution rules preserve compatibility.

For occasional batch (vendor MIS files, bank statements, reference data):

  • Scheduled jobs (Airflow / Dagster / cron) pull files / API.
  • Write to S3 staging.
  • Load via warehouse-native loaders.
  • Streaming (CDC): near-real-time for transactional tables.
  • Hourly for batch-loaded reference data.
  • Daily for downstream rollups and aggregates.
  • Monthly for slow-changing metadata (NIC codes, IFSC list).

Use dbt (data build tool) for SQL-based transformations Bronze → Silver → Gold.

  • Versioned models in git — every transformation reviewable.
  • Tested with assertions — declarative assert statements catch data-quality drift.
  • Lineage visible — automatic dependency graph.
  • Standard practice in modern analytics teams.
  • dbt-aware engineers are widely hireable in India by 2026.
warehouse/
├── dbt_project.yml
├── models/
│ ├── bronze/ # 1:1 with source tables; minimal transformation
│ ├── silver/ # Cleaned, deduplicated, joined
│ │ ├── borrower/
│ │ ├── loan/
│ │ └── settlement/
│ └── gold/ # Business-ready: dimensional models
│ ├── fact_loan_daily.sql
│ ├── dim_borrower.sql
│ ├── dim_partner.sql
│ └── mart_cohort_vintage.sql
├── tests/ # Cross-model integrity checks
├── macros/ # Reusable SQL fragments
└── snapshots/ # Slowly Changing Dimension tracking
-- gold/fact_loan_daily.sql
{{ config(materialized='incremental', unique_key=['loan_id','snapshot_date']) }}
WITH classifications AS (
SELECT * FROM {{ ref('silver_classification_snapshot') }}
),
provisions AS (
SELECT * FROM {{ ref('silver_provision') }}
),
loans AS (
SELECT * FROM {{ ref('silver_loan_account') }}
)
SELECT
l.loan_id,
l.borrower_id,
l.product_id,
c.snapshot_date,
c.class AS classification,
c.dpd_days,
l.current_outstanding,
p.provision_amount,
CASE WHEN l.partner_lender_id IS NOT NULL THEN 'co_lent' ELSE 'own_book' END AS book_type
FROM loans l
JOIN classifications c ON l.loan_id = c.loan_id
LEFT JOIN provisions p ON l.loan_id = p.loan_id AND c.snapshot_date = p.snapshot_date
{% if is_incremental() %}
WHERE c.snapshot_date > (SELECT max(snapshot_date) FROM {{ this }})
{% endif %}
ToolBest forTCO
MetabaseMVP; broad usability; lowest TCOOpen source / hosted
SupersetScale; richer features; self-hostOpen source
Looker / Tableau / PowerBIEnterprise rigor; SSO; access controlCommercial

Recommendation: Metabase at MVP, Superset or Looker at scale based on team preference.

Regulatory reports (RBI returns, bureau monthly files) are best generated from the warehouse — cleansed, modelled data with reproducibility. The reporting module (3.N) reads from Gold tables.

Benefits:

  • Reproducibility — same model generates same output for same date range.
  • Audit trail — dbt model versions track exactly which logic produced the output.
  • Speed — warehouse aggregation is fast; same query on OLTP would be slow and disruptive.

When ML models enter production (Phase 4 onwards):

  • Feast (feast.dev) — open-source feature store.
  • Tecton — managed feature store.
  • Custom on top of Snowflake / ClickHouse — common pattern for teams with strong data engineering.

The platform’s features (BSA-derived metrics, GST-derived metrics, behaviour features) should be defined once in the feature store and served both to training (offline) and to real-time decisioning (online).

  • MLflow for model registry + experiment tracking.
  • Monitoring — drift, calibration, decision distribution:
    • Built as dbt models on the decision_run table.
    • Or via specialised tooling (Arize, Fiddler, Evidently).
  • CDC lag — typically < 1 minute; acceptable for analytics. For real-time dashboards, augment with streaming aggregations (Materialize, Flink) at scale.
  • Data quality — assertions in dbt; anomaly detection on key metrics; daily quality report.
  • PII handling in warehouse — minimised; tokens preferred; access strictly RBAC-controlled; masking for non-need-to-know users.
  • Cost — Snowflake / BigQuery can surprise; budget alerts mandatory; per-team / per-query cost attribution.
  • It is the source of truth for analytics, reporting, ML, board MIS.
  • It is not the operational source of truth for cash-affecting decisions — always reconcile back to OLTP for any cash-flow-affecting action.
  • It is not the place to store raw vendor PDFs — use object store.
  • It is not the place to compute current loan outstanding for repayment processing — use OLTP / event-source projection.