Skip to content

5.4 Events and async

Lending is intrinsically event-driven: sanction triggers documentation; documentation triggers disbursement; disbursement triggers LMS activation; every repayment triggers ledger updates, classification, and analytics. Modelling these as synchronous chains is fragile (one slow downstream blocks everything) and unauditable.

Events give:

  • Loose coupling between modules.
  • Audit trail by default — events are immutable.
  • Replay capability for reconciliation and recovery.
  • Easy fan-out to new consumers (analytics, partner systems, audit, ML feature store).
OptionWhen to chooseTrade-off
RabbitMQMVP, lower throughput, simpler opsLess ordering / partitioning power
Kafka (MSK)Scale, partitioning, ordering, large fan-outOperational overhead, more complex client semantics
SQS + SNSAWS-native simple at very low volumeLimited features for complex routing

For this platform: RabbitMQ at MVP (lower ops; sufficient throughput for < 1000 events/sec). Kafka / MSK at scale when ordering per loan and very high fan-out matter.

Events are namespaced by module. Sample of the core set:

application.created application.submitted
kyc.individual.completed kyc.individual.failed
kyb.entity.completed
data.aa.consent_granted data.aa.fetch_succeeded data.aa.fetch_failed
data.bank_statement.parsed data.gst.pulled data.bureau.pulled
decision.run.started decision.run.completed
decision.approved decision.declined decision.referred
deviation.requested deviation.approved deviation.declined
sanction.issued
document.generated kfs.acknowledged
esign.completed esign.failed
mandate.active mandate.failed
disbursement.created disbursement.approved disbursement.executed
disbursement.failed utr.captured
loan.activated
accrual.posted
repayment.received repayment.allocated
bounce.received
classification.changed
provision.computed
restructure.applied
writeoff.applied
loan.closed
lender_share.posted settlement.completed
dlg.invoked
collection.case.opened ews.raised
audit.event

Every event has a schema in a schema registry (Confluent Schema Registry, AWS Glue Schema Registry, or a homegrown JSON-Schema store).

Rules:

  • Backward compatibility mandatory — consumers tolerate optional new fields.
  • No breaking changes without versioning the topic (loan.activated.v2).
  • Schema review in CI — every new event or change requires registry update + compatibility check.
  • Consumer SDK generated from schema so consumers compile-fail on incompatible reads.

Every event-producing API call carries an idempotency key (UUID); every consumer dedupes by key.

Pattern:

Producer:
- Include `idempotency_key` in event metadata (same as request's idempotency key).
Consumer:
- On receive: check `seen_idempotency_keys` table for this consumer+topic.
- If present: skip (it's a duplicate replay).
- Else: insert key into seen table in same transaction as business processing.
- Commit.
- Ack to broker.

This makes consumers safe under at-least-once delivery, which is the only reliable delivery semantic across broker failures.

  • Transient failures (network, vendor 5xx, DB deadlock) → retry with exponential backoff (100ms, 200ms, 400ms, …, capped at 30s, max 10 attempts).
  • Permanent failures (schema mismatch, business-rule violation in consumer) → dead-letter queue (DLQ) with full message + error context.
  • DLQ inspection UI for ops to triage, fix root cause, and re-process.
  • Alerts on DLQ size growth (PagerDuty / Slack).

For events that need ordering (e.g., all events for the same loan):

  • Partition key = loan_id (Kafka); ensures all events for the same loan land on the same partition and consumer.
  • Within a partition, Kafka guarantees order.
  • Across partitions, no order guarantee — design consumers to be order-tolerant per partition only.

For events without ordering requirements (e.g., notifications), any partition strategy works.

Loan-affecting writes (disbursement, accrual, repayment, charge, classification, write-off) are append-only events on the loan_event table (and emitted on the bus).

Current state is derived. The loan_account table’s current_outstanding, current_classification, etc. are projections from the event stream.

  • Auditability — every state change explicit and immutable.
  • Time travel — query the loan’s state at any past timestamp.
  • Reconciliation — replay against an alternate projection to verify.
  • Debugging — see exactly what happened in sequence.
  • Discipline required — every state-changing write must go through the event path.
  • Storage growth — events accumulate; old events partition / archive policy needed.
  • Projection complexity — projections must be carefully built and tested.

Event source the LMS module from day 1. Other modules can use simpler state-machine patterns; event-source only where the audit and reconciliation benefits justify the discipline cost.

Producers write event records to an outbox table in the same database transaction as the business change. A separate outbox publisher process reads the outbox and publishes to the bus.

This avoids the dual-write inconsistency where a business change commits but the event publication fails.

Business transaction:
INSERT INTO loan_event (...);
INSERT INTO outbox (event_type, payload, status='pending');
COMMIT;
Outbox publisher (async, separate process):
SELECT * FROM outbox WHERE status='pending' LIMIT 100;
for each row:
publish to bus;
UPDATE outbox SET status='published' WHERE id=...;

Most events in the platform use this pattern.

For multi-step workflows that span modules (sanction → docs → disbursement → LMS activation), use sagas orchestrated by the workflow engine (5.5). Sagas handle compensating actions on failure (e.g., revoke docs, cancel mandate if disbursement fails).

For high-read entities (borrower view, loan view), maintain a read model distinct from the write model. The read model is updated by consuming events. This lets reads scale independently and avoids contention on the write model.

Use CQRS selectively — not every entity needs it. LMS read views (statements, dashboards) are good candidates.

  • Event lag monitoring — consumer lag per topic.
  • Throughput per topic and per consumer.
  • DLQ size alerts.
  • Schema-incompatibility alerts in CI.
  • Per-event tracing via distributed-tracing IDs propagated through event metadata.

Standard dashboards:

  • Topic-level: rate, lag, error rate.
  • Consumer-level: throughput, lag, retry rate, DLQ rate.
  • Event-type-level: count, age distribution, top consumers.

A representative event schema (loan.activated):

{
"$schema": "https://json-schema.org/draft/2020-12/schema",
"type": "object",
"title": "loan.activated v1",
"required": ["event_id","occurred_at","loan_id","sanction_id","borrower_id"],
"properties": {
"event_id": { "type": "string", "format": "uuid" },
"occurred_at": { "type": "string", "format": "date-time" },
"loan_id": { "type": "integer" },
"sanction_id": { "type": "integer" },
"borrower_id": { "type": "integer" },
"product_id": { "type": "integer" },
"initial_principal": { "type": "number" },
"co_lender_share": {
"type": "object",
"properties": {
"partner_lender_id": { "type": "integer" },
"originator_pct": { "type": "number" },
"partner_pct": { "type": "number" }
}
},
"metadata": {
"type": "object",
"properties": {
"trace_id": { "type": "string" },
"idempotency_key": { "type": "string" }
}
}
}
}
  • It is the channel for loose-coupled state propagation between modules and for downstream fan-out.
  • It is not the transactional persistence layer — that remains PostgreSQL with ACID transactions.
  • It is not a substitute for synchronous APIs — those exist for read paths and command paths where the caller needs immediate confirmation.
  • It is not unbounded — every event-bus topic, every consumer, every retry policy is deliberate.