Implementing CORAL for Ordinal Regression to Continuous Targets

1 1. Correcting the probabilistic mistake (core insight)

CORAL does not output class probabilities. It outputs cumulative (ordinal) probabilities:

\[ p_t ;=; P(y > t), \quad t = 0,\dots,K-2 \]

These are threshold probabilities, not marginals. Treating them as \(P(y=t)\) leads to systematic overestimation because probability mass is counted multiple times.

1.1 Recovering marginal class probabilities

Define: \[ p_{-1}=1,\quad p_{K-1}=0 \]

Then: \[ \pi_k ;=; P(y=k) ;=; p_{k-1} - p_k,\quad k=0,\dots,K-1 \]

This converts cumulative → marginal, which is mandatory before computing expectations.


2 2. Correct computation of the regression prediction

Let \(r\) be the original continuous regression target (e.g. RRI), discretized into \(K\) ordinal bins.

Let \(u_k\) be a representative real value for bin \(k\) (initially: empirical mean or bin center).

The Bayes-optimal MSE predictor is the conditional mean:

\[ \hat r ;=; \mathbb{E}[r \mid x] ;\approx; \sum_{k=0}^{K-1} \pi_k , u_k \]

This is the only correct way to derive a scalar regression prediction from CORAL.


3 3. Single ordinal head + auxiliary regression loss (preferred design)

Rather than adding a second regression head, we keep one CORAL head and derive \(\hat r\) from it.

3.1 Loss

\[ \mathcal{L} ;=; \mathcal{L}_{\text{CORAL}} ;+; \lambda \, \mathcal{L}_{\text{reg}}(\hat r, r) \]

  • \(\mathcal{L}_{\text{CORAL}}\): binary cross-entropy on thresholds
  • \(\mathcal{L}_{\text{reg}}\): Huber / Smooth-L1 (preferred over L2 for heavy-tailed RRI)

3.2 Intuition

  • Enforces distributional correctness (ordinal structure)
  • Enforces calibration in real units
  • Avoids two heads learning the same quantity inconsistently

4 4. Learnable bin representatives \(u_k\)

Instead of fixing \(u_k=\mu_k\) (empirical mean), make them learnable, initialized from \(\mu_k\).

\[ \hat r = \sum_k \pi_k,u_k \]

4.1 Why this helps

  • Corrects discretization bias
  • Adapts to distribution shift
  • Improves calibration without adding a full regression head

4.2 Regularization (important)

Prevent \(u_k\) from drifting and “absorbing” all error:

$$ _{u} =

_{k=0}^{K-1} (u_k - _k)^2 $$

This keeps them tethered to empirical reality.


5 5. Enforcing ordinal ordering on \(u_k\)

You usually want: \[ u_0 \le u_1 \le \dots \le u_{K-1} \]

6 6. Residual correction (controlled extra expressivity)

If bins are coarse and you want within-bin refinement, add a small residual head:

$$ r =

_k _k u_k ;+; (x) $$

Where:

  • \(\delta(x)\): tiny MLP / linear head
  • Regularize strongly: \[ \mathcal{L}_{\delta} = \beta |\delta(x)|^2 \]

6.1 Interpretation

  • CORAL handles rank / coarse structure
  • Residual handles local, continuous variation
  • Much safer than a full parallel regression head

7 7. Final full objective

Putting it all together:

$$ $$

(with \(\delta(x)=0\) if you skip the residual)


8 8. Conceptual summary (one sentence per component)

  • CORAL head learns ordinal uncertainty via cumulative probabilities.
  • Marginalization converts ordinal outputs into a proper distribution.
  • Expectation over bins yields the Bayes-optimal regression estimate.
  • Learnable \(u_k\) correct discretization bias while preserving structure.
  • Monotonicity preserves semantic ordering.
  • Residual adds controlled continuous flexibility.

8.1 Bottom line

This design gives you:

  • Ordinal correctness
  • Probabilistic consistency
  • Real-unit calibration
  • Minimal redundancy
  • Explicit inductive bias aligned with your problem

In short: this is a principled ordinal-to-regression bridge, not a hack.


9 9. Implementation details (NBV codebase)

9.1 Core utilities (oracle_rri/oracle_rri/rri_metrics/coral.py)

We implement monotone bin representatives in MonotoneBinValues and wire them into CoralLayer so the expected value can be computed directly from CORAL marginals:

from oracle_rri.rri_metrics.coral import CoralLayer, coral_logits_to_prob

logits = head_coral(feats)               # (..., K-1)
probs = coral_logits_to_prob(logits)     # (..., K)
pred_rri = head_coral.expected_from_probs(probs)

This uses the learned monotone values u_k, initialized from bin means or midpoints and optionally regularized with:

u_reg = head_coral.bin_value_regularizer(target_values)

9.2 Initialization hook (oracle_rri/oracle_rri/vin/model_v2.py)

VinModelV2 exposes a lightweight hook to initialize the CORAL bin values:

model = VinModelV2(...)
model.init_bin_values(bin_means)  # Tensor[K]

9.3 Fitting + training integration (oracle_rri/oracle_rri/lightning/lit_module.py)

After loading the fitted RriOrdinalBinner, the Lightning module calls _maybe_init_bin_values() to seed u_k from bin_means (or midpoints if means are unavailable). During training, the auxiliary regression loss uses these values to compute a calibrated scalar prediction:

probs = pred.prob.squeeze(0)
pred_rri = head_coral.expected_from_probs(probs)
aux_loss = smooth_l1(pred_rri, rri)

This keeps the model single‑head, ensures probabilistic correctness, and lets the bin representatives adapt while remaining monotone.


10 10. Training dynamics, monitoring, and two practical fixes

During offline training runs we repeatedly observed the same signature:

  • Aux regression loss (Huber/Smooth‑L1 on the CORAL-derived expected value) drops very quickly (often “exponentially” in the first epochs).
  • CORAL loss improves more slowly.
  • Early confusion matrices show a central-band pattern (many samples predicted into a narrow set of middle bins).

This is expected given the current implementation, and it leads directly to two high-impact practical fixes.

10.1 10.1 Why the auxiliary loss drops so fast

In oracle_rri/oracle_rri/lightning/lit_module.py, the combined objective is currently implemented as:

combined_loss = coral_loss_value + aux_loss

This implicitly sets the auxiliary weight to λ = 1, even though CORAL loss and auxiliary regression loss typically live on very different numeric scales:

  • CORAL loss with K=15 bins starts around ~7–10.
  • Huber on a normalized-ish scalar target can quickly drop to ~1e-2.

So the auxiliary term provides a clean early calibration gradient and then rapidly becomes negligible. The “central-band” confusion matrix is consistent with the model initially learning a conservative conditional-mean / median proxy before it has learned sharp ordinal separation.

10.2 10.2 Fix #1: add an explicit auxiliary weight λ

Make the combined loss explicit:

\[ \mathcal{L} = \mathcal{L}_{\text{CORAL}} + \lambda_{\text{aux}}\,\mathcal{L}_{\text{aux}}(\hat r, r) \]

Even if \(\mathcal{L}_{\text{aux}}\) becomes small, \(\lambda_{\text{aux}}\) controls how much the auxiliary objective shapes representation learning early on.

Rule of thumb from the observed magnitudes:

  • If CORAL ≈ 8 and aux ≈ 0.03, then values like \(\lambda_{\text{aux}}\in[5,50]\) make the aux signal meaningful without dominating.
  • Alternatively set \(\lambda_{\text{aux}}\approx 0.1\) if aux should be a light calibrator only.

Implementation-wise, this is best exposed as a config field (e.g. aux_regression_weight) and applied as:

combined_loss = coral_loss_value + aux_regression_weight * aux_loss

10.3 10.3 Fix #2: schedule LR on the hard metric, not the easy one

The default ReduceLrOnPlateauConfig monitors:

monitor: str = "train/loss"

But train/loss is the sum of CORAL and aux. Because aux improves quickly, the LR scheduler can keep seeing improvements even when CORAL plateaus.

For CORAL-driven learning, prefer monitoring the validation CORAL loss:

  • monitor = "val/coral_loss"

In TOML, that looks like:

[module_config.lr_scheduler]
monitor = "val/coral_loss"

10.4 10.4 Sanity checks for CORAL learning signals

  • Chance-level baseline: if CORAL loss is the sum over K-1 binary thresholds, a rough random baseline is \((K-1)\log 2\) (for K=15 this is \(\approx 14\log 2 \approx 9.70\)). Our code logs coral_loss_rel_random using coral_random_loss(K) so you can track progress relative to chance.
  • Central-band confusion matrices: often indicate “mean-optimal” predictions early, not necessarily a bug. They should widen over epochs as ordinal separation improves.

10.5 10.5 Learnable bin representatives: initialization is not enough

We already support monotone learnable bin values u_k and initialization from the fitted binner. If you make u_k learnable, consider adding the regularizer term to the training objective explicitly:

\[ \mathcal{L}_{u} = \alpha \sum_{k=0}^{K-1} (u_k - \mu_k)^2 \]

This prevents the model from “cheating” by shifting bin values instead of learning the correct ordinal probabilities.