Implementing CORAL for Ordinal Regression to Continuous Targets
1 1. Correcting the probabilistic mistake (core insight)
CORAL does not output class probabilities. It outputs cumulative (ordinal) probabilities:
\[ p_t ;=; P(y > t), \quad t = 0,\dots,K-2 \]
These are threshold probabilities, not marginals. Treating them as \(P(y=t)\) leads to systematic overestimation because probability mass is counted multiple times.
1.1 Recovering marginal class probabilities
Define: \[ p_{-1}=1,\quad p_{K-1}=0 \]
Then: \[ \pi_k ;=; P(y=k) ;=; p_{k-1} - p_k,\quad k=0,\dots,K-1 \]
This converts cumulative → marginal, which is mandatory before computing expectations.
2 2. Correct computation of the regression prediction
Let \(r\) be the original continuous regression target (e.g. RRI), discretized into \(K\) ordinal bins.
Let \(u_k\) be a representative real value for bin \(k\) (initially: empirical mean or bin center).
The Bayes-optimal MSE predictor is the conditional mean:
\[ \hat r ;=; \mathbb{E}[r \mid x] ;\approx; \sum_{k=0}^{K-1} \pi_k , u_k \]
This is the only correct way to derive a scalar regression prediction from CORAL.
3 3. Single ordinal head + auxiliary regression loss (preferred design)
Rather than adding a second regression head, we keep one CORAL head and derive \(\hat r\) from it.
3.1 Loss
\[ \mathcal{L} ;=; \mathcal{L}_{\text{CORAL}} ;+; \lambda \, \mathcal{L}_{\text{reg}}(\hat r, r) \]
- \(\mathcal{L}_{\text{CORAL}}\): binary cross-entropy on thresholds
- \(\mathcal{L}_{\text{reg}}\): Huber / Smooth-L1 (preferred over L2 for heavy-tailed RRI)
3.2 Intuition
- Enforces distributional correctness (ordinal structure)
- Enforces calibration in real units
- Avoids two heads learning the same quantity inconsistently
4 4. Learnable bin representatives \(u_k\)
Instead of fixing \(u_k=\mu_k\) (empirical mean), make them learnable, initialized from \(\mu_k\).
\[ \hat r = \sum_k \pi_k,u_k \]
4.1 Why this helps
- Corrects discretization bias
- Adapts to distribution shift
- Improves calibration without adding a full regression head
4.2 Regularization (important)
Prevent \(u_k\) from drifting and “absorbing” all error:
$$ _{u} =
_{k=0}^{K-1} (u_k - _k)^2 $$
This keeps them tethered to empirical reality.
5 5. Enforcing ordinal ordering on \(u_k\)
You usually want: \[ u_0 \le u_1 \le \dots \le u_{K-1} \]
5.1 Monotone parameterization (recommended)
Let: \[ u_0 \in \mathbb{R},\qquad u_k = u_0 + \sum_{j=1}^{k} \text{softplus}(\delta_j) \]
- Guarantees monotonicity
- Differentiable
- Stable under optimization
This aligns ordinal meaning ↔︎ numeric meaning.
6 6. Residual correction (controlled extra expressivity)
If bins are coarse and you want within-bin refinement, add a small residual head:
$$ r =
_k _k u_k ;+; (x) $$
Where:
- \(\delta(x)\): tiny MLP / linear head
- Regularize strongly: \[ \mathcal{L}_{\delta} = \beta |\delta(x)|^2 \]
6.1 Interpretation
- CORAL handles rank / coarse structure
- Residual handles local, continuous variation
- Much safer than a full parallel regression head
7 7. Final full objective
Putting it all together:
$$ $$
(with \(\delta(x)=0\) if you skip the residual)
8 8. Conceptual summary (one sentence per component)
- CORAL head learns ordinal uncertainty via cumulative probabilities.
- Marginalization converts ordinal outputs into a proper distribution.
- Expectation over bins yields the Bayes-optimal regression estimate.
- Learnable \(u_k\) correct discretization bias while preserving structure.
- Monotonicity preserves semantic ordering.
- Residual adds controlled continuous flexibility.
8.1 Bottom line
This design gives you:
- Ordinal correctness
- Probabilistic consistency
- Real-unit calibration
- Minimal redundancy
- Explicit inductive bias aligned with your problem
In short: this is a principled ordinal-to-regression bridge, not a hack.
9 9. Implementation details (NBV codebase)
9.1 Core utilities (oracle_rri/oracle_rri/rri_metrics/coral.py)
We implement monotone bin representatives in MonotoneBinValues and wire them into CoralLayer so the expected value can be computed directly from CORAL marginals:
from oracle_rri.rri_metrics.coral import CoralLayer, coral_logits_to_prob
logits = head_coral(feats) # (..., K-1)
probs = coral_logits_to_prob(logits) # (..., K)
pred_rri = head_coral.expected_from_probs(probs)This uses the learned monotone values u_k, initialized from bin means or midpoints and optionally regularized with:
u_reg = head_coral.bin_value_regularizer(target_values)9.2 Initialization hook (oracle_rri/oracle_rri/vin/model_v2.py)
VinModelV2 exposes a lightweight hook to initialize the CORAL bin values:
model = VinModelV2(...)
model.init_bin_values(bin_means) # Tensor[K]9.3 Fitting + training integration (oracle_rri/oracle_rri/lightning/lit_module.py)
After loading the fitted RriOrdinalBinner, the Lightning module calls _maybe_init_bin_values() to seed u_k from bin_means (or midpoints if means are unavailable). During training, the auxiliary regression loss uses these values to compute a calibrated scalar prediction:
probs = pred.prob.squeeze(0)
pred_rri = head_coral.expected_from_probs(probs)
aux_loss = smooth_l1(pred_rri, rri)This keeps the model single‑head, ensures probabilistic correctness, and lets the bin representatives adapt while remaining monotone.
10 10. Training dynamics, monitoring, and two practical fixes
During offline training runs we repeatedly observed the same signature:
- Aux regression loss (Huber/Smooth‑L1 on the CORAL-derived expected value) drops very quickly (often “exponentially” in the first epochs).
- CORAL loss improves more slowly.
- Early confusion matrices show a central-band pattern (many samples predicted into a narrow set of middle bins).
This is expected given the current implementation, and it leads directly to two high-impact practical fixes.
10.1 10.1 Why the auxiliary loss drops so fast
In oracle_rri/oracle_rri/lightning/lit_module.py, the combined objective is currently implemented as:
combined_loss = coral_loss_value + aux_lossThis implicitly sets the auxiliary weight to λ = 1, even though CORAL loss and auxiliary regression loss typically live on very different numeric scales:
- CORAL loss with
K=15bins starts around ~7–10. - Huber on a normalized-ish scalar target can quickly drop to ~1e-2.
So the auxiliary term provides a clean early calibration gradient and then rapidly becomes negligible. The “central-band” confusion matrix is consistent with the model initially learning a conservative conditional-mean / median proxy before it has learned sharp ordinal separation.
10.2 10.2 Fix #1: add an explicit auxiliary weight λ
Make the combined loss explicit:
\[ \mathcal{L} = \mathcal{L}_{\text{CORAL}} + \lambda_{\text{aux}}\,\mathcal{L}_{\text{aux}}(\hat r, r) \]
Even if \(\mathcal{L}_{\text{aux}}\) becomes small, \(\lambda_{\text{aux}}\) controls how much the auxiliary objective shapes representation learning early on.
Rule of thumb from the observed magnitudes:
- If CORAL ≈ 8 and aux ≈ 0.03, then values like \(\lambda_{\text{aux}}\in[5,50]\) make the aux signal meaningful without dominating.
- Alternatively set \(\lambda_{\text{aux}}\approx 0.1\) if aux should be a light calibrator only.
Implementation-wise, this is best exposed as a config field (e.g. aux_regression_weight) and applied as:
combined_loss = coral_loss_value + aux_regression_weight * aux_loss10.2.1 Exponential decay schedule (recommended)
Given the training dynamics we observed, make \(\lambda_{\text{aux}}\) decay exponentially over time while keeping a small nonzero floor:
\[ \lambda_{\text{aux}}(t) = \max(\lambda_{\min}, \lambda_0 \cdot \gamma^{t}) \]
where \(t\) is either epoch or global step, and \(\gamma \in (0, 1]\) controls the decay speed. This preserves the early calibration signal but gradually shifts focus to ordinal separation.
Theory intuition. The combined objective is a weighted sum of two signals with very different curvature scales. Early in training, the aux term provides a strong gradient toward a reasonable conditional mean, but once it enters the quadratic (Huber) regime it quickly becomes small. An exponential schedule implements a smooth continuation from a calibrated-regression objective to an ordinal-ranking objective, without abruptly removing the auxiliary signal. The nonzero floor prevents drift of learnable bin values and keeps mild calibration pressure late in training.
In code/config:
[module_config]
aux_regression_weight = 10.0
aux_regression_weight_gamma = 0.98
aux_regression_weight_min = 0.5
aux_regression_weight_interval = "epoch" # or "step"10.3 10.3 Fix #2: schedule LR on the hard metric, not the easy one
The default ReduceLrOnPlateauConfig monitors:
monitor: str = "train/loss"But train/loss is the sum of CORAL and aux. Because aux improves quickly, the LR scheduler can keep seeing improvements even when CORAL plateaus.
For CORAL-driven learning, prefer monitoring the validation CORAL loss:
monitor = "val/coral_loss"
In TOML, that looks like:
[module_config.lr_scheduler]
monitor = "val/coral_loss"10.4 10.4 Sanity checks for CORAL learning signals
- Chance-level baseline: if CORAL loss is the sum over
K-1binary thresholds, a rough random baseline is \((K-1)\log 2\) (forK=15this is \(\approx 14\log 2 \approx 9.70\)). Our code logscoral_loss_rel_randomusingcoral_random_loss(K)so you can track progress relative to chance. - Central-band confusion matrices: often indicate “mean-optimal” predictions early, not necessarily a bug. They should widen over epochs as ordinal separation improves.
10.5 10.5 Learnable bin representatives: initialization is not enough
We already support monotone learnable bin values u_k and initialization from the fitted binner. If you make u_k learnable, consider adding the regularizer term to the training objective explicitly:
\[ \mathcal{L}_{u} = \alpha \sum_{k=0}^{K-1} (u_k - \mu_k)^2 \]
This prevents the model from “cheating” by shifting bin values instead of learning the correct ordinal probabilities.