Implementing CORAL for Ordinal Regression to Continuous Targets

1 1. Correcting the probabilistic mistake (core insight)

CORAL does not output class probabilities. It outputs cumulative (ordinal) probabilities:

\[ p_t ;=; P(y > t), \quad t = 0,\dots,K-2 \]

These are threshold probabilities, not marginals. Treating them as $P(y=t)$ leads to systematic overestimation because probability mass is counted multiple times.

1.1 Recovering marginal class probabilities

Define: \[ p_{-1}=1,\quad p_{K-1}=0 \]

Then: \[ \pi_k ;=; P(y=k) ;=; p_{k-1} - p_k,\quad k=0,\dots,K-1 \]

This converts cumulative → marginal, which is mandatory before computing expectations.

2 2. Correct computation of the regression prediction

Let $r$ be the original continuous regression target (e.g. RRI), discretized into $K$ ordinal bins.

Let $u_k$ be a representative real value for bin $k$ (initially: empirical mean or bin center).

The Bayes-optimal MSE predictor is the conditional mean:

\[ \hat r ;=; \mathbb{E}[r \mid x] ;\approx; \sum_{k=0}^{K-1} \pi_k , u_k \]

This is the only correct way to derive a scalar regression prediction from CORAL.

3 3. Single ordinal head + auxiliary regression loss (preferred design)

Rather than adding a second regression head, we keep one CORAL head and derive $\hat r$ from it.

3.1 Loss

\[ \mathcal{L} ;=; \mathcal{L}_{\text{CORAL}} ;+; \lambda \, \mathcal{L}_{\text{reg}}(\hat r, r) \]

$\mathcal{L}_{\text{CORAL}}$: binary cross-entropy on thresholds
$\mathcal{L}_{\text{reg}}$: Huber / Smooth-L1 (preferred over L2 for heavy-tailed RRI)

3.2 Intuition

Enforces distributional correctness (ordinal structure)
Enforces calibration in real units
Avoids two heads learning the same quantity inconsistently

4 4. Learnable bin representatives $u_k$

Instead of fixing $u_k=\mu_k$ (empirical mean), make them learnable, initialized from $\mu_k$.

\[ \hat r = \sum_k \pi_k,u_k \]

4.1 Why this helps

Corrects discretization bias
Adapts to distribution shift
Improves calibration without adding a full regression head

4.2 Regularization (important)

Prevent $u_k$ from drifting and “absorbing” all error:

$$ _{u} =

_{k=0}^{K-1} (u_k - _k)^2 $$

This keeps them tethered to empirical reality.

5 5. Enforcing ordinal ordering on $u_k$

You usually want: \[ u_0 \le u_1 \le \dots \le u_{K-1} \]

5.1 Monotone parameterization (recommended)

Let: \[ u_0 \in \mathbb{R},\qquad u_k = u_0 + \sum_{j=1}^{k} \text{softplus}(\delta_j) \]

Guarantees monotonicity
Differentiable
Stable under optimization

This aligns ordinal meaning ↔︎ numeric meaning.

6 6. Residual correction (controlled extra expressivity)

If bins are coarse and you want within-bin refinement, add a small residual head:

$$ r =

_k _k u_k ;+; (x) $$

Where:

$\delta(x)$: tiny MLP / linear head
Regularize strongly: \[ \mathcal{L}_{\delta} = \beta |\delta(x)|^2 \]

6.1 Interpretation

CORAL handles rank / coarse structure
Residual handles local, continuous variation
Much safer than a full parallel regression head

7 7. Final full objective

Putting it all together:

$$ $$

(with $\delta(x)=0$ if you skip the residual)

8 8. Conceptual summary (one sentence per component)

CORAL head learns ordinal uncertainty via cumulative probabilities.
Marginalization converts ordinal outputs into a proper distribution.
Expectation over bins yields the Bayes-optimal regression estimate.
Learnable $u_k$ correct discretization bias while preserving structure.
Monotonicity preserves semantic ordering.
Residual adds controlled continuous flexibility.

8.1 Bottom line

This design gives you:

Ordinal correctness
Probabilistic consistency
Real-unit calibration
Minimal redundancy
Explicit inductive bias aligned with your problem

In short: this is a principled ordinal-to-regression bridge, not a hack.

9 9. Implementation details (NBV codebase)

9.1 Core utilities (`oracle_rri/oracle_rri/rri_metrics/coral.py`)

We implement monotone bin representatives in MonotoneBinValues and wire them into CoralLayer so the expected value can be computed directly from CORAL marginals:

from oracle_rri.rri_metrics.coral import CoralLayer, coral_logits_to_prob

logits = head_coral(feats)               # (..., K-1)
probs = coral_logits_to_prob(logits)     # (..., K)
pred_rri = head_coral.expected_from_probs(probs)

This uses the learned monotone values u_k, initialized from bin means or midpoints and optionally regularized with:

u_reg = head_coral.bin_value_regularizer(target_values)

9.2 Initialization hook (`oracle_rri/oracle_rri/vin/model_v2.py`)

VinModelV2 exposes a lightweight hook to initialize the CORAL bin values:

model = VinModelV2(...)
model.init_bin_values(bin_means)  # Tensor[K]

9.3 Fitting + training integration (`oracle_rri/oracle_rri/lightning/lit_module.py`)

After loading the fitted RriOrdinalBinner, the Lightning module calls _maybe_init_bin_values() to seed u_k from bin_means (or midpoints if means are unavailable). During training, the auxiliary regression loss uses these values to compute a calibrated scalar prediction:

probs = pred.prob.squeeze(0)
pred_rri = head_coral.expected_from_probs(probs)
aux_loss = smooth_l1(pred_rri, rri)

This keeps the model single‑head, ensures probabilistic correctness, and lets the bin representatives adapt while remaining monotone.

10 10. Training dynamics, monitoring, and two practical fixes

During offline training runs we repeatedly observed the same signature:

Aux regression loss (Huber/Smooth‑L1 on the CORAL-derived expected value) drops very quickly (often “exponentially” in the first epochs).
CORAL loss improves more slowly.
Early confusion matrices show a central-band pattern (many samples predicted into a narrow set of middle bins).

This is expected given the current implementation, and it leads directly to two high-impact practical fixes.

10.1 10.1 Why the auxiliary loss drops so fast

In oracle_rri/oracle_rri/lightning/lit_module.py, the combined objective is currently implemented as:

combined_loss = coral_loss_value + aux_loss

This implicitly sets the auxiliary weight to λ = 1, even though CORAL loss and auxiliary regression loss typically live on very different numeric scales:

CORAL loss with K=15 bins starts around ~7–10.
Huber on a normalized-ish scalar target can quickly drop to ~1e-2.

So the auxiliary term provides a clean early calibration gradient and then rapidly becomes negligible. The “central-band” confusion matrix is consistent with the model initially learning a conservative conditional-mean / median proxy before it has learned sharp ordinal separation.

10.2 10.2 Fix #1: add an explicit auxiliary weight λ

Make the combined loss explicit:

\[ \mathcal{L} = \mathcal{L}_{\text{CORAL}} + \lambda_{\text{aux}}\,\mathcal{L}_{\text{aux}}(\hat r, r) \]

Even if $\mathcal{L}_{\text{aux}}$ becomes small, $\lambda_{\text{aux}}$ controls how much the auxiliary objective shapes representation learning early on.

Rule of thumb from the observed magnitudes:

If CORAL ≈ 8 and aux ≈ 0.03, then values like $\lambda_{\text{aux}}\in[5,50]$ make the aux signal meaningful without dominating.
Alternatively set $\lambda_{\text{aux}}\approx 0.1$ if aux should be a light calibrator only.

Implementation-wise, this is best exposed as a config field (e.g. aux_regression_weight) and applied as:

combined_loss = coral_loss_value + aux_regression_weight * aux_loss

10.2.1 Exponential decay schedule (recommended)

Given the training dynamics we observed, make $\lambda_{\text{aux}}$ decay exponentially over time while keeping a small nonzero floor:

\[ \lambda_{\text{aux}}(t) = \max(\lambda_{\min}, \lambda_0 \cdot \gamma^{t}) \]

where $t$ is either epoch or global step, and $\gamma \in (0, 1]$ controls the decay speed. This preserves the early calibration signal but gradually shifts focus to ordinal separation.

Theory intuition. The combined objective is a weighted sum of two signals with very different curvature scales. Early in training, the aux term provides a strong gradient toward a reasonable conditional mean, but once it enters the quadratic (Huber) regime it quickly becomes small. An exponential schedule implements a smooth continuation from a calibrated-regression objective to an ordinal-ranking objective, without abruptly removing the auxiliary signal. The nonzero floor prevents drift of learnable bin values and keeps mild calibration pressure late in training.

In code/config:

[module_config]
aux_regression_weight = 10.0
aux_regression_weight_gamma = 0.98
aux_regression_weight_min = 0.5
aux_regression_weight_interval = "epoch" # or "step"

10.3 10.3 Fix #2: schedule LR on the hard metric, not the easy one

The default ReduceLrOnPlateauConfig monitors:

monitor: str = "train/loss"

But train/loss is the sum of CORAL and aux. Because aux improves quickly, the LR scheduler can keep seeing improvements even when CORAL plateaus.

For CORAL-driven learning, prefer monitoring the validation CORAL loss:

monitor = "val/coral_loss"

In TOML, that looks like:

[module_config.lr_scheduler]
monitor = "val/coral_loss"

10.4 10.4 Sanity checks for CORAL learning signals

Chance-level baseline: if CORAL loss is the sum over K-1 binary thresholds, a rough random baseline is $(K-1)\log 2$ (for K=15 this is $\approx 14\log 2 \approx 9.70$). Our code logs coral_loss_rel_random using coral_random_loss(K) so you can track progress relative to chance.
Central-band confusion matrices: often indicate “mean-optimal” predictions early, not necessarily a bug. They should widen over epochs as ordinal separation improves.

10.5 10.5 Learnable bin representatives: initialization is not enough

We already support monotone learnable bin values u_k and initialization from the fitted binner. If you make u_k learnable, consider adding the regularizer term to the training objective explicitly:

\[ \mathcal{L}_{u} = \alpha \sum_{k=0}^{K-1} (u_k - \mu_k)^2 \]

This prevents the model from “cheating” by shifting bin values instead of learning the correct ordinal probabilities.

--- title: "Implementing CORAL for Ordinal Regression to Continuous Targets" --- ## 1. Correcting the probabilistic mistake (core insight) **CORAL does not output class probabilities.** It outputs **cumulative (ordinal) probabilities**: $$ p_t ;=; P(y > t), \quad t = 0,\dots,K-2 $$ These are *threshold* probabilities, not marginals. Treating them as $P(y=t)$ leads to systematic overestimation because probability mass is counted multiple times. ### Recovering marginal class probabilities Define: $$ p_{-1}=1,\quad p_{K-1}=0 $$ Then: $$ \pi_k ;=; P(y=k) ;=; p_{k-1} - p_k,\quad k=0,\dots,K-1 $$ This converts **cumulative → marginal**, which is mandatory before computing expectations. --- ## 2. Correct computation of the regression prediction Let $r$ be the original continuous regression target (e.g. RRI), discretized into $K$ ordinal bins. Let $u_k$ be a representative real value for bin $k$ (initially: empirical mean or bin center). The **Bayes-optimal MSE predictor** is the conditional mean: $$ \hat r ;=; \mathbb{E}[r \mid x] ;\approx; \sum_{k=0}^{K-1} \pi_k , u_k $$ This is the **only correct way** to derive a scalar regression prediction from CORAL. --- ## 3. Single ordinal head + auxiliary regression loss (preferred design) Rather than adding a second regression head, we keep **one CORAL head** and derive $\hat r$ from it. ### Loss $$ \mathcal{L} ;=; \mathcal{L}_{\text{CORAL}} ;+; \lambda \, \mathcal{L}_{\text{reg}}(\hat r, r) $$ * $\mathcal{L}_{\text{CORAL}}$: binary cross-entropy on thresholds * $\mathcal{L}_{\text{reg}}$: **Huber / Smooth-L1** (preferred over L2 for heavy-tailed RRI) ### Intuition * Enforces **distributional correctness** (ordinal structure) * Enforces **calibration in real units** * Avoids two heads learning the same quantity inconsistently --- ## 4. Learnable bin representatives $u_k$ Instead of fixing $u_k=\mu_k$ (empirical mean), make them **learnable**, initialized from $\mu_k$. $$ \hat r = \sum_k \pi_k,u_k $$ ### Why this helps * Corrects discretization bias * Adapts to distribution shift * Improves calibration without adding a full regression head ### Regularization (important) Prevent $u_k$ from drifting and “absorbing” all error: $$ \mathcal{L}_{u} = \alpha \sum_{k=0}^{K-1} (u_k - \mu_k)^2 $$ This keeps them tethered to empirical reality. --- ## 5. Enforcing ordinal ordering on $u_k$ You usually want: $$ u_0 \le u_1 \le \dots \le u_{K-1} $$ ### Monotone parameterization (recommended) Let: $$ u_0 \in \mathbb{R},\qquad u_k = u_0 + \sum_{j=1}^{k} \text{softplus}(\delta_j) $$ * Guarantees monotonicity * Differentiable * Stable under optimization This aligns **ordinal meaning ↔ numeric meaning**. --- ## 6. Residual correction (controlled extra expressivity) If bins are coarse and you want **within-bin refinement**, add a **small residual head**: $$ \hat r = \sum_k \pi_k u_k ;+; \delta(x) $$ Where: * $\delta(x)$: tiny MLP / linear head * Regularize strongly: $$ \mathcal{L}_{\delta} = \beta |\delta(x)|^2 $$ ### Interpretation * CORAL handles **rank / coarse structure** * Residual handles **local, continuous variation** * Much safer than a full parallel regression head --- ## 7. Final full objective Putting it all together: $$ \boxed{ \mathcal{L} = \mathcal{L}_{\text{CORAL}} + \lambda \, \text{Huber}(\hat r, r) + \alpha \sum_k (u_k-\mu_k)^2 + \beta |\delta(x)|^2 } $$ (with $\delta(x)=0$ if you skip the residual) --- ## 8. Conceptual summary (one sentence per component) * **CORAL head** learns *ordinal uncertainty* via cumulative probabilities. * **Marginalization** converts ordinal outputs into a proper distribution. * **Expectation over bins** yields the Bayes-optimal regression estimate. * **Learnable $u_k$** correct discretization bias while preserving structure. * **Monotonicity** preserves semantic ordering. * **Residual** adds controlled continuous flexibility. --- ### Bottom line This design gives you: * Ordinal correctness * Probabilistic consistency * Real-unit calibration * Minimal redundancy * Explicit inductive bias aligned with your problem In short: this is a *principled ordinal-to-regression bridge*, not a hack. --- ## 9. Implementation details (NBV codebase) ### Core utilities (`oracle_rri/oracle_rri/rri_metrics/coral.py`) We implement monotone bin representatives in `MonotoneBinValues` and wire them into `CoralLayer` so the expected value can be computed **directly** from CORAL marginals: ```python from oracle_rri.rri_metrics.coral import CoralLayer, coral_logits_to_prob logits = head_coral(feats) # (..., K-1) probs = coral_logits_to_prob(logits) # (..., K) pred_rri = head_coral.expected_from_probs(probs) ``` This uses the learned monotone values `u_k`, initialized from bin means or midpoints and optionally regularized with: ```python u_reg = head_coral.bin_value_regularizer(target_values) ``` ### Initialization hook (`oracle_rri/oracle_rri/vin/model_v2.py`) `VinModelV2` exposes a lightweight hook to initialize the CORAL bin values: ```python model = VinModelV2(...) model.init_bin_values(bin_means) # Tensor[K] ``` ### Fitting + training integration (`oracle_rri/oracle_rri/lightning/lit_module.py`) After loading the fitted `RriOrdinalBinner`, the Lightning module calls `_maybe_init_bin_values()` to seed `u_k` from `bin_means` (or midpoints if means are unavailable). During training, the auxiliary regression loss uses these values to compute a **calibrated** scalar prediction: ```python probs = pred.prob.squeeze(0) pred_rri = head_coral.expected_from_probs(probs) aux_loss = smooth_l1(pred_rri, rri) ``` This keeps the model **single‑head**, ensures **probabilistic correctness**, and lets the bin representatives adapt while remaining monotone. --- ## 10. Training dynamics, monitoring, and two practical fixes During offline training runs we repeatedly observed the same signature: - **Aux regression loss** (Huber/Smooth‑L1 on the CORAL-derived expected value) drops very quickly (often “exponentially” in the first epochs). - **CORAL loss** improves more slowly. - Early confusion matrices show a **central-band** pattern (many samples predicted into a narrow set of middle bins). This is expected given the current implementation, and it leads directly to two high-impact practical fixes. ### 10.1 Why the auxiliary loss drops so fast In `oracle_rri/oracle_rri/lightning/lit_module.py`, the combined objective is currently implemented as: ```python combined_loss = coral_loss_value + aux_loss ``` This implicitly sets the auxiliary weight to **λ = 1**, even though CORAL loss and auxiliary regression loss typically live on very different numeric scales: - CORAL loss with `K=15` bins starts around ~7–10. - Huber on a normalized-ish scalar target can quickly drop to ~1e-2. So the auxiliary term provides a clean early calibration gradient and then rapidly becomes negligible. The “central-band” confusion matrix is consistent with the model initially learning a conservative **conditional-mean / median** proxy before it has learned sharp ordinal separation. ### 10.2 Fix #1: add an explicit auxiliary weight λ Make the combined loss explicit: $$ \mathcal{L} = \mathcal{L}_{\text{CORAL}} + \lambda_{\text{aux}}\,\mathcal{L}_{\text{aux}}(\hat r, r) $$ Even if $\mathcal{L}_{\text{aux}}$ becomes small, $\lambda_{\text{aux}}$ controls how much the auxiliary objective shapes representation learning early on. **Rule of thumb from the observed magnitudes**: - If CORAL ≈ 8 and aux ≈ 0.03, then values like $\lambda_{\text{aux}}\in[5,50]$ make the aux signal meaningful without dominating. - Alternatively set $\lambda_{\text{aux}}\approx 0.1$ if aux should be a light calibrator only. Implementation-wise, this is best exposed as a config field (e.g. `aux_regression_weight`) and applied as: ```python combined_loss = coral_loss_value + aux_regression_weight * aux_loss ``` #### Exponential decay schedule (recommended) Given the training dynamics we observed, make $\lambda_{\text{aux}}$ decay exponentially over time while keeping a small nonzero floor: $$ \lambda_{\text{aux}}(t) = \max(\lambda_{\min}, \lambda_0 \cdot \gamma^{t}) $$ where $t$ is either epoch or global step, and $\gamma \in (0, 1]$ controls the decay speed. This preserves the early calibration signal but gradually shifts focus to ordinal separation. **Theory intuition.** The combined objective is a weighted sum of two signals with very different curvature scales. Early in training, the aux term provides a strong gradient toward a reasonable conditional mean, but once it enters the quadratic (Huber) regime it quickly becomes small. An exponential schedule implements a *smooth continuation* from a calibrated-regression objective to an ordinal-ranking objective, without abruptly removing the auxiliary signal. The nonzero floor prevents drift of learnable bin values and keeps mild calibration pressure late in training. In code/config: ```toml [module_config] aux_regression_weight = 10.0 aux_regression_weight_gamma = 0.98 aux_regression_weight_min = 0.5 aux_regression_weight_interval = "epoch" # or "step" ``` ### 10.3 Fix #2: schedule LR on the hard metric, not the easy one The default `ReduceLrOnPlateauConfig` monitors: ```python monitor: str = "train/loss" ``` But `train/loss` is the **sum** of CORAL and aux. Because aux improves quickly, the LR scheduler can keep seeing improvements even when CORAL plateaus. For CORAL-driven learning, prefer monitoring the validation CORAL loss: - `monitor = "val/coral_loss"` In TOML, that looks like: ```toml [module_config.lr_scheduler] monitor = "val/coral_loss" ``` ### 10.4 Sanity checks for CORAL learning signals - **Chance-level baseline:** if CORAL loss is the sum over `K-1` binary thresholds, a rough random baseline is $(K-1)\log 2$ (for `K=15` this is $\approx 14\log 2 \approx 9.70$). Our code logs `coral_loss_rel_random` using `coral_random_loss(K)` so you can track progress relative to chance. - **Central-band confusion matrices:** often indicate “mean-optimal” predictions early, not necessarily a bug. They should widen over epochs as ordinal separation improves. ### 10.5 Learnable bin representatives: initialization is not enough We already support **monotone** learnable bin values `u_k` and initialization from the fitted binner. If you make `u_k` learnable, consider adding the regularizer term to the training objective explicitly: $$ \mathcal{L}_{u} = \alpha \sum_{k=0}^{K-1} (u_k - \mu_k)^2 $$ This prevents the model from “cheating” by shifting bin values instead of learning the correct ordinal probabilities.

Implementing CORAL for Ordinal Regression to Continuous Targets

1 1. Correcting the probabilistic mistake (core insight)

1.1 Recovering marginal class probabilities

2 2. Correct computation of the regression prediction

3 3. Single ordinal head + auxiliary regression loss (preferred design)

3.1 Loss

3.2 Intuition

4 4. Learnable bin representatives \(u_k\)

4.1 Why this helps

4.2 Regularization (important)

5 5. Enforcing ordinal ordering on \(u_k\)

5.1 Monotone parameterization (recommended)

6 6. Residual correction (controlled extra expressivity)

6.1 Interpretation

7 7. Final full objective

8 8. Conceptual summary (one sentence per component)

8.1 Bottom line

9 9. Implementation details (NBV codebase)

9.1 Core utilities (`oracle_rri/oracle_rri/rri_metrics/coral.py`)

9.2 Initialization hook (`oracle_rri/oracle_rri/vin/model_v2.py`)

9.3 Fitting + training integration (`oracle_rri/oracle_rri/lightning/lit_module.py`)

10 10. Training dynamics, monitoring, and two practical fixes

10.1 10.1 Why the auxiliary loss drops so fast

10.2 10.2 Fix #1: add an explicit auxiliary weight λ

10.2.1 Exponential decay schedule (recommended)

10.3 10.3 Fix #2: schedule LR on the hard metric, not the easy one

10.4 10.4 Sanity checks for CORAL learning signals

10.5 10.5 Learnable bin representatives: initialization is not enough

1 1. Correcting the probabilistic mistake (core insight)

1.1 Recovering marginal class probabilities

2 2. Correct computation of the regression prediction

3 3. Single ordinal head + auxiliary regression loss (preferred design)

3.1 Loss

3.2 Intuition

4 4. Learnable bin representatives \(u_k\)

4.1 Why this helps

4.2 Regularization (important)

5 5. Enforcing ordinal ordering on \(u_k\)

5.1 Monotone parameterization (recommended)

6 6. Residual correction (controlled extra expressivity)

6.1 Interpretation

7 7. Final full objective

8 8. Conceptual summary (one sentence per component)

8.1 Bottom line

9 9. Implementation details (NBV codebase)

9.1 Core utilities (oracle_rri/oracle_rri/rri_metrics/coral.py)

9.2 Initialization hook (oracle_rri/oracle_rri/vin/model_v2.py)

9.3 Fitting + training integration (oracle_rri/oracle_rri/lightning/lit_module.py)

10 10. Training dynamics, monitoring, and two practical fixes

10.1 10.1 Why the auxiliary loss drops so fast

10.2 10.2 Fix #1: add an explicit auxiliary weight λ

10.2.1 Exponential decay schedule (recommended)

10.3 10.3 Fix #2: schedule LR on the hard metric, not the easy one

10.4 10.4 Sanity checks for CORAL learning signals

10.5 10.5 Learnable bin representatives: initialization is not enough

9.1 Core utilities (`oracle_rri/oracle_rri/rri_metrics/coral.py`)

9.2 Initialization hook (`oracle_rri/oracle_rri/vin/model_v2.py`)

9.3 Fitting + training integration (`oracle_rri/oracle_rri/lightning/lit_module.py`)