# Specification (v7.1)

This document is the authoritative specification for tacet, a Bayesian timing side-channel detection system. It defines the statistical methodology, abstract types, and requirements that implementations MUST follow to be conformant.

For implementation details (algorithms, numerical procedures), see the [Implementation Guide](https://tacet.sh/reference/implementation-guide). For language-specific APIs, see the [Rust API](https://tacet.sh/api/rust), [C API](https://tacet.sh/api/c), or [Go API](https://tacet.sh/api/go). For interpreting results, see [Interpreting Results](https://tacet.sh/core-concepts/interpreting-results) and [Attacker Models](https://tacet.sh/core-concepts/attacker-models).

***

## Terminology (RFC 2119)

The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this document are to be interpreted as described in [RFC 2119](https://www.rfc-editor.org/rfc/rfc2119).

In summary:
- **MUST** / **REQUIRED** / **SHALL**: Absolute requirement
- **MUST NOT** / **SHALL NOT**: Absolute prohibition
- **SHOULD** / **RECOMMENDED**: Strong recommendation (valid reasons to deviate may exist)
- **SHOULD NOT**: Strong discouragement (valid reasons to deviate may exist)
- **MAY** / **OPTIONAL**: Truly optional

***

## 1. Overview

### 1.1 Problem Statement

Timing side-channel attacks exploit data-dependent execution time in cryptographic implementations. Existing detection tools have significant limitations:

- **T-test approaches** (DudeCT) compare means, missing distributional differences such as cache effects that only affect upper quantiles
- **P-value misinterpretation**: Statistical significance does not equal practical significance; with enough samples, negligible effects become "significant"
- **CI flakiness**: Fixed sample sizes cause tests to pass locally but fail in CI (or vice versa) due to environmental noise
- **Binary output**: No distinction between "no leak detected" and "couldn't measure reliably"

### 1.2 Solution

tacet addresses these issues with:

1. **Wasserstein-1 distance**: A single metric capturing both uniform shifts and tail effects, measuring the minimum cost to transform one distribution into another
2. **Adaptive Bayesian inference**: Collect samples until confident, with natural early stopping
3. **Three-way decisions**: Pass / Fail / Inconclusive, distinguishing "safe" from "unmeasurable"
4. **Interpretable output**: Posterior probability (0-100%) instead of p-values
5. **Fail-safe design**: Prefer Inconclusive over confidently wrong

### 1.3 Design Goals

- **Interpretable**: Output is a probability, not a t-statistic
- **Adaptive**: Collects more samples when uncertain, stops early when confident
- **CI-friendly**: Three-way output prevents flaky tests
- **Portable**: Handles different timer resolutions via adaptive batching
- **Honest**: Never silently clamps thresholds; reports what it can actually resolve
- **Fail-safe**: CI verdicts SHOULD almost never be confidently wrong
- **Reproducible**: Deterministic results given identical samples and configuration

***

## 2. Abstract Types and Semantics

This section defines the types that all implementations MUST provide. Types are specified using language-agnostic pseudocode.

### 2.1 Outcome

The primary result type returned by the oracle:

```
Outcome =
  | Pass {
      leak_probability: Float,        // P(m(δ) > θ_eff | Δ), always at θ_eff
      effect: EffectEstimate,
      theta_user: Float,              // User-requested threshold (ns)
      theta_eff: Float,               // Effective threshold used (ns)
      theta_floor: Float,             // Measurement floor at decision time (ns)
      decision_threshold_ns: Float,   // θ_eff at which decision was made
      samples_used: Int,
      quality: MeasurementQuality,
      diagnostics: Diagnostics
    }
  | Fail {
      leak_probability: Float,        // P(m(δ) > θ_eff | Δ)
      effect: EffectEstimate,
      theta_user: Float,
      theta_eff: Float,               // MAY exceed θ_user (leak detected above floor)
      theta_floor: Float,
      decision_threshold_ns: Float,   // θ_eff at which decision was made
      samples_used: Int,
      quality: MeasurementQuality,
      diagnostics: Diagnostics
    }
  | Inconclusive {
      reason: InconclusiveReason,
      leak_probability: Float,        // P(m(δ) > θ_eff | Δ)
      effect: EffectEstimate,
      theta_user: Float,
      theta_eff: Float,
      theta_floor: Float,
      samples_used: Int,
      quality: MeasurementQuality,
      diagnostics: Diagnostics
    }
  | Unmeasurable {
      operation_ns: Float,            // Estimated operation time
      threshold_ns: Float,            // Timer resolution
      platform: String,
      recommendation: String
    }
```

**Semantics:**

The field `leak_probability` MUST be computed as $P(m(\delta) > \theta_{\text{eff}} \mid \Delta)$, where $\theta_{\text{eff}}$ is the effective threshold used for inference at decision time.

When $\theta_{\text{eff}}$ > $\theta_{\text{user}}$, the oracle cannot support a **Pass claim at $\theta_{\text{user}}$**, because effects in the range ($\theta_{\text{user}}$, $\theta_{\text{eff}}$] are not distinguishable from noise under the measured conditions.

Implementations MUST NOT substitute $\theta_{\text{user}}$ into `leak_probability` when $\theta_{\text{eff}}$ > $\theta_{\text{user}}$.

- **Pass**: MUST be returned when ALL of the following hold:
  1. `leak_probability < pass_threshold` (default 0.05)
  2. $\theta_{\text{eff}} \le \theta_{\text{user}} + \varepsilon_{\theta}$ where $\varepsilon_{\theta} = \max(\theta_{\text{tick}}, 10^{-6} \cdot \theta_{\text{user}})$
  3. All verdict-blocking quality gates pass
  4. $\theta_{\text{user}}$ > 0

- **Fail**: MUST be returned when ALL of the following hold:
  1. `leak_probability > fail_threshold` (default 0.95)
  2. All verdict-blocking quality gates pass

  Note: Fail MAY be returned when $\theta_{\text{eff}}$ > $\theta_{\text{user}}$. Detecting $m(\delta)$ > $\theta_{\text{eff}}$ implies $m(\delta)$ > $\theta_{\text{user}}$ (since $\theta_{\text{eff}}$ ≥ $\theta_{\text{user}}$ by construction).

- **Inconclusive**: MUST be returned when ANY of the following hold:
  1. A verdict-blocking quality gate fails
  2. Resource budgets exhausted without reaching a decision threshold
  3. `leak_probability < pass_threshold` but $\theta_{\text{eff}} > \theta_{\text{user}} + \varepsilon_{\theta}$ (threshold elevated)

- **Unmeasurable**: MUST be returned when the operation is too fast to measure reliably (see §4.5)

**Exploratory mode (θ_user = 0):** When $\theta_{\text{user}}$ = 0, Pass/Fail semantics do not apply. The oracle MUST return Inconclusive with the posterior estimates, allowing users to interpret the results themselves. This mode is useful for profiling and debugging but is not suitable for CI gating.

### 2.2 AttackerModel

Threat model presets defining the minimum effect size worth detecting:

```
AttackerModel =
  | SharedHardware      // θ = 0.4 ns (~2 cycles @ 5GHz)
  | PostQuantumSentinel // θ = 2.0 ns (~10 cycles @ 5GHz)
  | AdjacentNetwork     // θ = 100 ns
  | RemoteNetwork       // θ = 50,000 ns (50 μs)
  | Custom { threshold_ns: Float }
```

| Model | Threshold | Use Case |
|-------|-----------|----------|
| SharedHardware | 0.4 ns | SGX enclaves, cross-VM, containers, hyperthreading |
| PostQuantumSentinel | 2.0 ns | Post-quantum crypto (ML-KEM, ML-DSA) |
| AdjacentNetwork | 100 ns | LAN services, HTTP/2 APIs |
| RemoteNetwork | 50 μs | Internet-exposed services |

Cycle-based thresholds use a conservative 5 GHz reference frequency (assumes fast attacker hardware—smaller θ = more sensitive = safer).

**There is no single correct threshold.** The choice of attacker model is a statement about your threat model.

For exploratory analysis without a decision threshold, use `Custom { threshold_ns: 0.0 }`.

### 2.3 EffectEstimate

Summary of the detected timing effect:

```
EffectEstimate = {
  max_effect_ns: Float,                 // Posterior mean of W₁ distance (in nanoseconds)
  credible_interval_ns: (Float, Float), // 95% CI for W₁
  tail_diagnostics: TailDiagnostics     // Decomposition of effect into shift and tail components
}

TailDiagnostics = {
  shift_ns: Float,             // Uniform shift component (median difference)
  tail_ns: Float,              // Tail-specific component (beyond shift)
  tail_share: Float,           // Fraction of effect from tail [0-1]
  tail_slow_share: Float,      // Directionality of tail (p95+): fraction that are slowdowns [0-1]
  quantile_shifts: QuantileShifts,  // Per-quantile differences for interpretation
  pattern_label: EffectPattern // Classification of effect pattern
}

QuantileShifts = {
  p50_ns: Float,  // Median difference (50th percentile shift)
  p90_ns: Float,  // 90th percentile shift
  p95_ns: Float,  // 95th percentile shift
  p99_ns: Float   // 99th percentile shift
}

EffectPattern =
  | TailEffect      // Leak concentrated in upper quantiles (tail_share > 0.6)
  | UniformShift    // Leak affects all quantiles equally (tail_share < 0.3)
  | Mixed           // Combination of shift and tail (0.3 ≤ tail_share ≤ 0.6)
  | Negligible      // No significant effect detected
```

**Interpreting W₁ distance:**

The W₁ (Wasserstein-1) distance measures the minimum cost to transform one distribution into another. The `tail_diagnostics` field decomposes this single scalar metric into interpretable components, helping users understand whether the leak is:
- A **uniform shift** (all quantiles affected equally, e.g., constant-time overhead difference)
- A **tail effect** (upper quantiles affected more, e.g., cache misses)
- A **mixed pattern** (combination of both)

The decomposition works by comparing the W₁ distance to the median difference:
- If W₁ ≈ median difference, the effect is uniform (constant shift)
- If W₁ >> median difference, the effect is concentrated in the tail (e.g., cache misses)

The `tail_slow_share` field indicates the directionality of tail deviations (p95+): values near 1.0 indicate tail deviations are predominantly slowdowns, values near 0.0 indicate speedups, and values near 0.5 indicate balanced directionality. This metric operates on quantile-aligned differences and measures what fraction of tail deviation magnitude comes from positive (slowdown) differences.

The `quantile_shifts` provide per-quantile differences for understanding the effect distribution, helping identify at which percentiles the leak manifests.

### 2.4 MeasurementQuality

Assessment of measurement precision:

```
MeasurementQuality =
  | Excellent  // MDE < 5 ns
  | Good       // MDE 5–20 ns
  | Poor       // MDE 20–100 ns
  | TooNoisy   // MDE > 100 ns
```

### 2.5 InconclusiveReason

```
InconclusiveReason =
  | DataTooNoisy { message: String, guidance: String }
  | NotLearning { message: String, guidance: String }
  | WouldTakeTooLong { estimated_time_secs: Float, samples_needed: Int, guidance: String }
  | ThresholdElevated {
      theta_user: Float,                    // What user requested
      theta_eff: Float,                     // What we measured at
      leak_probability_at_eff: Float,       // P(m(δ) > θ_eff | Δ)
      meets_pass_criterion_at_eff: Bool,    // True if P < pass_threshold at θ_eff
      achievable_at_max: Bool,              // Could θ_user be reached with max budget?
      message: String,
      guidance: String
    }
  | TimeBudgetExceeded { current_probability: Float, samples_collected: Int }
  | SampleBudgetExceeded { current_probability: Float, samples_collected: Int }
  | ConditionsChanged { drift: ConditionDrift }
```

The `meets_pass_criterion_at_eff` field indicates whether $P(m(\delta) > \theta_{\text{eff}} \mid \Delta)$ < pass_threshold. This allows CI systems to implement policies like "treat pass-criterion-met-at-floor as acceptable" without changing inference semantics.

The `achievable_at_max` field distinguishes:
- `true`: $\theta_{\text{floor}}$ > $\theta_{\text{user}}$ now, but $\theta_{\text{floor}}(n_{\max}) \le \theta_{\text{user}}$ (more sampling may help)
- `false`: $\theta_{\text{floor}}(n_{\max}) > \theta_{\text{user}}$ (cannot reach $\theta_{\text{user}}$ on this platform/configuration)

### 2.6 Diagnostics

```
Diagnostics = {
  // Dependence and effective samples
  dependence_length: Int,           // b̂ from Politis-White (bootstrap only)
  effective_sample_size: Int,       // n / τ̂
  iact_combined: Float,             // max(τ̂_F, τ̂_R)

  // Stationarity
  stationarity_ratio: Float,
  stationarity_ok: Bool,

  // Outlier handling
  outlier_rate_baseline: Float,
  outlier_rate_sample: Float,
  outlier_asymmetry_ok: Bool,

  // Timer and mode
  discrete_mode: Bool,
  timer_resolution_ns: Float,
  duplicate_fraction: Float,

  // Run information
  preflight_ok: Bool,
  calibration_samples: Int,
  total_time_secs: Float,
  seed: Option<Int>,
  threshold_ns: Float,
  timer_name: String,
  platform: String,

  // Gibbs sampler
  gibbs_iters_total: Int,
  gibbs_burnin: Int,
  gibbs_retained: Int,
  lambda_mean: Float,               // Posterior mean of λ (prior precision)
  lambda_mixing_ok: Bool,

  // Robust likelihood
  likelihood_inflated: Bool,        // True if κ_mean < 0.3

  // Warnings
  warnings: List<String>,
  quality_issues: List<QualityIssue>
}

QualityIssue = {
  code: IssueCode,
  message: String,
  guidance: String
}

IssueCode =
  | DependenceHigh       // High autocorrelation, reduced effective samples
  | PrecisionLow         // Limited measurement precision
  | DiscreteMode         // Coarse timer resolution
  | ThresholdIssue       // Cannot achieve requested threshold
  | FilteringApplied     // Outliers were capped
  | StationarityIssue    // Conditions may have changed
  | NumericalIssue       // Gibbs sampler convergence concern
  | LikelihoodInflated   // Uncertainty inflated for robustness
```

***

## 3. Statistical Methodology

This section describes the mathematical foundation of tacet. All formulas in this section are normative; implementations MUST produce equivalent results.

### 3.1 Test Statistic: Wasserstein-1 Distance

We collect timing samples from two classes:
- **Fixed class (F)**: A specific input (e.g., all zeros)
- **Random class (R)**: Randomly sampled inputs

We measure the **Wasserstein-1 (W₁) distance** between the two timing distributions. The W₁ distance measures the minimum cost to transform one distribution into another, where cost is the amount of probability mass times the distance it must be moved.

The test statistic is a **1D scalar** (in nanoseconds):

$$
\Delta = W_1(\hat{F}_{\text{Fixed}}, \hat{F}_{\text{Random}}) \in \mathbb{R}
$$

where $\hat{F}$ denotes the empirical cumulative distribution function.

**Debiased W₁ distance:**

For human-readable output, implementations MAY compute a debiased estimator:

$$
W_1^{\text{deb}} = \max\left(0, W_1(\text{Fixed}, \text{Random}) - \theta_{\text{floor}}\right)
$$

where $\theta_{\text{floor}}$ is the measurement noise floor (§3.3.3). The debiased estimator provides a display-friendly interpretation of effect magnitude relative to measurement noise. Inference (§3.4) uses the raw W₁ distance.

**Why W₁ distance?**

Timing leaks manifest in different ways:
- **Uniform shift**: A different code path adds constant overhead → entire distribution shifts
- **Tail effect**: Cache misses occur probabilistically → upper quantiles shift more

The W₁ distance captures both patterns in a single scalar metric:
- For uniform shifts, W₁ ≈ shift magnitude (median difference)
- For tail effects, W₁ emphasizes the tail differences
- Mixed patterns are captured proportionally

**W₁ properties:**
- **Single metric**: Captures distributional differences in nanoseconds
- **Optimal transport**: Weights differences by their quantile position
- **Natural zero**: W₁ between identical distributions equals zero
- **1D inference**: Enables fast Bayesian updating with simple conjugacy

**W₁ computation:** Implementations MUST compute W₁ using the sorted-samples method for discrete distributions. See the [Implementation Guide](https://tacet.sh/reference/implementation-guide) for the algorithm.

### 3.2 Two-Phase Architecture

The system operates in two phases:

```
┌─────────────────────────────────────────────────────────────────┐
│                           Architecture                           │
├─────────────────────────────────────────────────────────────────┤
│                                                                  │
│  ┌──────────────┐    ┌──────────────┐    ┌──────────────┐       │
│  │  Calibration │───▶│   Adaptive   │───▶│   Decision   │       │
│  │    Phase     │    │     Loop     │    │    Output    │       │
│  └──────────────┘    └──────────────┘    └──────────────┘       │
│         │                   │                   │                │
│         ▼                   ▼                   ▼                │
│   • Estimate Σ_rate    • Collect batch    • Pass (P<5%)         │
│   • Compute θ_floor    • Update Δ         • Fail (P>95%)        │
│   • Set prior scale    • Scale Σ by 1/n   • Inconclusive        │
│   • Warmup caches      • Update θ_floor                         │
│   • Pre-flight checks  • Compute P(>θ)                          │
│                        • Check quality                          │
│                        • Check stopping                          │
│                                                                  │
└─────────────────────────────────────────────────────────────────┘
```

**Phase 1: Calibration** (runs once)
- Collect initial samples to characterize measurement noise
- Estimate covariance structure via stream-based block bootstrap
- Compute "covariance rate" $\Sigma_{\text{rate}}$ that scales as $\Sigma = \Sigma_{\text{rate}} / n$
- Compute initial measurement floor $\theta_{\text{floor}}$ and floor-rate constant c_floor
- Compute effective threshold $\theta_{\text{eff}}$ and calibrate prior scale $\sigma$
- Run pre-flight checks (timer sanity, harness sanity, stationarity)

**Phase 2: Adaptive Loop** (iterates until decision)
- Collect batches of samples
- Update quantile estimates from all data collected so far
- Scale covariance: $\Sigma_n = \Sigma_{\text{rate}}/n$
- Update $\theta_{\text{floor}}(n)$ using floor-rate constant
- Run Gibbs sampler to approximate posterior and compute $P(\text{effect} > \theta_{\text{eff}})$
- Check quality gates (posterior ≈ prior → Inconclusive)
- Check decision boundaries (P > 95% → Fail, P < 5% → Pass)
- Check time/sample budgets

**Why this structure?**

The key insight is that covariance scales as 1/n for quantile estimators. By estimating the covariance *rate* once during calibration, we can cheaply update the posterior as more data arrives; no re-bootstrapping needed. This makes adaptive sampling computationally tractable.

### 3.3 Calibration Phase

The calibration phase runs once at startup to characterize measurement noise.

**Sample collection:**

Implementations SHOULD collect n_cal samples per class (default: 5,000). This is enough to estimate covariance structure reliably while keeping calibration fast.

**Fragile regime:**

A **fragile regime** is a measurement condition where standard statistical assumptions may not hold, requiring more conservative estimation. The fragile regime is triggered when either:

- **Discrete timer mode**: The timer has coarse resolution (see §3.6)
- **Low uniqueness**: The minimum uniqueness ratio across classes is below 10%:
  $$\min\left(\frac{|\text{unique}(F)|}{n_F}, \frac{|\text{unique}(R)|}{n_R}\right) < 0.10$$

When a fragile regime is detected, implementations apply more conservative procedures (larger block lengths, regularized covariance) to maintain calibration.

#### 3.3.1 Acquisition Stream Model

Measurement produces an interleaved acquisition stream indexed by time:

$$
\{(c_t, y_t)\}_{t=1}^{T}, \quad c_t \in \{F, R\}, \; T \approx 2n
$$

where $y_t$ is the measured runtime (or ticks) at acquisition index $t$, and F/R denote Fixed and Random classes.

Per-class samples are obtained by filtering the stream:

$$
F := \{y_t : c_t = F\}, \quad R := \{y_t : c_t = R\}
$$

**Critical principle:** The acquisition stream is the data-generating process. All bootstrap and dependence estimation MUST preserve adjacency in acquisition order, not per-class position.

#### 3.3.2 Variance Estimation via Stream-Based Block Bootstrap

Timing measurements exhibit autocorrelation: nearby samples are more similar than distant ones due to cache state, frequency scaling, etc. Standard bootstrap assumes i.i.d. samples, underestimating variance. Implementations MUST use block bootstrap on the acquisition stream to preserve the true dependence structure.

**Block length selection:** Implementations SHOULD use the Politis-White algorithm to select the optimal block length, with class-conditional ACF to avoid underestimation from interleaved sampling. See the [Implementation Guide](https://tacet.sh/reference/implementation-guide#22-politis-white-block-length-selection) for the algorithm.

**Variance rate (scalar):**

The variance of the W₁ estimator scales as 1/n. The **variance rate** is a scalar for the 1D W₁ distance:

$$
\text{var}_{\text{rate}} = \hat{\sigma}^2_{\text{cal}} \cdot n_{\text{cal}}
$$

where $\hat{\sigma}^2_{\text{cal}}$ is the bootstrap variance estimate from calibration.

Define the **calibration long-run variance proxy**:

$$
V_{\text{cal}} := \text{var}_{\text{cal}} \cdot n_{\text{cal}}
$$

This is a **scalar** (not a matrix) because W₁ is a 1D statistic.

**Effective sample size via IACT:**

Under strong temporal dependence, n samples do not provide n independent observations. Implementations SHOULD compute integrated autocorrelation time (IACT) for diagnostic purposes:

$$
n_{\text{eff}} := \frac{n}{\hat{\tau}}
$$

where $\hat{\tau}$ is estimated via the Geyer Initial Monotone Sequence (IMS) method or equivalent. See the [Implementation Guide](https://tacet.sh/reference/implementation-guide#3-iact-computation) for algorithms.

Implementations SHOULD report as diagnostics:
- `Diagnostics.dependence_length` = $\hat{b}$ (block length from bootstrap)
- `Diagnostics.iact_combined` = $\hat{\tau}$ (IACT estimate)
- `Diagnostics.effective_sample_size` = $n_{\text{eff}}$ (for diagnostics only)

**Variance scaling during adaptive loop:**

Since $V_{\text{cal}}$ is computed via block bootstrap (which preserves temporal dependence structure), it already represents the long-run variance rate. During inference with $n$ samples:

$$
\text{var}_n = \frac{V_{\text{cal}}}{n}
$$

The block bootstrap accounts for autocorrelation in the variance estimate. IACT estimates are computed and reported as diagnostics but do not affect inference.

#### 3.3.3 Measurement Floor and Effective Threshold

A critical design element is distinguishing between what the user *wants* to detect ($\theta_{\text{user}}$) and what the measurement *can* detect ($\theta_{\text{floor}}$).

**Floor-rate constant (from null distribution):**

Implementations MUST compute a floor-rate constant once at calibration by bootstrapping the null distribution of raw W₁:

1. Generate null W₁ replicates via within-class splits: split baseline into two halves and compute W₁ between them; similarly for sample class
2. Scale each null replicate by $\sqrt{n_{\text{blocks}}}$ where $n_{\text{blocks}} = \max(1, \lfloor n_{\text{cal}} / L \rfloor)$ is the number of independent blocks at calibration
3. Compute $c_{\text{floor}}$ as the 95th percentile of these scaled null replicates

This SHOULD use at least 2,000 null bootstrap replicates for stable percentile estimation.

**Measurement floor (dynamic):**

During the adaptive loop, the statistical floor decreases as sample size grows:

$$
\theta_{\text{floor,stat}}(n) = \max\left(\theta_{\text{tick}}, \frac{c_{\text{floor}}}{\sqrt{n_{\text{blocks}}(n)}}\right)
$$

where $n_{\text{blocks}}(n) = \max(1, \lfloor n / L \rfloor)$ is the number of independent blocks for sample size $n$, and $L$ is the block length.

The tick floor is fixed once batching is determined:

$$
\theta_{\text{tick}} = \frac{\text{1 tick (ns)}}{K}
$$

where $K$ is the batch size. The combined floor:

$$
\theta_{\text{floor}}(n) = \theta_{\text{floor,stat}}(n)
$$

(The tick floor is already incorporated in the $\max$ within $\theta_{\text{floor,stat}}$.)

**Effective threshold ($\theta_{\text{eff}}$):**

$$
\theta_{\text{eff}} = \max(\theta_{\text{user}}, \theta_{\text{floor}})
$$

**Threshold elevation decision rule:**

When $\theta_{\text{eff}}$ > $\theta_{\text{user}}$:

1. **Fail propagates**: Detecting $m(\delta)$ > $\theta_{\text{eff}}$ implies $m(\delta)$ > $\theta_{\text{user}}$. Implementations MAY return Fail.

2. **Pass does not propagate**: "No detectable effect above $\theta_{\text{eff}}$" is compatible with effects in ($\theta_{\text{user}}$, $\theta_{\text{eff}}$]. Implementations MUST NOT return Pass when $\theta_{\text{eff}}$ > $\theta_{\text{user}}$ + $\varepsilon_{\theta}$.

**Dynamic floor updates:**

During the adaptive loop, $\theta_{\text{floor}}(n)$ decreases as n grows. Implementations MUST:

1. Recompute $\theta_{\text{eff}} = \max(\theta_{\text{user}}, \theta_{\text{floor}}(n))$ after each batch
2. If $\theta_{\text{floor}}(n)$ drops to $\theta_{\text{user}}$ or below, Pass becomes possible (subject to posterior)
3. Report the $\theta_{\text{eff}}$ used for the **final** decision

#### 3.3.4 Prior Scale Calibration (Half-t Prior)

The prior on the 1D W₁ effect $\delta$ MUST be a **half-t distribution** (Student's t restricted to positive values):

$$
\delta \sim \text{half-}t_\nu(0, \sigma^2)
$$

where $\nu$ is a fixed degrees-of-freedom parameter and $\sigma$ is a scale parameter calibrated to an exceedance target.

W₁ distances are non-negative, so the prior is a half-t distribution (Student's t restricted to positive values). This is implemented via a scale mixture of Gaussians (see §3.4.3).

**Degrees of freedom ($\nu$):**

Implementations MUST use $\nu := 4$.

**Calibrating $\sigma$ via exceedance target:**

The scale $\sigma$ MUST be chosen so that the prior exceedance probability equals a fixed target $\pi_0$ (default 0.62):

$$
P\left(\delta > \theta_{\text{user}} \;\middle|\; \delta \sim \text{half-}t_\nu(0, \sigma^2)\right) = \pi_0
$$

This MUST be solved by deterministic 1D root-finding using Monte Carlo integration. See the [Implementation Guide](https://tacet.sh/reference/implementation-guide#7-prior-scale-calibration) for the algorithm.

The prior is calibrated against $\theta_{\text{user}}$ (the user's threat-model threshold). The measurement floor $\theta_{\text{floor}}$ is handled separately in the decision logic (§3.3.3).

**Design properties:** The half-t with ν = 4 provides heavy tails while maintaining finite variance. The calibrated scale σ balances informativeness and flexibility: the prior is weakly informative (allowing data to dominate when signal is strong) without being vacuous.

#### 3.3.5 Deterministic Seeding Policy

To ensure reproducible results, all random number generation MUST be deterministic by default.

**Normative requirement:**

> Given identical timing samples and configuration, the oracle MUST produce identical results (up to floating-point roundoff).

**Seeding policy:**

- The bootstrap RNG seed MUST be deterministically derived from:
  - A fixed library constant seed (default: 0x74696D696E67, "timing" in ASCII)
  - A stable hash of configuration parameters
- All Monte Carlo RNG seeds (leak probability, floor constant, prior scale) MUST be similarly deterministic
- The Gibbs sampler RNG seed MUST be deterministic
- The chosen seeds SHOULD be reported in diagnostics

### 3.4 Bayesian Model

The prior is a half-t distribution over the 1D W₁ distance, implemented via Gibbs sampling on the scale-mixture representation.

#### 3.4.1 Latent Parameter

The latent parameter is the true W₁ distance between timing distributions:

$$
\delta \in \mathbb{R}^+ \quad \text{(true W₁ distance in nanoseconds)}
$$

This is constrained to be non-negative (W₁ distances are always ≥ 0).

#### 3.4.2 Likelihood (Robust t-likelihood via scale mixture)

The observed W₁ statistic $\Delta$ may deviate from the Gaussian approximation when $\text{var}_n$ is underestimated. To prevent pathological posterior certainty under variance misestimation, implementations MUST use a robust likelihood with a scalar precision factor $\kappa$.

$$
\Delta \mid \delta, \kappa \sim \mathcal{N}\!\left(\delta, \frac{\text{var}_n}{\kappa}\right)
$$

$$
\kappa \sim \text{Gamma}\!\left(\frac{\nu_\ell}{2}, \frac{\nu_\ell}{2}\right)
$$

Marginally, this gives a univariate t-distribution:

$$
\Delta \mid \delta \sim t_{\nu_\ell}(\delta, \text{var}_n)
$$

**Gamma parameterization:** shape–rate. $E[\kappa] = 1$.

**Degrees of freedom for likelihood ($\nu_{\ell}$):**

Implementations MUST use $\nu_{\ell}$ := 4.

**Robustness mechanism:** When $\text{var}_n$ is underestimated, $\kappa$ is pulled downward and inflates uncertainty automatically. The parameter ν_ℓ = 4 matches the prior degrees of freedom, maintaining consistency in tail behavior between prior and likelihood.

**Likelihood inflation warning:**

Implementations SHOULD set `Diagnostics.likelihood_inflated = true` when $\kappa_{\text{mean}} < 0.3$, indicating the likelihood variance was effectively scaled up for robustness.

#### 3.4.3 Prior (Half-t via scale mixture)

The prior on $\delta \in \mathbb{R}^+$ is:

$$
\delta \sim \text{half-}t_\nu(0, \sigma^2), \quad \nu = 4
$$

Implementations MUST implement inference using the equivalent hierarchical model:

$$
\lambda \sim \text{Gamma}\left(\frac{\nu}{2}, \frac{\nu}{2}\right)
$$

$$
\delta \mid \lambda \sim \text{half-}\mathcal{N}\left(0, \frac{\sigma^2}{\lambda}\right)
$$

where Gamma uses **shape–rate** parameterization, and half-𝒩 denotes a half-normal distribution (normal restricted to positive values).

**Marginal prior variance:**

For $\nu$ = 4:

$$
V_0^{\text{marginal}} := \text{Var}(\delta) = 2\sigma^2
$$

This scalar variance is used as the prior variance reference in Gate 1 (§3.5.2).

#### 3.4.4 Posterior Inference (Deterministic Gibbs Sampling)

The posterior is approximated using a **deterministic Gibbs sampler** over $(\delta, \lambda, \kappa)$.

**Gibbs schedule (normative):**

| Parameter | Value | Description |
|-----------|-------|-------------|
| N_gibbs | 5000 | Total iterations |
| N_burn | 1000 | Burn-in (discarded) |
| N_keep | 4000 | Retained samples |

**Initialization:**

$$
\lambda^{(0)} = 1, \quad \kappa^{(0)} = 1
$$

**Iteration order:**

For t = 1, ..., N_gibbs:
1. Sample $\delta^{(t)} \sim p(\delta \mid \lambda^{(t-1)}, \kappa^{(t-1)}, \Delta)$ (truncated normal, positive only)
2. Sample $\lambda^{(t)} \sim p(\lambda \mid \delta^{(t)})$
3. Sample $\kappa^{(t)} \sim p(\kappa \mid \delta^{(t)}, \Delta)$

The Gibbs conditionals and numerical implementation are detailed in the [Implementation Guide](https://tacet.sh/reference/implementation-guide#5-gibbs-sampler-implementation).

**Why 5000 iterations?**

The 1D Gibbs sampler converges much faster than the previous 9D sampler, but we use more iterations to ensure high-quality posterior approximation with minimal Monte Carlo error. The computational cost is still 5-10× lower than the 9D approach due to simpler conditionals.

**$\lambda$ mixing diagnostics:**

Implementations SHOULD set `Diagnostics.lambda_mixing_ok = false` when lambda_cv < 0.1 or lambda_ess < 20, indicating the sampler may not have converged.

#### 3.4.5 Decision Functional and Leak Probability

The decision functional is simply the W₁ distance itself:

$$
m(\delta) := \delta
$$

The leak probability is:

$$
P(\text{leak} > \theta_{\text{eff}} \mid \Delta) = P(\delta > \theta_{\text{eff}} \mid \Delta)
$$

**Estimation via Gibbs samples:**

$$
\widehat{P}(\text{leak}) = \frac{1}{N_{\text{keep}}} \sum_{s=1}^{N_{\text{keep}}} \mathbf{1}\left[\delta^{(s)} > \theta_{\text{eff}}\right]
$$

**Posterior summaries:**

Implementations MUST compute:

- **Posterior mean:** $\delta_{\text{post}} := \frac{1}{N_{\text{keep}}} \sum_s \delta^{(s)}$
- **Credible interval:** 95% CI for $\delta$ from empirical quantiles of $\{\delta^{(s)}\}_{s=1}^{N_{\text{keep}}}$

**Interpreting the probability:**

This is a **posterior probability**, not a p-value. When we report "72% probability of a leak," we mean: given the data and our model, 72% of the posterior mass corresponds to W₁ distances exceeding $\theta_{\text{eff}}$.

### 3.5 Adaptive Sampling Loop

The core innovation: collect samples until confident, with natural early stopping.

**Verdict-blocking semantics:**

> Pass/Fail verdicts MUST be emitted **only** if all measurement quality gates pass. Otherwise the oracle MUST return Inconclusive.

This policy ensures: **CI verdicts should almost never be confidently wrong.**

#### 3.5.1 Stopping Criteria

The adaptive loop terminates when any of these conditions is met:

1. **Pass**: `leak_probability < pass_threshold` (default 0.05) AND all quality gates pass
2. **Fail**: `leak_probability > fail_threshold` (default 0.95) AND all quality gates pass
3. **Inconclusive**: Any quality gate fails OR budget exhausted without reaching decision threshold

**Why adaptive sampling works for Bayesian inference:**

Frequentist methods suffer from **optional stopping**: if you keep sampling until you get a significant result, you inflate your false positive rate.

Bayesian methods don't have this problem. The posterior probability is valid regardless of when you stop; this is the **likelihood principle**. Your inference depends only on the data observed, not your sampling plan.

#### 3.5.2 Quality Gates (Verdict-Blocking)

Quality gates detect when data is too poor to reach a confident decision. When any gate triggers, the outcome MUST be Inconclusive.

**Gate 1: Insufficient Information Gain**

This gate detects when the data provides insufficient information relative to the prior, either because the posterior barely moved from the prior (data too noisy) or because the posterior stopped updating (not learning).

Implementations MUST compute the KL divergence between Gaussian surrogates of the prior and posterior. For the 1D case:

$$
\mathrm{KL} = \frac{1}{2}\left(
\frac{V_{\text{post}}}{V_0^{\text{marginal}}}
+ \frac{\mu_{\text{post}}^2}{V_0^{\text{marginal}}}
- 1 + \ln\frac{V_0^{\text{marginal}}}{V_{\text{post}}}
\right)
$$

where $V_{\text{post}}$ is the posterior variance of $\delta$ (estimated from Gibbs samples) and $V_0^{\text{marginal}} = 2\sigma^2$ is the marginal prior variance (§3.4.3).

**Trigger conditions:**
- KL < KL_min (default 0.7 nats) → `DataTooNoisy`
- Sum of recent inter-batch KL divergences < 0.001 for 5+ batches → `NotLearning`

**Gate 2: Would Take Too Long**

Extrapolate time to decision based on current convergence rate. If projected time exceeds budget by a large margin (e.g., 10×), trigger Inconclusive with reason `WouldTakeTooLong`.

**Gate 3: Resource Budget Exceeded**

If elapsed time exceeds configured time budget → `TimeBudgetExceeded`
If total samples per class exceeds configured maximum → `SampleBudgetExceeded`

**Gate 4: Condition Drift Detected**

The covariance estimate $\Sigma_{\text{rate}}$ is computed during calibration. If measurement conditions change during the adaptive loop, this estimate becomes invalid.

Detect condition drift by comparing measurement statistics from calibration against the full test run:

- Variance ratio: $\sigma_{\text{post}}^2 / \sigma_{\text{cal}}^2$
- Autocorrelation change: $|\rho_{\text{post}}(1) - \rho_{\text{cal}}(1)|$
- Mean drift: $|\mu_{\text{post}} - \mu_{\text{cal}}| / \sigma_{\text{cal}}$

If variance ratio is outside [0.5, 2.0], or autocorrelation change exceeds 0.3, or mean drift exceeds 3.0, trigger Inconclusive with reason `ConditionsChanged`.

**Note on threshold elevation:** The case where $\theta_{\text{floor}}$ > $\theta_{\text{user}}$ is handled by the threshold elevation decision rule (§3.3.3), not by a quality gate.

### 3.6 Discrete Timer Mode

When the timer has low resolution (e.g., Apple Silicon's 41ns cntvct_el0), quantile estimation behaves differently due to tied values.

**Trigger condition:**

Discrete timer mode triggers when minimum uniqueness ratio is below 10%.

**Mid-distribution quantiles:**

Instead of standard quantile estimators, use mid-distribution quantiles which handle ties correctly. See the [Implementation Guide](https://tacet.sh/reference/implementation-guide#62-mid-distribution-quantiles).

**Work in ticks internally:**

In discrete mode, implementations SHOULD perform computations in **ticks** (timer's native unit) and convert to nanoseconds only for display.

**Gaussian approximation caveat:**

The Gaussian likelihood is a rougher approximation with discrete data. Implementations MUST report a quality issue about discrete timer mode.

### 3.7 Calibration Validation Requirements

The Bayesian approach requires empirical validation that posteriors are well-calibrated.

**Null calibration test (normative requirement):**

Implementations MUST provide a "fixed-vs-fixed" validation that measures end-to-end false positive rates:

- FPR_gated = P(Fail | H₀, all verdict-blocking gates pass)

**Acceptance criteria:**

| Metric | Target | Action if Exceeded |
|--------|--------|-------------------|
| FPR_gated | 2-5% | ≤ 5% | MUST escalate conservatism |

**Large-effect detection tests:**

Implementations MUST include validation tests ensuring the Student's *t* prior correctly detects obvious leaks even when measurement noise is high.

***

## 4. Measurement Requirements

This section defines abstract requirements for measurement. For implementation details, see the [Implementation Guide](https://tacet.sh/reference/implementation-guide).

### 4.1 Timer Requirements

Implementations MUST use a timer that:
- Is monotonic (never decreases)
- Has known resolution
- Reports results that can be converted to nanoseconds

Implementations SHOULD use the highest-resolution timer available on the platform.

### 4.2 Acquisition Stream Requirements

Measurements MUST be collected as an interleaved acquisition stream (see §3.3.1):
- Fixed and Random class measurements MUST be interleaved
- The interleaving order SHOULD be randomized
- The full acquisition stream (with class labels) MUST be preserved for bootstrap

### 4.3 Input Pre-generation

All inputs MUST be generated before the measurement loop begins. Generating inputs inside the timed region causes false positives.

### 4.4 Outlier Handling

Implementations MUST cap (winsorize), not drop, outliers:
1. Compute t_cap = 99.99th percentile from pooled data
2. Cap samples exceeding t_cap
3. Winsorization happens before quantile computation

**Quality thresholds:** >0.1% capped → warning; >1% → acceptable; >5% → `TooNoisy`.

### 4.5 Adaptive Batching

On platforms with coarse timer resolution, implementations SHOULD batch operations:

**When batching is needed:**

If pilot measurement shows fewer than 5 ticks per call, enable batching.

**Batch size selection:**

$$
K = \text{clamp}\left( \left\lceil \frac{50}{\text{ticks\_per\_call}} \right\rceil, 1, 20 \right)
$$

**Effect scaling:**

Reported effects MUST be divided by K to give per-operation estimates.

### 4.6 Measurability

If ticks per call < 5 even with maximum batching (K=20), implementations MUST return `Unmeasurable`.

### 4.7 Pre-flight Checks

Implementations SHOULD perform pre-flight checks:
- **Timer sanity**: Verify monotonicity and reasonable resolution
- **Harness sanity (fixed-vs-fixed)**: Detect test harness bugs
- **Stationarity**: Detect drift during measurement

***

## 5. API Design Principles

This section provides language-agnostic guidance for API design. These are recommendations (SHOULD) unless marked otherwise.

### 5.1 Input Specification

**Two-class pattern:**

Implementations SHOULD expose the DudeCT two-class pattern:
- **Baseline class**: Fixed input (typically all zeros)
- **Sample class**: Variable input (typically random)

This pattern tests for data-dependent timing, not specific value comparisons.

### 5.2 Configuration Ergonomics

**Attacker model presets as primary entry point:**

The primary configuration entry point SHOULD be attacker model selection, not raw threshold values.

**Sane defaults:**

Default configuration SHOULD:
- Use `AdjacentNetwork` attacker model (or equivalent)
- Set time budget to 60 seconds
- Set sample budget to 1,000,000
- Set pass/fail thresholds to 0.05/0.95

### 5.3 Result Communication

**Leak probability prominence:**

The leak probability MUST be prominently displayed in results and human-readable output.

**Threshold transparency:**

When $\theta_{\text{eff}}$ > $\theta_{\text{user}}$, implementations MUST clearly indicate this to the user.

**Inconclusive guidance:**

For `Inconclusive` outcomes, implementations MUST provide the reason and SHOULD provide actionable guidance.

***

## 6. Configuration Parameters

### 6.1 Required Parameters

| Parameter | Type | Description |
|-----------|------|-------------|
| attacker_model OR threshold_ns | AttackerModel or Float | Defines the effect threshold |

### 6.2 Optional Parameters

| Parameter | Default | Description |
|-----------|---------|-------------|
| time_budget | 60 seconds | Maximum test duration |
| max_samples | 1,000,000 | Maximum samples per class |
| pass_threshold | 0.05 | P(leak) below this → Pass |
| fail_threshold | 0.95 | P(leak) above this → Fail |
| calibration_samples | 5,000 | Samples for calibration phase |
| batch_size | 1,000 | Samples per adaptive batch |
| bootstrap_iterations | 2,000 | Bootstrap iterations for covariance |

***

## Appendix A: Mathematical Notation

| Symbol | Meaning |
|--------|---------|
| {(c_t, y_t)} | Acquisition stream: class labels and timing measurements |
| T | Acquisition stream length (≈ 2n) |
| F, R | Per-class sample sets (filtered from stream) |
| $\Delta$ | Scalar: observed W₁ distance (in nanoseconds) |
| $\delta$ | Scalar: true (latent) W₁ distance (in nanoseconds) |
| W₁ | Wasserstein-1 distance between timing distributions |
| $\text{var}_n$ | Variance of W₁ estimator at sample size n |
| $\text{var}_{\text{rate}}$ | Variance rate (scalar) |
| $V_{\text{cal}}$ | Calibration long-run variance proxy: $\text{var}_{\text{cal}} \cdot n_{\text{cal}}$ |
| $\hat{\tau}$ | Integrated autocorrelation time (via Geyer IMS) |
| n_eff | Effective sample size: $n_{\text{eff}} = n / \hat{\tau}$ |
| n_blocks | Number of independent blocks: $\max(1, \lfloor n / L \rfloor)$ where $L$ is block length |
| $\nu$ | Half-t degrees of freedom (fixed at 4) |
| $\sigma$ | Half-t prior scale (calibrated via exceedance target) |
| $\lambda$ | Latent prior precision multiplier in scale-mixture representation |
| $\kappa$ | Latent likelihood precision multiplier (robust t-likelihood) |
| $\nu_{\ell}$ | Likelihood degrees of freedom (fixed at 4) |
| $t_{\nu}(0, \sigma^2)$ | Univariate Student's *t* with $\nu$ df, location 0, scale $\sigma^2$ |
| $\text{half-}t_{\nu}(0, \sigma^2)$ | Half-t distribution (t restricted to positive values) |
| $V_0^{\text{marginal}}$ | Marginal prior variance: $2\sigma^2$ for $\nu = 4$ |
| $V_{\text{post}}$ | Posterior variance of $\delta$ (estimated from Gibbs samples) |
| $\delta_{\text{post}}$ | Posterior mean of $\delta$ (from Gibbs samples) |
| N_gibbs | Total Gibbs iterations (5000) |
| N_burn | Burn-in iterations (1000) |
| N_keep | Retained Gibbs samples (4000) |
| $\theta_{\text{user}}$ | User-requested threshold |
| $\theta_{\text{floor}}$ | Measurement floor (smallest resolvable effect) |
| $c_{\text{floor}}$ | Floor-rate constant: $\theta_{\text{floor,stat}} = c_{\text{floor}}/\sqrt{n}$ |
| $\theta_{\text{tick}}$ | Timer resolution component of floor |
| $\theta_{\text{eff}}$ | Effective threshold used for inference |
| $m(\delta)$ | Decision functional: $\delta$ (the W₁ distance itself) |
| b̂ | Block length (Politis-White, on acquisition stream) |
| MDE | Minimum detectable effect |
| n | Samples per class |
| B | Bootstrap iterations |
| KL | KL divergence: KL(posterior ∥ prior) for Gate 1 |
| KL_min | Minimum KL threshold for conclusive verdict (default 0.7 nats) |

***

## Appendix B: Normative Constants

These constants define conformant implementations. Implementations MAY use different values only where noted.

| Constant | Default | Normative | Rationale |
|----------|---------|-----------|-----------|
| Test statistic | W₁ distance | MUST | 1D scalar, optimal transport |
| Prior family | Half-t | MUST | Continuous scale adaptation, non-negative |
| Degrees of freedom ($\nu$) | 4 | MUST | Heavy tails + finite variance |
| Gamma parameterization | shape–rate | MUST | Avoid library ambiguity |
| Gibbs iterations (N_gibbs) | 5000 | MUST | High-quality posterior approximation |
| Gibbs burn-in (N_burn) | 1000 | MUST | Conservative convergence |
| Gibbs retained (N_keep) | 4000 | MUST | Low MC variance |
| Gibbs initialization ($\lambda^{(0)}$, $\kappa^{(0)}$) | 1 | MUST | Prior means |
| Likelihood df ($\nu_{\ell}$) | 8 | MUST | Robustness to variance misestimation |
| Prior exceedance target ($\pi_0$) | 0.62 | SHOULD | Genuine uncertainty |
| Prior calibration MC draws | 50,000 | SHOULD | Stable $\sigma$ calibration |
| Bootstrap iterations | 2,000 | SHOULD | Variance estimation accuracy |
| Monte Carlo samples (c_floor) | 50,000 | SHOULD | Floor-rate constant estimation |
| Batch size | 1,000 | SHOULD | Adaptive iteration granularity |
| Calibration samples | 5,000 | SHOULD | Initial variance estimation |
| Pass threshold | 0.05 | SHOULD | 95% confidence of no leak |
| Fail threshold | 0.95 | SHOULD | 95% confidence of leak |
| KL_min (nats) | 0.7 | MUST | Minimum information gain for conclusive verdict |
| Block length cap | min(3√T, T/3) | SHOULD | Prevent degenerate blocks |
| Discrete threshold | 10% unique | SHOULD | Trigger discrete mode |
| Min ticks per call | 5 | SHOULD | Measurability floor |
| Max batch size | 20 | SHOULD | Limit microarch artifacts |
| Default time budget | 60 s | MAY | Maximum runtime |
| Default sample budget | 1,000,000 | MAY | Maximum samples |
| Default RNG seed | 0x74696D696E67 | SHOULD | "timing" in ASCII |
| Likelihood inflation threshold | 0.3 | SHOULD | $\kappa_{\text{mean}}$ triggering LikelihoodInflated |

***

## Appendix C: References

**Statistical methodology:**

1. Bishop, C. M. (2006). Pattern Recognition and Machine Learning, Ch. 3. Springer. (Bayesian linear regression)

2. Politis, D. N. & White, H. (2004). "Automatic Block-Length Selection for the Dependent Bootstrap." Econometric Reviews 23(1):53–70.

3. Künsch, H. R. (1989). "The Jackknife and the Bootstrap for General Stationary Observations." Annals of Statistics. (Block bootstrap)

4. Hyndman, R. J. & Fan, Y. (1996). "Sample quantiles in statistical packages." The American Statistician 50(4):361–365.

5. Welford, B. P. (1962). "Note on a Method for Calculating Corrected Sums of Squares and Products." Technometrics 4(3):419–420.

6. Gelman, A. et al. (2013). Bayesian Data Analysis, 3rd ed., Ch. 11-12. CRC Press. (Gibbs sampling, scale mixtures)

7. Lange, K. L., Little, R. J. A., & Taylor, J. M. G. (1989). "Robust Statistical Modeling Using the t Distribution." JASA 84(408):881-896. (Student's t for robustness)

**Timing attacks:**

8. Reparaz, O., Balasch, J., & Verbauwhede, I. (2016). "Dude, is my code constant time?" DATE. (DudeCT methodology)

9. Crosby, S. A., Wallach, D. S., & Riedi, R. H. (2009). "Opportunities and Limits of Remote Timing Attacks." ACM TISSEC 12(3):17. (Exploitability thresholds)

10. Van Goethem, T., et al. (2020). "Timeless Timing Attacks." USENIX Security. (HTTP/2 timing attacks)

11. Bernstein, D. J. et al. (2024). "KyberSlash." (Timing vulnerability example)

12. Dunsche, M. et al. (2025). "SILENT: A New Lens on Statistics in Software Timing Side Channels." arXiv:2504.19821. (Relevant hypotheses framework)

**Existing tools:**

13. dudect (C): https://github.com/oreparaz/dudect
14. dudect-bencher (Rust): https://github.com/rozbb/dudect-bencher

***

## Appendix D: Changelog

### v7.1 (from v7.0)

**Statistical correctness refinements:**

This release fixes inference semantics in the v7.0 W₁ implementation based on statistician review.

**Core changes:**

- **Inference uses raw W₁ (§3.1):** Bayesian inference uses raw W₁ distance without debiasing or clamping. Debiased W₁ is computed only for display purposes to help users interpret effect magnitude above measurement noise.

- **Floor from null distribution (§3.3.3):** Measurement floor $c_{\text{floor}}$ is calibrated from the 95th percentile of null W₁ replicates (via within-class splits). This provides correct Type I error control under the null hypothesis.

- **Block count terminology (Appendix A):** Variable $n_{\text{blocks}} = \max(1, \lfloor n / L \rfloor)$ replaces ambiguous $n_{\text{eff}}$ in floor calculations. Effective sample size $n_{\text{eff}} = n / \hat{\tau}$ remains for IACT-based diagnostics.

- **Prior targets user threshold (§3.3.4):** Half-t prior scale $\sigma$ is calibrated so that $P(\delta > \theta_{\text{user}}) = \pi_0$. The prior encodes the user's security threshold.

- **Robust likelihood (§3.4.2):** Student-t likelihood degrees of freedom changed from $\nu_{\ell} = 8$ to $\nu_{\ell} = 4$ (matching prior ν = 4) for consistency. Robustness parameter $\kappa \sim \text{Gamma}(\nu_{\ell}/2, \nu_{\ell}/2)$ guards against variance underestimation.

- **Tail directionality metric (§2.3):** `tail_slow_share` measures fraction of tail deviation magnitude (p95+) from slowdowns, computed as $\sum \max(d_i - \text{shift}, 0) / \sum |d_i - \text{shift}|$ over tail indices. Operates on quantile-aligned differences, not sample identities.

**Migration notes:**

- These changes affect statistical inference but not API surface
- Results may differ slightly from v7.0 due to corrected floor calibration and prior targeting
- No code changes required for users; test outcomes may be more conservative (fewer false positives)

### v7.0 (from v6.0)

**Migration from 9D quantile differences to 1D Wasserstein-1 distance:**

This is a **breaking change** that fundamentally simplifies the statistical methodology while improving sensitivity and performance.

**Core changes:**

- **Test statistic (§3.1):** Replace 9D quantile-difference vector with 1D W₁ (Wasserstein-1) distance
  - W₁ measures the minimum cost to transform one distribution into another
  - Naturally captures both uniform shifts and tail effects in a single scalar
  - Debiased estimator: W₁_deb = max(0, W₁(baseline, sample) - θ_floor)

- **Prior (§3.4.3):** Replace 9D multivariate t-distribution with 1D half-t distribution
  - Half-t is appropriate for non-negative W₁ distances
  - Implemented via scale mixture: λ ~ Gamma(ν/2, ν/2), δ | λ ~ half-𝒩(0, σ²/λ)
  - Marginal prior variance: V₀^marginal = 2σ² (scalar, not matrix)

- **Variance estimation (§3.3.2):** Replace 9×9 covariance matrix with scalar variance
  - var_rate (scalar) instead of Σ_rate (matrix)
  - Variance scaled by sample size: var_n = var_rate / n
  - Block bootstrap still used to preserve temporal dependence

- **Bayesian inference (§3.4):** Replace 9D Gibbs sampler with 1D Gibbs sampler
  - Gibbs iterations increased from 256 to 5000 (1000 burn-in, 4000 retained)
  - 1D sampler is much simpler but we use more iterations for low MC variance
  - Decision functional simplified: m(δ) = δ (the W₁ distance itself)
  - Posterior probability: P(δ > θ_eff | Δ)

- **Quality gate (§3.5.2):** Simplified KL divergence formula for 1D case
  - KL = ½(V_post/V₀ + μ_post²/V₀ - 1 + ln(V₀/V_post))
  - Same threshold (KL_min = 0.7 nats) for data-too-noisy detection

- **EffectEstimate (§2.3):** Updated to reflect W₁ as primary metric
  - max_effect_ns: Posterior mean of W₁ distance
  - credible_interval_ns: 95% CI for W₁
  - tail_diagnostics: Decomposition into shift and tail components
    - shift_ns: Median difference (uniform shift component)
    - tail_ns: Tail-specific component (beyond shift)
    - tail_share: Fraction of effect from tail [0-1]
    - quantile_shifts: Per-quantile differences (p50, p90, p95, p99) for interpretation
    - pattern_label: TailEffect | UniformShift | Mixed | Negligible

**Expected benefits:**

1. **5-10× faster inference:** 1D Gibbs sampler converges much faster than 9D, despite using more iterations
2. **Better tail detection:** Optimal transport naturally emphasizes distributional differences
3. **Simpler interpretation:** Single distance in nanoseconds vs. 9 correlated quantile differences
4. **Natural debiasing:** W₁(F, F) = 0 exactly (unlike quantile differences which vary due to sampling)

**Migration notes:**

- Existing code using tacet v6.x will need to update result handling
- The top_quantiles field in EffectEstimate has been replaced with tail_diagnostics
- All posterior probabilities now refer to W₁ distance, not max quantile difference
- Mathematical notation updated in Appendix A to reflect 1D formulation

### v6.0 (from v5.7)

**Specification simplification and modularization:**

- **Removed:** Research mode as separate outcome type. When θ_user = 0, the oracle returns Inconclusive with posterior estimates. This removes ~50 lines and the ResearchStatus enum.

- **Removed:** 2D projection for reporting (§3.4.6 in v5.7). The shift/tail decomposition and EffectPattern enum have been removed. EffectEstimate now reports max|δ|, 95% CI, and top quantiles by exceedance probability.

- **Removed:** Exploitability classification (§2.4 in v5.7). This is now documented only in the user guide, not the statistical specification.

- **Simplified:** κ robust likelihood diagnostics. Only `likelihood_inflated: Bool` is exposed; detailed κ mixing diagnostics (kappa_mean, kappa_sd, kappa_cv, kappa_ess, kappa_mixing_ok) are internal implementation details.

- **Consolidated:** Quality gates from 6 to 4:
  - Gates 1+2 → Insufficient Information Gain
  - Gates 4+5 → Resource Budget Exceeded
  - Gate 3 → Would Take Too Long
  - Gate 6 → Conditions Changed

- **Consolidated:** Issue codes from 30+ to 8 categories: DependenceHigh, PrecisionLow, DiscreteMode, ThresholdIssue, FilteringApplied, StationarityIssue, NumericalIssue, LikelihoodInflated.

- **Created:** Separate Implementation Guide for algorithms (IACT, block bootstrap, numerical stability, Gibbs conditionals, quantile computation).

- **Created:** Separate Power Module Specification for power/EM analysis.

- **Removed:** Projection mismatch diagnostics and threshold calibration (no longer needed without 2D projection).

- **Removed:** Dimension-aware regularization details (moved to Implementation Guide; timing uses d=9 which rarely triggers stressed regimes).

**Rationale:** These changes reduce specification complexity by ~40% while preserving all normative statistical requirements. The removed features were either:
1. Reporting conveniences (2D projection, exploitability classification) that don't affect inference
2. Separate code paths for edge cases (Research mode) that can be handled by existing mechanisms
3. Implementation details (algorithms, numerical procedures) better suited to a separate guide