# How It Works

<Aside>
**For curious users and practitioners.** This page explains the statistical methodology in accessible terms. You can use tacet effectively without understanding these details. Start with [Quick Start](https://tacet.sh/getting-started/quick-start) instead.

**For statisticians and researchers:** This is a simplified overview. For the complete mathematical specification with formal definitions, see the [Specification](https://tacet.sh/reference/specification).
</Aside>

This page describes the statistical methodology used by tacet to detect timing side channels.

## The core idea

**In plain English:** We run your code thousands of times with two types of inputs (baseline and sample). If the timing distributions differ by more than your threshold, we report a leak. The key innovation is *how* we make that comparison:

1. **Compare distributions, not averages**: A timing leak might only affect some runs (cache misses, branch mispredictions), so we compare the full timing distribution using multiple percentiles.

2. **Bayesian probability**: Instead of a yes/no answer, we compute the *probability* that there's a security-relevant leak. This lets us say "72% confident there's a leak" rather than just "the t-statistic is 4.7."

3. **Adaptive stopping**: We keep collecting samples until we're confident enough to decide. Clear cases stop early; borderline cases get more evidence.

## Two-phase pipeline

tacet uses an adaptive two-phase approach:

```
                    Timing Samples
                          │
           ┌──────────────┴──────────────┐
           ▼                             ▼
   ┌───────────────┐           ┌──────────────────┐
   │  Phase 1:     │           │  Phase 2:        │
   │  Calibration  │           │  Adaptive Loop   │
   ├───────────────┤           ├──────────────────┤
   │ 5,000 samples │           │ Batch collection │
   │ Covariance    │           │ Posterior update │
   │ estimation    │           │ Decision check   │
   │ Prior setup   │   ────>   │ Quality gates    │
   │               │           │                  │
   │ Fixed cost    │           │ Stops when       │
   │               │           │ P < 0.05 or      │
   │               │           │ P > 0.95         │
   └───────────────┘           └──────────────────┘
```

### Phase 1: Calibration

The calibration phase collects an initial batch of samples (default: 5,000 per class) to:

1. **Estimate noise covariance**: Timing measurements are noisy. Block bootstrap estimates the covariance structure of this noise.

2. **Compute minimum detectable effect (MDE)**: Based on noise levels, what's the smallest effect we could reliably detect?

3. **Set Bayesian priors**: The prior distribution is centered on zero effect with spread determined by the MDE.

### Phase 2: Adaptive loop

After calibration, the adaptive loop collects samples in batches until a decision is reached:

1. **Collect batch**: Gather N samples per class (default: 1,000)

2. **Update posterior**: Combine new data with existing posterior using Bayesian updating

3. **Check decision thresholds**: If P(leak) < 0.05, pass. If P(leak) > 0.95, fail.

4. **Apply quality gates**: Check for conditions that would make continued sampling futile

5. **Repeat or stop**: Either collect more samples or return a result

This adaptive approach:
- **Stops early** on clear pass/fail cases (saves time)
- **Gathers more evidence** when the signal is near the threshold
- **Returns Inconclusive** when hitting resource limits or quality gates

## Wasserstein distance

Rather than comparing means (which can miss distributional differences), tacet uses the **Wasserstein-1 distance** (also known as the earth mover's distance) to measure how different the timing distributions are:

$$
W_1(\text{baseline}, \text{sample}) = \frac{1}{n} \sum_{i=1}^{n} \left| x_{\text{baseline}}^{(i)} - x_{\text{sample}}^{(i)} \right|
$$

where $x_{\text{baseline}}^{(i)}$ and $x_{\text{sample}}^{(i)}$ are the $i$-th smallest values from each class.

**Why W₁?** The Wasserstein-1 distance captures the "cost" of transforming one distribution into another via optimal transport. Unlike a simple mean comparison, it's sensitive to both location shifts and shape changes in the distribution.

### Effect decomposition

The W₁ distance can be decomposed into two interpretable components:

$$
W_1 = |\text{shift}| + \text{tail}
$$

where:

- **shift** = median of rank-matched differences $\{x_{\text{baseline}}^{(i)} - x_{\text{sample}}^{(i)}\}$
- **tail** = mean of centered residuals $\left| (x_{\text{baseline}}^{(i)} - x_{\text{sample}}^{(i)}) - \text{shift} \right|$

The **tail share** ratio distinguishes timing leak patterns:

$$
\text{tail\_share} = \frac{\text{tail}}{|\text{shift}| + \text{tail}}
$$

| Tail Share | Pattern | Likely cause |
|------------|---------|--------------|
| < 0.20 | **UniformShift** | Branch on secret data (all timings shift uniformly) |
| 0.20–0.50 | **Mixed** | Multiple leak sources (branches + cache effects) |
| > 0.50 | **TailEffect** | Data-dependent cache behavior (upper quantiles affected more) |

This decomposition is useful because:
- A pure **UniformShift** suggests a conditional branch on secret data
- A pure **TailEffect** suggests data-dependent cache misses or memory access patterns
- **Mixed** patterns suggest multiple leak sources or complex microarchitectural effects

## Bayesian inference

The core question: given observed timing data, what's the probability that the timing difference exceeds your security threshold θ?

**In plain English:** We start by assuming there's probably no leak (the prior). As we collect data, we update our belief. If the data strongly suggests a difference, our confidence in a leak increases. If not, it decreases. The final number (72%, 3%, etc.) represents how confident we are that there's a security-relevant leak.

### Prior

We start by assuming "probably no leak":
- The prior is centered at zero effect
- Its spread is determined by how noisy our measurements are (the MDE)
- Default: 75% prior probability of no leak

### Likelihood

Each batch of samples updates our belief. The likelihood tells us: "How probable is this data if the true W₁ distance is δ?" We model the W₁ distance using a Student's t distribution with robust variance estimation to handle measurement uncertainty.

### Posterior

After each batch, we combine our prior belief with the new evidence using Bayes' theorem:

$$
p(\delta \mid \text{data}) \propto p(\text{data} \mid \delta) \cdot p(\delta)
$$

This gives us an updated probability distribution over the true W₁ distance $\delta$.

### Decision

We integrate the posterior to answer: "What's the probability that the W₁ distance exceeds the security threshold θ?"

$$
P(\text{leak}) = P(\delta > \theta \mid \text{all data})
$$

| Probability | Decision |
|-------------|----------|
| $P(\text{leak}) < 0.05$ | **Pass**: confident there's no exploitable leak |
| $P(\text{leak}) > 0.95$ | **Fail**: confident there's a leak |
| $0.05 \leq P(\text{leak}) \leq 0.95$ | Continue collecting or **Inconclusive** |

This is different from DudeCT's t-test approach. DudeCT asks "is there *any* statistical difference?" while tacet asks "is there a *security-relevant* difference above my threshold $\theta$?"

## Quality gates

The adaptive loop includes five quality gates that can stop sampling early:

| Gate | Triggers when | Outcome |
|------|---------------|---------|
| `DataTooNoisy` | Posterior ≈ prior after calibration | Inconclusive |
| `NotLearning` | KL divergence collapsed for 5+ batches | Inconclusive |
| `WouldTakeTooLong` | Projected time exceeds budget | Inconclusive |
| `TimeBudgetExceeded` | Wall-clock time exceeded | Inconclusive |
| `SampleBudgetExceeded` | Sample limit reached | Inconclusive |

These gates prevent wasting time on tests that won't converge.

## Adaptive batching

On platforms with coarse timer resolution (e.g., Apple Silicon's ~42ns counter), single-operation measurements are quantized. Adaptive batching compensates:

1. **Pilot measurement**: Measure ~100 warmup iterations to estimate operation time $t_{\text{op}}$

2. **$K$ selection**: Choose batch size $K$ to achieve ≥50 timer ticks per measurement:

$$
K = \max\!\left(1, \left\lceil \frac{50 \cdot r_{\text{timer}}}{t_{\text{op}}} \right\rceil\right)
$$

   where $r_{\text{timer}}$ is the timer resolution in nanoseconds.

3. **Bound $K$**: Cap at $K = 20$ to avoid microarchitectural artifacts

4. **Batch measurement**: Measure $K$ iterations together, analyze batch totals

Batching is disabled when:
- Timer resolution is already fine (>5 ticks per call)
- Using cycle-accurate timers (kperf, perf\_event)
- Operation is slow enough
- Running on x86_64 (rdtsc provides cycle-accurate timing)

## Block bootstrap

Variance estimation uses **circular block bootstrap** to account for potential autocorrelation in timing data:

1. Divide samples into overlapping blocks of length $\ell \approx n^{1/3}$
2. Resample blocks with replacement
3. Compute W₁ distance on resampled data
4. Repeat $B = 2{,}000$ times to estimate variance $\hat{\sigma}^2$

Block bootstrap handles short-range dependencies that might arise from CPU state, cache effects, or measurement overhead.

## References

For the complete mathematical specification, see the [Specification](https://tacet.sh/reference/specification).

**Academic references:**
- Reparaz, O., Balasch, J., & Verbauwhede, I. (2016). "Dude, is my code constant time?" DATE. (Original DudeCT methodology)
- Van Goethem, T., et al. (2020). "Timeless Timing Attacks." USENIX Security. (HTTP/2 timing attacks)
- Crosby, S. A., Wallach, D. S., & Riedi, R. H. (2009). "Opportunities and limits of remote timing attacks." ACM TISSEC. (Effect size to exploitability mapping)