Validating Signal Metrics with Synthetic Data | Breaking into quantitative finance

Introduction

Over the last few months I’ve been building a small ecosystem of tools to analyse trading signals: how often they fire, how strong they are, and whether they actually predict future price moves in a way that’s statistically robust rather than anecdotal.

At the core of this setup is a Python framework that:

ingests historical bar data for a universe of symbols,
attaches one or more signals (indicators with a direction and a strength),
tracks subsequent price moves over different horizons, and
computes summary metrics that try to quantify “how good” each signal is.

So far, most of my work has focused on getting the data pipeline and visualisations into a usable state. The next critical step is to make sure that the metrics themselves are correctly implemented.

In particular, I want to validate three metrics that I lean on heavily when evaluating individual signals:

the Information Coefficient (IC),
a custom Figure of Merit (FOM), and
win rate.

Each of these metrics is supposed to capture a different aspect of predictive power. Before I start trusting them on messy real-world data, I first want to know:

If I feed these metrics a very simple synthetic signal whose behaviour I understand analytically, do they produce the values I expect?

This post is about that validation step.

How to proceed

The most reliable way to test metrics like IC, FOM and win rate is to feed them synthetic data where I control the ground truth.

The idea is:

Define a simple toy model where I know the relationship between signal and future returns.
Derive the theoretical values that IC, FOM and win rate should take under that model (or at least what qualitative pattern they should follow).
Generate simulated data from that model and run it through the same pipeline I use for real signals.
Compare theory vs implementation:
- If the numbers (or shapes of the curves) match, that’s a good sign.
- If not, either my code is wrong or my mental model of the metric is wrong.

This synthetic-data approach buys me two things:

It decouples metric validation from the chaos of real markets.
It forces me to be precise about what each metric actually measures.

Metrics to be validated

Information Coefficient (IC)

What IC represents

In quant finance, the Information Coefficient is usually defined as the correlation between signal values and subsequent returns. Intuitively:

IC close to +1: high signal values tend to be followed by high future returns (strong, consistent positive relationship).
IC close to –1: high signal values tend to be followed by low future returns (strong negative relationship).
IC near 0: the signal behaves like noise with respect to future returns.

Strictly speaking, “IC” often refers to a cross-sectional correlation (across many names on the same date), but in this project I’m using the same term for a time-series correlation between one signal and its own future returns. The intuition is the same: does the ordering induced by the signal line up with the ordering of future returns?

Mathematical definition

Let

$s_t$ be the signal value at time $t$ ,
$r_{t+\tau}$ be the forward return over horizon $\tau$ starting at $t$ .

For a fixed horizon $\tau$ , the IC is

\mathrm{IC}(\tau) \;=\; \rho\!\bigl(s_t,\; r_{t+\tau}\bigr),

where $\rho(\cdot,\cdot)$ is a correlation coefficient (e.g. Pearson or rank correlation).

In practice I compute this over all times $t$ where the signal is “actionable” at horizon $\tau$ (e.g. excluding missing data, holidays, etc.).

How IC is useful here

IC gives me a scale-free measure of predictive power:

It doesn’t care about the absolute units of the signal or returns.
It directly encodes whether “larger signal” tends to mean “bigger subsequent move in the expected direction”.

For my use-case:

I can compute $\mathrm{IC}(\tau)$ for multiple horizons $\tau$ and see where the signal is most informative.
I can compare two signals on the same symbol by comparing their IC curves.
IC is also a nice sanity check: on purely random synthetic data, it should cluster around zero.

Figure of Merit (FOM)

What FOM represents

The Figure of Merit (FOM) is a custom scalar metric designed for this project. It tries to answer:

Over a range of holding periods, how much better is this signal than a carefully matched random reference, in a risk-adjusted sense?

The logic is:

For each direction (LONG or SHORT), I look at the distribution of price changes that follow signal occurrences.
I also construct a random reference signal with the same unconditional properties (frequency and strength distribution) but no predictive power.
I then compare the standardised mean returns of the studied signal vs the reference across multiple time-delta bins, down-weighting very long horizons.
Finally, I combine LONG and SHORT sides into a single scalar.

So FOM is essentially a time-horizon-weighted, risk-adjusted performance gap between the studied signal and a null model that “looks similar” but is random.

Mathematical definition (brief version)

For a given direction (say LONG), I have observations:

$p_i$ : realised price moves following the signal,
$t_i$ : their associated time deltas (how long we held the position).

I also have the corresponding $(p_i^{\text{ref}}, t_i^{\text{ref}})$ for the random reference.

I partition the time axis into bins $B_k = [b_k, b_{k+1})$ , $k = 0, \dots, K-1$ .
For each bin I compute sample means and standard deviations:

\mu_k^S = \text{mean}\{p_i : t_i \in B_k\}, \qquad \sigma_k^S = \max\bigl(\text{std}\{p_i : t_i \in B_k\}, \varepsilon\bigr),

and similarly $\mu_k^R, \sigma_k^R$ for the reference. I then standardise:

A_k^S = \frac{\mu_k^S}{\sigma_k^S}, \qquad A_k^R = \frac{\mu_k^R}{\sigma_k^R}.

These $A_k$ values say “how many standard deviations away from zero the average move is in each bin”.

To build the per-direction FOM, I:

take the bin-wise difference between studied and reference, and
down-weight later bins by their index:

\mathrm{FOM}_{\text{dir}} = \sum_{k=1}^{K-1} \frac{A_k^S - A_k^R}{k}.

(The $k=0$ term is explicitly set to zero in the implementation.)

If either the studied or reference series has too few observations, I just define $\mathrm{FOM}_{\text{dir}} = 0$ to avoid noisy artefacts.

Finally, the metric value reported for a signal is:

\mathrm{FOM}_{\text{metric}} = \mathrm{FOM}^{\text{LONG}}_{\text{dir}} - \mathrm{FOM}^{\text{SHORT}}_{\text{dir}}.

So a positive FOM means “LONG tends to outperform its random reference and/or SHORT tends to underperform its random reference (in a good way for us).”

How the random reference signal is generated

The random reference is designed to preserve the unconditional characteristics of the studied signal while removing any alignment with future returns.

Informally, the generator does two things:

Match the firing frequency.
For each bar, I flip a Bernoulli coin with probability equal to the empirical fraction of bars where the studied signal fires. If the coin comes up “true”, the reference signal fires on that bar.
Match the strength distribution.
When the reference fires, its strength is drawn from the empirical strength histogram of the studied signal, using inverse-CDF sampling. That way:
- the unconditional strength distribution of the random reference matches the studied signal,
- but where in time those strengths appear is completely random.

This matters because it ensures we’re not comparing the studied signal to a toy benchmark that’s obviously different (e.g. “signal that fires every day with constant strength”).

When FOM is meaningful (and when it isn’t)

FOM is only really interpretable if the studied signal has some sparsity and variation:

If the signal fires every single day with nearly constant strength, then a “random reference” that matches its frequency and distribution is almost trivially the same as the signal. The FOM will tend toward zero, but that doesn’t tell you much.
Likewise, if you never apply strength cuts (e.g. “only trade when $|\text{signal}| > x$ ”), the studied and random signals will look very similar in terms of which situations they’re exposed to.

In practice, I find FOM most informative when:

I work with subsetted signals, e.g. “only take trades when strength is above a threshold”, and
I compare FOM of the studied signal vs FOM of the random reference as a function of these strength thresholds.

One particularly interesting diagnostic is:

Fix a minimum strength of 0 and gradually increase the maximum strength of the studied signal. For each strength range, compute FOM for the studied signal and for the random reference, and plot both as a function of the maximum strength cut.

If the metric and code are behaving sensibly, I’d expect:

for purely random synthetic data: studied and reference FOM curves should sit on top of each other (up to noise),
for genuinely predictive synthetic data: the studied FOM curve should sit consistently above the reference curve in the strength ranges where the signal is informative.

How FOM is useful here

FOM complements IC and win rate by:

Incorporating time horizon structure (via the bins),
Adjusting for volatility (via standardisation),
Explicitly benchmarking against a matched random reference.

It effectively answers: “If I account for how often the signal fires, how volatile its payoffs are, and how that compares to a null model, does this thing still look good?”

Win rate

What win rate represents

The win rate is intentionally simple: it’s the fraction of trades where the signal’s direction matches the sign of the realised price move over a chosen horizon.

For a LONG signal, a “win” means the price went up over the holding period.
For a SHORT signal, a “win” means the price went down (so the realised P&L is positive in the signal’s direction).

It doesn’t care about trade size or magnitude of returns; it’s just a directional accuracy metric.

Mathematical definition

Let’s focus on a single horizon $\tau$ . For each signal occurrence $i$ :

$d_i \in \{+1, -1\}$ is the direction (LONG = $+1$ , SHORT = $-1$ ),
$r_i$ is the realised return from entry to exit at horizon $\tau$ .

Define an indicator of a “correct direction”:

w_i = \mathbf{1}\{ d_i \cdot r_i > 0 \},

i.e. $w_i = 1$ if the trade made money in the intended direction, and $0$ otherwise.

Then the win rate is simply:

\mathrm{WR}(\tau) = \frac{1}{N} \sum_{i=1}^N w_i.

I can also split this into $\mathrm{WR}_{\text{LONG}}$ and $\mathrm{WR}_{\text{SHORT}}$ by restricting to each direction.

How win rate is useful here

Win rate surfaces a different aspect of signal quality than IC or FOM:

IC is sensitive to the strength ordering and the full joint distribution of signal and returns.
FOM mixes risk-adjusted averages across multiple horizons against a random reference.
Win rate just asks: “How often is this thing right?”

For validation on synthetic data, win rate is especially handy:

On a noiseless, perfectly linear toy model, I should get win rates close to 100% for reasonable horizons.
As I add noise, I can predict how the win rate should deteriorate and check that the implementation tracks that trend.

Synthetic data to be used

To keep everything interpretable, I start with linear data: a deliberately over-simplified model where the relationship between the signal and returns is almost cartoonishly clean.

Linear data

The basic idea is:

Generate a time series of “latent scores” or signal values $x_t$ , e.g. from a normal distribution.
Define the actual signal as:
$s_t = x_t,$
possibly rescaled or clipped to mimic realistic strength ranges.
Define future returns over a fixed horizon $\tau$ as:
$r_{t+\tau} = \alpha\, s_t + \varepsilon_t,$
where:
- $\alpha$ controls how strong the relationship is (signal-to-noise ratio),
- $\varepsilon_t$ is independent noise, e.g. $\varepsilon_t \sim \mathcal{N}(0, \sigma_\varepsilon^2)$ .
For directional signals:
- LONG trades are opened when $s_t > 0$ ,
- SHORT trades are opened when $s_t < 0$ ,
- the strength $|s_t|$ can be used to apply strength cuts (e.g. only trade when $|s_t|$ exceeds a threshold).

This model is deliberately simple but has the key property we want:

Conditional on $s_t$ , the expected return is linear in the signal:
$\mathbb{E}[r_{t+\tau} \mid s_t] = \alpha s_t.$

That makes it much easier to derive what IC, FOM, and win rate should look like.

Theoretical values given linear data

Given the linear model

r_{t+\tau} = \alpha\, s_t + \varepsilon_t,

with $\varepsilon_t$ independent of $s_t$ , we can at least get closed-form or semi-closed-form expectations for the metrics.

IC under the linear model

Assume $\mathbb{E}[s_t] = 0$ , $\mathrm{Var}(s_t) = \sigma_s^2$ , and $\varepsilon_t$ has variance $\sigma_\varepsilon^2$ . Then:

$\mathrm{Cov}(s_t, r_{t+\tau}) = \alpha \sigma_s^2$ ,
$\mathrm{Var}(r_{t+\tau}) = \alpha^2 \sigma_s^2 + \sigma_\varepsilon^2$ .

So the IC is

\mathrm{IC}(\tau) = \frac{\mathrm{Cov}(s_t, r_{t+\tau})}{\sqrt{\mathrm{Var}(s_t)\,\mathrm{Var}(r_{t+\tau})}} = \frac{\alpha \sigma_s^2}{\sqrt{\sigma_s^2 \left( \alpha^2 \sigma_s^2 + \sigma_\varepsilon^2 \right)}} = \frac{\alpha \sigma_s}{\sqrt{\alpha^2 \sigma_s^2 + \sigma_\varepsilon^2}}.

This tells us:

If $\sigma_\varepsilon^2 \to 0$ (no noise), $\mathrm{IC} \to \mathrm{sign}(\alpha)$ .
If $\alpha = 0$ (no true signal), $\mathrm{IC} = 0$ .
For fixed $\sigma_s, \sigma_\varepsilon$ , $|\mathrm{IC}|$ is monotone in $|\alpha|$ .

This gives a precise target: for any chosen $\alpha, \sigma_s, \sigma_\varepsilon$ , I can compute the theoretical IC and then check that the empirical IC from the code converges to this value as I simulate more data.

Win rate under the linear model

For win rate, the reasoning is slightly different. Ignoring SHORTs for a moment and assuming LONG only (i.e. take a trade whenever $s_t > 0$ ):

A win occurs when $r_{t+\tau} > 0$ .
Since $r_{t+\tau} = \alpha s_t + \varepsilon_t$ , conditional on $s_t$ , the probability of a win is:
$\mathbb{P}(r_{t+\tau} > 0 \mid s_t) = \mathbb{P}(\varepsilon_t > -\alpha s_t) = 1 - F_\varepsilon(-\alpha s_t),$
where $F_\varepsilon$ is the CDF of $\varepsilon_t$ .

If $\varepsilon_t \sim \mathcal{N}(0, \sigma_\varepsilon^2)$ , this is a normal CDF evaluation. The overall win rate is then:

\mathrm{WR}(\tau) = \mathbb{E}\bigl[\mathbb{P}(r_{t+\tau} > 0 \mid s_t)\bigr] = \mathbb{E}\bigl[1 - F_\varepsilon(-\alpha s_t)\,\big|\, s_t > 0\bigr].

This integral has a closed form for Gaussian assumptions, but for the purposes of this blog post the key qualitative points are:

As $|\alpha|$ increases (with noise fixed), the win rate moves away from 50% toward 100%.
As $\sigma_\varepsilon$ increases (with $\alpha$ fixed), the win rate moves back toward 50%.

In practice, I’ll pick a few parameter sets (e.g. targeting a ~60%, ~70%, ~80% win rate) and compare the empirical win rates from the code against the numerical expectations from this integral.

FOM under the linear model

FOM depends not just on the signal–return relationship, but also on how the time deltas are distributed across bins. For validation, I can define the toy model so that:

The time deltas $t_i$ are drawn from a known distribution over a finite set of bins.
The expected return conditional on the time delta is either:
- constant across bins (to check that the bin weighting and reference subtraction don’t introduce spurious structure), or
- systematically decaying or increasing with the time delta (to check that FOM responds in the expected direction).

One simple starting point is:

Let all $t_i$ $t_{i}$ fall into the same bin: then all the contribution to FOM should come from that bin, and with properly matched random reference we expect:
- zero FOM when $\alpha = 0$ (no predictive power),
- positive FOM when $\alpha > 0$ and LONG is the relevant direction,
- negative FOM when $\alpha < 0$ (signal predicts the opposite direction).

Then I can extend this to multi-bin scenarios where the expected return decays with the bin index, to see if the $1/k$ weighting and reference subtraction behave as I intend.

Again, the focus here is not on deriving a closed-form symbolic expression for FOM, but on:

specifying synthetic models where the sign and relative magnitude of FOM are predictable, and
verifying that the implementation matches those expectations.

Plots showing metrics with linear data

With the synthetic linear model in place, the final step is to actually run the pipeline and visualise the results.

The kinds of plots I’m planning to include:

IC vs horizon $\tau$
- Simulated IC curve from the code over multiple horizons.
- Theoretical IC as a horizontal reference line for each $\tau$ where the model is stationary.
- Expectation: the empirical curve should converge toward the theoretical value as the simulated sample size increases.
Win rate vs horizon / strength cut
- For a fixed horizon, plot win rate as a function of a minimum strength threshold (only trade when $|s_t|$ exceeds the threshold).
- Alternatively, fix the strength threshold and show win rate vs $\tau$ .
- Expectation: higher strength thresholds should yield higher win rates (up to a point), and longer horizons may dilute the signal if the model is specified that way.
FOM: studied vs random reference across strength ranges
- Fix minimum strength = 0.
- Vary the maximum strength included in the sample along the x-axis.
- For each strength range $[0, s_{\max}]$ , plot:

FOM for the studied signal,
FOM for the random reference.
- Expectation:
Under the null (no predictive power), both curves sit together around zero.
Under the linear model with $\alpha \neq 0$ , the studied signal’s FOM curve rises above the reference in the region where the signal is actually informative.

These plots serve two purposes:

They are a visual confirmation that the metrics are working as intended on controlled data.
They double as documentation: anyone reading this (including future-me) can see exactly how IC, FOM and win rate behave on a simple benchmark.

In the next iteration of this series I’ll move beyond purely linear synthetic data and look at:

signals that only “switch on” in certain regimes,
asymmetric behaviour in LONG vs SHORT,
and more realistic holding-period distributions.

But first, the goal is just to make sure that on the simplest possible toy models, the metrics line up with the maths.