Correlation model¶

The collector's job is to turn three independent streams into one curated row per interaction. This page documents exactly how that join works, because the join's confidence is itself a data-quality signal you review.

The authoritative implementation is build_curated() in FDA_Collector.py; a KQL-native equivalent (BuildFdaInteractionsKql) exists for spot-checks and validation.

Step 1 — pair prompts to responses (source A)¶

Graph rows arrive as individual messages (InteractionType = userPrompt or aiResponse). Within each ConversationId, sorted by time, every aiResponse is paired with the most recent preceding userPrompt:

for each conversation:
    sort messages by CreatedDateTime
    for each aiResponse R:
        P = last userPrompt with PromptTime <= R.time
        emit interaction(Question=P.body, Answer=R.body, Timestamp=R.time, PromptTime=P.time, User, ThreadId)

This yields a candidate interaction with question + answer + identity + a time bracket [PromptTime, Timestamp].

Step 2 — attach executed DAX (source C) by user + time window¶

For each paired interaction, candidate DAX executions are those where:

ExecutingUser matches the interaction's User (case-insensitive), and
the DAX timestamp falls inside [PromptTime − window, Timestamp + window].

The window defaults to ±90 seconds (CORR_WINDOW_SEC). All matching executions are kept, sorted by time, and attached as the ordered DaxQueries array — an FDA turn frequently emits several validation/probe queries before the final one. The last execution in the window is treated as the primary ExecutedDax.

flowchart LR
    P[userPrompt<br/>PromptTime] --> R[aiResponse<br/>Timestamp]
    subgraph window["match window  [PromptTime − 90s … Timestamp + 90s]"]
        d1[DAX exec 1] --> d2[DAX exec 2] --> d3[DAX exec 3<br/>= primary ExecutedDax]
    end
    R -.->|same ExecutingUser<br/>same model| window

Step 3 — score match confidence¶

`MatchConfidence`	Meaning
Exact	At least one DAX execution falls strictly inside `[PromptTime, Timestamp]` — the strongest signal that this DAX belongs to this turn.
Windowed	DAX matched only within the ±window padding (before the prompt or after the response). Plausible, but flagged.
Unmatched	No DAX execution matched. The interaction has question/answer text but no recovered DAX.

The review app colour-codes this (Exact green, Windowed amber, Unmatched red) so reviewers see join uncertainty rather than having it hidden.

Step 4 — keep DAX orphans (no interaction matched)¶

Executed-DAX rows that matched no paired interaction are not discarded. Each becomes its own curated row with:

InteractionId = "dax-" + corr_key(...), MatchConfidence = "Unmatched", Sources = ["monitoring"],
empty Question/Answer (no text surface saw it).

This guarantees no executed DAX is lost, even when Graph coverage is incomplete or absent. If Graph returns no rows at all, the curated build is monitoring-only: DAX without question/answer text.

The correlation key¶

corr = sha256( lower(user) + "|" + floor(timestamp → 1-minute) + "|" + semanticModelId )[:24]

CorrelationKey is stored on every curated row. It is a deterministic, content-derived id (user + minute bucket + model) used for de-duplication and as a stable handle for orphan DAX rows.

De-duplication¶

Two layers protect against double-counting when the collector re-scans its trailing LOOKBACK_HOURS window:

Before append, the collector queries FdaInteractions for the InteractionIds it is about to write and drops any that already exist.
Within Raw_* reads, arg_max(IngestedAt, *) keeps the newest copy per natural key (the KQL-native builder and analyst queries use this pattern).

Latency and back-fill¶

Audit/Graph records can lag the live interaction by minutes to ~30 min; workspace monitoring is near-real-time. Because the collector always re-scans a trailing window (LOOKBACK_HOURS, default 48), late-arriving records are picked up on a subsequent run and de-duplicated against what was already curated.

KQL-native validation¶

BuildFdaInteractionsKql(lookback, windowSec) reproduces this join in pure KQL for validation or as a fallback when you want to spot-check the notebook's output without re-running it. The notebook remains authoritative; the function is for cross-checking. See KQL functions.