Operations runbook¶
Deployment checklist¶
-
(One-time, tenant admin) Enable browser sign-in —
UserDelegatedauth uses a public client app that must exist in the tenant. The least-friction option is to have an admin instantiate the Azure CLI well-known app once, after whichconfig.jsonClientIdcan staynull:Connect-MgGraph -Scopes "Application.ReadWrite.All" New-MgServicePrincipal -AppId "04b07795-8ddb-461a-bbee-02f9e1bf7b46" # Azure CLIThe Azure CLI app carries broad pre-consented delegated permissions (Fabric/Power BI, ARM, Kusto), so nothing else is required. Without admin help, register your own public-client app (Allow public client flows = Yes,
http://localhostredirect) and set its id inconfig.jsonClientId. A missing/unprovisioned app surfaces asAADSTS700016. Not needed for ServicePrincipal or ManagedIdentity runners. 1. Sign in and provision —$TenantId = Connect-FDAObservability -AuthMethod UserDelegated(browser), thenInitialize-FDAObservability -TenantId $TenantId. The Workspace / Eventhouse / Database default toconfig.json(WorkspaceName,EventhouseName,DatabaseName) and are created if missing. Override with-WorkspaceName/-EventhouseName/-DatabaseName, or target existing ids with-WorkspaceId/-EventhouseId. 2.Initialize-FDAObservabilityalso creates tables, ingestion mappings, update policies, retention policies, and seed levels. Idempotent. 3. Wire the proxy — replace every direct call to the FDA published endpoint withInvoke-FDAQuery. App teams should not call FDA directly. 4. Enable Workspace Monitoring on the underlying semantic model (Power BI workspace settings). Direct the export to the same Eventhouse, or to an Eventhouse you can query from KQL. Then point downstream telemetry into theFDAExecutionsRawtable via a scheduled.set-or-appendfrom the Workspace Monitoring source table. 5. Configure governance sync — scheduleSync-FDAGovernanceLog(seeexamples/05-scheduled-governance-sync.ps1). Run hourly to stay within the M365 audit feed's 7-day retention window. 6. Set retention — defaults: 90 d onFDAInteractions,FDAExecutions,FDALogEvents; 365 d onFDAAuthEventsandFDACostMetering. Override per-table via the schema file orSet-FDAObservabilityConfig -RetentionDays. 7. Set min-level —Set-FDAObservabilityConfig -MinLevel Information(default). UseVerboseonly in dev. 8. Tune the rate table for cost estimation:Set-FDAObservabilityConfig -CapacityRates @{ TokensPerCU = 1000; USDPerCU = 0.18; Version = 'v1' }. 9. RunTest-FDAObservability— every check should return Pass.
Monitoring¶
Wire a Fabric Activator or Azure Monitor alert against these KQL queries (snippets; full library in Schema/08-sample-queries.kql):
// Error rate spike
FDAInteractions
| where Timestamp > ago(15m)
| summarize Total = count(), Errors = countif(Status == 'Error')
| extend ErrRate = todouble(Errors) / iff(Total == 0, 1.0, todouble(Total))
| where ErrRate > 0.05
// Latency regression
FDAInteractions
| where Timestamp > ago(15m)
| summarize p95 = percentile(LatencyMs, 95)
| where p95 > 8000
// Auth failure spike
FDAAuthEvents
| where Timestamp > ago(15m) and Outcome != 'Success'
| summarize Failures = count()
| where Failures > 10
// Token-cost surge
FDACostMetering
| where Timestamp > ago(1h)
| summarize EstUSD = sum(EstimatedCostUSD)
| where EstUSD > 50
Daily ops digest: schedule New-FDAObservabilityReport -Type DailyOps -OutFile ... and email the markdown.
Capturing M365 Copilot-originated calls¶
Copilot calls the FDA published endpoint directly — the NL question is not visible to this module for those calls. Use the correlation pattern instead:
- Workspace Monitoring on the semantic model captures the executed DAX, user, ActivityID, timestamp, duration.
Sync-FDAGovernanceLogcaptures the Copilot operation (AnalyzedByExternalApplication/GetDataAgent) with consent, RLS, client app.- Join
FDAExecutionsandFDAAuthEventsonCorrelationId(M365 audit'sCorrelationIdmatches the Workspace Monitoring ActivityID for the same request).
You will have user → DAX → result → consent end-to-end. You will not have the natural-language question text — that lives only with Copilot.
Common operator queries¶
// Tail the last hour of warnings or worse.
Search-FDALog -MinLevel Warning -Last 1h
// Find a specific user's failures today.
Search-FDALog -UserPrincipalName 'p@m.com' -Last 1d -KQL @'
FDAInteractions
| where Timestamp > ago(1d) and UserPrincipalName == "p@m.com" and Status != "Success"
'@
// Full timeline for one interaction (uses helper function from 06-functions.kql).
Search-FDALog -KQL "GetInteractionTimeline('<interactionId>')"
// Custom-level filter (e.g., Audit).
Search-FDALog -Table FDALogEvents -MinLevel 'Audit' -Last 7d
Retention tuning¶
Default retention is encoded in Schema/05-retention-policies.kql. To override at runtime:
Set-FDAObservabilityConfig -Notes 'extend FDAInteractions to 180d' # writes config history
# Then alter via KQL admin command:
Invoke-KustoManagementCommand -Command '.alter table FDAInteractions policy retention softdelete = 180d recoverability = enabled'
(Or re-run Initialize-FDAObservability after editing the schema file.)
Schema migration¶
Adding a column to a curated table:
- Edit
Schema/02-create-tables.kqlto add the column with.create-merge. - Edit
Schema/04-update-policies.kqlsoExpandFDA*Raw()projects the new column. - Re-run
Initialize-FDAObservability— both files are idempotent.
The raw landing rows already contain everything in the Payload dynamic column, so historical rows reshape into the new column on next ingest. To backfill historical curated rows, re-run the update-policy projection on the raw table:
Token & secret hygiene¶
- Tokens are kept only in module memory and cleared by
Disconnect-FDAObservability. - For SP auth, prefer certificate-based auth (
-Certificate $cert) over secret-based. - For MI on App Service / Functions, the module honors
IDENTITY_ENDPOINT/IDENTITY_HEADER— no extra config required. - The disk spool at
~/.fda-observability/spool/holds records you were about to ingest. If a caller used-PreservePII, those records contain raw PII. Lock the spool directory to the runner's user.
Disaster recovery¶
| Failure | Behavior |
|---|---|
| Eventhouse unreachable | Records spool to disk; drained by Restore-FDASpool on next connect |
| Token refresh fails | Subsequent calls throw; reconnect re-installs the provider |
| Update policy throws on a record | The raw row is still ingested; the curated row is skipped. Inspect via .show ingestion failures |
| Cluster throttles | Exponential backoff, then spool |
| Process dies | In-memory buffer is lost; spool is on disk so anything written there is recovered. Recommend BatchMaxEvents = 50 for higher-criticality workloads |