WAI — Watcher AI.

Your application shouldn't have to manage its own errors.

You won't manage the errors. WAI will.

Event → fingerprint → AI doctor → playbook → resolution. Your app just sends the event; WAI remembers, classifies, and resolves the rest.

WAI aggregates every exception, slow query, cost spike, and error trace from your production systems in one place. AI-powered doctors classify events, recall past resolution knowledge, and apply automatic playbooks to known issues. Unresolvable issues are escalated as Plane tickets or pushed to an operator via Slack/Telegram — resolution knowledge accumulates in one centre rather than scattered across individual apps.

Documentation Start on GitHub Request a demo

The Problem

Today every product team manages its own errors. The same timeout is caught in 5 different ways across 5 different services, falls into 5 different logs, and 5 different operators solve the same problem for the 5th time. Organisations don't accumulate memory; teams don't learn from each other's experience.

Sentry collects exceptions but doesn't fix them. Datadog accumulates metrics but doesn't run playbooks. Grafana lights up red on dashboards but can't decide who resolves the incident or when. Existing tools pile up knowledge while leaving the resolution flow unmanaged.

WAI fills exactly that gap. Event → fingerprint → find similar cases → apply known fix → if inapplicable, ask the doctor → if the doctor can't resolve, open a ticket. Resolution knowledge accumulates centrally; the same problem is handled automatically the second time around.

WAI in 4 Sentences

SDK sends

Integrate via Python/TypeScript package; FastAPI middleware, SQLAlchemy hook, React ErrorBoundary, and fetch interceptor capture events automatically.

WAI ingests

REST + OTLP (OpenTelemetry gRPC/HTTP) ingest. 6 consumers over NATS JetStream handle parallel persist + classify.

Doctor thinks

Anthropic Claude-powered specialist doctors (network, db, auth, llm, infra, frontend, queue, runtime, cost, triage). Learns from past resolution knowledge via RAG; lifecycle: shadow → advisory → acting.

Action taken

Known playbook (canonical JSON) runs automatically (pod restart, scale, notify); unresolvable issues are escalated as Plane tickets or pushed to an operator via Slack/Telegram.

Key capabilities

🔭

OpenTelemetry first-class

Industry-standard OTLP receiver (gRPC :4317 + HTTP :4318). If your OTEL SDK is already set up, WAIClient.from_otel(tracer_provider) connects to WAI without double instrumentation. Semantic conventions (HTTP/DB/GenAI/K8s/messaging) are first-class citizens.

🛡

Enterprise multi-tenancy

Mandatory PostgreSQL Row-Level Security; 3 isolation modes per tenant (shared / schema / database). For teams working towards SOC 2 / ISO 27001: immutable audit log, scope-aware admin permissions (super-admin separate), tenant suspend + soft delete.

💰

Cost intelligence (FinOps)

LLM token consumption, Kubernetes CPU/RAM/GPU, R2/S3 storage, Anthropic + OpenRouter + Cloudflare billing — all in a single hypertable. 3-sigma anomaly detection auto-identifies 'this deploy caused a cost spike'. GPU burn detection (DCGM). Per-tenant monthly budget + overage alerts.

🤖

AI doctor lifecycle

10 specialist doctors (network/db/auth/llm/infra/frontend/queue/runtime/triage/cost). Each doctor operates at 3 maturity levels: shadow (log only) → advisory (UI suggestion) → acting (automatic playbook). Promotion and demotion are automatic based on judge score. Response JSON schema enforced; evidence_event_id validated in DB (anti-hallucination).

💬

ChatOps & ticket integration

Slack + Telegram (TR primary) + Microsoft Teams (Incoming Webhook) — inline buttons + slash commands. Automatic issue escalation to Plane, bi-directional sync (Plane close → incident resolved). Resolution comments automatically feed the knowledge base; the doctor arrives already learned at the next incident.

Architecture

Layer 6 — CLI + Operator Panel (wai-cli, wai-ui)

Layer 5 — UI — React + Mantine + Recharts

Layer 4 — Classifier + Doctor Pool + Playbook Executor

Layer 3 — Data — PostgreSQL + TimescaleDB + pgvector

Layer 2 — Event Bus — NATS JetStream

Layer 1 — SDKs — Python (uais-wai) + TypeScript (@uais/wai)

↓ Outputs

PlaneSlackTelegramGrafanaEmail

Who is it for?

Profil	Neden WAI?
Startup CTO	5+ services, no dedicated operator, solutions scattered everywhere. WAI centralises resolution memory.
DevOps / SRE team	Pager fatigue. Known issues should be resolved automatically; only wake people up when it truly matters.
AI / LLM-focused companies	Token costs and rate-limit errors go unmanaged. WAI's cost-doctor + llm-doctor are built specifically for this.
Enterprise IT	You're running a multi-tenant SaaS and each customer's data must be isolated. RLS + schema/database isolation ready out of the box.
Regulatory compliance (GDPR, SOC 2)	Immutable audit log, scope-aware admin, separate super-admin — compliance reports generated without hassle.

Technology highlights

Python 3.12 + FastAPINATS JetStreamPostgreSQL 16 + TimescaleDB + pgvectorOpenTelemetry OTLP gRPC + HTTPAnthropic Claude (Opus 4.6 + Sonnet 4.5)Kubernetes-native + Helm + KEDAPlane integrationSlack / Telegram / TeamsAdditive-only SDK contract

Why not Sentry/Datadog/Grafana?

WAI is not an observability tool; it is a resolution orchestrator.

Tool	Does	Doesn't
Sentry	Captures exceptions, shows stack traces	Does not apply fixes or open tickets
Datadog	Collects metrics + logs + APM	Doesn't suggest AI-driven fixes, resolution knowledge doesn't accumulate
Grafana / Prometheus	Dashboard + alerts	Visual only, no actions taken
PagerDuty	Alert routing + on-call	Cannot attempt to resolve issues itself
WAI	All of the above + AI doctor + playbook executor + institutional memory	Does not want to replace your existing tools — it sits alongside them, connected via OTLP

Where we are, where we're going

Today (2026 H1)

✓Round 2: Studio dependency removed, generic library, doctor lifecycle, Plane bridge
✓Round 3: OpenTelemetry first-class, multi-tenancy hardening (RLS + schema/database isolation), cost intelligence (FinOps + cost-doctor), ChatOps (Slack + Telegram + Teams)

Phase 1.5 (2026 H2)

→Mattermost / Rocket.Chat / Zulip ChatOps adapters
→Microsoft Teams two-way bot framework
→Cost forecasting (ML model — 'expect $X next month')
→Spot instance / pricing optimisation recommendations
→AWS Cost Explorer + GCP BigQuery billing + Azure Cost Management

Phase 2 (2027)

◦Hot-path Go port (ingest gateway only — Python stays everywhere else)
◦Logs OTLP signal support (currently traces + metrics)
◦Multi-region deployment
◦Marketplace UI (playbook + doctor sharing community)

Frequently asked questions

Can I self-host WAI?+

Yes. The Helm chart is open (github.com/mehmetulutug/uais-wai/infra/helm/wai), single-command deploy: make deploy-r3 TAG=<sha>. Requires Kubernetes 1.27+ + CNPG operator + NATS JetStream.

Which LLM does it use?+

Anthropic Claude (Opus + Sonnet) by default; can switch to other models via OpenRouter routing. A different model can be selected per doctor (e.g. triage-doctor on Haiku, cost-doctor on Opus).

Performance?+

No LLM calls in shadow mode — operational cost is zero. Once promoted to acting mode, model costs are tracked by the tenant's own cost-doctor (it monitors its own costs).

Should I drop Sentry?+

No. WAI can consume Sentry (redirect Sentry's webhook to WAI); or send directly to WAI using our SDK. WAI is not the observation layer — it's the layer that orchestrates the resolution.

Which exception/log types does it support?+

All of them. SDK custom event API (source, severity, category, payload); plus standard Tracing + Metrics over OTLP. Semantic conventions are first-class (HTTP, DB, GenAI, Messaging, K8s).

I run a multi-tenant SaaS — how do I isolate my customers?+

Per-tenant isolation_mode: shared (default — RLS) / schema (separate PG schema per tenant) / database (separate PG database per tenant). Upgradeable from the UI; data migration script handles it automatically.

Ways to try WAI

🚀 Try in 5 minutes 📖 Documentation 💬 Community

Built by a solo developer, tested within the UAIS ecosystem. Production-ready Helm chart + 730+ tests. Open communication, fast iteration.