WAI — Watcher AI.
Your application shouldn't have to manage its own errors.
You won't manage the errors. WAI will.
Event → fingerprint → AI doctor → playbook → resolution. Your app just sends the event; WAI remembers, classifies, and resolves the rest.
WAI aggregates every exception, slow query, cost spike, and error trace from your production systems in one place. AI-powered doctors classify events, recall past resolution knowledge, and apply automatic playbooks to known issues. Unresolvable issues are escalated as Plane tickets or pushed to an operator via Slack/Telegram — resolution knowledge accumulates in one centre rather than scattered across individual apps.
The Problem
WAI in 4 Sentences
SDK sends
Integrate via Python/TypeScript package; FastAPI middleware, SQLAlchemy hook, React ErrorBoundary, and fetch interceptor capture events automatically.
WAI ingests
REST + OTLP (OpenTelemetry gRPC/HTTP) ingest. 6 consumers over NATS JetStream handle parallel persist + classify.
Doctor thinks
Anthropic Claude-powered specialist doctors (network, db, auth, llm, infra, frontend, queue, runtime, cost, triage). Learns from past resolution knowledge via RAG; lifecycle: shadow → advisory → acting.
Action taken
Known playbook (canonical JSON) runs automatically (pod restart, scale, notify); unresolvable issues are escalated as Plane tickets or pushed to an operator via Slack/Telegram.
Key capabilities
OpenTelemetry first-class
Industry-standard OTLP receiver (gRPC :4317 + HTTP :4318). If your OTEL SDK is already set up, WAIClient.from_otel(tracer_provider) connects to WAI without double instrumentation. Semantic conventions (HTTP/DB/GenAI/K8s/messaging) are first-class citizens.
Enterprise multi-tenancy
Mandatory PostgreSQL Row-Level Security; 3 isolation modes per tenant (shared / schema / database). For teams working towards SOC 2 / ISO 27001: immutable audit log, scope-aware admin permissions (super-admin separate), tenant suspend + soft delete.
Cost intelligence (FinOps)
LLM token consumption, Kubernetes CPU/RAM/GPU, R2/S3 storage, Anthropic + OpenRouter + Cloudflare billing — all in a single hypertable. 3-sigma anomaly detection auto-identifies 'this deploy caused a cost spike'. GPU burn detection (DCGM). Per-tenant monthly budget + overage alerts.
AI doctor lifecycle
10 specialist doctors (network/db/auth/llm/infra/frontend/queue/runtime/triage/cost). Each doctor operates at 3 maturity levels: shadow (log only) → advisory (UI suggestion) → acting (automatic playbook). Promotion and demotion are automatic based on judge score. Response JSON schema enforced; evidence_event_id validated in DB (anti-hallucination).
ChatOps & ticket integration
Slack + Telegram (TR primary) + Microsoft Teams (Incoming Webhook) — inline buttons + slash commands. Automatic issue escalation to Plane, bi-directional sync (Plane close → incident resolved). Resolution comments automatically feed the knowledge base; the doctor arrives already learned at the next incident.
Architecture
↓ Outputs
Who is it for?
| Profil | Neden WAI? |
|---|---|
| Startup CTO | 5+ services, no dedicated operator, solutions scattered everywhere. WAI centralises resolution memory. |
| DevOps / SRE team | Pager fatigue. Known issues should be resolved automatically; only wake people up when it truly matters. |
| AI / LLM-focused companies | Token costs and rate-limit errors go unmanaged. WAI's cost-doctor + llm-doctor are built specifically for this. |
| Enterprise IT | You're running a multi-tenant SaaS and each customer's data must be isolated. RLS + schema/database isolation ready out of the box. |
| Regulatory compliance (GDPR, SOC 2) | Immutable audit log, scope-aware admin, separate super-admin — compliance reports generated without hassle. |
Technology highlights
Why not Sentry/Datadog/Grafana?
WAI is not an observability tool; it is a resolution orchestrator.
| Tool | Does | Doesn't |
|---|---|---|
| Sentry | Captures exceptions, shows stack traces | Does not apply fixes or open tickets |
| Datadog | Collects metrics + logs + APM | Doesn't suggest AI-driven fixes, resolution knowledge doesn't accumulate |
| Grafana / Prometheus | Dashboard + alerts | Visual only, no actions taken |
| PagerDuty | Alert routing + on-call | Cannot attempt to resolve issues itself |
| WAI | All of the above + AI doctor + playbook executor + institutional memory | Does not want to replace your existing tools — it sits alongside them, connected via OTLP |
Where we are, where we're going
Today (2026 H1)
- ✓Round 2: Studio dependency removed, generic library, doctor lifecycle, Plane bridge
- ✓Round 3: OpenTelemetry first-class, multi-tenancy hardening (RLS + schema/database isolation), cost intelligence (FinOps + cost-doctor), ChatOps (Slack + Telegram + Teams)
Phase 1.5 (2026 H2)
- →Mattermost / Rocket.Chat / Zulip ChatOps adapters
- →Microsoft Teams two-way bot framework
- →Cost forecasting (ML model — 'expect $X next month')
- →Spot instance / pricing optimisation recommendations
- →AWS Cost Explorer + GCP BigQuery billing + Azure Cost Management
Phase 2 (2027)
- ◦Hot-path Go port (ingest gateway only — Python stays everywhere else)
- ◦Logs OTLP signal support (currently traces + metrics)
- ◦Multi-region deployment
- ◦Marketplace UI (playbook + doctor sharing community)
Frequently asked questions
Can I self-host WAI?+
Which LLM does it use?+
Performance?+
Should I drop Sentry?+
Which exception/log types does it support?+
I run a multi-tenant SaaS — how do I isolate my customers?+
Ways to try WAI
Built by a solo developer, tested within the UAIS ecosystem. Production-ready Helm chart + 730+ tests. Open communication, fast iteration.
