# Runbook

Операционный runbook: инциденты, восстановление, мониторинг 1C MCP Gateway.

## Prerequisites

- On-call доступ к host gateway, `.data` (`ONEC_MCP_DATA_DIR`), логам systemd/Docker/k8s.
- Backup procedure: [PRODUCTION_STORAGE.md](./PRODUCTION_STORAGE.md).
- Contacts: platform admin, 1C admin, security.

## Пошаговые процедуры

### INC-001: Gateway down / unhealthy

**Symptoms:** `/healthz` fail, MCP clients disconnected, 502 from proxy.

1. Check process:
   ```bash
   systemctl status onec-mcp-gateway
   # or: docker compose ps && docker compose logs gateway --tail=100
   ```
2. Verify port: `curl -s http://127.0.0.1:3000/healthz`
3. Disk space on data volume: `df -h $ONEC_MCP_DATA_DIR`
4. Restart:
   ```bash
   systemctl restart onec-mcp-gateway
   # or: docker compose restart gateway
   ```
5. If crash loop — run foreground debug:
   ```bash
   cd /opt/onec-mcp-gateway && ONEC_MCP_DATA_DIR=/var/lib/onec-mcp-gateway node dist/apps/gateway/src/http.js
   ```

**Escalate:** corrupt `.data` → restore backup (INC-003).

### INC-002: MCP auth failures spike

**Symptoms:** 401/403 on `/mcp`, audit `auth.failed`.

1. Identify token: audit `tokenPrefix`, client id.
2. Revoke compromised: `POST /api/mcp-tokens/:id/revoke`
3. Issue new scoped token; update client secrets (CI, IDE).
4. Check clock skew (JWT/OIDC if enabled).
5. Review `allowedClients` and expired tokens.

### INC-003: Data corruption / bad migration

**Symptoms:** startup error on `StoreMigrationRunner`, invalid JSON in `.data`.

1. **Stop** gateway.
2. List backups: `GET /api/system/backups` or `ls $ONEC_MCP_DATA_DIR/backups/`
3. Restore:
   ```bash
   curl -X POST http://127.0.0.1:3000/api/system/backups/<backupId>/restore
   ```
4. Start gateway; verify `GET /api/system/storage-meta`.
5. Post-incident: root cause in migration version, fix forward.

### INC-004: 1C connection degraded

**Symptoms:** `check_connection_health` fail, OData timeouts.

1. Test from gateway host:
   ```bash
   curl -sS -o /dev/null -w "%{http_code}" "https://1c-host/base/odata/standard.odata/\$metadata"
   ```
2. Check 1C cluster, web server, cert expiry.
3. Verify credentials (`passwordEnv` / vault) not rotated without update.
4. Temporary: mark profile unhealthy in comms; do not switch prod agents to dev profile without review.

### INC-005: Rate limit / quota exceeded

**Symptoms:** HTTP 429, billing alerts.

1. Identify actor from audit / usage events.
2. Adjust token `rateLimit` or plan upgrade ([BILLING_AND_LIMITS.md](./BILLING_AND_LIMITS.md)).
3. Throttle abusive automation; block token if abuse.

### INC-006: Suspected secret leak

1. Revoke all MCP tokens for org; rotate `ONEC_MCP_HTTP_TOKEN` bootstrap.
2. Rotate 1C OData/HTTP service credentials.
3. Review `.data/audit.jsonl` export (redacted) for exfil patterns.
4. Notify security; reference [THREAT_MODEL.md](./THREAT_MODEL.md).

## Copy-ready monitoring

Health probes (k8s/systemd already use these):

```bash
curl -sf http://127.0.0.1:3000/healthz
curl -sf http://127.0.0.1:3000/readyz
```

Optional OTEL:

```bash
export ONEC_OTEL_ENABLED=1
export ONEC_OTEL_EXPORTER_OTLP_ENDPOINT=http://otel-collector:4318
```

Scheduled backup cron:

```bash
0 2 * * * curl -sf -X POST http://127.0.0.1:3000/api/system/backups
```

## Проверка tools/list (post-recovery)

After restore/restart:

```
tools/list → list_profiles → check_profile_health (each critical profile)
```

Document results in incident ticket.

## Первый тестовый prompt (post-incident validation)

```
После восстановления gateway: list_profiles, get_active_profile, check_connection_health
для всех production профилей. Подтверди, что write tools недоступны на production.
```

## Типовые ошибки оператора

| Mistake | Consequence |
|---------|-------------|
| Restore without stop | Partial write corruption |
| Delete `backups/` during restore | No rollback |
| Restart during index job | Stale index — re-run discover |
| Shared debug on prod data dir | Policy bypass risk |

## Security warning

- Runbook actions logged — use break-glass account with audit.
- Do not paste live tokens into tickets.
- Forensic copies of `.data` — encrypt at rest, limit access.

## Related docs

- [PRODUCTION_STORAGE.md](./PRODUCTION_STORAGE.md) — backup API
- [DEPLOYMENT.md](./DEPLOYMENT.md) — systemd/Docker/k8s
- [TROUBLESHOOTING.md](./TROUBLESHOOTING.md) — dev diagnostics
- [THREAT_MODEL.md](./THREAT_MODEL.md)
