Disaster Recovery: Backup, restore, failover cho agent platform — khi agent 'quên' mình đang làm gì giữa chừng

Chiến lược DR cho AI agent platform: từ checkpoint cognitive state đến failover multi-region, đảm bảo agent không 'mất trí' giữa task dài hạn.

Hãy tưởng tượng agent của bạn đang xử lý một support ticket phức tạp: đã trao đổi 20 lượt với khách hàng, gọi 3 API internal để tra cứu đơn hàng, và đang ở giữa bước "refund" thì server chính crash. Khi hệ thống failover sang region dự phòng, nếu bạn chỉ restore database mà không có "cognitive continuity", agent sẽ tỉnh dậy trong trạng thái "hóa đá" — không nhớ mình đang xử lý ticket nào, không biết đã gọi API chưa, và hỏi lại khách hàng: "Chào bạn, tôi có thể giúp gì ạ?" từ đầu. Disaster Recovery (DR) cho agent platform không chỉ là backup data; đó là nghệ thuật bảo toàn "trạng thái suy nghĩ" (cognitive state) giữa các region.

Vấn đề

AI agents là các quá trình stateful dài hạn (long-running), không giống web app stateless truyền thống. Một request HTTP thường xử lý trong 200ms rồi quên sạch; một agent có thể duy trì "cuộc trò chuyện" với context window kéo dài hàng giờ, lưu trữ episodic memory trong vector DB, giữ tool auth tokens tạm thời, và duy trử routing topology giữa các sub-agents.

Cách tiếp cận DR truyền thống tập trung vào database snapshot và volume backup, nhưng thiếu sót nghiêm trọng với agent:

Context Window Loss: Khi instance crash, in-flight reasoning (các tokens đang được generate) và conversation history chưa kịp flush vào persistent store biến mất.
Tool Auth Drift: OAuth tokens hoặc AWS credentials mà agent vừa refresh bị mất, buộc phải re-authenticate với user hoặc external service.
Saga Interruption: Multi-step workflows (ví dụ: "verify inventory → hold payment → create label → notify courier") dừng giữa chừng tạo ra trạng thái "zombie" — một số bước đã execute, một số chưa, nhưng không có checkpoint để resume.
Topology Amnesia: Sau failover, agent không biết mình đang chạy ở region nào, kết nối với tool servers nào, hoặc đang xử lý subtask nào của multi-agent mesh.

Hậu quả là "moment loss" — khả năng resume từ chính xác millisecond agent dừng lại. Người dùng thấy agent "bị lãng", lặp lại câu hỏi cũ, hoặc worse, thực hiện double execution (charge tiền 2 lần vì không nhớ đã hold payment).

Ý tưởng cốt lõi

Disaster Recovery cho agent platform dựa trên Cognitive Continuity — khả năng khôi phục không chỉ data mà cả "tâm trí" agent: trạng thái reasoning, memory embeddings, tool contracts, và network topology. Kiến trúc này có 4 trụ cột:

Stateful Checkpointing: Bắt trọn "giấc mơ dang dở"

Thay vì chỉ backup PostgreSQL chứa user profiles, bạn cần capture working state của agent:

Conversation State: Lưu trữ thread_ref, context window (KV cache của transformer), và intent classification results. Dùng Redis với AOF (Append-Only File) + RDB snapshot, hoặc LangGraph's built-in persistence layers (Postgres checkpointing) để ghi mỗi turn ngay lập tức.
Memory State: Episodic memory (vector DB như Pinecone/Qdrant) và semantic memory (knowledge graph) cần real-time replication, không chỉ nightly backup.
Execution State: Với long-horizon tasks (>100 steps), dùng event sourcing — ghi lại chuỗi ToolCall → Observation → Thought vào append-only log (Kafka, Redis Streams), cho phép replay từ bước N khi failover.

Checkpoint Pattern

Mỗi 5 turns hoặc trước mỗi tool call tốn kém, agent ghi "brain state" vào Redis với TTL 24h. Standby instance liên tục replicate stream này qua Redis Sentinel hoặc Global Datastore.

Infrastructure-as-Code Identity: Khôi phục "thần kinh" của agent

Khi failover sang region mới, agent cần tái tạo chính xác "nervous system" — không chỉ là binaries:

Prompt Versioning: SOUL.md, SYSTEM_PROMPT, và tool descriptions phải được pin bằng Git commit SHA, không phải "latest". Terraform/Pulumi đảm bảo standby region deploy đúng prompt version đang chạy ở primary.
Tool Contracts: Schema definitions (OpenAPI specs cho MCP tools) và allowlist policies được versioned cùng code, tránh trường hợp standby agent gọi tool với sai parameter vì schema đã drift.
IAM & Network Identity: Service account tokens, VPC peering configs cho tool servers (ví dụ: kết nối đến internal SAP API) được mã hóa cứng trong infrastructure definition, không phụ thuộc vào runtime discovery.

Graceful Task Draining: Không "cắt ngang" giữa lời nói

Khi phát hiện sự cố, cần circuit breakers để hoàn thành hoặc checkpoint các in-flight operations:

LLM Call Completion: Nếu request đang stream tokens từ GPT-4 về, drain nó xuống Redis stream rồi mới terminate pod (Kubernetes preStop hooks).
Saga Checkpointing: Với distributed transactions (multi-step workflows), dùng Saga pattern với compensation logic. Mỗi bước success ghi vào "saga log" trước khi proceed; khi failover, coordinator replay log để xác định: resume, rollback, hay abort.
User Notification: Nếu failover xảy ra mid-conversation, agent phải thông báo graceful: "Tôi vừa chuyển sang server dự phòng, để tôi xem lại context..." thay vì im lặng mất hút.

Tiered Replication Strategy: Phân cấp tốc độ khôi phục

Không phải state nào cũng cần RPO (Recovery Point Objective) bằng nhau:

State Type	RPO Target	Replication Mode	Công nghệ
Conversation State	< 30s	Active-Active	Redis Cluster cross-region, CockroachDB
Tool Auth Tokens	Real-time	Synchronous	HashiCorp Vault replication, AWS Secrets Manager multi-region
Vector Memory	< 5 phút	Asynchronous	Qdrant Raft replication, Pinecone metadata sync
Model Weights	Hourly	Cold standby	S3/GCS replication, EKS AMI

Active-Active cho Conversation: Conversation state (Redis) chạy active-active giữa 2 region, đảm bảo RPO <30s. Khi failover, user không nhận ra đã switch region ngoại trừ latency tăng 50ms.

Warm Standby cho Inference: GPU nodes chạy model serving (vLLM, TGI) ở chế độ "warm" — không nhận traffic nhưng đã load weights, RTO (Recovery Time Objective) < 2 phút.

Tại sao nó hoạt động

Logic then chốt là separation of state and compute. Trong agent platform, LLM inference nodes (compute) là stateless; chúng có thể die và respawn ở region khác. Điều quan trọng là state store (Redis, Postgres, Kafka) phải sống sót hoặc replicate liên tục.

Event Sourcing cho Agent Actions cũng quan trọng: thay vì lưu "current state", lưu chuỗi events (ReAct loop: Thought → Action → Observation). Khi failover, standby agent replay events từ checkpoint cuối cùng để reconstruct trạng thái tâm trí. Điều này giải quyết vấn đề non-determinism của LLM — dù model có thể generate khác nhau giữa 2 lần chạy, nhưng với cùng context window và tool observations, agent sẽ arrive at cùng decision point.

Idempotency Keys cho tool calls ngăn double execution: mỗi tool call được tag với execution_id duy nhất. Khi failover và retry xảy ra, tool layer (Stripe API, internal ERP) nhận ra duplicate và trả về kết quả cũ thay vì execute lại.

Ý nghĩa thực tế

So sánh với web app DR truyền thống:

Khía cạnh	Web App Stateless	AI Agent Platform
State	Session token ở client	10K tokens context window + vector memory
Failover	Load balancer route sang healthy node	Cần "thaw" cognitive state từ Redis
Data Loss	Mất 1 request	Mất 20 turns conversation + intent history
RPO	Không quan trọng (stateless)	`<30s` cho conversation, `<5s` cho auth tokens

Production Story: Một deployment OpenClaw trên Kubernetes multi-region dùng:

Redis Cluster (6 nodes, 3 AZ) cho conversation state với appendfsync everysec
Litestream replicate SQLite checkpoints (cho local agent memory) lên S3 mỗi 10s
K8s Pod Disruption Budgets đảm bảo drain in-flight requests trước khi terminate
Terraform workspaces đồng bộ hóa prompt versions và tool configs giữa us-east-1 và eu-west-1

Kết quả: Failover xảy ra trong 45s, user chỉ thấy message "Đang kết nối lại..." trong 2s, sau đó agent tiếp tục câu trả lời dở dang như chưa hề có sự cố.

Hạn chế:

Chi phí replicate Redis active-active cao gấp 3-4 lần single node
Cold start GPU nodes vẫn mất 2-3 phút nếu chưa có warm standby
Không thể recover "exact thought process" nếu LLM call đang giữa chừng generate (chỉ có thể retry từ prompt ban đầu)

Đào sâu hơn

Docs chính thức:

Redis Persistence Documentation — Chi tiết AOF vs RDB cho agent state
LangGraph Persistence — Checkpointing và memory management
AWS Well-Architected: Disaster Recovery — RPO/RTO definitions cho ứng dụng mission-critical

Bài liên quan TroiSinh:

Cùng cụm — Production Deployment:

Disaster Recovery: Backup, restore, failover cho agent platform — khi agent 'quên' mình đang làm gì giữa chừng

Vấn đề

Ý tưởng cốt lõi

Stateful Checkpointing: Bắt trọn "giấc mơ dang dở"

Checkpoint Pattern

Infrastructure-as-Code Identity: Khôi phục "thần kinh" của agent

Graceful Task Draining: Không "cắt ngang" giữa lời nói

Tiered Replication Strategy: Phân cấp tốc độ khôi phục

Tại sao nó hoạt động

Ý nghĩa thực tế

Đào sâu hơn

Deploy bằng Docker Compose

Deploy trên Kubernetes

Observability cho Agent

5-layer Security

On this page