Performance Tuning: Goroutine pool, connection pooling, caching — Tối ưu agent runtime cho 10,000 concurrent tool calls

Tuning goroutine pool, connection pooling và caching để agent platform xử lý 10,000 concurrent tool calls không crash. Từ sync.Pool đến HTTP/2 connection reuse.

Khi agent platform của bạn bỗng dưng nhận 10,000 tool calls đồng thời từ một cron job chạy quá hạn, việc spawn goroutine không giới hạn sẽ biến server thành một "hố đen bộ nhớ" chỉ trong vài giây. Khác với web app thông thường, agent runtime duy trì WebSocket/SSE dài hạn cho token streaming, spawn hàng loạt subprocess cho tool execution, và gọi cascade API đến LLM providers — những pattern concurrency đặc thù khiến cách tiếp cận "mỗi request một goroutine" trở nên thảm họa.

Vấn đề

Agent systems có ba đặc tính làm bùng nổ resource consumption:

1. Thundering herd từ tool execution
Khi một agent team thức dậy từ webhook hoặc cron job, nó có thể spawn đồng thợi hàng trăm tool calls (bash commands, API queries, file I/O). Mỗi goroutine trong Go bắt đầu với stack 2KB nhưng có thể grow lên MBs nếu hàm đệ quy sâu. 10,000 goroutines = 20MB+ chỉ cho stack, chưa kể heap allocations cho JSON parsing, token buffers, và API response objects.

2. Connection churn với LLM APIs
Mỗi lần tạo mới TCP connection đến OpenAI/Anthropic tốn 1-3 RTT (TCP handshake + TLS negotiation). Với agent streaming tokens qua HTTP/2, việc mở connection mới cho mỗi request sẽ giết chết latency — người dùng sẽ thấy khoảng trắng chờ 300-500ms giữa các token thay vì real-time streaming.

3. GC pressure từ ephemeral objects
Agent liên tục allocate temporary objects: token buffers cho LLM response, JSON structs cho tool schemas, intermediate embeddings. Khi hệ thống đạt 100K+ allocations/giây, garbage collector (GC) bước vào stop-the-world phase, làm đơ cứng toàn bộ agent conversations đang chạy.

Cách tiếp cận cũ — "spawn goroutine per request" — hoạt động với HTTP API stateless nhưng sụp đổ với stateful agent sessions duy trì context window và memory state dài hạn.

Ý tưởng cốt lõi

Thay vì coi goroutine, TCP connection, và heap objects là "hàng tiêu dùng dùng một lần", hãy đối xử chúng như tài sản tái sử dụng với bounded concurrency. Kiến trúc tuning cho agent runtime gồm ba trụ cột:

Goroutine Pool với Backpressure

Fixed worker pool (kích thước thường đặt = runtime.NumCPU() cho CPU-bound, hoặc 2-4×CPU cho I/O-bound tool calls) pull tasks từ buffered channel. Khi channel đầy, hệ thống trả về lỗi "busy" hoặc queue overflow — một cơ chế backpressure rõ ràng thay vì silently crash vì OOM.

Request → Channel Buffer (size=N) → Worker Pool (M goroutines)
                ↓ (if full)
         Reject/Retry with exponential backoff

sync.Pool cho Object Reuse

sync.Pool maintain per-P (processor-local) queues của reusable objects (byte buffers cho JSON encode/decode, token slices). Khi một goroutine trên CPU core X allocate object, nó được reuse bởi chính goroutine đó hoặc goroutine khác trên cùng core — tránh cross-core synchronization và cache coherency protocols. Objects chỉ bị clear khi GC pressure cao, giúp giảm allocation rate 90%+.

Connection Pooling với HTTP/2

HTTP client reuse với MaxIdleConns và IdleConnTimeout duy trì persistent TCP connections (HTTP/2 multiplexed) đến LLM providers và vector DBs. TLS handshake chỉ xảy ra một lần cho hàng nghìn requests, và TCP congestion window được maintain qua các lần gọi — critical cho agent systems thực hiện 10-50 tool calls per session.

Caching Layers Phân cấp

sync.Map: Cho read-heavy concurrent maps như tool schema registry, agent cards metadata — không cần global mutex.
Segmented LRU: Cho embedding caches và RAG retrieval results, tránh truy xuất disk/network cho frequently accessed knowledge.

Ví dụ Code/Config

// Goroutine pool với explicit backpressure
type AgentWorkerPool struct {
    workers   int
    jobQueue  chan Task
    sem       chan struct{} // Bounded concurrency
}

func (p *AgentWorkerPool) Submit(task Task) error {
    select {
    case p.sem <- struct{}{}: // Acquire slot
        p.jobQueue <- task
        return nil
    default:
        return ErrPoolFull // Backpressure!
    }
}

func (p *AgentWorkerPool) Start() {
    for i := 0; i < p.workers; i++ {
        go p.worker()
    }
}

// sync.Pool cho JSON buffers — giảm GC pressure ở agent tool I/O
var jsonBufferPool = sync.Pool{
    New: func() interface{} {
        return bytes.NewBuffer(make([]byte, 0, 4096))
    },
}

func ProcessToolOutput(data []byte) {
    buf := jsonBufferPool.Get().(*bytes.Buffer)
    defer jsonBufferPool.Put(buf)
    
    buf.Reset()
    buf.Write(data)
    // Parse JSON...
}

// Connection pooling cho LLM API — critical cho streaming latency
var llmHTTPClient = &http.Client{
    Transport: &http.Transport{
        MaxIdleConns:        100,
        MaxConnsPerHost:     100,
        IdleConnTimeout:     90 * time.Second,
        ForceAttemptHTTP2:   true, // HTTP/2 multiplexing
        TLSHandshakeTimeout: 10 * time.Second,
    },
    Timeout: 120 * time.Second, // Cho long-running agent reasoning
}

Tại sao nó hoạt động

Backpressure như Circuit Breaker
Khi pool đầy, agent platform trả về lỗi "503 Service Unavailable" hoặc trigger horizontal scaling (HPA trên Kubernetes) thay vì silently die vì OOM. Điều này cho phép orchestrator như Kubernetes Fleet đánh dấu node "unhealthy" và route traffic sang pod khác, giữ agent state an toàn trong external memory (Redis/Persistent Volume).

Processor-Local Optimization
sync.Pool hoạt động tốt vì nó tránh "cache coherency protocols" — khi goroutine trên CPU core X put object vào pool, và get lại trên cùng core, object ở trong L1/L2 cache của core đó. Cross-core synchronization là expensive; sync.Pool thiết kế để minimize điều này bằng cách maintain per-P local queues.

Amortization của TLS Handshake
Chi phí fixed của TCP+TLS (2-3 RTT ≈ 200-400ms) được spread qua hàng nghìn requests trên cùng connection. Với agent streaming tokens, điều này biến "cold start" 300ms thành "hot path" 50ms — người dùng cảm nhận real-time response.

Trade-off: Sizing Rigidity
Nếu đặt worker count quá thấp (< CPU cores), CPU underutilize trong lúc agent đang busy wait I/O. Nếu quá cao, context switching thrashing làm giảm throughput. Quy tắc thực chiến: cho I/O-heavy agent (nhiều API calls), đặt workers = 2 * NumCPU; cho CPU-heavy inference local, workers = NumCPU.

Ý nghĩa thực tế

Metric	Unbounded Goroutines	Goroutine Pool + sync.Pool	Cải thiện
Memory (10K concurrent)	200MB+ (stacks + heap growth)	40MB	80% giảm
p99 Latency	5s (GC thrashing)	200ms	25× nhanh hơn
LLM API RTT	300ms (new TLS mỗi lần)	50ms (connection reuse)	6× nhanh hơn
Max Throughput	Crash ở 8K req/s	Ổn định 50K+ req/s	Không crash

Production Stories
VictoriaMetrics xử lý millions of metrics samples/giây dùng sync.Pool cho byte buffers, chứng minh pattern này scale đến data ingestion rates cao. Trong hệ sinh thái GoClaw/OpenClaw, worker pools được dùng để isolate "agent session coordination" (goroutine nhẹ) khỏi "tool execution bursts" (goroutine nặng), ngăn một agent "crazy" spawn infinite loops làm sập toàn bộ node.

Giới hạn

sync.Pool không phải cache lâu dài — objects bị clear mỗi GC cycle. Dùng Redis cho agent memory/long-term caching.
Pool sizing sai còn tệ hơn unbounded (cần tune dựa trên metrics HPA trong K8s).
Pattern này không giải quyết algorithmic complexity (nếu agent reasoning của bạn là $O(n^2)$ , tuning concurrency chỉ delay sụp đổ).

Đào sâu hơn

Tài liệu chính thức

Go sync.Pool internals — Deep dive vào per-P local caches và cơ chế cooperation với GC.
GoPerf Worker Pools — Pattern sizing và backpressure mechanics.

Bài viết liên quan TroiSinh

Cùng cụm (Production Deploy):

"Goroutine Worker Pools vs Unlimited Goroutines" — Benchmark chi tiết về allocation reduction và p99 latency.
"Idiosyncrasies of Programmable Caching Engines" (arXiv:2603.14357) — Kiến trúc multi-tenant cache isolation cho agent fleets.

Performance Tuning: Goroutine pool, connection pooling, caching — Tối ưu agent runtime cho 10,000 concurrent tool calls

Vấn đề

Ý tưởng cốt lõi

Goroutine Pool với Backpressure

sync.Pool cho Object Reuse

Connection Pooling với HTTP/2

Caching Layers Phân cấp

Ví dụ Code/Config

Tại sao nó hoạt động

Ý nghĩa thực tế

Đào sâu hơn

Deploy bằng Docker Compose

Deploy trên Kubernetes

Observability cho Agent

Disaster Recovery

5-Layer Security

Agent cho Customer Support

On this page