Go e Prometheus: Métricas e Alertas em Produção

Prometheus tornou-se o padrão de facto para monitoramento de aplicações cloud-native. Sua combinação com Go é natural - ambos foram criados pela SoundCloud e compartilham filosofias de simplicidade e performance.

Neste guia, você aprenderá a instrumentar aplicações Go com métricas Prometheus, criar dashboards no Grafana e configurar alertas inteligentes.

Índice

  1. Fundamentos do Prometheus
  2. Bibliotecas Go para Prometheus
  3. Tipos de Métricas
  4. Instrumentando Aplicações
  5. Métricas de Runtime Go
  6. Tracing e Context
  7. Dashboards no Grafana
  8. Alertas com Alertmanager
  9. Padrões de Produção

Fundamentos do Prometheus

Arquitetura

┌─────────────────────────────────────────────────────────────┐
│                      Aplicações Go                          │
│  ┌─────────────┐  ┌─────────────┐  ┌─────────────┐         │
│  │   App 1     │  │   App 2     │  │   App 3     │         │
│  │ :8080/metrics│  │ :8081/metrics│  │ :8082/metrics│        │
│  └──────┬──────┘  └──────┬──────┘  └──────┬──────┘         │
└─────────┼────────────────┼────────────────┼─────────────────┘
          │                │                │
          ▼                ▼                ▼
┌─────────────────────────────────────────────────────────────┐
│                     Prometheus Server                       │
│  ┌──────────────┐  ┌──────────────┐  ┌──────────────┐      │
│  │  Retriever   │  │    TSDB      │  │ Query Engine │      │
│  │  (Pull)      │  │  (Storage)   │  │  (PromQL)    │      │
│  └──────────────┘  └──────────────┘  └──────────────┘      │
└──────────┬──────────────────────────────────────────────────┘
           │ HTTP
┌─────────────────────────────────────────────────────────────┐
│                       Grafana                               │
│              (Visualização e Dashboards)                   │
└─────────────────────────────────────────────────────────────┘
           │ Webhook
┌─────────────────────────────────────────────────────────────┐
│                     Alertmanager                            │
│              (Roteamento e Notificações)                    │
└─────────────────────────────────────────────────────────────┘

Modelo de Dados

Prometheus usa um modelo de dados dimensional:

http_requests_total{method="GET", endpoint="/api/users", status="200"} 1027
http_requests_total{method="POST", endpoint="/api/users", status="201"} 45
http_requests_total{method="GET", endpoint="/api/users", status="500"} 3
  • Nome da métrica: http_requests_total
  • Labels (dimensões): method, endpoint, status
  • Valor: número atual (timestamp é implícito)

Bibliotecas Go para Prometheus

Instalação

go get github.com/prometheus/client_golang/prometheus
go get github.com/prometheus/client_golang/prometheus/promauto
go get github.com/prometheus/client_golang/prometheus/promhttp

Configuração Básica

package main

import (
    "net/http"
    "time"

    "github.com/prometheus/client_golang/prometheus"
    "github.com/prometheus/client_golang/prometheus/promauto"
    "github.com/prometheus/client_golang/prometheus/promhttp"
)

var (
    // Counter - apenas incrementa
    requestsTotal = promauto.NewCounterVec(
        prometheus.CounterOpts{
            Name: "http_requests_total",
            Help: "Total de requisições HTTP",
        },
        []string{"method", "endpoint", "status"},
    )

    // Gauge - pode subir ou descer
    activeConnections = promauto.NewGauge(
        prometheus.GaugeOpts{
            Name: "active_connections",
            Help: "Número de conexões ativas",
        },
    )

    // Histogram - distribuição de valores
    requestDuration = promauto.NewHistogramVec(
        prometheus.HistogramOpts{
            Name:    "http_request_duration_seconds",
            Help:    "Duração das requisições HTTP",
            Buckets: prometheus.DefBuckets, // 0.005, 0.01, 0.025, ..., 10
        },
        []string{"method", "endpoint"},
    )

    // Summary - similar ao histogram, mas calcula percentis
    responseSize = promauto.NewSummaryVec(
        prometheus.SummaryOpts{
            Name:       "http_response_size_bytes",
            Help:       "Tamanho das respostas HTTP",
            Objectives: map[float64]float64{0.5: 0.05, 0.9: 0.01, 0.99: 0.001},
        },
        []string{"method", "endpoint"},
    )
)

func main() {
    // Endpoint de métricas
    http.Handle("/metrics", promhttp.Handler())
    
    // Aplicação
    http.HandleFunc("/api/data", handleRequest)
    
    http.ListenAndServe(":8080", nil)
}

func handleRequest(w http.ResponseWriter, r *http.Request) {
    start := time.Now()
    
    // Simula processamento
    processRequest(w, r)
    
    // Registra métricas
    duration := time.Since(start).Seconds()
    status := "200" // Determinado pelo resultado
    
    requestsTotal.WithLabelValues(r.Method, "/api/data", status).Inc()
    requestDuration.WithLabelValues(r.Method, "/api/data").Observe(duration)
}

Tipos de Métricas

1. Counter (Contador)

Apenas incrementa. Útil para contar eventos.

var (
    // Contador simples
    errorsTotal = promauto.NewCounter(
        prometheus.CounterOpts{
            Name: "errors_total",
            Help: "Número total de erros",
        },
    )

    // Counter com labels
    httpErrors = promauto.NewCounterVec(
        prometheus.CounterOpts{
            Name: "http_errors_total",
            Help: "Total de erros HTTP por tipo",
        },
        []string{"code", "endpoint"},
    )

    // Contador de exceções
    panicRecoveries = promauto.NewCounter(
        prometheus.CounterOpts{
            Name: "panic_recoveries_total",
            Help: "Número de panics recuperados",
        },
    )
)

// Uso
func processWithError() error {
    if err := doSomething(); err != nil {
        errorsTotal.Inc()
        httpErrors.WithLabelValues("500", "/api/endpoint").Inc()
        return err
    }
    return nil
}

// Rate() no Prometheus: rate(errors_total[5m])

2. Gauge (Medidor)

Pode aumentar ou diminuir. Útil para valores instantâneos.

var (
    // Conexões ativas
    activeConnections = promauto.NewGauge(
        prometheus.GaugeOpts{
            Name: "active_connections",
            Help: "Número de conexões ativas",
        },
    )

    // Tamanho da fila
    queueSize = promauto.NewGaugeVec(
        prometheus.GaugeOpts{
            Name: "queue_size",
            Help: "Tamanho da fila por tipo",
        },
        []string{"queue_name"},
    )

    // Temperatura (exemplo IoT)
    temperature = promauto.NewGaugeVec(
        prometheus.GaugeOpts{
            Name: "sensor_temperature_celsius",
            Help: "Temperatura dos sensores",
        },
        []string{"sensor_id", "location"},
    )
)

// Uso
func handleConnection(conn net.Conn) {
    activeConnections.Inc()
    defer activeConnections.Dec()
    
    // Processa conexão
}

// Set() para valores absolutos
func updateQueueSize(name string, size int) {
    queueSize.WithLabelValues(name).Set(float64(size))
}

// Query no Prometheus: active_connections

3. Histogram (Histograma)

Amostra observações e conta em buckets. Útil para latências.

var (
    // Latência de requisições
    requestDuration = promauto.NewHistogramVec(
        prometheus.HistogramOpts{
            Name:    "http_request_duration_seconds",
            Help:    "Duração das requisições HTTP em segundos",
            Buckets: []float64{0.001, 0.005, 0.01, 0.025, 0.05, 0.1, 0.25, 0.5, 1, 2.5, 5, 10},
        },
        []string{"method", "endpoint"},
    )

    // Tamanho de payloads
    payloadSize = promauto.NewHistogram(
        prometheus.HistogramOpts{
            Name:    "request_payload_bytes",
            Help:    "Tamanho dos payloads em bytes",
            Buckets: prometheus.ExponentialBuckets(100, 10, 8), // 100, 1000, 10000...
        },
    )
)

// Uso
func timedHandler(w http.ResponseWriter, r *http.Request) {
    start := time.Now()
    defer func() {
        duration := time.Since(start).Seconds()
        requestDuration.WithLabelValues(r.Method, r.URL.Path).Observe(duration)
    }()
    
    // Processa requisição
}

// Queries úteis:
// - histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m]))
// - http_request_duration_seconds_count
// - http_request_duration_seconds_sum / http_request_duration_seconds_count (média)

4. Summary (Resumo)

Similar ao histogram, mas calcula percentis no cliente.

var (
    // Latência com percentis calculados
    requestLatency = promauto.NewSummaryVec(
        prometheus.SummaryOpts{
            Name:       "http_request_latency_seconds",
            Help:       "Latência das requisições HTTP",
            Objectives: map[float64]float64{
                0.5:  0.05,  // Mediana com 5% de erro
                0.9:  0.01,  // P90 com 1% de erro
                0.99: 0.001, // P99 com 0.1% de erro
            },
            MaxAge:     10 * time.Minute,
            AgeBuckets: 5,
        },
        []string{"method", "endpoint"},
    )
)

// Summary é mais preciso para percentis, mas mais custoso no cliente
// Use quando precisar de percentis precisos e tiver poucas séries temporais

Instrumentando Aplicações

Middleware HTTP Completo

package middleware

import (
    "net/http"
    "strconv"
    "time"

    "github.com/prometheus/client_golang/prometheus"
    "github.com/prometheus/client_golang/prometheus/promauto"
)

var (
    httpDuration = promauto.NewHistogramVec(
        prometheus.HistogramOpts{
            Name:    "http_request_duration_seconds",
            Help:    "Duração das requisições HTTP",
            Buckets: prometheus.DefBuckets,
        },
        []string{"method", "path", "status"},
    )

    httpRequests = promauto.NewCounterVec(
        prometheus.CounterOpts{
            Name: "http_requests_total",
            Help: "Total de requisições HTTP",
        },
        []string{"method", "path", "status"},
    )

    httpRequestSize = promauto.NewHistogramVec(
        prometheus.HistogramOpts{
            Name:    "http_request_size_bytes",
            Help:    "Tamanho das requisições HTTP",
            Buckets: prometheus.ExponentialBuckets(100, 10, 7),
        },
        []string{"method", "path"},
    )

    httpResponseSize = promauto.NewHistogramVec(
        prometheus.HistogramOpts{
            Name:    "http_response_size_bytes",
            Help:    "Tamanho das respostas HTTP",
            Buckets: prometheus.ExponentialBuckets(100, 10, 7),
        },
        []string{"method", "path"},
    )

    activeRequests = promauto.NewGauge(
        prometheus.GaugeOpts{
            Name: "http_active_requests",
            Help: "Requisições HTTP ativas",
        },
    )
)

// responseWriter wrapper para capturar status code
type responseWriter struct {
    http.ResponseWriter
    statusCode int
    written    int64
}

func (rw *responseWriter) WriteHeader(code int) {
    rw.statusCode = code
    rw.ResponseWriter.WriteHeader(code)
}

func (rw *responseWriter) Write(b []byte) (int, error) {
    n, err := rw.ResponseWriter.Write(b)
    rw.written += int64(n)
    return n, err
}

// PrometheusMiddleware instrumenta handlers HTTP
func PrometheusMiddleware(next http.Handler) http.Handler {
    return http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) {
        start := time.Now()
        activeRequests.Inc()
        defer activeRequests.Dec()

        wrapped := &responseWriter{ResponseWriter: w, statusCode: http.StatusOK}
        
        next.ServeHTTP(wrapped, r)

        duration := time.Since(start).Seconds()
        status := strconv.Itoa(wrapped.statusCode)
        
        labels := prometheus.Labels{
            "method": r.Method,
            "path":   r.URL.Path,
            "status": status,
        }

        httpDuration.With(labels).Observe(duration)
        httpRequests.With(labels).Inc()
        httpResponseSize.WithLabelValues(r.Method, r.URL.Path).Observe(float64(wrapped.written))
        
        if r.ContentLength > 0 {
            httpRequestSize.WithLabelValues(r.Method, r.URL.Path).Observe(float64(r.ContentLength))
        }
    })
}

// Uso
func main() {
    mux := http.NewServeMux()
    mux.HandleFunc("/api/users", handleUsers)
    mux.HandleFunc("/api/orders", handleOrders)

    // Aplica middleware
    handler := PrometheusMiddleware(mux)
    
    // Endpoint de métricas
    http.Handle("/metrics", promhttp.Handler())
    http.Handle("/", handler)

    http.ListenAndServe(":8080", nil)
}

Instrumentação de Database

package database

import (
    "context"
    "database/sql"
    "time"

    "github.com/prometheus/client_golang/prometheus"
    "github.com/prometheus/client_golang/prometheus/promauto"
    _ "github.com/lib/pq"
)

var (
    dbQueries = promauto.NewCounterVec(
        prometheus.CounterOpts{
            Name: "db_queries_total",
            Help: "Total de queries executadas",
        },
        []string{"operation", "table"},
    )

    dbQueryDuration = promauto.NewHistogramVec(
        prometheus.HistogramOpts{
            Name:    "db_query_duration_seconds",
            Help:    "Duração das queries",
            Buckets: []float64{0.001, 0.005, 0.01, 0.05, 0.1, 0.5, 1},
        },
        []string{"operation", "table"},
    )

    dbConnections = promauto.NewGaugeVec(
        prometheus.GaugeOpts{
            Name: "db_connections",
            Help: "Número de conexões no pool",
        },
        []string{"state"},
    )

    dbErrors = promauto.NewCounterVec(
        prometheus.CounterOpts{
            Name: "db_errors_total",
            Help: "Erros de banco de dados",
        },
        []string{"operation", "error_type"},
    )
)

// InstrumentedDB wrapper para database/sql.DB
type InstrumentedDB struct {
    *sql.DB
}

func NewInstrumentedDB(driver, dsn string) (*InstrumentedDB, error) {
    db, err := sql.Open(driver, dsn)
    if err != nil {
        return nil, err
    }

    // Coleta métricas do pool
    go collectPoolMetrics(db)

    return &InstrumentedDB{DB: db}, nil
}

func collectPoolMetrics(db *sql.DB) {
    ticker := time.NewTicker(10 * time.Second)
    defer ticker.Stop()

    for range ticker.C {
        stats := db.Stats()
        dbConnections.WithLabelValues("open").Set(float64(stats.OpenConnections))
        dbConnections.WithLabelValues("in_use").Set(float64(stats.InUse))
        dbConnections.WithLabelValues("idle").Set(float64(stats.Idle))
    }
}

func (db *InstrumentedDB) QueryContext(ctx context.Context, query string, args ...interface{}) (*sql.Rows, error) {
    return db.instrumentQuery("SELECT", "unknown", func() (*sql.Rows, error) {
        return db.DB.QueryContext(ctx, query, args...)
    })
}

func (db *InstrumentedDB) instrumentQuery(operation, table string, fn func() (*sql.Rows, error)) (*sql.Rows, error) {
    start := time.Now()
    defer func() {
        dbQueryDuration.WithLabelValues(operation, table).Observe(time.Since(start).Seconds())
    }()

    dbQueries.WithLabelValues(operation, table).Inc()
    
    rows, err := fn()
    if err != nil {
        dbErrors.WithLabelValues(operation, "query_error").Inc()
    }
    
    return rows, err
}

Métricas de Runtime Go

Coletor de Runtime

package metrics

import (
    "runtime"
    "runtime/metrics"
    "time"

    "github.com/prometheus/client_golang/prometheus"
    "github.com/prometheus/client_golang/prometheus/promauto"
)

var (
    // Métricas do GC
    gcCycles = promauto.NewCounter(
        prometheus.CounterOpts{
            Name: "go_gc_cycles_total",
            Help: "Número de ciclos de garbage collection",
        },
    )

    gcPauseNs = promauto.NewHistogram(
        prometheus.HistogramOpts{
            Name:    "go_gc_pause_duration_seconds",
            Help:    "Duração das pausas do GC",
            Buckets: []float64{0.0001, 0.0005, 0.001, 0.005, 0.01, 0.05},
        },
    )

    // Memória
    heapAlloc = promauto.NewGauge(
        prometheus.GaugeOpts{
            Name: "go_memory_heap_alloc_bytes",
            Help: "Bytes alocados no heap",
        },
    )

    heapSys = promauto.NewGauge(
        prometheus.GaugeOpts{
            Name: "go_memory_heap_sys_bytes",
            Help: "Bytes obtidos do sistema para o heap",
        },
    )

    // Goroutines
    goroutines = promauto.NewGauge(
        prometheus.GaugeOpts{
            Name: "go_goroutines",
            Help: "Número de goroutines ativas",
        },
    )

    // Threads
    threads = promauto.NewGauge(
        prometheus.GaugeOpts{
            Name: "go_threads",
            Help: "Número de threads do sistema operacional",
        },
    )
)

// StartRuntimeMetrics inicia a coleta de métricas de runtime
func StartRuntimeMetrics() {
    // Coleta inicial
    collectRuntimeMetrics()

    // Coleta periódica
    ticker := time.NewTicker(15 * time.Second)
    go func() {
        for range ticker.C {
            collectRuntimeMetrics()
        }
    }()
}

func collectRuntimeMetrics() {
    var m runtime.MemStats
    runtime.ReadMemStats(&m)

    heapAlloc.Set(float64(m.HeapAlloc))
    heapSys.Set(float64(m.HeapSys))
    goroutines.Set(float64(runtime.NumGoroutine()))
    threads.Set(float64(runtime.GOMAXPROCS(0)))

    // GC metrics
    gcCycles.Add(float64(m.NumGC))
    
    // Pauses recentes do GC
    if len(m.PauseNs) > 0 {
        // Pega a última pausa
        lastPause := m.PauseNs[(m.NumGC+255)%256]
        gcPauseNs.Observe(float64(lastPause) / 1e9)
    }
}

// Métricas avançadas com runtime/metrics (Go 1.16+)
func collectAdvancedMetrics() {
    samples := []metrics.Sample{
        {Name: "/sched/gomaxprocs:threads"},
        {Name: "/sched/goroutines:goroutines"},
        {Name: "/memory/classes/heap/free:bytes"},
        {Name: "/memory/classes/heap/objects:bytes"},
        {Name: "/gc/heap/allocs:bytes"},
        {Name: "/gc/heap/frees:bytes"},
    }

    metrics.Read(samples)

    for _, sample := range samples {
        switch sample.Value.Kind() {
        case metrics.KindUint64:
            // Processa uint64
        case metrics.KindFloat64:
            // Processa float64
        }
    }
}

Tracing e Context

Distributed Tracing

package tracing

import (
    "context"
    "net/http"
    "time"

    "github.com/opentracing/opentracing-go"
    "github.com/opentracing/opentracing-go/ext"
    "github.com/prometheus/client_golang/prometheus"
    "github.com/prometheus/client_golang/prometheus/promauto"
)

var (
    traceSpans = promauto.NewCounterVec(
        prometheus.CounterOpts{
            Name: "trace_spans_total",
            Help: "Total de spans criados",
        },
        []string{"operation", "service"},
    )

    traceDuration = promauto.NewHistogramVec(
        prometheus.HistogramOpts{
            Name:    "trace_span_duration_seconds",
            Help:    "Duração dos spans",
            Buckets: prometheus.DefBuckets,
        },
        []string{"operation", "service"},
    )
)

// TracedHandler adiciona tracing a handlers HTTP
func TracedHandler(operation string, next http.HandlerFunc) http.HandlerFunc {
    return func(w http.ResponseWriter, r *http.Request) {
        tracer := opentracing.GlobalTracer()
        
        // Extrai span context do header (distributed tracing)
        spanCtx, _ := tracer.Extract(
            opentracing.HTTPHeaders,
            opentracing.HTTPHeadersCarrier(r.Header),
        )

        span := tracer.StartSpan(
            operation,
            ext.RPCServerOption(spanCtx),
        )
        defer span.Finish()

        // Adiciona span ao context
        ctx := opentracing.ContextWithSpan(r.Context(), span)
        
        start := time.Now()
        next(w, r.WithContext(ctx))
        duration := time.Since(start).Seconds()

        // Métricas
        traceSpans.WithLabelValues(operation, "api").Inc()
        traceDuration.WithLabelValues(operation, "api").Observe(duration)
    }
}

// TracedFunction instrumenta funções internas
func TracedFunction(ctx context.Context, operation string, fn func() error) error {
    span, ctx := opentracing.StartSpanFromContext(ctx, operation)
    defer span.Finish()

    start := time.Now()
    err := fn()
    duration := time.Since(start).Seconds()

    if err != nil {
        ext.Error.Set(span, true)
        span.LogKV("error", err.Error())
    }

    traceSpans.WithLabelValues(operation, "internal").Inc()
    traceDuration.WithLabelValues(operation, "internal").Observe(duration)

    return err
}

Dashboards no Grafana

Dashboard JSON Exemplo

{
  "dashboard": {
    "title": "Go Application Metrics",
    "panels": [
      {
        "title": "Request Rate",
        "type": "stat",
        "targets": [{
          "expr": "sum(rate(http_requests_total[5m]))",
          "legendFormat": "req/s"
        }],
        "gridPos": {"h": 8, "w": 6, "x": 0, "y": 0}
      },
      {
        "title": "Error Rate",
        "type": "stat",
        "targets": [{
          "expr": "sum(rate(http_requests_total{status=~\"5..\"}[5m])) / sum(rate(http_requests_total[5m])) * 100",
          "legendFormat": "%"
        }],
        "gridPos": {"h": 8, "w": 6, "x": 6, "y": 0}
      },
      {
        "title": "Latency P95",
        "type": "graph",
        "targets": [{
          "expr": "histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m]))",
          "legendFormat": "{{method}} {{endpoint}}"
        }],
        "gridPos": {"h": 8, "w": 12, "x": 12, "y": 0}
      },
      {
        "title": "Goroutines",
        "type": "graph",
        "targets": [{
          "expr": "go_goroutines",
          "legendFormat": "goroutines"
        }],
        "gridPos": {"h": 8, "w": 12, "x": 0, "y": 8}
      },
      {
        "title": "Memory Usage",
        "type": "graph",
        "targets": [
          {
            "expr": "go_memory_heap_alloc_bytes",
            "legendFormat": "Heap Alloc"
          },
          {
            "expr": "go_memory_heap_sys_bytes",
            "legendFormat": "Heap Sys"
          }
        ],
        "gridPos": {"h": 8, "w": 12, "x": 12, "y": 8}
      }
    ]
  }
}

Queries PromQL Essenciais

# Rate de requisições por segundo
sum(rate(http_requests_total[5m]))

# Latência P95 por endpoint
histogram_quantile(0.95, 
  sum(rate(http_request_duration_seconds_bucket[5m])) by (le, endpoint)
)

# Taxa de erro
sum(rate(http_requests_total{status=~"5.."}[5m])) 
/ 
sum(rate(http_requests_total[5m])) * 100

# Uso de memória
process_resident_memory_bytes{job="api"}

# Goroutines crescendo (possível leak)
rate(go_goroutines[5m]) > 10

# GC pressure
rate(go_gc_duration_seconds_sum[5m]) / rate(go_gc_duration_seconds_count[5m])

Alertas com Alertmanager

Regras de Alerta

# alerts.yml
groups:
  - name: api_alerts
    rules:
      # Alta taxa de erro
      - alert: HighErrorRate
        expr: |
          (
            sum(rate(http_requests_total{status=~"5.."}[5m])) 
            / 
            sum(rate(http_requests_total[5m]))
          ) > 0.05
        for: 2m
        labels:
          severity: critical
        annotations:
          summary: "Alta taxa de erro na API"
          description: "Taxa de erro: {{ $value | humanizePercentage }}"

      # Latência alta
      - alert: HighLatency
        expr: |
          histogram_quantile(0.95, 
            sum(rate(http_request_duration_seconds_bucket[5m])) by (le)
          ) > 0.5
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "Latência P95 acima de 500ms"

      # Muitas goroutines (possível leak)
      - alert: GoroutineLeak
        expr: go_goroutines > 10000
        for: 10m
        labels:
          severity: warning
        annotations:
          summary: "Possível leak de goroutines"

      # API down
      - alert: APIDown
        expr: up{job="api"} == 0
        for: 1m
        labels:
          severity: critical
        annotations:
          summary: "API está offline"

      # GC pausas longas
      - alert: LongGCPauses
        expr: |
          histogram_quantile(0.99, 
            sum(rate(go_gc_pause_duration_seconds_bucket[5m])) by (le)
          ) > 0.1
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "GC pausas acima de 100ms"

Configuração do Alertmanager

# alertmanager.yml
global:
  smtp_smarthost: 'smtp.example.com:587'
  smtp_from: 'alerts@example.com'

route:
  receiver: 'default'
  group_by: ['alertname', 'severity']
  group_wait: 30s
  group_interval: 5m
  repeat_interval: 4h
  routes:
    - match:
        severity: critical
      receiver: 'pagerduty-critical'
      continue: true
    - match:
        severity: warning
      receiver: 'slack-warnings'

receivers:
  - name: 'default'
    email_configs:
      - to: 'team@example.com'

  - name: 'pagerduty-critical'
    pagerduty_configs:
      - service_key: 'your-service-key'
        severity: critical

  - name: 'slack-warnings'
    slack_configs:
      - api_url: 'https://hooks.slack.com/services/YOUR/WEBHOOK/URL'
        channel: '#alerts'
        title: 'Alerta: {{ .GroupLabels.alertname }}'
        text: '{{ range .Alerts }}{{ .Annotations.summary }}{{ end }}'

inhibit_rules:
  - source_match:
      severity: 'critical'
    target_match:
      severity: 'warning'
    equal: ['alertname']

Padrões de Produção

1. Cardinalidade Controlada

// ❌ Evite: cardinalidade não limitada
httpRequests.WithLabelValues(r.Method, r.URL.Path, r.UserAgent()).Inc()
// Path pode ter IDs: /users/123, /users/456 → explosão de séries

// ✅ Faça: agrupe endpoints
func normalizePath(path string) string {
    // Remove IDs e normaliza
    re := regexp.MustCompile(`/\d+`)
    return re.ReplaceAllString(path, "/:id")
}

httpRequests.WithLabelValues(
    r.Method, 
    normalizePath(r.URL.Path), 
    r.UserAgent(),
).Inc()

2. Métricas de Health

var (
    healthCheck = promauto.NewGaugeVec(
        prometheus.GaugeOpts{
            Name: "health_check",
            Help: "Status dos health checks",
        },
        []string{"check"},
    )

    dependencyUp = promauto.NewGaugeVec(
        prometheus.GaugeOpts{
            Name: "dependency_up",
            Help: "Status de dependências externas",
        },
        []string{"name"},
    )
)

func healthHandler(w http.ResponseWriter, r *http.Request) {
    checks := map[string]bool{
        "database": checkDatabase(),
        "cache":    checkCache(),
        "queue":    checkQueue(),
    }

    allHealthy := true
    for name, healthy := range checks {
        if healthy {
            healthCheck.WithLabelValues(name).Set(1)
        } else {
            healthCheck.WithLabelValues(name).Set(0)
            allHealthy = false
        }
    }

    if allHealthy {
        w.WriteHeader(http.StatusOK)
    } else {
        w.WriteHeader(http.StatusServiceUnavailable)
    }
}

3. Graceful Shutdown

func main() {
    srv := &http.Server{Addr: ":8080"}
    
    // Métricas
    go func() {
        http.Handle("/metrics", promhttp.Handler())
        http.ListenAndServe(":9090", nil)
    }()

    // Graceful shutdown
    quit := make(chan os.Signal, 1)
    signal.Notify(quit, syscall.SIGINT, syscall.SIGTERM)

    go func() {
        <-quit
        
        // Métrica de shutdown
        shutdownTimestamp.SetToCurrentTime()
        
        ctx, cancel := context.WithTimeout(context.Background(), 30*time.Second)
        defer cancel()
        
        srv.Shutdown(ctx)
    }()

    srv.ListenAndServe()
}

Conclusão

Neste guia, você aprendeu:

Fundamentos: Modelo de dados e arquitetura Prometheus ✅ Métricas: Counter, Gauge, Histogram e Summary ✅ Instrumentação: Middleware HTTP, database, runtime ✅ Tracing: Distributed tracing com OpenTracing ✅ Visualização: Dashboards Grafana e queries PromQL ✅ Alertas: Regras de alerta e Alertmanager ✅ Produção: Cardinalidade, health checks, graceful shutdown

Próximos Passos

  1. Go e Grafana - Dashboards avançados e visualizações
  2. Go Observability - Logs, métricas e traces unificados
  3. Go e Jaeger - Distributed tracing completo

Recursos Adicionais


FAQ

Q: Qual a diferença entre Histogram e Summary? R: Histogram agrupa em buckets no servidor; Summary calcula percentis no cliente. Histogram é mais flexível para agregação; Summary é mais preciso para percentis.

Q: Como evitar alta cardinalidade? R: Evite usar valores únicos (IDs, timestamps) em labels. Normalize paths, agrupe status codes (2xx, 4xx, 5xx), use enums para valores finitos.

Q: Devo expor métricas na mesma porta da aplicação? R: Para simplicidade, sim. Para segurança, use uma porta separada ou proteja /metrics com autenticação.

Q: Qual o intervalo ideal de scrape? R: 15-30s é comum. Ajuste conforme necessidade de granularidade vs overhead.