Setting Up Monitoring

Learn how to monitor Redis Cloud and Enterprise deployments using redisctl with various monitoring stacks.

Overview

Effective monitoring requires:

Regular health checks
Metric collection
Alert configuration
Dashboard visualization
Log aggregation

Monitoring Architecture

┌─────────────┐     ┌──────────────┐     ┌─────────────┐
│  redisctl   │────▶│ Redis APIs   │────▶│   Metrics   │
│   Scripts   │     │ Cloud/Ent.   │     │  Exporters  │
└─────────────┘     └──────────────┘     └─────────────┘
                                                │
                                                ▼
┌─────────────┐     ┌──────────────┐     ┌─────────────┐
│   Grafana   │◀────│  Prometheus  │◀────│   Format    │
│  Dashboards │     │   Storage    │     │ Conversion  │
└─────────────┘     └──────────────┘     └─────────────┘

Basic Health Monitoring

Health Check Script

Create a basic health monitor:

#!/bin/bash
# health-check.sh

set -euo pipefail

# Configuration
PROFILE="${REDIS_PROFILE:-prod-cloud}"
CHECK_INTERVAL="${CHECK_INTERVAL:-60}"
ALERT_WEBHOOK="${ALERT_WEBHOOK}"

# Color codes for output
RED='\033[0;31m'
GREEN='\033[0;32m'
YELLOW='\033[1;33m'
NC='\033[0m'

log() {
    echo "[$(date +'%Y-%m-%d %H:%M:%S')] $1"
}

send_alert() {
    local level=$1
    local message=$2
    
    if [ -n "$ALERT_WEBHOOK" ]; then
        curl -X POST "$ALERT_WEBHOOK" \
            -H 'Content-Type: application/json' \
            -d "{\"level\": \"$level\", \"message\": \"$message\"}"
    fi
    
    case $level in
        ERROR)   echo -e "${RED}[ERROR]${NC} $message" ;;
        WARNING) echo -e "${YELLOW}[WARN]${NC} $message" ;;
        INFO)    echo -e "${GREEN}[INFO]${NC} $message" ;;
    esac
}

check_databases() {
    local subscription_id=$1
    
    # Get all databases
    local databases=$(redisctl --profile $PROFILE cloud database list \
        --subscription-id $subscription_id \
        -q "[].{id: databaseId, name: name, status: status}" 2>/dev/null)
    
    if [ -z "$databases" ]; then
        send_alert "ERROR" "Failed to fetch databases for subscription $subscription_id"
        return 1
    fi
    
    echo "$databases" | jq -c '.[]' | while read db; do
        local id=$(echo $db | jq -r .id)
        local name=$(echo $db | jq -r .name)
        local status=$(echo $db | jq -r .status)
        
        if [ "$status" != "active" ]; then
            send_alert "ERROR" "Database $name ($id) is not active: $status"
        else
            log "Database $name ($id) is healthy"
        fi
    done
}

# Main monitoring loop
while true; do
    log "Starting health check..."
    
    # Get all subscriptions
    SUBSCRIPTIONS=$(redisctl --profile $PROFILE cloud subscription list \
        -q "[].id" 2>/dev/null | jq -r '.[]')
    
    for sub_id in $SUBSCRIPTIONS; do
        check_databases $sub_id
    done
    
    log "Health check complete. Sleeping for ${CHECK_INTERVAL}s..."
    sleep $CHECK_INTERVAL
done

Prometheus Integration

Metrics Exporter

Create a Prometheus exporter for Redis metrics:

#!/usr/bin/env python3
# redis_exporter.py

import json
import subprocess
import time
from prometheus_client import start_http_server, Gauge, Counter
import os

# Prometheus metrics
db_memory_used = Gauge('redis_memory_used_mb', 'Memory used in MB', ['database', 'subscription'])
db_memory_limit = Gauge('redis_memory_limit_gb', 'Memory limit in GB', ['database', 'subscription'])
db_connections = Gauge('redis_connections_used', 'Connections used', ['database', 'subscription'])
db_ops = Gauge('redis_operations_per_second', 'Operations per second', ['database', 'subscription'])
db_status = Gauge('redis_database_status', 'Database status (1=active, 0=inactive)', ['database', 'subscription'])

def get_databases(profile, subscription_id):
    """Fetch database list using redisctl"""
    cmd = [
        'redisctl', '--profile', profile, 'cloud', 'database', 'list',
        '--subscription-id', str(subscription_id), '-o', 'json'
    ]
    
    try:
        result = subprocess.run(cmd, capture_output=True, text=True, check=True)
        return json.loads(result.stdout)
    except Exception as e:
        print(f"Error fetching databases: {e}")
        return []

def get_database_details(profile, subscription_id, database_id):
    """Fetch detailed database metrics"""
    cmd = [
        'redisctl', '--profile', profile, 'cloud', 'database', 'get',
        '--subscription-id', str(subscription_id),
        '--database-id', str(database_id),
        '-o', 'json'
    ]
    
    try:
        result = subprocess.run(cmd, capture_output=True, text=True, check=True)
        return json.loads(result.stdout)
    except Exception as e:
        print(f"Error fetching database {database_id}: {e}")
        return None

def collect_metrics():
    """Collect metrics from all databases"""
    profile = os.getenv('REDIS_PROFILE', 'prod-cloud')
    subscriptions = os.getenv('REDIS_SUBSCRIPTIONS', '').split(',')
    
    for sub_id in subscriptions:
        if not sub_id:
            continue
            
        databases = get_databases(profile, sub_id)
        
        for db in databases:
            db_id = db.get('databaseId')
            db_name = db.get('name', f'db-{db_id}')
            
            # Get detailed metrics
            details = get_database_details(profile, sub_id, db_id)
            if not details:
                continue
            
            # Update Prometheus metrics
            labels = {'database': db_name, 'subscription': sub_id}
            
            db_memory_used.labels(**labels).set(details.get('memoryUsageInMB', 0))
            db_memory_limit.labels(**labels).set(details.get('memoryLimitInGB', 0))
            db_connections.labels(**labels).set(details.get('connectionsUsed', 0))
            
            throughput = details.get('throughputMeasurement', {})
            db_ops.labels(**labels).set(throughput.get('value', 0))
            
            status_value = 1 if details.get('status') == 'active' else 0
            db_status.labels(**labels).set(status_value)
            
            print(f"Updated metrics for {db_name}")

def main():
    """Main exporter loop"""
    port = int(os.getenv('EXPORTER_PORT', '9090'))
    interval = int(os.getenv('SCRAPE_INTERVAL', '30'))
    
    # Start Prometheus HTTP server
    start_http_server(port)
    print(f"Exporter listening on port {port}")
    
    while True:
        try:
            collect_metrics()
        except Exception as e:
            print(f"Error collecting metrics: {e}")
        
        time.sleep(interval)

if __name__ == '__main__':
    main()

Prometheus Configuration

Configure Prometheus to scrape the exporter:

# prometheus.yml
global:
  scrape_interval: 30s
  evaluation_interval: 30s

scrape_configs:
  - job_name: 'redis-metrics'
    static_configs:
      - targets: ['localhost:9090']
        labels:
          environment: 'production'
          service: 'redis'

# Alert rules
rule_files:
  - 'redis_alerts.yml'

Alert Rules

Define Prometheus alert rules:

# redis_alerts.yml
groups:
  - name: redis_alerts
    interval: 30s
    rules:
      - alert: RedisHighMemoryUsage
        expr: |
          (redis_memory_used_mb / (redis_memory_limit_gb * 1024)) > 0.8
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "High memory usage on {{ $labels.database }}"
          description: "Database {{ $labels.database }} is using {{ $value | humanizePercentage }} of available memory"
      
      - alert: RedisDatabaseDown
        expr: redis_database_status == 0
        for: 2m
        labels:
          severity: critical
        annotations:
          summary: "Database {{ $labels.database }} is down"
          description: "Database {{ $labels.database }} has been inactive for more than 2 minutes"
      
      - alert: RedisHighConnections
        expr: redis_connections_used > 900
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "High connection count on {{ $labels.database }}"
          description: "Database {{ $labels.database }} has {{ $value }} active connections"
      
      - alert: RedisLowThroughput
        expr: redis_operations_per_second < 100
        for: 10m
        labels:
          severity: info
        annotations:
          summary: "Low throughput on {{ $labels.database }}"
          description: "Database {{ $labels.database }} has only {{ $value }} ops/sec"

Grafana Dashboards

Dashboard Configuration

Create a comprehensive Grafana dashboard:

{
  "dashboard": {
    "title": "Redis Production Monitoring",
    "panels": [
      {
        "title": "Database Status",
        "type": "stat",
        "targets": [
          {
            "expr": "sum(redis_database_status)",
            "legendFormat": "Active Databases"
          }
        ]
      },
      {
        "title": "Memory Usage",
        "type": "graph",
        "targets": [
          {
            "expr": "redis_memory_used_mb",
            "legendFormat": "{{ database }}"
          }
        ]
      },
      {
        "title": "Operations/Second",
        "type": "graph",
        "targets": [
          {
            "expr": "redis_operations_per_second",
            "legendFormat": "{{ database }}"
          }
        ]
      },
      {
        "title": "Connection Count",
        "type": "graph",
        "targets": [
          {
            "expr": "redis_connections_used",
            "legendFormat": "{{ database }}"
          }
        ]
      }
    ]
  }
}

Log Monitoring

Centralized Logging with ELK

Ship Redis logs to Elasticsearch:

#!/bin/bash
# ship-logs.sh

# For Redis Enterprise
redisctl enterprise logs list \
  --profile prod-enterprise \
  --output json | \
  jq -c '.[] | {
    "@timestamp": .time,
    "level": .severity,
    "message": .message,
    "node": .node_uid,
    "component": .component
  }' | \
  while read log; do
    curl -X POST "http://elasticsearch:9200/redis-logs/_doc" \
      -H 'Content-Type: application/json' \
      -d "$log"
  done

Logstash Configuration

Process logs with Logstash:

# logstash.conf
input {
  exec {
    command => "redisctl enterprise logs list --output json"
    interval => 60
    codec => "json"
  }
}

filter {
  date {
    match => [ "time", "ISO8601" ]
    target => "@timestamp"
  }
  
  mutate {
    add_field => { "environment" => "production" }
  }
  
  if [severity] == "error" {
    mutate {
      add_tag => [ "alert" ]
    }
  }
}

output {
  elasticsearch {
    hosts => ["localhost:9200"]
    index => "redis-logs-%{+YYYY.MM.dd}"
  }
  
  if "alert" in [tags] {
    email {
      to => "ops-team@example.com"
      subject => "Redis Error Alert"
      body => "Error detected: %{message}"
    }
  }
}

Alerting Integration

Slack Notifications

Send alerts to Slack:

#!/bin/bash
# slack-alert.sh

send_slack_alert() {
    local level=$1
    local message=$2
    local webhook_url="${SLACK_WEBHOOK_URL}"
    
    local color="good"
    case $level in
        ERROR)   color="danger" ;;
        WARNING) color="warning" ;;
    esac
    
    curl -X POST "$webhook_url" \
        -H 'Content-Type: application/json' \
        -d "{
            \"attachments\": [{
                \"color\": \"$color\",
                \"title\": \"Redis Alert: $level\",
                \"text\": \"$message\",
                \"footer\": \"redisctl monitoring\",
                \"ts\": $(date +%s)
            }]
        }"
}

# Monitor and alert
while true; do
    STATUS=$(redisctl cloud database get \
        --subscription-id 123456 \
        --database-id 789 \
        -q "status")
    
    if [ "$STATUS" != "active" ]; then
        send_slack_alert "ERROR" "Database 789 is $STATUS"
    fi
    
    sleep 60
done

PagerDuty Integration

Integrate with PagerDuty for critical alerts:

#!/usr/bin/env python3
# pagerduty_alert.py

import pdpyras
import subprocess
import json
import os

def check_redis_health():
    """Check Redis database health"""
    cmd = [
        'redisctl', 'cloud', 'database', 'list',
        '--subscription-id', os.getenv('SUBSCRIPTION_ID'),
        '-o', 'json'
    ]
    
    result = subprocess.run(cmd, capture_output=True, text=True)
    databases = json.loads(result.stdout)
    
    alerts = []
    for db in databases:
        if db['status'] != 'active':
            alerts.append({
                'database': db['name'],
                'status': db['status'],
                'id': db['databaseId']
            })
    
    return alerts

def send_pagerduty_alert(session, alerts):
    """Send alert to PagerDuty"""
    for alert in alerts:
        session.trigger_incident(
            summary=f"Redis database {alert['database']} is {alert['status']}",
            source="redisctl-monitoring",
            severity="error",
            custom_details=alert
        )

def main():
    api_key = os.getenv('PAGERDUTY_API_KEY')
    session = pdpyras.APISession(api_key)
    
    alerts = check_redis_health()
    if alerts:
        send_pagerduty_alert(session, alerts)

if __name__ == '__main__':
    main()

Custom Metrics Collection

Performance Baseline

Establish performance baselines:

#!/bin/bash
# baseline.sh

# Collect baseline metrics for 24 hours
DURATION=86400
INTERVAL=60
OUTPUT="baseline_$(date +%Y%m%d).csv"

echo "timestamp,database,ops,latency,memory,cpu" > $OUTPUT

END=$(($(date +%s) + DURATION))
while [ $(date +%s) -lt $END ]; do
    TIMESTAMP=$(date +%s)
    
    redisctl cloud database get \
        --subscription-id 123456 \
        --database-id 789 \
        -o json | \
    jq -r "\"$TIMESTAMP,prod-db,\(.throughputMeasurement.value),\(.latency),\(.memoryUsageInMB),\(.cpuUsagePercentage)\"" \
        >> $OUTPUT
    
    sleep $INTERVAL
done

# Analyze baseline
echo "Baseline collection complete. Analyzing..."
python3 analyze_baseline.py $OUTPUT

Automation with Cron

Schedule monitoring tasks:

# crontab -e

# Health check every 5 minutes
*/5 * * * * /opt/monitoring/health-check.sh

# Collect metrics every minute
* * * * * /opt/monitoring/collect-metrics.sh

# Daily report
0 8 * * * /opt/monitoring/daily-report.sh

# Weekly capacity planning
0 0 * * 0 /opt/monitoring/capacity-planning.sh

# Backup monitoring config
0 2 * * * /opt/monitoring/backup-monitoring.sh

Best Practices

Monitor proactively - Set up alerts before issues occur
Use multiple data sources - Combine metrics, logs, and traces
Set appropriate thresholds - Avoid alert fatigue
Automate responses - Use runbooks for common issues
Track trends - Look for patterns over time
Test alert paths - Ensure alerts reach the right people
Document procedures - Have clear escalation paths
Review regularly - Update monitoring as systems evolve

redisctl Documentation