Setting Up Monitoring
Learn how to monitor Redis Cloud and Enterprise deployments using redisctl with various monitoring stacks.
Overview
Effective monitoring requires:
- Regular health checks
- Metric collection
- Alert configuration
- Dashboard visualization
- Log aggregation
Monitoring Architecture
┌─────────────┐ ┌──────────────┐ ┌─────────────┐
│ redisctl │────▶│ Redis APIs │────▶│ Metrics │
│ Scripts │ │ Cloud/Ent. │ │ Exporters │
└─────────────┘ └──────────────┘ └─────────────┘
│
▼
┌─────────────┐ ┌──────────────┐ ┌─────────────┐
│ Grafana │◀────│ Prometheus │◀────│ Format │
│ Dashboards │ │ Storage │ │ Conversion │
└─────────────┘ └──────────────┘ └─────────────┘
Basic Health Monitoring
Health Check Script
Create a basic health monitor:
#!/bin/bash
# health-check.sh
set -euo pipefail
# Configuration
PROFILE="${REDIS_PROFILE:-prod-cloud}"
CHECK_INTERVAL="${CHECK_INTERVAL:-60}"
ALERT_WEBHOOK="${ALERT_WEBHOOK}"
# Color codes for output
RED='\033[0;31m'
GREEN='\033[0;32m'
YELLOW='\033[1;33m'
NC='\033[0m'
log() {
echo "[$(date +'%Y-%m-%d %H:%M:%S')] $1"
}
send_alert() {
local level=$1
local message=$2
if [ -n "$ALERT_WEBHOOK" ]; then
curl -X POST "$ALERT_WEBHOOK" \
-H 'Content-Type: application/json' \
-d "{\"level\": \"$level\", \"message\": \"$message\"}"
fi
case $level in
ERROR) echo -e "${RED}[ERROR]${NC} $message" ;;
WARNING) echo -e "${YELLOW}[WARN]${NC} $message" ;;
INFO) echo -e "${GREEN}[INFO]${NC} $message" ;;
esac
}
check_databases() {
local subscription_id=$1
# Get all databases
local databases=$(redisctl --profile $PROFILE cloud database list \
--subscription-id $subscription_id \
-q "[].{id: databaseId, name: name, status: status}" 2>/dev/null)
if [ -z "$databases" ]; then
send_alert "ERROR" "Failed to fetch databases for subscription $subscription_id"
return 1
fi
echo "$databases" | jq -c '.[]' | while read db; do
local id=$(echo $db | jq -r .id)
local name=$(echo $db | jq -r .name)
local status=$(echo $db | jq -r .status)
if [ "$status" != "active" ]; then
send_alert "ERROR" "Database $name ($id) is not active: $status"
else
log "Database $name ($id) is healthy"
fi
done
}
# Main monitoring loop
while true; do
log "Starting health check..."
# Get all subscriptions
SUBSCRIPTIONS=$(redisctl --profile $PROFILE cloud subscription list \
-q "[].id" 2>/dev/null | jq -r '.[]')
for sub_id in $SUBSCRIPTIONS; do
check_databases $sub_id
done
log "Health check complete. Sleeping for ${CHECK_INTERVAL}s..."
sleep $CHECK_INTERVAL
done
Prometheus Integration
Metrics Exporter
Create a Prometheus exporter for Redis metrics:
#!/usr/bin/env python3
# redis_exporter.py
import json
import subprocess
import time
from prometheus_client import start_http_server, Gauge, Counter
import os
# Prometheus metrics
db_memory_used = Gauge('redis_memory_used_mb', 'Memory used in MB', ['database', 'subscription'])
db_memory_limit = Gauge('redis_memory_limit_gb', 'Memory limit in GB', ['database', 'subscription'])
db_connections = Gauge('redis_connections_used', 'Connections used', ['database', 'subscription'])
db_ops = Gauge('redis_operations_per_second', 'Operations per second', ['database', 'subscription'])
db_status = Gauge('redis_database_status', 'Database status (1=active, 0=inactive)', ['database', 'subscription'])
def get_databases(profile, subscription_id):
"""Fetch database list using redisctl"""
cmd = [
'redisctl', '--profile', profile, 'cloud', 'database', 'list',
'--subscription-id', str(subscription_id), '-o', 'json'
]
try:
result = subprocess.run(cmd, capture_output=True, text=True, check=True)
return json.loads(result.stdout)
except Exception as e:
print(f"Error fetching databases: {e}")
return []
def get_database_details(profile, subscription_id, database_id):
"""Fetch detailed database metrics"""
cmd = [
'redisctl', '--profile', profile, 'cloud', 'database', 'get',
'--subscription-id', str(subscription_id),
'--database-id', str(database_id),
'-o', 'json'
]
try:
result = subprocess.run(cmd, capture_output=True, text=True, check=True)
return json.loads(result.stdout)
except Exception as e:
print(f"Error fetching database {database_id}: {e}")
return None
def collect_metrics():
"""Collect metrics from all databases"""
profile = os.getenv('REDIS_PROFILE', 'prod-cloud')
subscriptions = os.getenv('REDIS_SUBSCRIPTIONS', '').split(',')
for sub_id in subscriptions:
if not sub_id:
continue
databases = get_databases(profile, sub_id)
for db in databases:
db_id = db.get('databaseId')
db_name = db.get('name', f'db-{db_id}')
# Get detailed metrics
details = get_database_details(profile, sub_id, db_id)
if not details:
continue
# Update Prometheus metrics
labels = {'database': db_name, 'subscription': sub_id}
db_memory_used.labels(**labels).set(details.get('memoryUsageInMB', 0))
db_memory_limit.labels(**labels).set(details.get('memoryLimitInGB', 0))
db_connections.labels(**labels).set(details.get('connectionsUsed', 0))
throughput = details.get('throughputMeasurement', {})
db_ops.labels(**labels).set(throughput.get('value', 0))
status_value = 1 if details.get('status') == 'active' else 0
db_status.labels(**labels).set(status_value)
print(f"Updated metrics for {db_name}")
def main():
"""Main exporter loop"""
port = int(os.getenv('EXPORTER_PORT', '9090'))
interval = int(os.getenv('SCRAPE_INTERVAL', '30'))
# Start Prometheus HTTP server
start_http_server(port)
print(f"Exporter listening on port {port}")
while True:
try:
collect_metrics()
except Exception as e:
print(f"Error collecting metrics: {e}")
time.sleep(interval)
if __name__ == '__main__':
main()
Prometheus Configuration
Configure Prometheus to scrape the exporter:
# prometheus.yml
global:
scrape_interval: 30s
evaluation_interval: 30s
scrape_configs:
- job_name: 'redis-metrics'
static_configs:
- targets: ['localhost:9090']
labels:
environment: 'production'
service: 'redis'
# Alert rules
rule_files:
- 'redis_alerts.yml'
Alert Rules
Define Prometheus alert rules:
# redis_alerts.yml
groups:
- name: redis_alerts
interval: 30s
rules:
- alert: RedisHighMemoryUsage
expr: |
(redis_memory_used_mb / (redis_memory_limit_gb * 1024)) > 0.8
for: 5m
labels:
severity: warning
annotations:
summary: "High memory usage on {{ $labels.database }}"
description: "Database {{ $labels.database }} is using {{ $value | humanizePercentage }} of available memory"
- alert: RedisDatabaseDown
expr: redis_database_status == 0
for: 2m
labels:
severity: critical
annotations:
summary: "Database {{ $labels.database }} is down"
description: "Database {{ $labels.database }} has been inactive for more than 2 minutes"
- alert: RedisHighConnections
expr: redis_connections_used > 900
for: 5m
labels:
severity: warning
annotations:
summary: "High connection count on {{ $labels.database }}"
description: "Database {{ $labels.database }} has {{ $value }} active connections"
- alert: RedisLowThroughput
expr: redis_operations_per_second < 100
for: 10m
labels:
severity: info
annotations:
summary: "Low throughput on {{ $labels.database }}"
description: "Database {{ $labels.database }} has only {{ $value }} ops/sec"
Grafana Dashboards
Dashboard Configuration
Create a comprehensive Grafana dashboard:
{
"dashboard": {
"title": "Redis Production Monitoring",
"panels": [
{
"title": "Database Status",
"type": "stat",
"targets": [
{
"expr": "sum(redis_database_status)",
"legendFormat": "Active Databases"
}
]
},
{
"title": "Memory Usage",
"type": "graph",
"targets": [
{
"expr": "redis_memory_used_mb",
"legendFormat": "{{ database }}"
}
]
},
{
"title": "Operations/Second",
"type": "graph",
"targets": [
{
"expr": "redis_operations_per_second",
"legendFormat": "{{ database }}"
}
]
},
{
"title": "Connection Count",
"type": "graph",
"targets": [
{
"expr": "redis_connections_used",
"legendFormat": "{{ database }}"
}
]
}
]
}
}
Log Monitoring
Centralized Logging with ELK
Ship Redis logs to Elasticsearch:
#!/bin/bash
# ship-logs.sh
# For Redis Enterprise
redisctl enterprise logs list \
--profile prod-enterprise \
--output json | \
jq -c '.[] | {
"@timestamp": .time,
"level": .severity,
"message": .message,
"node": .node_uid,
"component": .component
}' | \
while read log; do
curl -X POST "http://elasticsearch:9200/redis-logs/_doc" \
-H 'Content-Type: application/json' \
-d "$log"
done
Logstash Configuration
Process logs with Logstash:
# logstash.conf
input {
exec {
command => "redisctl enterprise logs list --output json"
interval => 60
codec => "json"
}
}
filter {
date {
match => [ "time", "ISO8601" ]
target => "@timestamp"
}
mutate {
add_field => { "environment" => "production" }
}
if [severity] == "error" {
mutate {
add_tag => [ "alert" ]
}
}
}
output {
elasticsearch {
hosts => ["localhost:9200"]
index => "redis-logs-%{+YYYY.MM.dd}"
}
if "alert" in [tags] {
email {
to => "ops-team@example.com"
subject => "Redis Error Alert"
body => "Error detected: %{message}"
}
}
}
Alerting Integration
Slack Notifications
Send alerts to Slack:
#!/bin/bash
# slack-alert.sh
send_slack_alert() {
local level=$1
local message=$2
local webhook_url="${SLACK_WEBHOOK_URL}"
local color="good"
case $level in
ERROR) color="danger" ;;
WARNING) color="warning" ;;
esac
curl -X POST "$webhook_url" \
-H 'Content-Type: application/json' \
-d "{
\"attachments\": [{
\"color\": \"$color\",
\"title\": \"Redis Alert: $level\",
\"text\": \"$message\",
\"footer\": \"redisctl monitoring\",
\"ts\": $(date +%s)
}]
}"
}
# Monitor and alert
while true; do
STATUS=$(redisctl cloud database get \
--subscription-id 123456 \
--database-id 789 \
-q "status")
if [ "$STATUS" != "active" ]; then
send_slack_alert "ERROR" "Database 789 is $STATUS"
fi
sleep 60
done
PagerDuty Integration
Integrate with PagerDuty for critical alerts:
#!/usr/bin/env python3
# pagerduty_alert.py
import pdpyras
import subprocess
import json
import os
def check_redis_health():
"""Check Redis database health"""
cmd = [
'redisctl', 'cloud', 'database', 'list',
'--subscription-id', os.getenv('SUBSCRIPTION_ID'),
'-o', 'json'
]
result = subprocess.run(cmd, capture_output=True, text=True)
databases = json.loads(result.stdout)
alerts = []
for db in databases:
if db['status'] != 'active':
alerts.append({
'database': db['name'],
'status': db['status'],
'id': db['databaseId']
})
return alerts
def send_pagerduty_alert(session, alerts):
"""Send alert to PagerDuty"""
for alert in alerts:
session.trigger_incident(
summary=f"Redis database {alert['database']} is {alert['status']}",
source="redisctl-monitoring",
severity="error",
custom_details=alert
)
def main():
api_key = os.getenv('PAGERDUTY_API_KEY')
session = pdpyras.APISession(api_key)
alerts = check_redis_health()
if alerts:
send_pagerduty_alert(session, alerts)
if __name__ == '__main__':
main()
Custom Metrics Collection
Performance Baseline
Establish performance baselines:
#!/bin/bash
# baseline.sh
# Collect baseline metrics for 24 hours
DURATION=86400
INTERVAL=60
OUTPUT="baseline_$(date +%Y%m%d).csv"
echo "timestamp,database,ops,latency,memory,cpu" > $OUTPUT
END=$(($(date +%s) + DURATION))
while [ $(date +%s) -lt $END ]; do
TIMESTAMP=$(date +%s)
redisctl cloud database get \
--subscription-id 123456 \
--database-id 789 \
-o json | \
jq -r "\"$TIMESTAMP,prod-db,\(.throughputMeasurement.value),\(.latency),\(.memoryUsageInMB),\(.cpuUsagePercentage)\"" \
>> $OUTPUT
sleep $INTERVAL
done
# Analyze baseline
echo "Baseline collection complete. Analyzing..."
python3 analyze_baseline.py $OUTPUT
Automation with Cron
Schedule monitoring tasks:
# crontab -e
# Health check every 5 minutes
*/5 * * * * /opt/monitoring/health-check.sh
# Collect metrics every minute
* * * * * /opt/monitoring/collect-metrics.sh
# Daily report
0 8 * * * /opt/monitoring/daily-report.sh
# Weekly capacity planning
0 0 * * 0 /opt/monitoring/capacity-planning.sh
# Backup monitoring config
0 2 * * * /opt/monitoring/backup-monitoring.sh
Best Practices
- Monitor proactively - Set up alerts before issues occur
- Use multiple data sources - Combine metrics, logs, and traces
- Set appropriate thresholds - Avoid alert fatigue
- Automate responses - Use runbooks for common issues
- Track trends - Look for patterns over time
- Test alert paths - Ensure alerts reach the right people
- Document procedures - Have clear escalation paths
- Review regularly - Update monitoring as systems evolve