Configuration¶
This guide covers performance tuning and configuration options.
Connection URL¶
polars-redis uses standard Redis URLs:
# Local Redis
url = "redis://localhost:6379"
# With password
url = "redis://:password@localhost:6379"
# With username and password
url = "redis://user:password@localhost:6379"
# Specific database
url = "redis://localhost:6379/1"
# TLS
url = "rediss://localhost:6379"
Batch Size¶
The batch_size parameter controls how many keys are processed per Arrow batch:
Tuning Guidelines¶
| Batch Size | Memory | Latency | Use Case |
|---|---|---|---|
| 100-500 | Low | Higher | Memory-constrained, streaming |
| 1000 | Medium | Balanced | General purpose (default) |
| 5000-10000 | Higher | Lower | Large datasets, fast networks |
Count Hint¶
The count_hint parameter hints to Redis how many keys to return per SCAN iteration:
Note
This is a hint, not a guarantee. Redis may return more or fewer keys.
Tuning Guidelines¶
- Low values (10-50): More SCAN iterations, lower memory per iteration
- High values (500-1000): Fewer iterations, higher throughput
Parallel Fetching¶
The parallel parameter enables parallel data fetching with multiple workers:
lf = redis.scan_hashes(
url,
pattern="user:*",
schema=schema,
parallel=4, # Use 4 parallel workers
)
Each batch of keys is split across workers, with results collected in order.
Tuning Guidelines¶
| Workers | Use Case |
|---|---|
None (default) |
Small datasets, simple queries |
| 2-4 | Medium datasets, local Redis |
| 4-8 | Large datasets, remote Redis with higher latency |
Note
Parallel fetching uses multiple Redis connections. Ensure your Redis server and connection pool can handle the additional concurrent connections.
When to Use Parallel Fetching¶
- Large datasets: Thousands of keys
- High-latency connections: Remote Redis servers
- CPU-bound parsing: Complex schemas with many fields
When NOT to Use¶
- Small datasets: Overhead exceeds benefit
- Connection-limited environments: Limited connection pool
- Already saturated Redis: Adding connections won't help
Write Pipelining¶
Write operations automatically use Redis pipelining with batches of 1000 keys:
This reduces network round-trips significantly for large writes.
Memory Considerations¶
Scanning Large Datasets¶
For very large datasets, use streaming with smaller batches:
lf = redis.scan_hashes(
url,
pattern="user:*",
schema=schema,
batch_size=500, # Smaller batches
)
# Process in chunks
for batch_df in lf.collect_iter():
process(batch_df)
Projection Pushdown¶
Always select only needed columns to reduce memory:
# Good: Only fetches 'name' and 'age' from Redis
df = lf.select(["name", "age"]).collect()
# Less efficient: Fetches all fields, then discards
df = lf.collect().select(["name", "age"])
Error Handling¶
Connection Errors¶
try:
lf = redis.scan_hashes(url, pattern="*", schema=schema)
df = lf.collect()
except Exception as e:
print(f"Redis error: {e}")
Missing Fields¶
Missing hash fields become null:
schema = {"name": pl.Utf8, "optional_field": pl.Utf8}
df = redis.read_hashes(url, pattern="user:*", schema=schema)
# optional_field will be null for hashes that don't have it
Type Conversion Errors¶
Invalid values for a type become null:
Environment Variables¶
For Rust examples, the connection URL can be overridden: