Skip to content

Configuration

This guide covers performance tuning and configuration options.

Connection URL

polars-redis uses standard Redis URLs:

# Local Redis
url = "redis://localhost:6379"

# With password
url = "redis://:password@localhost:6379"

# With username and password
url = "redis://user:password@localhost:6379"

# Specific database
url = "redis://localhost:6379/1"

# TLS
url = "rediss://localhost:6379"

Batch Size

The batch_size parameter controls how many keys are processed per Arrow batch:

lf = redis.scan_hashes(
    url,
    pattern="user:*",
    schema=schema,
    batch_size=1000,  # default
)

Tuning Guidelines

Batch Size Memory Latency Use Case
100-500 Low Higher Memory-constrained, streaming
1000 Medium Balanced General purpose (default)
5000-10000 Higher Lower Large datasets, fast networks

Count Hint

The count_hint parameter hints to Redis how many keys to return per SCAN iteration:

lf = redis.scan_hashes(
    url,
    pattern="user:*",
    schema=schema,
    count_hint=100,  # default
)

Note

This is a hint, not a guarantee. Redis may return more or fewer keys.

Tuning Guidelines

  • Low values (10-50): More SCAN iterations, lower memory per iteration
  • High values (500-1000): Fewer iterations, higher throughput

Parallel Fetching

The parallel parameter enables parallel data fetching with multiple workers:

lf = redis.scan_hashes(
    url,
    pattern="user:*",
    schema=schema,
    parallel=4,  # Use 4 parallel workers
)

Each batch of keys is split across workers, with results collected in order.

Tuning Guidelines

Workers Use Case
None (default) Small datasets, simple queries
2-4 Medium datasets, local Redis
4-8 Large datasets, remote Redis with higher latency

Note

Parallel fetching uses multiple Redis connections. Ensure your Redis server and connection pool can handle the additional concurrent connections.

When to Use Parallel Fetching

  • Large datasets: Thousands of keys
  • High-latency connections: Remote Redis servers
  • CPU-bound parsing: Complex schemas with many fields

When NOT to Use

  • Small datasets: Overhead exceeds benefit
  • Connection-limited environments: Limited connection pool
  • Already saturated Redis: Adding connections won't help

Write Pipelining

Write operations automatically use Redis pipelining with batches of 1000 keys:

# Writes are pipelined automatically
redis.write_hashes(df, url)

This reduces network round-trips significantly for large writes.

Memory Considerations

Scanning Large Datasets

For very large datasets, use streaming with smaller batches:

lf = redis.scan_hashes(
    url,
    pattern="user:*",
    schema=schema,
    batch_size=500,  # Smaller batches
)

# Process in chunks
for batch_df in lf.collect_iter():
    process(batch_df)

Projection Pushdown

Always select only needed columns to reduce memory:

# Good: Only fetches 'name' and 'age' from Redis
df = lf.select(["name", "age"]).collect()

# Less efficient: Fetches all fields, then discards
df = lf.collect().select(["name", "age"])

Error Handling

Connection Errors

try:
    lf = redis.scan_hashes(url, pattern="*", schema=schema)
    df = lf.collect()
except Exception as e:
    print(f"Redis error: {e}")

Missing Fields

Missing hash fields become null:

schema = {"name": pl.Utf8, "optional_field": pl.Utf8}
df = redis.read_hashes(url, pattern="user:*", schema=schema)
# optional_field will be null for hashes that don't have it

Type Conversion Errors

Invalid values for a type become null:

schema = {"age": pl.Int64}
# If a hash has age="not a number", it becomes null

Environment Variables

For Rust examples, the connection URL can be overridden:

export REDIS_URL="redis://custom-host:6379"
cargo run --example scan_hashes