Schema Inference¶

polars-redis can automatically infer schemas from existing Redis data.

Inferring Hash Schema¶

infer_hash_schema samples Redis hashes to detect field names and types:

import polars_redis as redis

schema = redis.infer_hash_schema(
    "redis://localhost:6379",
    pattern="user:*",
    sample_size=100,
)
print(schema)
# {'name': Utf8, 'age': Int64, 'score': Float64, 'active': Boolean}

Parameters¶

Parameter	Type	Default	Description
`url`	str	required	Redis connection URL
`pattern`	str	`"*"`	Key pattern to sample
`sample_size`	int	`100`	Maximum keys to sample
`type_inference`	bool	`True`	Infer types (vs all Utf8)

Type Detection¶

When type_inference=True, values are analyzed:

Pattern	Detected Type
Integer strings (`"123"`)	`Int64`
Float strings (`"3.14"`)	`Float64`
Boolean strings (`"true"`, `"false"`)	`Boolean`
ISO dates (`"2024-01-15"`)	`Date`
ISO datetimes (`"2024-01-15T10:30:00"`)	`Datetime`
Everything else	`Utf8`

Without Type Inference¶

Set type_inference=False to treat all fields as strings:

schema = redis.infer_hash_schema(
    url,
    pattern="user:*",
    type_inference=False,
)
# All fields are Utf8

Inferring JSON Schema¶

infer_json_schema samples RedisJSON documents:

schema = redis.infer_json_schema(
    "redis://localhost:6379",
    pattern="doc:*",
    sample_size=100,
)
print(schema)
# {'title': Utf8, 'views': Int64, 'rating': Float64}

JSON type inference uses native JSON types (number, string, boolean) rather than parsing strings.

Using Inferred Schemas¶

Pass the inferred schema directly to scan functions:

# Infer
schema = redis.infer_hash_schema(url, pattern="user:*")

# Scan
lf = redis.scan_hashes(url, pattern="user:*", schema=schema)
df = lf.collect()

Sampling Strategy¶

Schema inference uses Redis SCAN to sample keys:

Keys matching the pattern are scanned
Up to sample_size keys are fetched
All unique field names are collected
Field types are inferred from sampled values

Tip

For heterogeneous data, increase sample_size to capture more field variations.

Handling Missing Fields¶

If some hashes have fields that others don't:

All discovered fields are included in the schema
Missing values become null when scanning

# user:1 has {name, age}
# user:2 has {name, age, email}

schema = redis.infer_hash_schema(url, pattern="user:*")
# {'name': Utf8, 'age': Int64, 'email': Utf8}

df = redis.read_hashes(url, pattern="user:*", schema=schema)
# user:1 will have email=null

Confidence Scores¶

For production use, you may want to validate the quality of inferred schemas before processing large datasets. The infer_hash_schema_with_confidence function provides detailed confidence information:

result = redis.infer_hash_schema_with_confidence(
    "redis://localhost:6379",
    pattern="user:*",
    sample_size=100
)

# Check overall confidence
print(f"Average confidence: {result.average_confidence:.1%}")
print(f"All fields confident: {result.all_confident}")

# Get the schema if confidence is high
if result.all_confident:
    df = redis.read_hashes(url, pattern="user:*", schema=result.schema)
else:
    # Investigate low-confidence fields
    for field, conf in result.low_confidence_fields(threshold=0.8):
        print(f"Warning: {field} has {conf:.0%} confidence")

SchemaConfidence Properties¶

Property	Type	Description
`schema`	`dict`	The inferred schema
`sample_count`	`int`	Number of keys sampled
`average_confidence`	`float`	Average confidence across all fields (0.0-1.0)
`all_confident`	`bool`	Whether all fields have confidence >= 0.9
`field_info`	`dict`	Detailed per-field inference information

Per-Field Information¶

The field_info property provides detailed statistics for each field:

for name, info in result.field_info.items():
    print(f"{name}:")
    print(f"  Type: {info['type']}")
    print(f"  Confidence: {info['confidence']:.1%}")
    print(f"  Samples: {info['samples']}")
    print(f"  Valid: {info['valid']}")
    print(f"  Nulls: {info['nulls']} ({info['null_ratio']:.1%})")
    print(f"  Type candidates: {info['type_candidates']}")

Low Confidence Fields¶

Use low_confidence_fields() to identify fields that may need manual type specification:

# Get fields with confidence below 80%
low_conf = result.low_confidence_fields(threshold=0.8)
for field, confidence in low_conf:
    print(f"{field}: {confidence:.0%} confidence")

Schema Overwrite¶

When schema inference gets a type wrong or you need to enforce specific types, use the overwrite functions:

# Infer schema but override specific field types
schema = redis.infer_hash_schema_with_overwrite(
    "redis://localhost:6379",
    pattern="user:*",
    schema_overwrite={
        "age": pl.Int64,           # Force age to Int64
        "created_at": pl.Datetime, # Force timestamp field
        "score": pl.Float64,       # Ensure float precision
    }
)

Use Cases for Overwrite¶

Fix incorrect inference: When a field looks like one type but should be another
Add missing fields: Fields not in sampled data will be added
Force timestamp parsing: Override string fields to Datetime or Date
Ensure numeric precision: Override ambiguous numeric fields to specific types

JSON Schema Overwrite¶

The same pattern works for JSON documents:

schema = redis.infer_json_schema_with_overwrite(
    "redis://localhost:6379",
    pattern="doc:*",
    schema_overwrite={
        "timestamp": pl.Datetime,
        "count": pl.Int64,
    }
)

Limitations¶

Sampling may miss rare fields (increase sample_size)
Type inference is based on sampled values only
Mixed types in the same field default to Utf8
Nested JSON structures are not supported (top-level fields only)