Schema Inference¶
polars-redis can automatically infer schemas from existing Redis data.
Inferring Hash Schema¶
infer_hash_schema samples Redis hashes to detect field names and types:
import polars_redis as redis
schema = redis.infer_hash_schema(
"redis://localhost:6379",
pattern="user:*",
sample_size=100,
)
print(schema)
# {'name': Utf8, 'age': Int64, 'score': Float64, 'active': Boolean}
Parameters¶
| Parameter | Type | Default | Description |
|---|---|---|---|
url |
str | required | Redis connection URL |
pattern |
str | "*" |
Key pattern to sample |
sample_size |
int | 100 |
Maximum keys to sample |
type_inference |
bool | True |
Infer types (vs all Utf8) |
Type Detection¶
When type_inference=True, values are analyzed:
| Pattern | Detected Type |
|---|---|
Integer strings ("123") |
Int64 |
Float strings ("3.14") |
Float64 |
Boolean strings ("true", "false") |
Boolean |
ISO dates ("2024-01-15") |
Date |
ISO datetimes ("2024-01-15T10:30:00") |
Datetime |
| Everything else | Utf8 |
Without Type Inference¶
Set type_inference=False to treat all fields as strings:
schema = redis.infer_hash_schema(
url,
pattern="user:*",
type_inference=False,
)
# All fields are Utf8
Inferring JSON Schema¶
infer_json_schema samples RedisJSON documents:
schema = redis.infer_json_schema(
"redis://localhost:6379",
pattern="doc:*",
sample_size=100,
)
print(schema)
# {'title': Utf8, 'views': Int64, 'rating': Float64}
JSON type inference uses native JSON types (number, string, boolean) rather than parsing strings.
Using Inferred Schemas¶
Pass the inferred schema directly to scan functions:
# Infer
schema = redis.infer_hash_schema(url, pattern="user:*")
# Scan
lf = redis.scan_hashes(url, pattern="user:*", schema=schema)
df = lf.collect()
Sampling Strategy¶
Schema inference uses Redis SCAN to sample keys:
- Keys matching the pattern are scanned
- Up to
sample_sizekeys are fetched - All unique field names are collected
- Field types are inferred from sampled values
Tip
For heterogeneous data, increase sample_size to capture more field variations.
Handling Missing Fields¶
If some hashes have fields that others don't:
- All discovered fields are included in the schema
- Missing values become
nullwhen scanning
# user:1 has {name, age}
# user:2 has {name, age, email}
schema = redis.infer_hash_schema(url, pattern="user:*")
# {'name': Utf8, 'age': Int64, 'email': Utf8}
df = redis.read_hashes(url, pattern="user:*", schema=schema)
# user:1 will have email=null
Confidence Scores¶
For production use, you may want to validate the quality of inferred schemas before processing large datasets. The infer_hash_schema_with_confidence function provides detailed confidence information:
result = redis.infer_hash_schema_with_confidence(
"redis://localhost:6379",
pattern="user:*",
sample_size=100
)
# Check overall confidence
print(f"Average confidence: {result.average_confidence:.1%}")
print(f"All fields confident: {result.all_confident}")
# Get the schema if confidence is high
if result.all_confident:
df = redis.read_hashes(url, pattern="user:*", schema=result.schema)
else:
# Investigate low-confidence fields
for field, conf in result.low_confidence_fields(threshold=0.8):
print(f"Warning: {field} has {conf:.0%} confidence")
SchemaConfidence Properties¶
| Property | Type | Description |
|---|---|---|
schema |
dict |
The inferred schema |
sample_count |
int |
Number of keys sampled |
average_confidence |
float |
Average confidence across all fields (0.0-1.0) |
all_confident |
bool |
Whether all fields have confidence >= 0.9 |
field_info |
dict |
Detailed per-field inference information |
Per-Field Information¶
The field_info property provides detailed statistics for each field:
for name, info in result.field_info.items():
print(f"{name}:")
print(f" Type: {info['type']}")
print(f" Confidence: {info['confidence']:.1%}")
print(f" Samples: {info['samples']}")
print(f" Valid: {info['valid']}")
print(f" Nulls: {info['nulls']} ({info['null_ratio']:.1%})")
print(f" Type candidates: {info['type_candidates']}")
Low Confidence Fields¶
Use low_confidence_fields() to identify fields that may need manual type specification:
# Get fields with confidence below 80%
low_conf = result.low_confidence_fields(threshold=0.8)
for field, confidence in low_conf:
print(f"{field}: {confidence:.0%} confidence")
Schema Overwrite¶
When schema inference gets a type wrong or you need to enforce specific types, use the overwrite functions:
# Infer schema but override specific field types
schema = redis.infer_hash_schema_with_overwrite(
"redis://localhost:6379",
pattern="user:*",
schema_overwrite={
"age": pl.Int64, # Force age to Int64
"created_at": pl.Datetime, # Force timestamp field
"score": pl.Float64, # Ensure float precision
}
)
Use Cases for Overwrite¶
- Fix incorrect inference: When a field looks like one type but should be another
- Add missing fields: Fields not in sampled data will be added
- Force timestamp parsing: Override string fields to
DatetimeorDate - Ensure numeric precision: Override ambiguous numeric fields to specific types
JSON Schema Overwrite¶
The same pattern works for JSON documents:
schema = redis.infer_json_schema_with_overwrite(
"redis://localhost:6379",
pattern="doc:*",
schema_overwrite={
"timestamp": pl.Datetime,
"count": pl.Int64,
}
)
Limitations¶
- Sampling may miss rare fields (increase
sample_size) - Type inference is based on sampled values only
- Mixed types in the same field default to
Utf8 - Nested JSON structures are not supported (top-level fields only)