Skip to content

RediSearch Overview

Why RediSearch?

When you have thousands or millions of documents in Redis, scanning everything and filtering in Python is slow and wasteful. RediSearch changes this by filtering and aggregating inside Redis - only the data you need crosses the network.

With polars-redis + RediSearch, you can:

  • Filter server-side: Query @age > 30 AND @status = "active" returns only matching documents
  • Aggregate server-side: Get COUNT, AVG, SUM by group without transferring raw data
  • Search text: Full-text search with stemming, fuzzy matching, and phrase queries
  • Query geospatially: Find documents within a radius or polygon
  • Search vectors: K-nearest neighbors for semantic similarity

The result? 90%+ reduction in data transfer for selective queries.

When to Use RediSearch

RediSearch shines when:

  • You have 10K+ documents and selective queries (filtering returns <50% of data)
  • You need server-side aggregation (GROUP BY, COUNT, AVG, SUM)
  • You want full-text search with stemming, fuzzy matching, or phrase queries
  • You need geo queries (find within radius/polygon)
  • You're doing vector similarity search

When NOT to Use RediSearch

Skip RediSearch and use scan_hashes() + Polars filtering when:

  • Small datasets (<10K documents) - scanning is fast enough
  • One-off exploratory queries - creating an index isn't worth it
  • You need all or most of the data - no filtering benefit
  • Complex transforms that Polars handles better than RediSearch
# For small datasets, this is often simpler and fast enough
df = scan_hashes(url, "user:*", schema).filter(
    pl.col("age") > 30
).collect()

Setup: Test Data

To follow along with examples in this guide, set up sample data:

import polars as pl
import polars_redis as pr
from polars_redis import Index, TextField, NumericField, TagField

url = "redis://localhost:6379"

# Create sample employees
df = pl.DataFrame({
    "id": [1, 2, 3, 4, 5, 6, 7, 8],
    "name": ["Alice", "Bob", "Carol", "Dave", "Eve", "Frank", "Grace", "Henry"],
    "age": [32, 28, 45, 35, 29, 52, 38, 41],
    "department": ["engineering", "engineering", "product", "product", 
                   "marketing", "engineering", "marketing", "engineering"],
    "salary": [120000, 95000, 140000, 110000, 85000, 150000, 95000, 130000],
    "status": ["active", "active", "active", "inactive", "active", 
               "active", "active", "inactive"],
})

# Write to Redis
pr.write_hashes(df, url, key_column="id", key_prefix="employee:")

# Create index
idx = Index(
    name="employees_idx",
    prefix="employee:",
    schema=[
        TextField("name", sortable=True),
        NumericField("age", sortable=True),
        TagField("department"),
        NumericField("salary", sortable=True),
        TagField("status"),
    ]
)
idx.ensure_exists(url)

print("Created 8 employees and index")

Prerequisites

RediSearch requires Redis Stack or Redis with the RediSearch module:

# Start Redis Stack (includes RediSearch)
docker run -d -p 6379:6379 redis/redis-stack:latest

# Verify RediSearch is available
redis-cli MODULE LIST
# Should include "search"

Two Ways to Query

You can write queries using native RediSearch syntax or the polars-redis query builder:

from polars_redis import col, search_hashes

# Polars-like syntax you already know
query = (col("age") > 30) & (col("status") == "active")

df = search_hashes(
    url,
    index="employees_idx",
    query=query,
    schema={"name": pl.Utf8, "age": pl.Int64},
).collect()
# Native RediSearch query string
df = search_hashes(
    url,
    index="employees_idx",
    query="@age:[(30 +inf] @status:{active}",
    schema={"name": pl.Utf8, "age": pl.Int64},
).collect()

The query builder generates valid RediSearch queries while giving you a familiar, composable API. Use query.to_redis() anytime to see the generated query string.

search_hashes() uses RediSearch FT.SEARCH to filter data server-side:

import polars as pl
import polars_redis as pr
from polars_redis import col

url = "redis://localhost:6379"

# Find active employees over 30
df = pr.search_hashes(
    url,
    index="employees_idx",
    query=(col("age") > 30) & (col("status") == "active"),
    schema={"name": pl.Utf8, "age": pl.Int64, "department": pl.Utf8},
).collect()

print(df)
# shape: (4, 3)
# +-------+-----+-------------+
# | name  | age | department  |
# | ---   | --- | ---         |
# | str   | i64 | str         |
# +=======+=====+=============+
# | Alice | 32  | engineering |
# | Carol | 45  | product     |
# | Frank | 52  | engineering |
# | Grace | 38  | marketing   |
# +-------+-----+-------------+

Basic Aggregation

aggregate_hashes() uses FT.AGGREGATE for server-side grouping:

# Salary statistics by department
df = pr.aggregate_hashes(
    url,
    index="employees_idx",
    query="@status:{active}",
    group_by=["@department"],
    reduce=[
        ("COUNT", [], "headcount"),
        ("AVG", ["@salary"], "avg_salary"),
        ("SUM", ["@salary"], "total_payroll"),
    ],
    sort_by=[("@avg_salary", False)],  # Descending
)

print(df)
# shape: (3, 4)
# +-------------+-----------+------------+--------------+
# | department  | headcount | avg_salary | total_payroll|
# | ---         | ---       | ---        | ---          |
# | str         | i64       | f64        | f64          |
# +=============+===========+============+==============+
# | engineering | 3         | 121666.67  | 365000.0     |
# | product     | 1         | 140000.0   | 140000.0     |
# | marketing   | 2         | 90000.0    | 180000.0     |
# +-------------+-----------+------------+--------------+

Only the aggregated results transfer - not the raw documents.

Smart Scan: Automatic Optimization

Don't want to think about indexes? Use smart_scan():

from polars_redis import smart_scan

# Auto-detect index and use FT.SEARCH if available
df = smart_scan(
    url,
    "employee:*",
    schema={"name": pl.Utf8, "age": pl.Int64},
).filter(pl.col("age") > 30).collect()

If an index exists for the pattern, smart_scan uses FT.SEARCH automatically. Otherwise, it falls back to SCAN - same API, optimal execution.

Debugging Queries

See what RediSearch query gets generated:

query = (col("department") == "engineering") & (col("salary") > 100000)
print(query.to_redis())
# @department:{engineering} @salary:[(100000 +inf]

See how a query will execute:

from polars_redis import explain_scan

plan = explain_scan(url, "employee:*", schema={"name": pl.Utf8})
print(plan.explain())
# Strategy: SEARCH
# Index: employees_idx
#   Prefixes: employee:
#   Type: HASH

Next Steps

Performance Comparison

Method Data Transfer Use Case
scan_hashes() All matching keys No index, small datasets
scan_hashes() + filter All keys, filter client-side No index available
search_hashes() Only matching docs Indexed data, selective queries
smart_scan() Optimized automatically Best default - auto-selects strategy
aggregate_hashes() Only aggregated results Analytics, reporting