Large Datasets & Parquet¶
jpx can query large datasets that would overwhelm an LLM's context window. With Parquet support, you get columnar compression and fast reads — the same queries work on both JSON and Parquet files.
This guide uses the Chicago Crimes dataset — 200,000 real crime reports with dates, locations, crime types, and arrest records.
Why this matters for AI agents¶
When an AI agent needs to analyze JSON data, it typically reads the file into its context window. That costs tokens and has hard limits:
| Data size | Tokens (approx.) | Fits in context? |
|---|---|---|
| 60 KB (85 events) | 21,000 | Yes, but expensive |
| 46 MB (50K records) | 15,000,000 | No |
| 196 MB (200K records) | 65,000,000 | Absolutely not |
With jpx as an MCP sidecar, the file stays on disk. The agent sends a short expression and gets back only the result — typically a few hundred tokens regardless of input size.
Getting the data¶
# Download 200K recent Chicago crime records (≈196 MB)
for offset in 0 50000 100000 150000; do
curl -s "https://data.cityofchicago.org/resource/ijzp-q8t2.json?\$limit=50000&\$offset=$offset"
done | python3 -c "
import json, sys
records = []
for line in sys.stdin:
records.extend(json.loads(line))
json.dump(records, open('chicago_crimes.json', 'w'))
print(f'{len(records)} records')
"
Converting to Parquet¶
Requires jpx built with --features parquet:
Select the fields you need and write to Parquet:
jpx '[*].{
id: id,
case_number: case_number,
date: date,
primary_type: primary_type,
description: description,
location_description: location_description,
arrest: arrest,
domestic: domestic,
beat: beat,
district: district,
ward: ward,
year: year,
latitude: latitude,
longitude: longitude
}' -f chicago_crimes.json --parquet -o chicago_crimes.parquet
The result:
jpx auto-detects .parquet files — no flags needed to read them.
Data structure¶
Each record looks like:
{
"id": "14101518",
"case_number": "JK138251",
"date": "2026-02-02T00:00:00.000",
"primary_type": "BURGLARY",
"description": "UNLAWFUL ENTRY",
"location_description": "RESIDENCE",
"arrest": false,
"domestic": false,
"beat": "0522",
"district": "005",
"ward": "9",
"year": "2026",
"latitude": "41.679126158",
"longitude": "-87.62632452"
}
Basic exploration¶
# How many records?
jpx 'length(@)' -f chicago_crimes.parquet
# → 200000
# What years are covered?
jpx 'sort(unique([*].year))' -f chicago_crimes.parquet
# → ["2025", "2026"]
# Summary stats
jpx '{
total: length(@),
years: sort(unique([*].year)),
arrest_rate: round(divide(length([?arrest == `true`]), length(@)), `3`)
}' -f chicago_crimes.parquet
Crime type analysis¶
Top 10 crime types¶
jpx '[*].{type: primary_type}
| group_by(@, '\''type'\'')
| items(@)
| [*].{crime: [0], count: length([1])}
| sort_by(@, &count)
| reverse(@)
| [:10]' -f chicago_crimes.parquet -t
╭───────┬─────────────────────╮
│ count │ crime │
├───────┼─────────────────────┤
│ 46138 │ THEFT │
│ 36512 │ BATTERY │
│ 22767 │ CRIMINAL DAMAGE │
│ 18450 │ ASSAULT │
│ 14928 │ MOTOR VEHICLE THEFT │
│ 13613 │ OTHER OFFENSE │
│ 11708 │ DECEPTIVE PRACTICE │
│ 8744 │ BURGLARY │
│ 5830 │ NARCOTICS │
│ 4864 │ ROBBERY │
╰───────┴─────────────────────╯
Arrest rate by crime type¶
Which crimes actually lead to arrests?
jpx 'let $types = [*].{type: primary_type, arrested: arrest} in
$types[]
| group_by(@, '\''type'\'')
| items(@)
| [*].{
crime: [0],
total: length([1]),
arrests: length([1][?arrested == `true`]),
rate: round(divide(length([1][?arrested == `true`]), length([1])), `3`)
}
| sort_by(@, &total)
| reverse(@)
| [:10]' -f chicago_crimes.parquet -t
╭─────────┬─────────────────────┬───────┬───────╮
│ arrests │ crime │ rate │ total │
├─────────┼─────────────────────┼───────┼───────┤
│ 3952 │ THEFT │ 0.086 │ 46138 │
│ 6897 │ BATTERY │ 0.189 │ 36512 │
│ 939 │ CRIMINAL DAMAGE │ 0.041 │ 22767 │
│ 2280 │ ASSAULT │ 0.124 │ 18450 │
│ 467 │ MOTOR VEHICLE THEFT │ 0.031 │ 14928 │
│ 2578 │ OTHER OFFENSE │ 0.189 │ 13613 │
│ 362 │ DECEPTIVE PRACTICE │ 0.031 │ 11708 │
│ 345 │ BURGLARY │ 0.039 │ 8744 │
│ 5501 │ NARCOTICS │ 0.944 │ 5830 │
│ 420 │ ROBBERY │ 0.086 │ 4864 │
╰─────────┴─────────────────────┴───────┴───────╯
Narcotics has a 94% arrest rate — these are typically arrest-initiated reports. Motor vehicle theft is 3% — the car is usually gone.
Geographic analysis¶
Crimes farthest from downtown Chicago¶
jpx 'let $lat = `41.8781`, $lon = `-87.6298` in
[?latitude != null]
| [*].{
type: primary_type,
dist_km: round(geo_distance_km($lat, $lon,
to_number(latitude), to_number(longitude)), `2`),
date: date
}
| sort_by(@, &dist_km)
| reverse(@)
| [:5]' -f chicago_crimes.parquet -t
Crime by district¶
jpx '[*].{district: district}
| group_by(@, '\''district'\'')
| items(@)
| [*].{district: [0], count: length([1])}
| sort_by(@, &count)
| reverse(@)
| [:10]' -f chicago_crimes.parquet -t
Using a query file¶
Save your queries as a reusable .jpx library:
cat > chicago.jpx << 'EOF'
-- :name stats
-- :desc Summary statistics
{
total: length(@),
years: sort(unique([*].year)),
arrest_rate: round(divide(length([?arrest == `true`]), length(@)), `3`)
}
-- :name top-crimes
-- :desc Top 10 crime types by count
[*].{type: primary_type}
| group_by(@, 'type')
| items(@)
| [*].{crime: [0], count: length([1])}
| sort_by(@, &count)
| reverse(@)
| [:10]
-- :name arrests-by-type
-- :desc Arrest rate by crime type (top 10 crimes)
let $types = [*].{type: primary_type, arrested: arrest} in
$types[]
| group_by(@, 'type')
| items(@)
| [*].{
crime: [0],
total: length([1]),
arrests: length([1][?arrested == `true`]),
rate: round(divide(length([1][?arrested == `true`]), length([1])), `3`)
}
| sort_by(@, &total)
| reverse(@)
| [:10]
-- :name by-district
-- :desc Crime count by district
[*].{district: district}
| group_by(@, 'district')
| items(@)
| [*].{district: [0], count: length([1])}
| sort_by(@, &count)
| reverse(@)
| [:10]
EOF
Then run any query against JSON or Parquet:
# Same query, either format
jpx -Q chicago.jpx:top-crimes -f chicago_crimes.json -t
jpx -Q chicago.jpx:top-crimes -f chicago_crimes.parquet -t
Performance¶
All timings on a laptop (Apple M-series), 200K records:
| Query | JSON (196 MB) | Parquet (6.8 MB) |
|---|---|---|
length(@) |
~1s | ~1s |
group_by + items (top crimes) |
~1.2s | ~1.2s |
geo_distance_km on all rows |
~1.8s | ~1.8s |
| Arrest rate with let + nested filters | ~1.5s | ~1.5s |
Query time is similar — the JMESPath evaluation dominates. The Parquet advantage is storage (29x smaller) and transfer (faster to download, copy, or send to a remote agent).
The AI sidecar argument¶
When jpx runs as an MCP server, evaluate_file reads the Parquet file
on disk and returns only the query result. The token math:
| Tokens | |
|---|---|
| 200K records as raw JSON in context | ~65,000,000 (impossible) |
| jpx expression + result | ~150 |
The data never enters the context window. The result is deterministic — the query engine processes 200K rows, not the LLM. Same query, same answer, every time.
More datasets to try¶
- Chicago Crimes API — up to 8M+ records
- GH Archive — GitHub events, ~900 MB/hour (NDJSON)
- NYC Open Data — 311 complaints, taxi trips, more
- data.gov — US government open data