Introduction
unimorph-rs is a complete Rust toolkit for working with UniMorph morphological data. It provides both a command-line interface and a Rust library for downloading, querying, and analyzing morphological inflection data across 180+ languages.
What is UniMorph?
UniMorph is a collaborative project providing morphological paradigms for the world's languages. Each language dataset contains entries mapping lemmas (dictionary forms) to their inflected forms along with morphological feature annotations.
For example, in Spanish:
| Lemma | Form | Features |
|---|---|---|
| hablar | hablo | V;IND;PRS;1;SG |
| hablar | hablas | V;IND;PRS;2;SG |
| hablar | habla | V;IND;PRS;3;SG |
| hablar | hablamos | V;IND;PRS;1;PL |
Features
- Fast lookups: SQLite-backed storage with indexed queries
- 180+ languages: Access to all UniMorph language datasets
- Transparent decompression: Handles
.xz,.gz, and.zipcompressed datasets automatically - Flexible querying: Search by lemma, form, features, or part of speech
- Multiple output formats: Table, JSON, TSV for scripting
- Pipe-friendly: Output designed for Unix pipelines
- Offline-first: Data cached locally after download
- Library + CLI: Use as a Rust library or command-line tool
Use Cases
- Language learners: Look up conjugations and declensions
- NLP researchers: Training data for morphological models
- Lexicographers: Verify inflection paradigms
- Educators: Build conjugation practice tools
- Linguists: Cross-linguistic morphological analysis
Quick Example
# Download Hebrew dataset
unimorph download heb
# Look up all forms of a verb
unimorph inflect -l heb כתב
# Analyze a surface form
unimorph analyze -l heb כתבתי
# Search for plural masculine forms
unimorph search -l heb --contains PL,MASC --limit 10
Getting Started
Head to the Installation guide to get started, or jump straight to the Quick Start for a hands-on introduction.
Installation
Command-Line Tool
Homebrew (macOS/Linux)
brew tap joshrotenberg/brew
brew install unimorph
Cargo (from crates.io)
If you have Rust installed:
cargo install unimorph
Docker
Pull the image from GitHub Container Registry:
docker pull ghcr.io/joshrotenberg/unimorph-rs:latest
Run with a persistent data cache:
# Download a dataset
docker run -v ~/.cache/unimorph:/data ghcr.io/joshrotenberg/unimorph-rs download spa
# Query the data
docker run -v ~/.cache/unimorph:/data ghcr.io/joshrotenberg/unimorph-rs inflect spa hablar
# Export data
docker run -v ~/.cache/unimorph:/data -v $(pwd):/output ghcr.io/joshrotenberg/unimorph-rs \
export spa -f jsonl -o /output/spanish.jsonl
You can also create a shell alias for convenience:
alias unimorph='docker run -v ~/.cache/unimorph:/data ghcr.io/joshrotenberg/unimorph-rs'
From Source
git clone https://github.com/joshrotenberg/unimorph-rs
cd unimorph-rs
cargo install --path crates/unimorph-cli # directory still named unimorph-cli
Rust Library
Add to your Cargo.toml:
[dependencies]
unimorph-core = "0.1"
Or with cargo:
cargo add unimorph-core
Shell Completions
Generate completions for your shell:
# Bash
unimorph completions bash > ~/.local/share/bash-completion/completions/unimorph
# Zsh
unimorph completions zsh > ~/.zfunc/_unimorph
# Fish
unimorph completions fish > ~/.config/fish/completions/unimorph.fish
# PowerShell
unimorph completions powershell > _unimorph.ps1
For Zsh, ensure ~/.zfunc is in your fpath:
# Add to ~/.zshrc before compinit
fpath=(~/.zfunc $fpath)
autoload -Uz compinit && compinit
Verifying Installation
unimorph --version
unimorph --help
Data Storage
By default, unimorph stores data in:
- Linux/macOS:
~/.cache/unimorph/ - Custom: Set
UNIMORPH_DATAenvironment variable or use--data-dir
Configuration is stored in:
- All platforms:
~/.config/unimorph/config.toml
Quick Start
This guide will get you up and running with unimorph in under 5 minutes.
Download Your First Language
Let's start by downloading a language dataset. We'll use Hebrew (heb) as an example:
unimorph download heb
You'll see output like:
Downloading heb...
Downloaded 33177 entries for heb
Look Up Inflections
Now let's look up all the forms of a Hebrew verb. The inflect command takes a lemma (dictionary form) and shows all its inflected forms:
unimorph inflect -l heb כתב
Output:
LEMMA FORM FEATURES
------------------------------------------------------------
כתב אכתוב V;1;SG;FUT
כתב יכתבו V;3;PL;FUT;MASC
כתב יכתוב V;3;SG;FUT;MASC
כתב כותב V;SG;PRS;MASC
כתב כתב V;3;SG;PST;MASC
...
29 form(s) found.
Analyze a Surface Form
What if you have a word and want to know what it is? Use analyze:
unimorph analyze -l heb כתבתי
Output:
FORM LEMMA FEATURES
------------------------------------------------------------
כתבתי כתב V;1;SG;PST
1 analysis(es) found.
Search with Filters
Find entries matching specific criteria:
# Find all first person singular future forms
unimorph search -l heb --contains 1,SG,FUT --limit 5
# Find verbs (part of speech = V)
unimorph search -l heb --pos V --limit 5
# Search by lemma pattern (SQL LIKE wildcards)
unimorph search -l heb --lemma "כת%" --limit 5
Check Dataset Statistics
unimorph stats heb
Statistics for heb:
Total entries: 33177
Unique lemmas: 1176
Unique forms: 27286
Unique features: 55
Imported at: 2024-01-15 10:30:00 UTC
Set a Default Language
Tired of typing -l heb every time? Set a default:
export UNIMORPH_LANG=heb
Or create a config file:
unimorph config init
Then edit ~/.config/unimorph/config.toml:
default_lang = "heb"
Now you can just run:
unimorph inflect כתב
unimorph analyze כתבתי
Output Formats
JSON Output
Add --json for machine-readable output:
unimorph inflect -l heb כתב --json
TSV for Piping
Use --tsv for tab-separated output without headers:
unimorph inflect -l heb כתב --tsv | head -5
כתב אכתוב V;1;SG;FUT
כתב יכתבו V;3;PL;FUT;MASC
כתב יכתוב V;3;SG;FUT;MASC
כתב כותב V;SG;PRS;MASC
כתב כותבות V;PL;PRS;FEM
Export Full Dataset
Export an entire language to a file:
unimorph export -l heb -o hebrew.tsv
unimorph export -l heb -o hebrew.jsonl --format jsonl
Or to stdout for piping:
unimorph export -l heb -o - | grep "FUT" | wc -l
Next Steps
- Browse available languages
- Learn about the feature schema
- Explore the full CLI reference
- Use the Rust library in your projects
Configuration
unimorph can be configured through environment variables, a config file, or command-line flags. Settings are applied in this priority order (highest to lowest):
- Command-line flags
- Environment variables
- Config file
- Built-in defaults
Config File
The config file is located at ~/.config/unimorph/config.toml on all platforms.
Creating a Config File
# Create a config file with example content
unimorph config init
# View current configuration
unimorph config show
# Show config file path
unimorph config path
Config File Format
# Default language for commands (ISO 639-3 code)
default_lang = "heb"
# Custom data directory (default: ~/.cache/unimorph)
# data_dir = "/path/to/custom/data"
# Default output format: "table", "json", or "tsv"
# output_format = "table"
# Disable colored output
# no_color = true
# Language aliases for convenience
[languages]
hebrew = "heb"
spanish = "spa"
german = "deu"
spanish = "spa"
finnish = "fin"
Language Aliases
Define shortcuts for language codes:
[languages]
he = "heb"
it = "spa"
de = "deu"
Then use:
unimorph inflect -l he כתב
# Resolves to: unimorph inflect -l heb כתב
Environment Variables
| Variable | Description | Example |
|---|---|---|
UNIMORPH_LANG | Default language code | export UNIMORPH_LANG=heb |
UNIMORPH_DATA | Custom data directory | export UNIMORPH_DATA=/data/unimorph |
NO_COLOR | Disable colored output | export NO_COLOR=1 |
Command-Line Flags
Global flags available on all commands:
| Flag | Description |
|---|---|
-d, --data-dir <PATH> | Custom data directory |
-v, --verbose | Enable debug output (-vv for trace) |
-q, --quiet | Suppress non-essential output |
Data Storage
Default Locations
- Dataset database:
~/.cache/unimorph/datasets.db - API cache:
~/.cache/unimorph/available_languages.json - Config file:
~/.config/unimorph/config.toml
Custom Data Directory
Override the data directory:
# Via environment variable
export UNIMORPH_DATA=/custom/path
unimorph download heb
# Via command-line flag
unimorph --data-dir /custom/path download heb
# Via config file
# data_dir = "/custom/path"
Resetting Data
# Clear API response cache
unimorph repair --clear-cache
# Clear all downloaded datasets (requires re-download)
unimorph repair --clear-data
Output Modes
Table (Default)
Human-readable formatted output with colors when connected to a terminal:
unimorph inflect -l heb כתב
JSON
Machine-readable JSON output:
unimorph inflect -l heb כתב --json
TSV
Tab-separated values without headers, ideal for piping:
unimorph inflect -l heb כתב --tsv
Pipe Detection
When stdout is not a terminal (e.g., piped to another command), unimorph automatically outputs in a pipe-friendly format:
# Automatically outputs just language codes, one per line
unimorph list | xargs -I{} echo "Language: {}"
CLI Overview
The unimorph command-line tool provides access to UniMorph morphological data through a set of intuitive subcommands.
Command Structure
unimorph [OPTIONS] <COMMAND> [ARGS]
Global Options
| Option | Description |
|---|---|
-v, --verbose | Enable debug output (-vv for trace) |
-q, --quiet | Suppress non-essential output |
-d, --data-dir <PATH> | Custom data directory |
-h, --help | Print help |
-V, --version | Print version |
Commands at a Glance
| Command | Alias | Description |
|---|---|---|
| download | dl | Download a language dataset |
| list | ls | List available/cached languages |
| inflect | i | Look up all forms of a lemma |
| analyze | a | Analyze a surface form (reverse lookup) |
| search | s | Search entries with flexible filtering |
| stats | st | Show dataset statistics |
| info | in | Show detailed info about a language |
| export | x | Export dataset to file |
| update | up | Update cached datasets |
| features | f | Explore morphological features |
| delete | rm | Delete a cached dataset |
| repair | Repair or reset data store | |
| config | cfg | Manage configuration |
| completions | Generate shell completions |
Common Workflows
First-Time Setup
# See what languages are available
unimorph list --available
# Download a language
unimorph download heb
# Set as default (optional)
export UNIMORPH_LANG=heb
Looking Up Words
# All forms of a lemma
unimorph inflect -l heb כתב
# What lemma does this form come from?
unimorph analyze -l heb כתבתי
Searching
# By features
unimorph search -l heb --contains PL,MASC
# By part of speech
unimorph search -l heb --pos V --limit 20
# By lemma pattern
unimorph search -l heb --lemma "כת%"
Data Management
# Check for updates
unimorph update --all --check
# Update a specific language
unimorph update heb
# Export for external use
unimorph export -l heb -o hebrew.tsv
Output Formats
Most commands support multiple output formats:
| Flag | Format | Use Case |
|---|---|---|
| (default) | Table | Human reading in terminal |
--json | JSON | Machine parsing, APIs |
--tsv | TSV | Piping to other tools |
Examples
# Pretty table output
unimorph inflect -l heb כתב
# JSON for parsing
unimorph inflect -l heb כתב --json | jq '.[0]'
# TSV for piping
unimorph inflect -l heb כתב --tsv | cut -f2 | sort -u
Piping and Scripting
When output is piped (not a terminal), unimorph automatically uses pipe-friendly formats:
# Get all cached language codes
unimorph list | while read lang; do
echo "Processing $lang..."
unimorph stats "$lang"
done
# Export to stdout and filter
unimorph export -l heb -o - | grep "FUT" > future_forms.tsv
# Count forms per lemma
unimorph search -l heb --pos V --tsv --limit 1000 | cut -f1 | sort | uniq -c | sort -rn | head
Error Handling
Commands provide helpful error messages:
$ unimorph inflect כתב
Error: No language specified.
Provide a language code as an argument, or set a default:
export UNIMORPH_LANG=heb
Or in ~/.config/unimorph/config.toml:
default_lang = "heb"
Run 'unimorph list --available' to see available languages.
Getting Help
# General help
unimorph --help
# Command-specific help
unimorph inflect --help
unimorph search --help
Commands
This section provides detailed documentation for each unimorph command.
Data Management
- download - Download language datasets from UniMorph
- list - List available and cached languages
- update - Update cached datasets to latest versions
- delete - Remove cached datasets
- repair - Repair or reset the data store
- export - Export datasets to files
Querying
- inflect - Look up all inflected forms of a lemma
- analyze - Analyze a surface form (reverse lookup)
- search - Search with flexible filtering
- features - Explore morphological features
Information
Configuration
- config - Manage configuration settings
download
Download a language dataset from UniMorph.
Alias: dl
Synopsis
unimorph download [OPTIONS] [LANG]
Description
Downloads a UniMorph language dataset from GitHub and imports it into the local SQLite database. Datasets are cached locally, so subsequent queries don't require network access.
If the dataset is already cached, this command does nothing unless --force is specified.
Arguments
| Argument | Description |
|---|---|
[LANG] | Language code (ISO 639-3, e.g., heb, ita, deu). Optional if UNIMORPH_LANG is set or configured. |
Options
| Option | Description |
|---|---|
-f, --force | Force re-download even if cached |
--json | Output as JSON |
-q, --quiet | Suppress progress output |
Examples
Basic Download
unimorph download heb
Downloading heb...
Downloaded 33177 entries for heb
Force Re-download
unimorph download heb --force
Quiet Mode
unimorph download heb --quiet
JSON Output
unimorph download heb --json
{
"language": "heb",
"entries": 33177,
"status": "downloaded"
}
Download Multiple Languages
for lang in heb ita deu spa; do
unimorph download "$lang"
done
With Default Language
export UNIMORPH_LANG=heb
unimorph download # Downloads Hebrew
Verbose Output
Use -v for detailed import reporting:
unimorph download spa --force -v
This shows:
parsed downloaded data lang=spa filename=["spa"] compression=none from_lfs=false valid_entries=1196224 blank_lines=0 malformed=21
malformed entry lang=spa line=80710 reason=empty form
malformed entry lang=spa line=134234 reason=empty form
...
additional malformed entries not shown lang=spa additional=11
Understanding the Output
| Field | Description |
|---|---|
filename | Source file(s) downloaded |
compression | Format: none, xz, gzip, or zip |
from_lfs | Whether fetched via Git LFS (large files) |
valid_entries | Successfully parsed entries |
blank_lines | Empty lines skipped (not an error) |
malformed | Entries that failed to parse |
Malformed Entry Details
When entries fail to parse, the first 10 are logged with:
- Line number: Where in the source file
- Reason: Why it failed (e.g., "empty form", "expected at least 3 columns")
Common reasons for malformed entries:
empty form- The inflected form field is blankempty lemma- The dictionary form field is blankexpected at least 3 columns- Line doesn't have lemma, form, and features
These indicate upstream data quality issues in the UniMorph repository.
Notes
- Language codes are ISO 639-3 (3 lowercase letters)
- Use
unimorph list --availableto see all available languages - Downloads are atomic: partial downloads won't corrupt your data
- The first download creates the database at
~/.cache/unimorph/datasets.db - Compressed files: Large datasets (Polish, Czech, Ukrainian, Slovak) use
.xzcompression - handled automatically - Git LFS: Very large files (like Czech's full MorfFlex dataset) use Git LFS - also handled automatically
See Also
list
List available and cached languages.
Alias: ls
Synopsis
unimorph list [OPTIONS]
Description
Lists UniMorph languages. By default, shows cached (downloaded) languages with entry counts. Use --available to fetch the full list of available languages from GitHub.
Options
| Option | Description |
|---|---|
--cached | Show only cached (downloaded) languages |
--available | Fetch available languages from GitHub |
--refresh | Refresh the cached list of available languages |
--json | Output as JSON |
Examples
List Cached Languages
unimorph list
Cached languages:
fin (2737048 entries)
heb (33177 entries)
Use 'unimorph list --available' to see all available languages.
List All Available Languages
unimorph list --available
Available languages (145 total, 2 cached):
ady
afb
ain
...
heb [cached]
...
zul
Use 'unimorph download <code>' to download a language.
JSON Output
unimorph list --json
["fin", "heb"]
unimorph list --available --json
[
{"code": "ady", "cached": false},
{"code": "afb", "cached": false},
...
{"code": "heb", "cached": true},
...
]
Refresh Available List
unimorph list --available --refresh
Forces a fresh fetch from GitHub (the list is normally cached for 24 hours).
Pipe-Friendly Output
When piped, outputs just language codes:
unimorph list | head -3
fin
heb
# Download all available languages
unimorph list --available | while read lang; do
unimorph download "$lang"
done
See Also
inflect
Look up all inflected forms of a lemma.
Alias: i
Synopsis
unimorph inflect [OPTIONS] <LEMMA>
Description
Given a lemma (dictionary form), returns all its inflected forms with their morphological features. This is the primary way to see a word's full paradigm.
Arguments
| Argument | Description |
|---|---|
<LEMMA> | The lemma (dictionary form) to look up |
Options
| Option | Description |
|---|---|
-l, --lang <LANG> | Language code (ISO 639-3) |
-f, --features <PATTERN> | Filter by feature pattern (e.g., V;IND;*;SG) |
--json | Output as JSON |
--tsv | Output as TSV (tab-separated, no headers) |
Examples
Basic Lookup
unimorph inflect -l heb כתב
LEMMA FORM FEATURES
------------------------------------------------------------
כתב אכתוב V;1;SG;FUT
כתב יכתבו V;3;PL;FUT;MASC
כתב יכתוב V;3;SG;FUT;MASC
כתב כותב V;SG;PRS;MASC
כתב כתב V;3;SG;PST;MASC
...
29 form(s) found.
Filter by Features
Use wildcards (*) to match any value at a position:
# Only singular forms
unimorph inflect -l heb כתב -f "V;*;SG;*"
# Only past tense
unimorph inflect -l heb כתב -f "V;*;*;PST;*"
JSON Output
unimorph inflect -l heb כתב --json
[
{
"lemma": "כתב",
"form": "אכתוב",
"features": {
"raw": "V;1;SG;FUT",
"features": ["V", "1", "SG", "FUT"]
}
},
...
]
TSV for Piping
unimorph inflect -l heb כתב --tsv
כתב אכתוב V;1;SG;FUT
כתב יכתבו V;3;PL;FUT;MASC
כתב יכתוב V;3;SG;FUT;MASC
...
Scripting Examples
# Get unique forms only
unimorph inflect -l heb כתב --tsv | cut -f2 | sort -u
# Count forms by tense
unimorph inflect -l heb כתב --tsv | cut -f3 | grep -o 'PST\|PRS\|FUT' | sort | uniq -c
# Find forms matching a pattern
unimorph inflect -l spa hablar --tsv | grep "1;SG"
Notes
- The lemma must match exactly (case-sensitive for most languages)
- Use search with
--lemmafor partial/wildcard matching - Returns empty results if the lemma doesn't exist in the dataset
See Also
analyze
Analyze a surface form (reverse lookup).
Alias: a
Synopsis
unimorph analyze [OPTIONS] <FORM>
Description
Given a surface form (inflected word), returns all possible analyses: the lemma it comes from and its morphological features. This is the reverse of inflect.
A form may have multiple analyses if it's ambiguous (e.g., same spelling for different lemmas or different grammatical analyses).
Arguments
| Argument | Description |
|---|---|
<FORM> | The surface form to analyze |
Options
| Option | Description |
|---|---|
-l, --lang <LANG> | Language code (ISO 639-3) |
--json | Output as JSON |
--tsv | Output as TSV (tab-separated, no headers) |
Examples
Basic Analysis
unimorph analyze -l heb כתבתי
FORM LEMMA FEATURES
------------------------------------------------------------
כתבתי כתב V;1;SG;PST
1 analysis(es) found.
Ambiguous Forms
Some forms have multiple possible analyses:
unimorph analyze -l heb כתבו
FORM LEMMA FEATURES
------------------------------------------------------------
כתבו כתב V;3;PL;PST
כתבו כתב V;2;PL;IMP;MASC
2 analysis(es) found.
JSON Output
unimorph analyze -l heb כתבתי --json
[
{
"lemma": "כתב",
"form": "כתבתי",
"features": {
"raw": "V;1;SG;PST",
"features": ["V", "1", "SG", "PST"]
}
}
]
TSV for Piping
unimorph analyze -l heb כתבתי --tsv
כתבתי כתב V;1;SG;PST
Form Not Found
unimorph analyze -l heb xyz
No analyses found for 'xyz'.
The form may not exist in the dataset, or it could be:
- A proper noun or foreign word
- A misspelling
- A rare or archaic form
Scripting Examples
# Analyze words from a file
cat words.txt | while read word; do
echo "=== $word ==="
unimorph analyze -l heb "$word"
done
# Get just the lemma
unimorph analyze -l heb כתבתי --tsv | cut -f2
# Check if a word exists
if unimorph analyze -l heb כתבתי --tsv | grep -q .; then
echo "Found"
fi
Notes
- Analysis is case-sensitive for most languages
- Forms must match exactly (no fuzzy matching)
- Use search with
--formfor pattern matching
See Also
search
Search entries with flexible filtering.
Alias: s
Synopsis
unimorph search [OPTIONS]
Description
Search the dataset with flexible filtering by lemma, form, features, part of speech, and more. Supports wildcards and multiple filter combinations.
Options
| Option | Description |
|---|---|
-l, --lang <LANG> | Language code (ISO 639-3) |
--lemma <PATTERN> | Filter by lemma (supports SQL LIKE wildcards: % and _) |
--form <PATTERN> | Filter by form (supports SQL LIKE wildcards) |
-f, --features <PATTERN> | Filter by feature pattern (e.g., V;IND;*;1;*) |
-c, --contains <FEATURES> | Filter by features contained (comma-separated, position-independent) |
--pos <POS> | Filter by part of speech (e.g., V, N, ADJ) |
--limit <N> | Limit number of results (default: 100) |
--offset <N> | Skip first N results |
--count | Just show count of matching entries |
--json | Output as JSON |
--tsv | Output as TSV |
Examples
Search by Lemma Pattern
# Lemmas starting with "כת"
unimorph search -l heb --lemma "כת%"
# Lemmas containing "בר"
unimorph search -l heb --lemma "%בר%"
# Exact 4-letter lemmas
unimorph search -l heb --lemma "____"
Search by Form Pattern
# Forms ending with "ים"
unimorph search -l heb --form "%ים"
Filter by Features (Position-Dependent)
Use semicolon-separated patterns with * as wildcard:
# First person singular verbs
unimorph search -l heb -f "V;1;SG;*"
# Past tense forms
unimorph search -l heb -f "V;*;*;PST;*"
Filter by Features (Position-Independent)
Use --contains for features that can be at any position:
# Plural masculine forms (regardless of position)
unimorph search -l heb --contains PL,MASC
# Future tense first person
unimorph search -l heb --contains FUT,1
Filter by Part of Speech
# Only verbs
unimorph search -l heb --pos V
# Only nouns
unimorph search -l heb --pos N
Combine Filters
# Verbs with plural masculine future forms
unimorph search -l heb --pos V --contains PL,MASC,FUT
# Lemmas starting with "א" that are verbs
unimorph search -l heb --lemma "א%" --pos V
Pagination
# First 20 results
unimorph search -l heb --pos V --limit 20
# Results 21-40
unimorph search -l heb --pos V --limit 20 --offset 20
Count Only
unimorph search -l heb --pos V --count
15234 entries match.
Output Formats
# JSON
unimorph search -l heb --pos V --limit 5 --json
# TSV for piping
unimorph search -l heb --pos V --limit 5 --tsv
Scripting Examples
# Get unique lemmas for a part of speech
unimorph search -l heb --pos V --limit 10000 --tsv | cut -f1 | sort -u
# Count entries per lemma
unimorph search -l heb --pos V --limit 10000 --tsv | cut -f1 | sort | uniq -c | sort -rn | head
# Export filtered subset
unimorph search -l heb --contains FUT --tsv > future_forms.tsv
Wildcards Reference
SQL LIKE Wildcards (for --lemma and --form)
| Pattern | Matches |
|---|---|
% | Any sequence of characters |
_ | Any single character |
abc% | Starts with "abc" |
%abc | Ends with "abc" |
%abc% | Contains "abc" |
a_c | "a" + any char + "c" |
Feature Pattern Wildcards (for -f)
| Pattern | Matches |
|---|---|
* | Any value at that position |
V;*;SG;* | Verb, any person, singular, any tense |
See Also
stats
Show dataset statistics.
Alias: st
Synopsis
unimorph stats [OPTIONS] [LANG]
Description
Displays statistics about a downloaded language dataset, including entry counts, unique lemmas, unique forms, and unique feature combinations.
Arguments
| Argument | Description |
|---|---|
[LANG] | Language code (ISO 639-3). Optional if default is configured. |
Options
| Option | Description |
|---|---|
--json | Output as JSON |
Examples
Basic Statistics
unimorph stats heb
Statistics for heb:
Total entries: 33177
Unique lemmas: 1176
Unique forms: 27286
Unique features: 55
Imported at: 2024-01-15 10:30:00 UTC
JSON Output
unimorph stats heb --json
{
"total_entries": 33177,
"unique_lemmas": 1176,
"unique_forms": 27286,
"unique_features": 55
}
Compare Languages
for lang in heb ita fin deu; do
echo "=== $lang ==="
unimorph stats "$lang"
echo
done
Scripting
# Get entry count
unimorph stats heb --json | jq '.total_entries'
# Compare sizes
unimorph list | while read lang; do
count=$(unimorph stats "$lang" --json | jq '.total_entries')
echo "$lang: $count"
done | sort -t: -k2 -rn
Understanding the Statistics
| Metric | Description |
|---|---|
| Total entries | Number of (lemma, form, features) triples |
| Unique lemmas | Number of distinct dictionary forms |
| Unique forms | Number of distinct surface forms |
| Unique features | Number of distinct feature bundle combinations |
| Imported at | When the dataset was downloaded |
See Also
info
Show detailed info about a cached language.
Alias: in
Synopsis
unimorph info [OPTIONS] [LANG]
Description
Displays detailed information about a downloaded language dataset, including source URL, local and remote commit information, update status, and statistics.
Arguments
| Argument | Description |
|---|---|
[LANG] | Language code (ISO 639-3). Optional if default is configured. |
Options
| Option | Description |
|---|---|
--json | Output as JSON |
Examples
Basic Info
unimorph info heb
Language: heb
Source: https://github.com/unimorph/heb
Local imported: 2024-01-15 10:30:00 UTC
Local commit: b2bff12
Remote commit: b2bff12 (2023-01-09)
Status: Up to date
Statistics:
Total entries: 33177
Unique lemmas: 1176
Unique forms: 27286
Unique features: 55
Update Available
unimorph info heb
Language: heb
Source: https://github.com/unimorph/heb
Local imported: 2024-01-15 10:30:00 UTC
Local commit: b2bff12
Remote commit: c4d8e23 (2024-02-01)
Status: Update available
Statistics:
Total entries: 33177
Unique lemmas: 1176
Unique forms: 27286
Unique features: 55
JSON Output
unimorph info heb --json
{
"language": "heb",
"source": "https://github.com/unimorph/heb",
"local_commit": "b2bff12",
"remote_commit": "c4d8e23",
"imported_at": "2024-01-15T10:30:00Z",
"update_available": true,
"stats": {
"total_entries": 33177,
"unique_lemmas": 1176,
"unique_forms": 27286,
"unique_features": 55
}
}
See Also
export
Export a language dataset to file.
Alias: x
Synopsis
unimorph export [OPTIONS]
Description
Exports a downloaded language dataset to a file in TSV or JSONL format. Useful for integrating with other tools, creating backups, or processing data with external programs.
Options
| Option | Description |
|---|---|
-l, --lang <LANG> | Language code (ISO 639-3) |
-o, --output <PATH> | Output file path (use - for stdout) |
-F, --format <FORMAT> | Output format: tsv or jsonl (auto-detected from extension) |
Examples
Export to TSV
unimorph export -l heb -o hebrew.tsv
Exported 33177 entries to hebrew.tsv
Export to JSONL
unimorph export -l heb -o hebrew.jsonl
Or explicitly specify format:
unimorph export -l heb -o hebrew.json --format jsonl
Export to Stdout
Use -o - to write to stdout:
unimorph export -l heb -o - --format tsv | head -5
איבד אאבד V;1;SG;FUT
איבזר אאבזר V;1;SG;FUT
איבטח אאבטח V;1;SG;FUT
האביס אאביס V;1;SG;FUT
אבל אאבל V;1;SG;FUT
The status message goes to stderr, so piping works correctly:
unimorph export -l heb -o - 2>/dev/null | wc -l
33177
Scripting Examples
# Filter exported data
unimorph export -l heb -o - | grep "FUT" > future_forms.tsv
# Export and compress
unimorph export -l heb -o - | gzip > hebrew.tsv.gz
# Export multiple languages
for lang in heb ita deu; do
unimorph export -l "$lang" -o "${lang}.tsv"
done
# Convert to CSV
unimorph export -l heb -o - | tr '\t' ',' > hebrew.csv
Output Formats
TSV (Tab-Separated Values)
lemma<TAB>form<TAB>features
Example:
hablar hablo V;IND;PRS;1;SG
hablar hablas V;IND;PRS;2;SG
JSONL (JSON Lines)
One JSON object per line:
{"lemma":"hablar","form":"hablo","features":"V;IND;PRS;1;SG"}
{"lemma":"hablar","form":"hablas","features":"V;IND;PRS;2;SG"}
Notes
- Format is auto-detected from file extension (
.tsvor.jsonl) - Use
--formatto override auto-detection - Stdout export writes status to stderr to avoid polluting data
See Also
update
Update cached language datasets.
Alias: up
Synopsis
unimorph update [OPTIONS] [LANG]
Description
Checks for and downloads updates to cached language datasets. Can update a single language or all cached languages at once.
Arguments
| Argument | Description |
|---|---|
[LANG] | Language code to update. Omit with --all to update all. |
Options
| Option | Description |
|---|---|
--all | Update all cached languages |
--check | Check for updates without downloading |
--json | Output as JSON |
Examples
Check for Updates
unimorph update heb --check
Checking for updates...
heb - update available
Or if up to date:
Checking for updates...
heb - up to date
Update a Single Language
unimorph update heb
Updating heb...
Updated heb: 33177 -> 33250 entries
Check All Languages
unimorph update --all --check
Checking for updates...
fin - up to date
heb - update available
ita - up to date
1 update(s) available.
Update All Languages
unimorph update --all
Updating all cached languages...
fin - up to date
heb - updated (33177 -> 33250 entries)
ita - up to date
1 language(s) updated.
JSON Output
unimorph update --all --check --json
{
"languages": [
{"code": "fin", "update_available": false},
{"code": "heb", "update_available": true},
{"code": "spa", "update_available": false}
],
"updates_available": 1
}
Scripting
# Check and update only if needed
if unimorph update heb --check --json | jq -e '.update_available' > /dev/null; then
unimorph update heb
fi
See Also
features
Explore morphological features in a language.
Alias: f
Synopsis
unimorph features [OPTIONS]
Description
Explore the morphological features used in a language dataset. View unique feature values, their frequencies, search for entries with specific features, or analyze feature positions.
Options
| Option | Description |
|---|---|
-l, --lang <LANG> | Language code (ISO 639-3) |
--list | List all unique feature values |
--stats | Show feature value counts (histogram) |
--search <FEATURE> | Search for entries containing a specific feature |
--position <N> | Show values at a specific position (0-indexed) |
--limit <N> | Limit number of results (default: 50) |
--json | Output as JSON |
Examples
Feature Structure Overview
unimorph features -l heb
Feature structure for heb:
Position 0: 3 unique values (e.g., V, N, V.MSDR)
Position 1: 6 unique values (e.g., 2, 3, 1)
Position 2: 6 unique values (e.g., SG, PL, PRS)
Position 3: 11 unique values (e.g., FUT, PST, IMP)
Position 4: 2 unique values (e.g., FEM, MASC)
Use --list for all unique values, --stats for counts, --search <FEATURE> to find entries.
List All Features
unimorph features -l heb --list
Unique features in heb:
1
2
3
DEF
FEM
FUT
IMP
MASC
N
...
24 unique feature values.
Feature Statistics
unimorph features -l heb --stats
Feature statistics for heb:
FEATURE COUNT
----------------------------------------
V 28663
SG 16226
PL 15158
FEM 12384
MASC 12384
2 12108
FUT 10400
PST 9378
3 7286
1 4164
... and 14 more
Search by Feature
unimorph features -l heb --search FUT --limit 5
Entries with feature 'FUT':
LEMMA FORM FEATURES
------------------------------------------------------------
איבד אאבד V;1;SG;FUT
איבזר אאבזר V;1;SG;FUT
איבטח אאבטח V;1;SG;FUT
האביס אאביס V;1;SG;FUT
אבל אאבל V;1;SG;FUT
Showing 5 of 10400 results.
Analyze Feature Position
unimorph features -l heb --position 0
Feature values at position 0 in heb:
VALUE COUNT
----------------------------------------
V 28663
N 3338
V.MSDR 1176
JSON Output
unimorph features -l heb --stats --json
{
"V": 28663,
"SG": 16226,
"PL": 15158,
...
}
Pipe-Friendly Output
When piped, outputs clean format:
# Get just feature names
unimorph features -l heb --list | head -5
1
2
3
DEF
FEM
# Feature counts as TSV
unimorph features -l heb --stats | head -5
V 28663
SG 16226
PL 15158
FEM 12384
MASC 12384
Use Cases
- Understanding a language: See what features are used
- Finding examples: Search for entries with specific features
- Data exploration: Analyze feature distribution
- Building queries: Discover feature names for search filters
See Also
- search - Search with feature filters
- UniMorph Schema - Feature definitions
delete
Delete a cached language dataset.
Alias: rm
Synopsis
unimorph delete [OPTIONS] [LANG]
Description
Removes a downloaded language dataset from the local cache. The data can be re-downloaded later with unimorph download.
Arguments
| Argument | Description |
|---|---|
[LANG] | Language code (ISO 639-3). Optional if default is configured. |
Options
| Option | Description |
|---|---|
--json | Output as JSON |
Examples
Delete a Language
unimorph delete heb
Deleted heb (33177 entries removed)
JSON Output
unimorph delete heb --json
{
"language": "heb",
"entries_removed": 33177,
"status": "deleted"
}
Delete Multiple Languages
for lang in heb ita deu; do
unimorph delete "$lang"
done
Notes
- This only removes the data from the local cache
- Statistics and metadata are also removed
- Re-download anytime with
unimorph download - Use
unimorph repair --clear-datato delete all languages at once
See Also
repair
Repair or reset the local data store.
Synopsis
unimorph repair [OPTIONS]
Description
Utility command for troubleshooting and resetting the local data store. Can clear the API response cache or all downloaded datasets.
Options
| Option | Description |
|---|---|
--clear-cache | Clear cached API responses |
--clear-data | Clear all downloaded datasets (requires re-download) |
--json | Output as JSON |
Examples
Clear API Cache
Clears the cached list of available languages (normally cached for 24 hours):
unimorph repair --clear-cache
Cleared API cache
Clear All Data
Removes all downloaded language datasets:
unimorph repair --clear-data
Cleared all data (5 languages removed)
Clear Both
unimorph repair --clear-cache --clear-data
JSON Output
unimorph repair --clear-data --json
{
"cache_cleared": false,
"data_cleared": true,
"languages_removed": 5
}
Use Cases
- Corrupted data: If queries return unexpected results
- Stale cache: If available language list seems outdated
- Disk space: Remove all data to free space
- Fresh start: Reset everything to initial state
Notes
--clear-cacheonly removes API response cache, not datasets--clear-dataremoves all downloaded languages- Data can be re-downloaded with
unimorph download
See Also
sample
Randomly sample entries from a language dataset.
Alias: rand
Synopsis
unimorph sample [OPTIONS] <N>
Description
Samples random entries from a downloaded language dataset. Useful for exploring data, creating test sets, or getting a quick overview of a language's morphology.
Arguments
| Argument | Description |
|---|---|
<N> | Number of entries to sample |
Options
| Option | Description |
|---|---|
-l, --lang <LANG> | Language code (ISO 639-3) |
-s, --seed <SEED> | Seed for reproducible sampling |
--by-lemma | Sample complete paradigms instead of random entries |
--json | Output as JSON |
--tsv | Output as TSV (tab-separated, no headers) |
Examples
Random Entries
unimorph sample -l spa 5
LEMMA FORM FEATURES
------------------------------------------------------------
tapiar tapiemos V;SBJV;PRS;1;PL
apilar apilando V;V.CVB;PRS
hablar hablaste V;IND;PST;PFV;2;SG;INFM
comer comieron V;IND;PST;PFV;3;PL
vivir viviremos V;IND;FUT;1;PL
5 sampled entry(ies).
Sample Complete Paradigms
Use --by-lemma to get all forms of randomly selected lemmas:
unimorph sample -l spa 2 --by-lemma
This returns complete paradigms for 2 random lemmas, showing all their inflected forms.
Reproducible Sampling
Use --seed for reproducible results:
unimorph sample -l spa 5 --seed 42
Running with the same seed always returns the same entries.
JSON Output
unimorph sample -l spa 3 --json
[
{
"lemma": "hablar",
"form": "hablamos",
"features": {
"raw": "V;IND;PRS;1;PL",
"features": ["V", "IND", "PRS", "1", "PL"]
}
},
...
]
TSV for Scripting
unimorph sample -l spa 10 --tsv > sample.tsv
Scripting Examples
# Create a test set
unimorph sample -l spa 100 --seed 123 --tsv > test_set.tsv
# Sample paradigms for flashcard generation
unimorph sample -l spa 10 --by-lemma --json > flashcards.json
# Get random verbs only
unimorph sample -l spa 50 --tsv | grep "^V;" | head -10
Notes
- Without
--seed, results are different each run --by-lemmareturns more entries than N (all forms of N lemmas)- Large N values may take longer for big datasets
See Also
config
Manage configuration.
Alias: cfg
Synopsis
unimorph config <COMMAND>
Subcommands
| Command | Description |
|---|---|
show | Show current configuration |
init | Initialize a new config file |
path | Show the config file path |
config show
Display the current configuration, including both config file settings and defaults.
unimorph config show
Configuration
Path: /home/user/.config/unimorph/config.toml
Status: loaded
Current Settings
default_lang: heb
data_dir: (default)
output_format: (default: table)
no_color: (not set)
JSON Output
unimorph config show --json
{
"path": "/home/user/.config/unimorph/config.toml",
"exists": true,
"default_lang": "heb",
"data_dir": null,
"output_format": null,
"no_color": null
}
config init
Create a new config file with example content.
unimorph config init
Created config file at /home/user/.config/unimorph/config.toml
Force Overwrite
unimorph config init --force
Overwrites existing config file.
JSON Output
unimorph config init --json
{
"path": "/home/user/.config/unimorph/config.toml",
"created": true
}
config path
Show the config file path.
unimorph config path
/home/user/.config/unimorph/config.toml
JSON Output
unimorph config path --json
{
"path": "/home/user/.config/unimorph/config.toml"
}
Config File Format
The config file uses TOML format:
# Default language for commands
default_lang = "heb"
# Custom data directory
# data_dir = "/custom/path"
# Default output format: "table", "json", or "tsv"
# output_format = "table"
# Disable colored output
# no_color = true
# Language aliases
[languages]
hebrew = "heb"
spanish = "spa"
See Also
- Configuration Guide - Full configuration documentation
completions
Generate shell completions for your shell.
Synopsis
unimorph completions <SHELL>
Description
Generates shell completion scripts that enable tab-completion for unimorph commands, options, and arguments.
Arguments
| Argument | Description |
|---|---|
<SHELL> | Shell to generate completions for: bash, zsh, fish, elvish, powershell |
Installation
Bash
# Add to ~/.bashrc
source <(unimorph completions bash)
# Or save to a file
unimorph completions bash > ~/.local/share/bash-completion/completions/unimorph
Zsh
# Add to ~/.zshrc (before compinit)
source <(unimorph completions zsh)
# Or save to fpath
unimorph completions zsh > ~/.zfunc/_unimorph
# Then add to ~/.zshrc: fpath=(~/.zfunc $fpath)
Fish
unimorph completions fish > ~/.config/fish/completions/unimorph.fish
PowerShell
# Add to your PowerShell profile
unimorph completions powershell | Out-String | Invoke-Expression
# Or save to a file and source it
unimorph completions powershell > unimorph.ps1
Elvish
unimorph completions elvish > ~/.elvish/lib/unimorph.elv
# Then add to ~/.elvish/rc.elv: use unimorph
Examples
After installation, you can use tab completion:
# Complete commands
unimorph inf<TAB> # completes to 'inflect'
# Complete options
unimorph inflect --<TAB> # shows available options
# Complete language codes (if supported by your shell)
unimorph inflect -l <TAB>
Notes
- Restart your shell or source your config file after installation
- Some completions may require a downloaded language list to work
See Also
- Installation - Full installation instructions
Library Overview
The unimorph-core crate provides a Rust library for working with UniMorph morphological data. Use it to integrate morphological lookups into your own applications.
Installation
Add to your Cargo.toml:
[dependencies]
unimorph-core = "0.1"
Quick Example
use unimorph_core::{Repository, LangCode}; fn main() -> anyhow::Result<()> { // Create a repository (uses default cache directory) let repo = Repository::open_default()?; // Parse language code let lang: LangCode = "heb".parse()?; // Look up all forms of a lemma let forms = repo.store().inflect(&lang, "כתב")?; for entry in forms { println!("{} -> {} ({})", entry.lemma, entry.form, entry.features); } // Analyze a surface form let analyses = repo.store().analyze(&lang, "כתבתי")?; for entry in analyses { println!("{} <- {} ({})", entry.form, entry.lemma, entry.features); } Ok(()) }
Core Components
Repository
The Repository manages data downloads and caching:
#![allow(unused)] fn main() { use unimorph_core::Repository; // Default location (~/.cache/unimorph) let repo = Repository::open_default()?; // Custom location let repo = Repository::open("/custom/path")?; // Download a language repo.download("heb").await?; // List cached languages let languages = repo.cached_languages()?; }
Store
The Store provides the query interface:
#![allow(unused)] fn main() { let store = repo.store(); // Inflect: lemma -> forms let forms = store.inflect("heb", "כתב")?; // Analyze: form -> lemmas let analyses = store.analyze("heb", "כתבתי")?; // Statistics let stats = store.stats("heb")?; }
Query Builder
Flexible searching with the query builder:
#![allow(unused)] fn main() { let results = store.query("heb") .lemma("כת%") // LIKE pattern .pos("V") // Part of speech .features_contain(&["FUT", "1"]) // Has these features .limit(100) .execute()?; }
Types
Core data types:
#![allow(unused)] fn main() { use unimorph_core::{Entry, LangCode, FeatureBundle}; // Language codes (validated) let lang: LangCode = "heb".parse()?; // Entries contain lemma, form, features let entry = Entry { lemma: "כתב".to_string(), form: "כתבתי".to_string(), features: "V;1;SG;PST".parse()?, }; // Feature bundles support pattern matching let features: FeatureBundle = "V;1;SG;PST".parse()?; assert!(features.matches("V;*;SG;*")); assert!(features.contains("PST")); }
Error Handling
The library uses a custom Error type:
#![allow(unused)] fn main() { use unimorph_core::{Result, Error}; fn example() -> Result<()> { let repo = Repository::open_default()?; match repo.store().inflect("heb", "xyz") { Ok(entries) => println!("Found {} entries", entries.len()), Err(Error::NotFound(msg)) => println!("Not found: {}", msg), Err(e) => return Err(e), } Ok(()) } }
Feature Flags
| Flag | Description |
|---|---|
default | Standard features |
parquet | Parquet export support |
[dependencies]
unimorph-core = { version = "0.1", features = ["parquet"] }
Next Steps
- Types - Core data types
- Store - Query interface
- Repository - Data management
- Query Builder - Advanced searching
Types
Core data types in unimorph-core.
LangCode
A validated ISO 639-3 language code (3 lowercase ASCII letters).
#![allow(unused)] fn main() { use unimorph_core::LangCode; // Parse from string let lang: LangCode = "heb".parse()?; // Validation happens at parse time assert!("HEB".parse::<LangCode>().is_err()); // Must be lowercase assert!("he".parse::<LangCode>().is_err()); // Must be 3 chars assert!("h3b".parse::<LangCode>().is_err()); // Must be letters // Convert to string let s: &str = lang.as_ref(); let s: String = lang.to_string(); }
Entry
A single morphological entry with lemma, form, and features.
#![allow(unused)] fn main() { use unimorph_core::Entry; // Entries are returned from queries let entries = store.inflect("heb", "כתב")?; for entry in entries { println!("Lemma: {}", entry.lemma); println!("Form: {}", entry.form); println!("Features: {}", entry.features); println!("Features (raw): {}", entry.features.raw()); println!("Features (list): {:?}", entry.features.as_slice()); } // Parse from TSV line let entry = Entry::parse_line("כתב\tכתבתי\tV;1;SG;PST", 1)?; // Serialize to JSON let json = serde_json::to_string(&entry)?; }
Fields
| Field | Type | Description |
|---|---|---|
lemma | String | Dictionary form |
form | String | Inflected surface form |
features | FeatureBundle | Morphological features |
FeatureBundle
A semicolon-separated bundle of morphological features.
#![allow(unused)] fn main() { use unimorph_core::FeatureBundle; // Parse from string let features: FeatureBundle = "V;1;SG;PST".parse()?; // Access individual features assert_eq!(features.as_slice(), &["V", "1", "SG", "PST"]); assert_eq!(features.raw(), "V;1;SG;PST"); assert_eq!(features.len(), 4); // Check if contains a feature (position-independent) assert!(features.contains("PST")); assert!(features.contains("V")); assert!(!features.contains("FUT")); // Check if contains all features assert!(features.contains_all(&["V", "PST"])); // Pattern matching with wildcards assert!(features.matches("V;*;SG;*")); assert!(features.matches("V;1;*;PST")); assert!(!features.matches("N;*;*;*")); // Display println!("{}", features); // "V;1;SG;PST" }
Pattern Matching
The matches method supports positional pattern matching:
| Pattern | Description |
|---|---|
V;1;SG;PST | Exact match |
V;*;SG;* | Wildcard at positions 1 and 3 |
*;*;*;PST | Only check position 3 |
Note: Pattern must have same number of positions as the bundle.
Validation
- Feature bundles cannot be empty
- Individual features cannot be empty
- Features are separated by semicolons
#![allow(unused)] fn main() { assert!("".parse::<FeatureBundle>().is_err()); // Empty assert!("V;;SG".parse::<FeatureBundle>().is_err()); // Empty feature }
DatasetStats
Statistics about a downloaded language dataset.
#![allow(unused)] fn main() { use unimorph_core::DatasetStats; let stats = store.stats("heb")?; if let Some(stats) = stats { println!("Total entries: {}", stats.total_entries); println!("Unique lemmas: {}", stats.unique_lemmas); println!("Unique forms: {}", stats.unique_forms); println!("Unique features: {}", stats.unique_features); } }
Fields
| Field | Type | Description |
|---|---|---|
total_entries | usize | Number of entries |
unique_lemmas | usize | Distinct lemmas |
unique_forms | usize | Distinct surface forms |
unique_features | usize | Distinct feature bundles |
Serialization
All types implement Serialize and Deserialize from serde:
#![allow(unused)] fn main() { use unimorph_core::Entry; let entry = store.inflect("heb", "כתב")?.first().unwrap(); // To JSON let json = serde_json::to_string(&entry)?; // From JSON let entry: Entry = serde_json::from_str(&json)?; }
Store
The Store provides the query interface for morphological data.
Opening a Store
Usually accessed through Repository:
#![allow(unused)] fn main() { use unimorph_core::Repository; let repo = Repository::open_default()?; let store = repo.store(); }
Or open directly:
#![allow(unused)] fn main() { use unimorph_core::Store; // Open existing database let store = Store::open("path/to/datasets.db")?; // In-memory store (for testing) let store = Store::in_memory()?; }
Basic Queries
Inflect (Lemma to Forms)
Look up all inflected forms of a lemma:
#![allow(unused)] fn main() { let forms = store.inflect("heb", "כתב")?; for entry in &forms { println!("{} -> {} ({})", entry.lemma, entry.form, entry.features); } println!("Found {} forms", forms.len()); }
Analyze (Form to Lemmas)
Find all possible lemmas for a surface form:
#![allow(unused)] fn main() { let analyses = store.analyze("heb", "כתבו")?; for entry in &analyses { println!("{} <- {} ({})", entry.form, entry.lemma, entry.features); } // Handle ambiguous forms if analyses.len() > 1 { println!("Ambiguous: {} possible analyses", analyses.len()); } }
Statistics
Get dataset statistics:
#![allow(unused)] fn main() { if let Some(stats) = store.stats("heb")? { println!("Entries: {}", stats.total_entries); println!("Lemmas: {}", stats.unique_lemmas); println!("Forms: {}", stats.unique_forms); } }
Check Language
Check if a language is loaded:
#![allow(unused)] fn main() { if store.has_language("heb")? { println!("Hebrew is available"); } // List all languages let languages = store.languages()?; for lang in languages { println!("- {}", lang); } }
Query Builder
For flexible searching, use the query builder:
#![allow(unused)] fn main() { let results = store.query("heb") .lemma("כת%") // LIKE pattern (% = any chars) .form("%ים") // Forms ending in ים .pos("V") // Part of speech .features_match("V;*;SG;*") // Pattern match .features_contain(&["FUT"]) // Contains feature .limit(100) .offset(0) .execute()?; }
See Query Builder for full documentation.
Data Management
Import Data
Import entries from TSV format:
#![allow(unused)] fn main() { use unimorph_core::{Entry, LangCode}; let lang: LangCode = "test".parse()?; let entries = vec![ Entry::parse_line("test\tform1\tN;SG", 1)?, Entry::parse_line("test\tform2\tN;PL", 2)?, ]; store.import(&lang, &entries, None, None)?; }
Delete Language
Remove a language from the store:
#![allow(unused)] fn main() { let removed = store.delete_language("heb")?; println!("Removed {} entries", removed); }
Export
Export to various formats:
#![allow(unused)] fn main() { // Export to TSV file let count = store.export_tsv("heb", "hebrew.tsv")?; // Export to JSONL file let count = store.export_jsonl("heb", "hebrew.jsonl")?; // Export to writer (e.g., stdout) use std::io::stdout; let count = store.export_tsv_to_writer("heb", stdout().lock())?; // Parquet (with feature flag) #[cfg(feature = "parquet")] let count = store.export_parquet("heb", "hebrew.parquet")?; }
Thread Safety
Store is Send but not Sync. For concurrent access, use a mutex or create separate store instances:
#![allow(unused)] fn main() { use std::sync::Mutex; let store = Mutex::new(Store::open("datasets.db")?); // In threads: let store = store.lock().unwrap(); let results = store.inflect("heb", "כתב")?; }
Error Handling
#![allow(unused)] fn main() { use unimorph_core::{Store, Error}; match store.inflect("xyz", "test") { Ok(entries) => println!("Found {} entries", entries.len()), Err(Error::LanguageNotFound(lang)) => { println!("Language {} not downloaded", lang); } Err(e) => return Err(e.into()), } }
Repository
The Repository manages data downloads, caching, and provides access to the underlying store.
Creating a Repository
#![allow(unused)] fn main() { use unimorph_core::Repository; // Default location (~/.cache/unimorph) let repo = Repository::open_default()?; // Custom location let repo = Repository::open("/path/to/data")?; // Custom location with PathBuf use std::path::PathBuf; let path = PathBuf::from("/path/to/data"); let repo = Repository::open(&path)?; }
Downloading Data
Download a language dataset from UniMorph:
#![allow(unused)] fn main() { // Download (async) repo.download("heb").await?; // Force re-download repo.download_with_options("heb", true).await?; }
Compressed Files and Git LFS
Some large datasets are distributed differently due to GitHub file size limits:
| Format | Languages | Notes |
|---|---|---|
.xz (LZMA) | ces, pol, slk, ukr | Best compression for text |
.zip | rus (segmentations), san | Archive format |
| Git LFS | ces (full MorfFlex) | For files > 100MB |
The repository automatically:
- Tries compressed versions first (
.xz,.gz) - Falls back to uncompressed if not found
- Detects Git LFS pointers and fetches from media endpoint
- Decompresses transparently before importing
No special handling is needed - just call download() as usual.
Parse Reporting
When parsing downloaded data, use Entry::parse_tsv_with_report() for detailed diagnostics:
#![allow(unused)] fn main() { use unimorph_core::{Entry, ParseReport, CompressionFormat}; let content = "lemma\tform\tV;IND\nbad line\nlemma2\tform2\tN;SG\n"; let (entries, report) = Entry::parse_tsv_with_report(content); println!("Valid entries: {}", report.valid_entries); println!("Blank lines: {}", report.blank_lines); println!("Malformed: {}", report.malformed_count); // Inspect malformed entries (first 10 stored) for entry in &report.malformed { println!(" Line {}: {} - {}", entry.line_num, entry.reason, entry.content ); } }
The ParseReport includes:
| Field | Type | Description |
|---|---|---|
valid_entries | usize | Successfully parsed entries |
blank_lines | usize | Empty lines (not an error) |
malformed_count | usize | Total entries that failed |
malformed | Vec<MalformedEntry> | Details for first 10 failures |
compression | CompressionFormat | Source file format |
from_lfs | bool | Whether fetched via Git LFS |
filename | Option<String> | Source filename(s) |
The CompressionFormat enum:
#![allow(unused)] fn main() { pub enum CompressionFormat { None, // Plain text Xz, // .xz (LZMA) Gzip, // .gz Zip, // .zip archive } }
Accessing the Store
Get the underlying store for queries:
#![allow(unused)] fn main() { let store = repo.store(); let forms = store.inflect("heb", "כתב")?; }
Checking Cached Languages
#![allow(unused)] fn main() { // List cached languages let languages = repo.cached_languages()?; for lang in &languages { println!("Cached: {}", lang); } // Check if specific language is cached if languages.iter().any(|l| l.as_ref() == "heb") { println!("Hebrew is cached"); } }
Data Directory
The repository manages a data directory containing:
~/.cache/unimorph/
├── datasets.db # SQLite database
└── available_languages.json # Cached API response
Get the data directory:
#![allow(unused)] fn main() { let data_dir = repo.data_dir(); println!("Data stored in: {}", data_dir.display()); }
Full Example
use unimorph_core::Repository; #[tokio::main] async fn main() -> anyhow::Result<()> { // Open repository let repo = Repository::open_default()?; // Download Hebrew if not cached let cached = repo.cached_languages()?; if !cached.iter().any(|l| l.as_ref() == "heb") { println!("Downloading Hebrew..."); repo.download("heb").await?; } // Query the data let store = repo.store(); let forms = store.inflect("heb", "כתב")?; println!("Found {} forms of כתב:", forms.len()); for entry in &forms { println!(" {} - {}", entry.form, entry.features); } Ok(()) }
Error Handling
#![allow(unused)] fn main() { use unimorph_core::{Repository, Error}; async fn download_language(repo: &Repository, lang: &str) -> anyhow::Result<()> { match repo.download(lang).await { Ok(()) => println!("Downloaded {}", lang), Err(Error::Network(e)) => { println!("Network error: {}", e); println!("Check your connection and try again"); } Err(Error::InvalidLanguage(l)) => { println!("Invalid language code: {}", l); } Err(e) => return Err(e.into()), } Ok(()) } }
Async Runtime
Download operations are async and require a runtime:
// With tokio #[tokio::main] async fn main() { let repo = Repository::open_default().unwrap(); repo.download("heb").await.unwrap(); } // Or with block_on fn main() { let rt = tokio::runtime::Runtime::new().unwrap(); let repo = Repository::open_default().unwrap(); rt.block_on(repo.download("heb")).unwrap(); }
Query Builder
The query builder provides a fluent interface for flexible searching.
Basic Usage
#![allow(unused)] fn main() { let results = store.query("heb") .limit(100) .execute()?; }
Filter Methods
By Lemma
#![allow(unused)] fn main() { // Exact match .lemma("כתב") // LIKE pattern (% = any chars, _ = single char) .lemma("כת%") // Starts with כת .lemma("%ב") // Ends with ב .lemma("%בר%") // Contains בר .lemma("___") // Exactly 3 characters }
By Form
#![allow(unused)] fn main() { // Exact match .form("כתבתי") // LIKE pattern .form("%ים") // Plural forms ending in ים .form("ה%") // Forms starting with ה }
By Part of Speech
#![allow(unused)] fn main() { .pos("V") // Verbs .pos("N") // Nouns .pos("ADJ") // Adjectives }
By Features (Pattern Match)
Position-dependent matching with wildcards:
#![allow(unused)] fn main() { // Match specific positions .features_match("V;1;SG;*") // 1st person singular verbs .features_match("V;*;*;PST;*") // Past tense verbs .features_match("N;*;PL;*") // Plural nouns }
By Features (Contains)
Position-independent matching:
#![allow(unused)] fn main() { // Has these features anywhere .features_contain(&["FUT"]) // Future tense .features_contain(&["PL", "MASC"]) // Plural masculine .features_contain(&["V", "1", "SG"]) // 1st person singular verbs }
Pagination
#![allow(unused)] fn main() { // First page .limit(20) .offset(0) // Second page .limit(20) .offset(20) // All results (careful with large datasets!) .limit(usize::MAX) }
Executing Queries
Get Results
#![allow(unused)] fn main() { let entries: Vec<Entry> = store.query("heb") .pos("V") .limit(100) .execute()?; for entry in &entries { println!("{} {} {}", entry.lemma, entry.form, entry.features); } }
Count Results
#![allow(unused)] fn main() { let count = store.query("heb") .pos("V") .count()?; println!("Found {} verbs", count); }
Check Existence
#![allow(unused)] fn main() { let exists = store.query("heb") .lemma("כתב") .exists()?; if exists { println!("Lemma found"); } }
Get First Result
#![allow(unused)] fn main() { if let Some(entry) = store.query("heb") .lemma("כתב") .first()? { println!("First form: {}", entry.form); } }
Chaining Filters
Filters are combined with AND logic:
#![allow(unused)] fn main() { let results = store.query("heb") .lemma("כת%") // AND .pos("V") // AND .features_contain(&["FUT"]) // AND .limit(10) .execute()?; }
Examples
Find All Verb Infinitives
#![allow(unused)] fn main() { let infinitives = store.query("heb") .pos("V") .features_contain(&["NFIN"]) .execute()?; }
Find Ambiguous Forms
Forms that could be multiple parts of speech:
#![allow(unused)] fn main() { let form = "שמר"; let as_verb = store.query("heb") .form(form) .pos("V") .execute()?; let as_noun = store.query("heb") .form(form) .pos("N") .execute()?; if !as_verb.is_empty() && !as_noun.is_empty() { println!("{} is ambiguous (verb and noun)", form); } }
Paginate Through All Results
#![allow(unused)] fn main() { let page_size = 100; let mut offset = 0; loop { let results = store.query("heb") .pos("V") .limit(page_size) .offset(offset) .execute()?; if results.is_empty() { break; } for entry in &results { // Process entry } offset += page_size; } }
Export Filtered Subset
#![allow(unused)] fn main() { use std::io::Write; let mut file = std::fs::File::create("verbs.tsv")?; let verbs = store.query("heb") .pos("V") .limit(usize::MAX) .execute()?; for entry in &verbs { writeln!(file, "{}\t{}\t{}", entry.lemma, entry.form, entry.features)?; } }
Performance Tips
- Use limits: Always set a reasonable limit
- Prefer specific filters: More filters = faster queries
- Use
count()first: Check result size before fetching all - Index-friendly queries: Lemma and form queries use indexes
#![allow(unused)] fn main() { // Good: Uses index .lemma("כתב") // Good: Uses index .form("כתבתי") // Slower: Full scan with pattern .lemma("%תב%") // Slower: Feature scan .features_contain(&["FUT"]) }
Python Bindings
The unimorph-rs Python package provides fast, Rust-powered access to UniMorph morphological data with native Polars DataFrame support.
Installation
pip install unimorph-rs
For Polars DataFrame support:
pip install unimorph-rs[polars]
Links:
Requirements
- Python 3.9+
- Polars (optional, for DataFrame methods)
Quick Start
from unimorph import Store, download
# Download a language dataset (one-time)
download("spa") # Spanish
# Create a store to query the data
store = Store()
# Get all inflected forms of a lemma
forms = store.inflect("spa", "hablar")
for entry in forms:
print(f"{entry.form}: {entry.features}")
Output:
hablar: V;NFIN
hablando: V;V.CVB;PRS
hablado: V;V.PTCP;PST;MASC;SG
hablo: V;IND;PRS;1;SG
hablas: V;IND;PRS;2;SG
habla: V;IND;PRS;3;SG
...
Core API
download(lang)
Downloads a language dataset from UniMorph. Only needs to be called once per language.
from unimorph import download
download("deu") # German
download("spa") # Spanish
download("fra") # French
Store
The main interface for querying morphological data.
from unimorph import Store
store = Store()
store.inflect(lang, lemma)
Get all inflected forms for a lemma (dictionary form).
forms = store.inflect("deu", "gehen") # "to go" in German
for entry in forms:
print(f"{entry.lemma} -> {entry.form}: {entry.features}")
store.analyze(lang, form)
Analyze a word form to find possible lemmas and features.
analyses = store.analyze("spa", "hablamos")
for entry in analyses:
print(f"{entry.form} <- {entry.lemma}: {entry.features}")
store.search_features(lang, features, limit=None)
Search for entries containing specific morphological features.
# Find all past tense subjunctive forms in Spanish
entries = store.search_features("spa", "SBJV;PST", limit=100)
store.stats(lang)
Get statistics about a downloaded language dataset.
stats = store.stats("spa")
if stats:
print(f"Entries: {stats.total_entries}")
print(f"Unique lemmas: {stats.unique_lemmas}")
print(f"Unique forms: {stats.unique_forms}")
store.languages()
List all downloaded languages.
langs = store.languages()
print(langs) # ['deu', 'ita', 'spa', ...]
store.has_language(lang)
Check if a language is downloaded.
if store.has_language("fra"):
print("French data is available")
Polars DataFrame Support
Note: Requires
pip install unimorph-rs[polars]
All query methods have _df variants that return Polars DataFrames for easy data analysis.
from unimorph import Store, download
download("spa")
store = Store()
# Get results as a DataFrame
df = store.inflect_df("spa", "ser")
print(df)
Output:
shape: (70, 3)
+-------+---------+------------------------+
| lemma | form | features |
| --- | --- | --- |
| str | str | str |
+-------+---------+------------------------+
| ser | ser | V;NFIN |
| ser | siendo | V;V.CVB;PRS |
| ser | sido | V;V.PTCP;PST;MASC;SG |
| ser | soy | V;IND;PRS;1;SG |
| ser | eres | V;IND;PRS;2;SG |
| ... | ... | ... |
+-------+---------+------------------------+
DataFrame Methods
store.inflect_df(lang, lemma)- Inflections as DataFramestore.analyze_df(lang, form)- Analyses as DataFramestore.search_features_df(lang, features, limit=None)- Feature search as DataFrame
Working with DataFrames
import polars as pl
df = store.inflect_df("spa", "hablar")
# Filter to indicative mood only
indicative = df.filter(pl.col("features").str.contains("IND"))
# Group by tense
by_tense = df.filter(
pl.col("features").str.contains("IND")
).with_columns(
pl.when(pl.col("features").str.contains("PRS")).then(pl.lit("present"))
.when(pl.col("features").str.contains("PST")).then(pl.lit("past"))
.when(pl.col("features").str.contains("FUT")).then(pl.lit("future"))
.otherwise(pl.lit("other"))
.alias("tense")
)
print(by_tense)
Entry Objects
Query results return Entry objects with the following attributes:
| Attribute | Type | Description |
|---|---|---|
lemma | str | Dictionary form / citation form |
form | str | Inflected surface form |
features | str | UniMorph feature bundle (semicolon-separated) |
entry = store.inflect("spa", "hablar")[0]
print(entry.lemma) # "hablar"
print(entry.form) # "hablar"
print(entry.features) # "V;NFIN"
print(repr(entry)) # Entry(lemma='hablar', form='hablar', features='V;NFIN')
DatasetStats Objects
Statistics returned by store.stats():
| Attribute | Type | Description |
|---|---|---|
language | str | Language code |
total_entries | int | Total number of entries |
unique_lemmas | int | Number of unique lemmas |
unique_forms | int | Number of unique forms |
unique_features | int | Number of unique feature bundles |
Example: Building a Conjugation Table
import polars as pl
from unimorph import Store, download
download("spa")
store = Store()
# Get all forms of "hablar" (to speak)
df = store.inflect_df("spa", "hablar")
# Filter to present indicative
present = df.filter(
pl.col("features").str.contains("IND") &
pl.col("features").str.contains("PRS")
)
# Extract person and number
conjugation = present.with_columns([
pl.when(pl.col("features").str.contains("1")).then(pl.lit("1st"))
.when(pl.col("features").str.contains("2")).then(pl.lit("2nd"))
.when(pl.col("features").str.contains("3")).then(pl.lit("3rd"))
.alias("person"),
pl.when(pl.col("features").str.contains("SG")).then(pl.lit("singular"))
.when(pl.col("features").str.contains("PL")).then(pl.lit("plural"))
.alias("number")
]).select(["person", "number", "form"])
print(conjugation)
See Also
- UniMorph Feature Schema - Understanding feature codes
- Available Languages - List of supported languages
- CLI Reference - Command-line interface
About UniMorph
UniMorph is a collaborative project that provides morphological paradigms for the world's languages in a standardized format.
What is Morphology?
Morphology is the study of word structure and how words change form to express different grammatical meanings. For example:
- English: "walk" -> "walks", "walked", "walking"
- Spanish: "hablar" -> "hablo", "hablas", "habla", "hablamos"...
- Hebrew: "כתב" -> "כותב", "כתבתי", "יכתוב"...
What UniMorph Provides
UniMorph datasets contain mappings from lemmas (dictionary forms) to their inflected forms, along with morphological features describing each form.
Data Format
Each entry is a triple:
lemma <TAB> form <TAB> features
Example (Spanish):
hablar hablo V;IND;PRS;1;SG
hablar hablas V;IND;PRS;2;SG
hablar habla V;IND;PRS;3;SG
hablar hablamos V;IND;PRS;1;PL
hablar habláis V;IND;PRS;2;PL
hablar hablan V;IND;PRS;3;PL
Coverage
UniMorph includes data for 100+ languages, ranging from:
- High-resource languages: English, Spanish, German, French
- Medium-resource languages: Finnish, Hungarian, Turkish
- Low-resource languages: Many endangered and under-documented languages
Data Sources
UniMorph data comes from:
- Wiktionary extractions
- Linguistic databases
- Academic contributions
- Community submissions
Use Cases
Natural Language Processing
- Training morphological inflection models
- Data augmentation for NLU systems
- Lemmatization and stemming lookup tables
Language Learning
- Conjugation practice applications
- Flashcard generation
- Grammar reference tools
Linguistic Research
- Cross-linguistic typology studies
- Morphological complexity analysis
- Paradigm structure research
Lexicography
- Dictionary development
- Inflection table generation
- Coverage verification
The UniMorph Schema
UniMorph uses a standardized feature schema across all languages, making cross-linguistic comparison possible. Features are organized into dimensions:
- Part of Speech (V, N, ADJ, ...)
- Person (1, 2, 3)
- Number (SG, PL, DU)
- Tense (PST, PRS, FUT)
- And many more...
See the official UniMorph schema documentation for the complete specification, or our Feature Schema page for a quick reference.
Contributing to UniMorph
UniMorph is open source. Each language has its own GitHub repository:
- Main site: unimorph.github.io
- Organization: github.com/unimorph
Contributions welcome:
- Report data errors
- Add missing forms
- Contribute new languages
Citation
If you use UniMorph in research, please cite:
@inproceedings{mccarthy-etal-2020-unimorph,
title = "{U}ni{M}orph 3.0: Universal Morphology",
author = "McCarthy, Arya D. and others",
booktitle = "LREC",
year = "2020",
}
Related Projects
- SIGMORPHON: Shared tasks on morphological analysis
- Universal Dependencies: Syntactic annotation
- Lexical Markup Framework: ISO standard for lexical resources
External Links
Feature Schema
UniMorph uses a standardized feature schema to annotate morphological forms. Features are semicolon-separated and position-dependent within each language.
For the complete official specification, see the UniMorph Schema documentation (PDF).
Feature Format
FEATURE1;FEATURE2;FEATURE3;...
Example: V;IND;PRS;1;SG means:
- V = Verb
- IND = Indicative mood
- PRS = Present tense
- 1 = First person
- SG = Singular number
Feature Dimensions
Part of Speech
| Feature | Description |
|---|---|
V | Verb |
N | Noun |
ADJ | Adjective |
ADV | Adverb |
PRO | Pronoun |
DET | Determiner |
ADP | Adposition |
NUM | Numeral |
CONJ | Conjunction |
PART | Particle |
INTJ | Interjection |
V.MSDR | Verbal noun / Masdar |
V.PTCP | Participle |
V.CVB | Converb |
Person
| Feature | Description |
|---|---|
1 | First person |
2 | Second person |
3 | Third person |
4 | Fourth person (obviate) |
INCL | Inclusive |
EXCL | Exclusive |
Number
| Feature | Description |
|---|---|
SG | Singular |
PL | Plural |
DU | Dual |
TRI | Trial |
PAUC | Paucal |
GRPL | Greater plural |
Gender
| Feature | Description |
|---|---|
MASC | Masculine |
FEM | Feminine |
NEUT | Neuter |
NAKH | Animate (Algonquian) |
Case
| Feature | Description |
|---|---|
NOM | Nominative |
ACC | Accusative |
GEN | Genitive |
DAT | Dative |
INS | Instrumental |
LOC | Locative |
ABL | Ablative |
VOC | Vocative |
ESS | Essive |
TRANS | Translative |
COM | Comitative |
PRIV | Privative |
PRT | Partitive |
| And many more... |
Tense
| Feature | Description |
|---|---|
PRS | Present |
PST | Past |
FUT | Future |
IPFV | Imperfective |
PFV | Perfective |
PRF | Perfect |
PLPRF | Pluperfect |
PROSP | Prospective |
Aspect
| Feature | Description |
|---|---|
IPFV | Imperfective |
PFV | Perfective |
HAB | Habitual |
PROG | Progressive |
ITER | Iterative |
Mood
| Feature | Description |
|---|---|
IND | Indicative |
SBJV | Subjunctive |
IMP | Imperative |
COND | Conditional |
OPT | Optative |
POT | Potential |
PURP | Purposive |
Voice
| Feature | Description |
|---|---|
ACT | Active |
PASS | Passive |
MID | Middle |
ANTIP | Antipassive |
CAUS | Causative |
Finiteness
| Feature | Description |
|---|---|
FIN | Finite |
NFIN | Non-finite |
Definiteness
| Feature | Description |
|---|---|
DEF | Definite |
NDEF | Indefinite |
SPEC | Specific |
NSPEC | Non-specific |
Comparison
| Feature | Description |
|---|---|
CMPR | Comparative |
SPRL | Superlative |
Polarity
| Feature | Description |
|---|---|
POS | Positive |
NEG | Negative |
Possession
| Feature | Description |
|---|---|
PSS1S | 1st person singular possessor |
PSS2S | 2nd person singular possessor |
PSS3S | 3rd person singular possessor |
PSS1P | 1st person plural possessor |
PSS2P | 2nd person plural possessor |
PSS3P | 3rd person plural possessor |
PSSD | Possessed form |
Language-Specific Features
Some languages have additional features not listed above. Use unimorph features -l <lang> --list to see all features used in a specific language.
Feature Position
Feature positions vary by language. For example:
Hebrew verbs: V;PERSON;NUMBER;TENSE;GENDER
V;1;SG;PST (1st person singular past)
V;3;PL;FUT;MASC (3rd person plural future masculine)
Spanish verbs: V;MOOD;TENSE;PERSON;NUMBER
V;IND;PRS;1;SG (indicative present 1st singular)
V;SBJV;PST;3;PL (subjunctive past 3rd plural)
Working with Features
CLI
# List all features in a language
unimorph features -l heb --list
# See feature statistics
unimorph features -l heb --stats
# Find entries with a feature
unimorph features -l heb --search FUT
# Search by feature pattern
unimorph search -l heb -f "V;1;SG;*"
# Search by contained features
unimorph search -l heb --contains PL,MASC
Library
#![allow(unused)] fn main() { use unimorph_core::FeatureBundle; let features: FeatureBundle = "V;1;SG;PST".parse()?; // Check for specific feature if features.contains("PST") { println!("Past tense"); } // Pattern matching if features.matches("V;*;SG;*") { println!("Singular verb"); } }
References
- UniMorph Schema Documentation (PDF) - Official schema specification
- UniMorph Website - Main project site
- UniMorph GitHub - Language repositories
- Leipzig Glossing Rules - Standard for interlinear glossing
- SIGMORPHON - Shared tasks using UniMorph data
Available Languages
UniMorph provides morphological data for 100+ languages. Use unimorph list --available to see the current list.
For the complete list of languages with download links, see the official UniMorph languages page.
Listing Languages
# See all available languages
unimorph list --available
# See cached (downloaded) languages
unimorph list --cached
# Refresh the available list
unimorph list --available --refresh
Language Codes
UniMorph uses ISO 639-3 three-letter language codes:
| Code | Language |
|---|---|
ara | Arabic |
deu | German |
ell | Greek |
eng | English |
fas | Persian |
fin | Finnish |
fra | French |
heb | Hebrew |
hin | Hindi |
hun | Hungarian |
ita | Italian |
jpn | Japanese |
kat | Georgian |
kor | Korean |
lat | Latin |
nld | Dutch |
pol | Polish |
por | Portuguese |
ron | Romanian |
rus | Russian |
spa | Spanish |
swe | Swedish |
tur | Turkish |
ukr | Ukrainian |
zho | Chinese |
And many more...
Dataset Sizes
Dataset sizes vary significantly:
| Language | Entries | Lemmas |
|---|---|---|
Finnish (fin) | 2.7M+ | 50K+ |
Spanish (spa) | 1.2M+ | 10K+ |
German (deu) | 500K+ | 50K+ |
Italian (ita) | 500K+ | 10K+ |
Hebrew (heb) | 33K+ | 1K+ |
Check specific sizes with:
unimorph stats <lang>
Language Repositories
Each language has its own GitHub repository under the UniMorph organization:
https://github.com/unimorph/<code>
For example:
- Hebrew: github.com/unimorph/heb
- Italian: github.com/unimorph/ita
- Finnish: github.com/unimorph/fin
You can also browse all languages on the UniMorph website.
Data Quality
Data quality varies by language:
- High quality: Languages with extensive Wiktionary coverage
- Medium quality: Languages with academic contributions
- Lower quality: Newer or less-resourced languages
Check the language's GitHub repository for:
- Data sources
- Known issues
- Contribution guidelines
Finding Language Codes
If you don't know a language's code:
# List all available and search
unimorph list --available | grep -i finnish
# Output: fin
# Or use the SIL database
# https://iso639-3.sil.org/code_tables/639/data
Setting Up Aliases
Create shortcuts for frequently used languages:
# ~/.config/unimorph/config.toml
[languages]
hebrew = "heb"
spanish = "spa"
german = "deu"
finnish = "fin"
Then use:
unimorph inflect -l hebrew כתב
# Resolves to: unimorph inflect -l heb כתב
Contributing Languages
To contribute to a language or add a new one:
- Visit the language repository on GitHub
- Check existing issues
- Submit corrections or additions via pull request
See the UniMorph contribution guidelines for more information.
Contributing
Thank you for your interest in contributing to unimorph-rs!
Getting Started
Prerequisites
- Rust (latest stable)
- Git
Clone and Build
git clone https://github.com/joshrotenberg/unimorph-rs
cd unimorph-rs
cargo build
Run Tests
cargo test --all-features
Run Lints
cargo fmt --all -- --check
cargo clippy --all-targets --all-features -- -D warnings
Project Structure
unimorph-rs/
├── crates/
│ ├── unimorph-core/ # Core library
│ │ ├── src/
│ │ │ ├── lib.rs
│ │ │ ├── types.rs # Core types
│ │ │ ├── store.rs # SQLite backend
│ │ │ ├── query.rs # Query builder
│ │ │ ├── repository.rs
│ │ │ └── export.rs
│ │ └── Cargo.toml
│ │
│ └── unimorph-cli/ # CLI application
│ ├── src/
│ │ ├── main.rs
│ │ ├── commands/ # Command implementations
│ │ ├── config.rs
│ │ └── colors.rs
│ └── Cargo.toml
│
├── docs/ # mdBook documentation
│ ├── book.toml
│ └── src/
│
└── Cargo.toml # Workspace root
Making Changes
Creating a Branch
git checkout -b feat/your-feature
# or
git checkout -b fix/your-fix
Commit Messages
Use conventional commits:
feat: add new feature
fix: resolve bug in X
docs: update documentation
test: add tests for Y
refactor: restructure Z
Pull Requests
- Fork the repository
- Create a feature branch
- Make your changes
- Run tests and lints
- Submit a pull request
Development Guidelines
Code Style
- Follow Rust idioms
- Use
rustfmtfor formatting - Address all
clippywarnings - Document public APIs
Testing
- Add tests for new features
- Maintain test coverage
- Use meaningful test names
#![allow(unused)] fn main() { #[test] fn inflect_returns_all_forms() { // ... } }
Error Handling
- Use
thiserrorfor library errors - Use
anyhowfor CLI errors - Provide helpful error messages
Documentation
- Document public items
- Include examples in doc comments
- Update mdBook docs for user-facing changes
Areas for Contribution
Good First Issues
Look for issues labeled good first issue on GitHub.
Feature Ideas
- Additional export formats
- Performance optimizations
- New query capabilities
- Language-specific features
Documentation
- Fix typos
- Improve examples
- Add tutorials
- Translate documentation
Testing
- Add edge case tests
- Improve test coverage
- Add integration tests
Code of Conduct
Be respectful and constructive. We welcome contributors of all experience levels.
Getting Help
- Open a GitHub issue for bugs
- Use discussions for questions
- Check existing issues before creating new ones
License
Contributions are licensed under the same terms as the project (MIT/Apache-2.0).