Introduction

unimorph-rs is a complete Rust toolkit for working with UniMorph morphological data. It provides both a command-line interface and a Rust library for downloading, querying, and analyzing morphological inflection data across 180+ languages.

What is UniMorph?

UniMorph is a collaborative project providing morphological paradigms for the world's languages. Each language dataset contains entries mapping lemmas (dictionary forms) to their inflected forms along with morphological feature annotations.

For example, in Spanish:

Lemma	Form	Features
hablar	hablo	V;IND;PRS;1;SG
hablar	hablas	V;IND;PRS;2;SG
hablar	habla	V;IND;PRS;3;SG
hablar	hablamos	V;IND;PRS;1;PL

Features

Fast lookups: SQLite-backed storage with indexed queries
180+ languages: Access to all UniMorph language datasets
Transparent decompression: Handles .xz, .gz, and .zip compressed datasets automatically
Flexible querying: Search by lemma, form, features, or part of speech
Multiple output formats: Table, JSON, TSV for scripting
Pipe-friendly: Output designed for Unix pipelines
Offline-first: Data cached locally after download
Library + CLI: Use as a Rust library or command-line tool

Use Cases

Language learners: Look up conjugations and declensions
NLP researchers: Training data for morphological models
Lexicographers: Verify inflection paradigms
Educators: Build conjugation practice tools
Linguists: Cross-linguistic morphological analysis

Quick Example

# Download Hebrew dataset
unimorph download heb

# Look up all forms of a verb
unimorph inflect -l heb כתב

# Analyze a surface form
unimorph analyze -l heb כתבתי

# Search for plural masculine forms
unimorph search -l heb --contains PL,MASC --limit 10

Getting Started

Head to the Installation guide to get started, or jump straight to the Quick Start for a hands-on introduction.

Installation

Command-Line Tool

Homebrew (macOS/Linux)

brew tap joshrotenberg/brew
brew install unimorph

Cargo (from crates.io)

If you have Rust installed:

cargo install unimorph

Docker

Pull the image from GitHub Container Registry:

docker pull ghcr.io/joshrotenberg/unimorph-rs:latest

Run with a persistent data cache:

# Download a dataset
docker run -v ~/.cache/unimorph:/data ghcr.io/joshrotenberg/unimorph-rs download spa

# Query the data
docker run -v ~/.cache/unimorph:/data ghcr.io/joshrotenberg/unimorph-rs inflect spa hablar

# Export data
docker run -v ~/.cache/unimorph:/data -v $(pwd):/output ghcr.io/joshrotenberg/unimorph-rs \
    export spa -f jsonl -o /output/spanish.jsonl

You can also create a shell alias for convenience:

alias unimorph='docker run -v ~/.cache/unimorph:/data ghcr.io/joshrotenberg/unimorph-rs'

From Source

git clone https://github.com/joshrotenberg/unimorph-rs
cd unimorph-rs
cargo install --path crates/unimorph-cli  # directory still named unimorph-cli

Rust Library

Add to your Cargo.toml:

[dependencies]
unimorph-core = "0.1"

Or with cargo:

cargo add unimorph-core

Shell Completions

Generate completions for your shell:

# Bash
unimorph completions bash > ~/.local/share/bash-completion/completions/unimorph

# Zsh
unimorph completions zsh > ~/.zfunc/_unimorph

# Fish
unimorph completions fish > ~/.config/fish/completions/unimorph.fish

# PowerShell
unimorph completions powershell > _unimorph.ps1

For Zsh, ensure ~/.zfunc is in your fpath:

# Add to ~/.zshrc before compinit
fpath=(~/.zfunc $fpath)
autoload -Uz compinit && compinit

Verifying Installation

unimorph --version
unimorph --help

Data Storage

By default, unimorph stores data in:

Linux/macOS: ~/.cache/unimorph/
Custom: Set UNIMORPH_DATA environment variable or use --data-dir

Configuration is stored in:

All platforms: ~/.config/unimorph/config.toml

Quick Start

This guide will get you up and running with unimorph in under 5 minutes.

Download Your First Language

Let's start by downloading a language dataset. We'll use Hebrew (heb) as an example:

unimorph download heb

You'll see output like:

Downloading heb...
Downloaded 33177 entries for heb

Look Up Inflections

Now let's look up all the forms of a Hebrew verb. The inflect command takes a lemma (dictionary form) and shows all its inflected forms:

unimorph inflect -l heb כתב

Output:

LEMMA        FORM         FEATURES
------------------------------------------------------------
כתב         אכתוב       V;1;SG;FUT
כתב         יכתבו       V;3;PL;FUT;MASC
כתב         יכתוב       V;3;SG;FUT;MASC
כתב         כותב        V;SG;PRS;MASC
כתב         כתב         V;3;SG;PST;MASC
...

29 form(s) found.

Analyze a Surface Form

What if you have a word and want to know what it is? Use analyze:

unimorph analyze -l heb כתבתי

Output:

FORM         LEMMA        FEATURES
------------------------------------------------------------
כתבתי       כתב         V;1;SG;PST

1 analysis(es) found.

Search with Filters

Find entries matching specific criteria:

# Find all first person singular future forms
unimorph search -l heb --contains 1,SG,FUT --limit 5

# Find verbs (part of speech = V)
unimorph search -l heb --pos V --limit 5

# Search by lemma pattern (SQL LIKE wildcards)
unimorph search -l heb --lemma "כת%" --limit 5

Check Dataset Statistics

unimorph stats heb

Statistics for heb:
  Total entries:    33177
  Unique lemmas:    1176
  Unique forms:     27286
  Unique features:  55
  Imported at:      2024-01-15 10:30:00 UTC

Set a Default Language

Tired of typing -l heb every time? Set a default:

export UNIMORPH_LANG=heb

Or create a config file:

unimorph config init

Then edit ~/.config/unimorph/config.toml:

default_lang = "heb"

Now you can just run:

unimorph inflect כתב
unimorph analyze כתבתי

Output Formats

JSON Output

Add --json for machine-readable output:

unimorph inflect -l heb כתב --json

TSV for Piping

Use --tsv for tab-separated output without headers:

unimorph inflect -l heb כתב --tsv | head -5

כתב	אכתוב	V;1;SG;FUT
כתב	יכתבו	V;3;PL;FUT;MASC
כתב	יכתוב	V;3;SG;FUT;MASC
כתב	כותב	V;SG;PRS;MASC
כתב	כותבות	V;PL;PRS;FEM

Export Full Dataset

Export an entire language to a file:

unimorph export -l heb -o hebrew.tsv
unimorph export -l heb -o hebrew.jsonl --format jsonl

Or to stdout for piping:

unimorph export -l heb -o - | grep "FUT" | wc -l

Next Steps

Browse available languages
Learn about the feature schema
Explore the full CLI reference
Use the Rust library in your projects

Configuration

unimorph can be configured through environment variables, a config file, or command-line flags. Settings are applied in this priority order (highest to lowest):

Command-line flags
Environment variables
Config file
Built-in defaults

Config File

The config file is located at ~/.config/unimorph/config.toml on all platforms.

Creating a Config File

# Create a config file with example content
unimorph config init

# View current configuration
unimorph config show

# Show config file path
unimorph config path

Config File Format

# Default language for commands (ISO 639-3 code)
default_lang = "heb"

# Custom data directory (default: ~/.cache/unimorph)
# data_dir = "/path/to/custom/data"

# Default output format: "table", "json", or "tsv"
# output_format = "table"

# Disable colored output
# no_color = true

# Language aliases for convenience
[languages]
hebrew = "heb"
spanish = "spa"
german = "deu"
spanish = "spa"
finnish = "fin"

Language Aliases

Define shortcuts for language codes:

[languages]
he = "heb"
it = "spa"
de = "deu"

Then use:

unimorph inflect -l he כתב
# Resolves to: unimorph inflect -l heb כתב

Environment Variables

Variable	Description	Example
`UNIMORPH_LANG`	Default language code	`export UNIMORPH_LANG=heb`
`UNIMORPH_DATA`	Custom data directory	`export UNIMORPH_DATA=/data/unimorph`
`NO_COLOR`	Disable colored output	`export NO_COLOR=1`

Command-Line Flags

Global flags available on all commands:

Flag	Description
`-d, --data-dir <PATH>`	Custom data directory
`-v, --verbose`	Enable debug output (`-vv` for trace)
`-q, --quiet`	Suppress non-essential output

Data Storage

Default Locations

Dataset database: ~/.cache/unimorph/datasets.db
API cache: ~/.cache/unimorph/available_languages.json
Config file: ~/.config/unimorph/config.toml

Custom Data Directory

Override the data directory:

# Via environment variable
export UNIMORPH_DATA=/custom/path
unimorph download heb

# Via command-line flag
unimorph --data-dir /custom/path download heb

# Via config file
# data_dir = "/custom/path"

Resetting Data

# Clear API response cache
unimorph repair --clear-cache

# Clear all downloaded datasets (requires re-download)
unimorph repair --clear-data

Output Modes

Table (Default)

Human-readable formatted output with colors when connected to a terminal:

unimorph inflect -l heb כתב

JSON

Machine-readable JSON output:

unimorph inflect -l heb כתב --json

TSV

Tab-separated values without headers, ideal for piping:

unimorph inflect -l heb כתב --tsv

Pipe Detection

When stdout is not a terminal (e.g., piped to another command), unimorph automatically outputs in a pipe-friendly format:

# Automatically outputs just language codes, one per line
unimorph list | xargs -I{} echo "Language: {}"

CLI Overview

The unimorph command-line tool provides access to UniMorph morphological data through a set of intuitive subcommands.

Command Structure

unimorph [OPTIONS] <COMMAND> [ARGS]

Global Options

Option	Description
`-v, --verbose`	Enable debug output (`-vv` for trace)
`-q, --quiet`	Suppress non-essential output
`-d, --data-dir <PATH>`	Custom data directory
`-h, --help`	Print help
`-V, --version`	Print version

Commands at a Glance

Command	Alias	Description
download	`dl`	Download a language dataset
list	`ls`	List available/cached languages
inflect	`i`	Look up all forms of a lemma
analyze	`a`	Analyze a surface form (reverse lookup)
search	`s`	Search entries with flexible filtering
stats	`st`	Show dataset statistics
info	`in`	Show detailed info about a language
export	`x`	Export dataset to file
update	`up`	Update cached datasets
features	`f`	Explore morphological features
delete	`rm`	Delete a cached dataset
repair		Repair or reset data store
config	`cfg`	Manage configuration
completions		Generate shell completions

Common Workflows

First-Time Setup

# See what languages are available
unimorph list --available

# Download a language
unimorph download heb

# Set as default (optional)
export UNIMORPH_LANG=heb

Looking Up Words

# All forms of a lemma
unimorph inflect -l heb כתב

# What lemma does this form come from?
unimorph analyze -l heb כתבתי

Searching

# By features
unimorph search -l heb --contains PL,MASC

# By part of speech
unimorph search -l heb --pos V --limit 20

# By lemma pattern
unimorph search -l heb --lemma "כת%"

Data Management

# Check for updates
unimorph update --all --check

# Update a specific language
unimorph update heb

# Export for external use
unimorph export -l heb -o hebrew.tsv

Output Formats

Most commands support multiple output formats:

Flag	Format	Use Case
(default)	Table	Human reading in terminal
`--json`	JSON	Machine parsing, APIs
`--tsv`	TSV	Piping to other tools

Examples

# Pretty table output
unimorph inflect -l heb כתב

# JSON for parsing
unimorph inflect -l heb כתב --json | jq '.[0]'

# TSV for piping
unimorph inflect -l heb כתב --tsv | cut -f2 | sort -u

Piping and Scripting

When output is piped (not a terminal), unimorph automatically uses pipe-friendly formats:

# Get all cached language codes
unimorph list | while read lang; do
  echo "Processing $lang..."
  unimorph stats "$lang"
done

# Export to stdout and filter
unimorph export -l heb -o - | grep "FUT" > future_forms.tsv

# Count forms per lemma
unimorph search -l heb --pos V --tsv --limit 1000 | cut -f1 | sort | uniq -c | sort -rn | head

Error Handling

Commands provide helpful error messages:

$ unimorph inflect כתב
Error: No language specified.

Provide a language code as an argument, or set a default:

  export UNIMORPH_LANG=heb

Or in ~/.config/unimorph/config.toml:

  default_lang = "heb"

Run 'unimorph list --available' to see available languages.

Getting Help

# General help
unimorph --help

# Command-specific help
unimorph inflect --help
unimorph search --help

Commands

This section provides detailed documentation for each unimorph command.

Data Management

download - Download language datasets from UniMorph
list - List available and cached languages
update - Update cached datasets to latest versions
delete - Remove cached datasets
repair - Repair or reset the data store
export - Export datasets to files

Querying

inflect - Look up all inflected forms of a lemma
analyze - Analyze a surface form (reverse lookup)
search - Search with flexible filtering
features - Explore morphological features

Information

stats - Show dataset statistics
info - Show detailed language info

Configuration

config - Manage configuration settings

download

Download a language dataset from UniMorph.

Alias: dl

Synopsis

unimorph download [OPTIONS] [LANG]

Description

Downloads a UniMorph language dataset from GitHub and imports it into the local SQLite database. Datasets are cached locally, so subsequent queries don't require network access.

If the dataset is already cached, this command does nothing unless --force is specified.

Arguments

Argument	Description
`[LANG]`	Language code (ISO 639-3, e.g., `heb`, `ita`, `deu`). Optional if `UNIMORPH_LANG` is set or configured.

Options

Option	Description
`-f, --force`	Force re-download even if cached
`--json`	Output as JSON
`-q, --quiet`	Suppress progress output

Examples

Basic Download

unimorph download heb

Downloading heb...
Downloaded 33177 entries for heb

Force Re-download

unimorph download heb --force

Quiet Mode

unimorph download heb --quiet

JSON Output

unimorph download heb --json

{
  "language": "heb",
  "entries": 33177,
  "status": "downloaded"
}

Download Multiple Languages

for lang in heb ita deu spa; do
  unimorph download "$lang"
done

With Default Language

export UNIMORPH_LANG=heb
unimorph download  # Downloads Hebrew

Verbose Output

Use -v for detailed import reporting:

unimorph download spa --force -v

This shows:

parsed downloaded data lang=spa filename=["spa"] compression=none from_lfs=false valid_entries=1196224 blank_lines=0 malformed=21
malformed entry lang=spa line=80710 reason=empty form
malformed entry lang=spa line=134234 reason=empty form
...
additional malformed entries not shown lang=spa additional=11

Understanding the Output

Field	Description
`filename`	Source file(s) downloaded
`compression`	Format: `none`, `xz`, `gzip`, or `zip`
`from_lfs`	Whether fetched via Git LFS (large files)
`valid_entries`	Successfully parsed entries
`blank_lines`	Empty lines skipped (not an error)
`malformed`	Entries that failed to parse

Malformed Entry Details

When entries fail to parse, the first 10 are logged with:

Line number: Where in the source file
Reason: Why it failed (e.g., "empty form", "expected at least 3 columns")

Common reasons for malformed entries:

empty form - The inflected form field is blank
empty lemma - The dictionary form field is blank
expected at least 3 columns - Line doesn't have lemma, form, and features

These indicate upstream data quality issues in the UniMorph repository.

Notes

Language codes are ISO 639-3 (3 lowercase letters)
Use unimorph list --available to see all available languages
Downloads are atomic: partial downloads won't corrupt your data
The first download creates the database at ~/.cache/unimorph/datasets.db
Compressed files: Large datasets (Polish, Czech, Ukrainian, Slovak) use .xz compression - handled automatically
Git LFS: Very large files (like Czech's full MorfFlex dataset) use Git LFS - also handled automatically

list

List available and cached languages.

Alias: ls

Synopsis

unimorph list [OPTIONS]

Description

Lists UniMorph languages. By default, shows cached (downloaded) languages with entry counts. Use --available to fetch the full list of available languages from GitHub.

Options

Option	Description
`--cached`	Show only cached (downloaded) languages
`--available`	Fetch available languages from GitHub
`--refresh`	Refresh the cached list of available languages
`--json`	Output as JSON

Examples

List Cached Languages

unimorph list

Cached languages:
  fin (2737048 entries)
  heb (33177 entries)

Use 'unimorph list --available' to see all available languages.

List All Available Languages

unimorph list --available

Available languages (145 total, 2 cached):

  ady
  afb
  ain
  ...
  heb [cached]
  ...
  zul

Use 'unimorph download <code>' to download a language.

JSON Output

unimorph list --json

["fin", "heb"]

unimorph list --available --json

[
  {"code": "ady", "cached": false},
  {"code": "afb", "cached": false},
  ...
  {"code": "heb", "cached": true},
  ...
]

Refresh Available List

unimorph list --available --refresh

Forces a fresh fetch from GitHub (the list is normally cached for 24 hours).

Pipe-Friendly Output

When piped, outputs just language codes:

unimorph list | head -3

fin
heb

# Download all available languages
unimorph list --available | while read lang; do
  unimorph download "$lang"
done

inflect

Look up all inflected forms of a lemma.

Alias: i

Synopsis

unimorph inflect [OPTIONS] <LEMMA>

Description

Given a lemma (dictionary form), returns all its inflected forms with their morphological features. This is the primary way to see a word's full paradigm.

Arguments

Argument	Description
`<LEMMA>`	The lemma (dictionary form) to look up

Options

Option	Description
`-l, --lang <LANG>`	Language code (ISO 639-3)
`-f, --features <PATTERN>`	Filter by feature pattern (e.g., `V;IND;*;SG`)
`--json`	Output as JSON
`--tsv`	Output as TSV (tab-separated, no headers)

Examples

Basic Lookup

unimorph inflect -l heb כתב

LEMMA        FORM         FEATURES
------------------------------------------------------------
כתב         אכתוב       V;1;SG;FUT
כתב         יכתבו       V;3;PL;FUT;MASC
כתב         יכתוב       V;3;SG;FUT;MASC
כתב         כותב        V;SG;PRS;MASC
כתב         כתב         V;3;SG;PST;MASC
...

29 form(s) found.

Filter by Features

Use wildcards (*) to match any value at a position:

# Only singular forms
unimorph inflect -l heb כתב -f "V;*;SG;*"

# Only past tense
unimorph inflect -l heb כתב -f "V;*;*;PST;*"

JSON Output

unimorph inflect -l heb כתב --json

[
  {
    "lemma": "כתב",
    "form": "אכתוב",
    "features": {
      "raw": "V;1;SG;FUT",
      "features": ["V", "1", "SG", "FUT"]
    }
  },
  ...
]

TSV for Piping

unimorph inflect -l heb כתב --tsv

כתב	אכתוב	V;1;SG;FUT
כתב	יכתבו	V;3;PL;FUT;MASC
כתב	יכתוב	V;3;SG;FUT;MASC
...

Scripting Examples

# Get unique forms only
unimorph inflect -l heb כתב --tsv | cut -f2 | sort -u

# Count forms by tense
unimorph inflect -l heb כתב --tsv | cut -f3 | grep -o 'PST\|PRS\|FUT' | sort | uniq -c

# Find forms matching a pattern
unimorph inflect -l spa hablar --tsv | grep "1;SG"

Notes

The lemma must match exactly (case-sensitive for most languages)
Use search with --lemma for partial/wildcard matching
Returns empty results if the lemma doesn't exist in the dataset

analyze

Analyze a surface form (reverse lookup).

Alias: a

Synopsis

unimorph analyze [OPTIONS] <FORM>

Description

Given a surface form (inflected word), returns all possible analyses: the lemma it comes from and its morphological features. This is the reverse of inflect.

A form may have multiple analyses if it's ambiguous (e.g., same spelling for different lemmas or different grammatical analyses).

Arguments

Argument	Description
`<FORM>`	The surface form to analyze

Options

Option	Description
`-l, --lang <LANG>`	Language code (ISO 639-3)
`--json`	Output as JSON
`--tsv`	Output as TSV (tab-separated, no headers)

Examples

Basic Analysis

unimorph analyze -l heb כתבתי

FORM         LEMMA        FEATURES
------------------------------------------------------------
כתבתי       כתב         V;1;SG;PST

1 analysis(es) found.

Ambiguous Forms

Some forms have multiple possible analyses:

unimorph analyze -l heb כתבו

FORM         LEMMA        FEATURES
------------------------------------------------------------
כתבו        כתב         V;3;PL;PST
כתבו        כתב         V;2;PL;IMP;MASC

2 analysis(es) found.

JSON Output

unimorph analyze -l heb כתבתי --json

[
  {
    "lemma": "כתב",
    "form": "כתבתי",
    "features": {
      "raw": "V;1;SG;PST",
      "features": ["V", "1", "SG", "PST"]
    }
  }
]

TSV for Piping

unimorph analyze -l heb כתבתי --tsv

כתבתי	כתב	V;1;SG;PST

Form Not Found

unimorph analyze -l heb xyz

No analyses found for 'xyz'.

The form may not exist in the dataset, or it could be:
  - A proper noun or foreign word
  - A misspelling
  - A rare or archaic form

Scripting Examples

# Analyze words from a file
cat words.txt | while read word; do
  echo "=== $word ==="
  unimorph analyze -l heb "$word"
done

# Get just the lemma
unimorph analyze -l heb כתבתי --tsv | cut -f2

# Check if a word exists
if unimorph analyze -l heb כתבתי --tsv | grep -q .; then
  echo "Found"
fi

Notes

Analysis is case-sensitive for most languages
Forms must match exactly (no fuzzy matching)
Use search with --form for pattern matching

search

Search entries with flexible filtering.

Alias: s

Synopsis

unimorph search [OPTIONS]

Description

Search the dataset with flexible filtering by lemma, form, features, part of speech, and more. Supports wildcards and multiple filter combinations.

Options

Option	Description
`-l, --lang <LANG>`	Language code (ISO 639-3)
`--lemma <PATTERN>`	Filter by lemma (supports SQL LIKE wildcards: `%` and `_`)
`--form <PATTERN>`	Filter by form (supports SQL LIKE wildcards)
`-f, --features <PATTERN>`	Filter by feature pattern (e.g., `V;IND;;1;`)
`-c, --contains <FEATURES>`	Filter by features contained (comma-separated, position-independent)
`--pos <POS>`	Filter by part of speech (e.g., `V`, `N`, `ADJ`)
`--limit <N>`	Limit number of results (default: 100)
`--offset <N>`	Skip first N results
`--count`	Just show count of matching entries
`--json`	Output as JSON
`--tsv`	Output as TSV

Examples

Search by Lemma Pattern

# Lemmas starting with "כת"
unimorph search -l heb --lemma "כת%"

# Lemmas containing "בר"
unimorph search -l heb --lemma "%בר%"

# Exact 4-letter lemmas
unimorph search -l heb --lemma "____"

Search by Form Pattern

# Forms ending with "ים"
unimorph search -l heb --form "%ים"

Filter by Features (Position-Dependent)

Use semicolon-separated patterns with * as wildcard:

# First person singular verbs
unimorph search -l heb -f "V;1;SG;*"

# Past tense forms
unimorph search -l heb -f "V;*;*;PST;*"

Filter by Features (Position-Independent)

Use --contains for features that can be at any position:

# Plural masculine forms (regardless of position)
unimorph search -l heb --contains PL,MASC

# Future tense first person
unimorph search -l heb --contains FUT,1

Filter by Part of Speech

# Only verbs
unimorph search -l heb --pos V

# Only nouns
unimorph search -l heb --pos N

Combine Filters

# Verbs with plural masculine future forms
unimorph search -l heb --pos V --contains PL,MASC,FUT

# Lemmas starting with "א" that are verbs
unimorph search -l heb --lemma "א%" --pos V

Pagination

# First 20 results
unimorph search -l heb --pos V --limit 20

# Results 21-40
unimorph search -l heb --pos V --limit 20 --offset 20

Count Only

unimorph search -l heb --pos V --count

15234 entries match.

Output Formats

# JSON
unimorph search -l heb --pos V --limit 5 --json

# TSV for piping
unimorph search -l heb --pos V --limit 5 --tsv

Scripting Examples

# Get unique lemmas for a part of speech
unimorph search -l heb --pos V --limit 10000 --tsv | cut -f1 | sort -u

# Count entries per lemma
unimorph search -l heb --pos V --limit 10000 --tsv | cut -f1 | sort | uniq -c | sort -rn | head

# Export filtered subset
unimorph search -l heb --contains FUT --tsv > future_forms.tsv

Wildcards Reference

SQL LIKE Wildcards (for `--lemma` and `--form`)

Pattern	Matches
`%`	Any sequence of characters
`_`	Any single character
`abc%`	Starts with "abc"
`%abc`	Ends with "abc"
`%abc%`	Contains "abc"
`a_c`	"a" + any char + "c"

Feature Pattern Wildcards (for `-f`)

Pattern	Matches
`*`	Any value at that position
`V;;SG;`	Verb, any person, singular, any tense

stats

Show dataset statistics.

Alias: st

Synopsis

unimorph stats [OPTIONS] [LANG]

Description

Displays statistics about a downloaded language dataset, including entry counts, unique lemmas, unique forms, and unique feature combinations.

Arguments

Argument	Description
`[LANG]`	Language code (ISO 639-3). Optional if default is configured.

Options

Option	Description
`--json`	Output as JSON

Examples

Basic Statistics

unimorph stats heb

Statistics for heb:
  Total entries:    33177
  Unique lemmas:    1176
  Unique forms:     27286
  Unique features:  55
  Imported at:      2024-01-15 10:30:00 UTC

JSON Output

unimorph stats heb --json

{
  "total_entries": 33177,
  "unique_lemmas": 1176,
  "unique_forms": 27286,
  "unique_features": 55
}

Compare Languages

for lang in heb ita fin deu; do
  echo "=== $lang ==="
  unimorph stats "$lang"
  echo
done

Scripting

# Get entry count
unimorph stats heb --json | jq '.total_entries'

# Compare sizes
unimorph list | while read lang; do
  count=$(unimorph stats "$lang" --json | jq '.total_entries')
  echo "$lang: $count"
done | sort -t: -k2 -rn

Understanding the Statistics

Metric	Description
Total entries	Number of (lemma, form, features) triples
Unique lemmas	Number of distinct dictionary forms
Unique forms	Number of distinct surface forms
Unique features	Number of distinct feature bundle combinations
Imported at	When the dataset was downloaded

info

Show detailed info about a cached language.

Alias: in

Synopsis

unimorph info [OPTIONS] [LANG]

Description

Displays detailed information about a downloaded language dataset, including source URL, local and remote commit information, update status, and statistics.

Arguments

Argument	Description
`[LANG]`	Language code (ISO 639-3). Optional if default is configured.

Options

Option	Description
`--json`	Output as JSON

Examples

Basic Info

unimorph info heb

Language: heb
Source: https://github.com/unimorph/heb

Local imported:  2024-01-15 10:30:00 UTC
Local commit:    b2bff12
Remote commit:   b2bff12 (2023-01-09)

Status: Up to date

Statistics:
  Total entries:   33177
  Unique lemmas:   1176
  Unique forms:    27286
  Unique features: 55

Update Available

unimorph info heb

Language: heb
Source: https://github.com/unimorph/heb

Local imported:  2024-01-15 10:30:00 UTC
Local commit:    b2bff12
Remote commit:   c4d8e23 (2024-02-01)

Status: Update available

Statistics:
  Total entries:   33177
  Unique lemmas:   1176
  Unique forms:    27286
  Unique features: 55

JSON Output

unimorph info heb --json

{
  "language": "heb",
  "source": "https://github.com/unimorph/heb",
  "local_commit": "b2bff12",
  "remote_commit": "c4d8e23",
  "imported_at": "2024-01-15T10:30:00Z",
  "update_available": true,
  "stats": {
    "total_entries": 33177,
    "unique_lemmas": 1176,
    "unique_forms": 27286,
    "unique_features": 55
  }
}

export

Export a language dataset to file.

Alias: x

Synopsis

unimorph export [OPTIONS]

Description

Exports a downloaded language dataset to a file in TSV or JSONL format. Useful for integrating with other tools, creating backups, or processing data with external programs.

Options

Option	Description
`-l, --lang <LANG>`	Language code (ISO 639-3)
`-o, --output <PATH>`	Output file path (use `-` for stdout)
`-F, --format <FORMAT>`	Output format: `tsv` or `jsonl` (auto-detected from extension)

Examples

Export to TSV

unimorph export -l heb -o hebrew.tsv

Exported 33177 entries to hebrew.tsv

Export to JSONL

unimorph export -l heb -o hebrew.jsonl

Or explicitly specify format:

unimorph export -l heb -o hebrew.json --format jsonl

Export to Stdout

Use -o - to write to stdout:

unimorph export -l heb -o - --format tsv | head -5

איבד	אאבד	V;1;SG;FUT
איבזר	אאבזר	V;1;SG;FUT
איבטח	אאבטח	V;1;SG;FUT
האביס	אאביס	V;1;SG;FUT
אבל	אאבל	V;1;SG;FUT

The status message goes to stderr, so piping works correctly:

unimorph export -l heb -o - 2>/dev/null | wc -l
33177

Scripting Examples

# Filter exported data
unimorph export -l heb -o - | grep "FUT" > future_forms.tsv

# Export and compress
unimorph export -l heb -o - | gzip > hebrew.tsv.gz

# Export multiple languages
for lang in heb ita deu; do
  unimorph export -l "$lang" -o "${lang}.tsv"
done

# Convert to CSV
unimorph export -l heb -o - | tr '\t' ',' > hebrew.csv

Output Formats

TSV (Tab-Separated Values)

lemma<TAB>form<TAB>features

Example:

hablar	hablo	V;IND;PRS;1;SG
hablar	hablas	V;IND;PRS;2;SG

JSONL (JSON Lines)

One JSON object per line:

{"lemma":"hablar","form":"hablo","features":"V;IND;PRS;1;SG"}
{"lemma":"hablar","form":"hablas","features":"V;IND;PRS;2;SG"}

Notes

Format is auto-detected from file extension (.tsv or .jsonl)
Use --format to override auto-detection
Stdout export writes status to stderr to avoid polluting data

update

Update cached language datasets.

Alias: up

Synopsis

unimorph update [OPTIONS] [LANG]

Description

Checks for and downloads updates to cached language datasets. Can update a single language or all cached languages at once.

Arguments

Argument	Description
`[LANG]`	Language code to update. Omit with `--all` to update all.

Options

Option	Description
`--all`	Update all cached languages
`--check`	Check for updates without downloading
`--json`	Output as JSON

Examples

Check for Updates

unimorph update heb --check

Checking for updates...

  heb - update available

Or if up to date:

Checking for updates...

  heb - up to date

Update a Single Language

unimorph update heb

Updating heb...
Updated heb: 33177 -> 33250 entries

Check All Languages

unimorph update --all --check

Checking for updates...

  fin - up to date
  heb - update available
  ita - up to date

1 update(s) available.

Update All Languages

unimorph update --all

Updating all cached languages...

  fin - up to date
  heb - updated (33177 -> 33250 entries)
  ita - up to date

1 language(s) updated.

JSON Output

unimorph update --all --check --json

{
  "languages": [
    {"code": "fin", "update_available": false},
    {"code": "heb", "update_available": true},
    {"code": "spa", "update_available": false}
  ],
  "updates_available": 1
}

Scripting

# Check and update only if needed
if unimorph update heb --check --json | jq -e '.update_available' > /dev/null; then
  unimorph update heb
fi

features

Explore morphological features in a language.

Alias: f

Synopsis

unimorph features [OPTIONS]

Description

Explore the morphological features used in a language dataset. View unique feature values, their frequencies, search for entries with specific features, or analyze feature positions.

Options

Option	Description
`-l, --lang <LANG>`	Language code (ISO 639-3)
`--list`	List all unique feature values
`--stats`	Show feature value counts (histogram)
`--search <FEATURE>`	Search for entries containing a specific feature
`--position <N>`	Show values at a specific position (0-indexed)
`--limit <N>`	Limit number of results (default: 50)
`--json`	Output as JSON

Examples

Feature Structure Overview

unimorph features -l heb

Feature structure for heb:

  Position 0: 3 unique values (e.g., V, N, V.MSDR)
  Position 1: 6 unique values (e.g., 2, 3, 1)
  Position 2: 6 unique values (e.g., SG, PL, PRS)
  Position 3: 11 unique values (e.g., FUT, PST, IMP)
  Position 4: 2 unique values (e.g., FEM, MASC)

Use --list for all unique values, --stats for counts, --search <FEATURE> to find entries.

List All Features

unimorph features -l heb --list

Unique features in heb:

  1
  2
  3
  DEF
  FEM
  FUT
  IMP
  MASC
  N
  ...

24 unique feature values.

Feature Statistics

unimorph features -l heb --stats

Feature statistics for heb:

FEATURE              COUNT
----------------------------------------
V                    28663
SG                   16226
PL                   15158
FEM                  12384
MASC                 12384
2                    12108
FUT                  10400
PST                  9378
3                    7286
1                    4164
... and 14 more

Search by Feature

unimorph features -l heb --search FUT --limit 5

Entries with feature 'FUT':

LEMMA                FORM                 FEATURES
------------------------------------------------------------
איבד                 אאבד                 V;1;SG;FUT
איבזר                אאבזר                V;1;SG;FUT
איבטח                אאבטח                V;1;SG;FUT
האביס                אאביס                V;1;SG;FUT
אבל                  אאבל                 V;1;SG;FUT

Showing 5 of 10400 results.

Analyze Feature Position

unimorph features -l heb --position 0

Feature values at position 0 in heb:

VALUE                COUNT
----------------------------------------
V                    28663
N                    3338
V.MSDR               1176

JSON Output

unimorph features -l heb --stats --json

{
  "V": 28663,
  "SG": 16226,
  "PL": 15158,
  ...
}

Pipe-Friendly Output

When piped, outputs clean format:

# Get just feature names
unimorph features -l heb --list | head -5

1
2
3
DEF
FEM

# Feature counts as TSV
unimorph features -l heb --stats | head -5

V	28663
SG	16226
PL	15158
FEM	12384
MASC	12384

Use Cases

Understanding a language: See what features are used
Finding examples: Search for entries with specific features
Data exploration: Analyze feature distribution
Building queries: Discover feature names for search filters

delete

Delete a cached language dataset.

Alias: rm

Synopsis

unimorph delete [OPTIONS] [LANG]

Description

Removes a downloaded language dataset from the local cache. The data can be re-downloaded later with unimorph download.

Arguments

Argument	Description
`[LANG]`	Language code (ISO 639-3). Optional if default is configured.

Options

Option	Description
`--json`	Output as JSON

Examples

Delete a Language

unimorph delete heb

Deleted heb (33177 entries removed)

JSON Output

unimorph delete heb --json

{
  "language": "heb",
  "entries_removed": 33177,
  "status": "deleted"
}

Delete Multiple Languages

for lang in heb ita deu; do
  unimorph delete "$lang"
done

Notes

This only removes the data from the local cache
Statistics and metadata are also removed
Re-download anytime with unimorph download
Use unimorph repair --clear-data to delete all languages at once

repair

Repair or reset the local data store.

Synopsis

unimorph repair [OPTIONS]

Description

Utility command for troubleshooting and resetting the local data store. Can clear the API response cache or all downloaded datasets.

Options

Option	Description
`--clear-cache`	Clear cached API responses
`--clear-data`	Clear all downloaded datasets (requires re-download)
`--json`	Output as JSON

Examples

Clear API Cache

Clears the cached list of available languages (normally cached for 24 hours):

unimorph repair --clear-cache

Cleared API cache

Clear All Data

Removes all downloaded language datasets:

unimorph repair --clear-data

Cleared all data (5 languages removed)

Clear Both

unimorph repair --clear-cache --clear-data

JSON Output

unimorph repair --clear-data --json

{
  "cache_cleared": false,
  "data_cleared": true,
  "languages_removed": 5
}

Use Cases

Corrupted data: If queries return unexpected results
Stale cache: If available language list seems outdated
Disk space: Remove all data to free space
Fresh start: Reset everything to initial state

Notes

--clear-cache only removes API response cache, not datasets
--clear-data removes all downloaded languages
Data can be re-downloaded with unimorph download

sample

Randomly sample entries from a language dataset.

Alias: rand

Synopsis

unimorph sample [OPTIONS] <N>

Description

Samples random entries from a downloaded language dataset. Useful for exploring data, creating test sets, or getting a quick overview of a language's morphology.

Arguments

Argument	Description
`<N>`	Number of entries to sample

Options

Option	Description
`-l, --lang <LANG>`	Language code (ISO 639-3)
`-s, --seed <SEED>`	Seed for reproducible sampling
`--by-lemma`	Sample complete paradigms instead of random entries
`--json`	Output as JSON
`--tsv`	Output as TSV (tab-separated, no headers)

Examples

Random Entries

unimorph sample -l spa 5

LEMMA        FORM         FEATURES
------------------------------------------------------------
tapiar      tapiemos     V;SBJV;PRS;1;PL
apilar      apilando     V;V.CVB;PRS
hablar      hablaste     V;IND;PST;PFV;2;SG;INFM
comer       comieron     V;IND;PST;PFV;3;PL
vivir       viviremos    V;IND;FUT;1;PL

5 sampled entry(ies).

Sample Complete Paradigms

Use --by-lemma to get all forms of randomly selected lemmas:

unimorph sample -l spa 2 --by-lemma

This returns complete paradigms for 2 random lemmas, showing all their inflected forms.

Reproducible Sampling

Use --seed for reproducible results:

unimorph sample -l spa 5 --seed 42

Running with the same seed always returns the same entries.

JSON Output

unimorph sample -l spa 3 --json

[
  {
    "lemma": "hablar",
    "form": "hablamos",
    "features": {
      "raw": "V;IND;PRS;1;PL",
      "features": ["V", "IND", "PRS", "1", "PL"]
    }
  },
  ...
]

TSV for Scripting

unimorph sample -l spa 10 --tsv > sample.tsv

Scripting Examples

# Create a test set
unimorph sample -l spa 100 --seed 123 --tsv > test_set.tsv

# Sample paradigms for flashcard generation
unimorph sample -l spa 10 --by-lemma --json > flashcards.json

# Get random verbs only
unimorph sample -l spa 50 --tsv | grep "^V;" | head -10

Notes

Without --seed, results are different each run
--by-lemma returns more entries than N (all forms of N lemmas)
Large N values may take longer for big datasets

config

Manage configuration.

Alias: cfg

Synopsis

unimorph config <COMMAND>

Subcommands

Command	Description
`show`	Show current configuration
`init`	Initialize a new config file
`path`	Show the config file path

config show

Display the current configuration, including both config file settings and defaults.

unimorph config show

Configuration

  Path: /home/user/.config/unimorph/config.toml
  Status: loaded

Current Settings

  default_lang: heb
  data_dir: (default)
  output_format: (default: table)
  no_color: (not set)

JSON Output

unimorph config show --json

{
  "path": "/home/user/.config/unimorph/config.toml",
  "exists": true,
  "default_lang": "heb",
  "data_dir": null,
  "output_format": null,
  "no_color": null
}

config init

Create a new config file with example content.

unimorph config init

Created config file at /home/user/.config/unimorph/config.toml

Force Overwrite

unimorph config init --force

Overwrites existing config file.

JSON Output

unimorph config init --json

{
  "path": "/home/user/.config/unimorph/config.toml",
  "created": true
}

config path

Show the config file path.

unimorph config path

/home/user/.config/unimorph/config.toml

JSON Output

unimorph config path --json

{
  "path": "/home/user/.config/unimorph/config.toml"
}

Config File Format

The config file uses TOML format:

# Default language for commands
default_lang = "heb"

# Custom data directory
# data_dir = "/custom/path"

# Default output format: "table", "json", or "tsv"
# output_format = "table"

# Disable colored output
# no_color = true

# Language aliases
[languages]
hebrew = "heb"
spanish = "spa"

completions

Generate shell completions for your shell.

Synopsis

unimorph completions <SHELL>

Description

Generates shell completion scripts that enable tab-completion for unimorph commands, options, and arguments.

Arguments

Argument	Description
`<SHELL>`	Shell to generate completions for: `bash`, `zsh`, `fish`, `elvish`, `powershell`

Installation

Bash

# Add to ~/.bashrc
source <(unimorph completions bash)

# Or save to a file
unimorph completions bash > ~/.local/share/bash-completion/completions/unimorph

Zsh

# Add to ~/.zshrc (before compinit)
source <(unimorph completions zsh)

# Or save to fpath
unimorph completions zsh > ~/.zfunc/_unimorph
# Then add to ~/.zshrc: fpath=(~/.zfunc $fpath)

Fish

unimorph completions fish > ~/.config/fish/completions/unimorph.fish

PowerShell

# Add to your PowerShell profile
unimorph completions powershell | Out-String | Invoke-Expression

# Or save to a file and source it
unimorph completions powershell > unimorph.ps1

Elvish

unimorph completions elvish > ~/.elvish/lib/unimorph.elv
# Then add to ~/.elvish/rc.elv: use unimorph

Examples

After installation, you can use tab completion:

# Complete commands
unimorph inf<TAB>  # completes to 'inflect'

# Complete options
unimorph inflect --<TAB>  # shows available options

# Complete language codes (if supported by your shell)
unimorph inflect -l <TAB>

Notes

Restart your shell or source your config file after installation
Some completions may require a downloaded language list to work

Library Overview

The unimorph-core crate provides a Rust library for working with UniMorph morphological data. Use it to integrate morphological lookups into your own applications.

Installation

Add to your Cargo.toml:

[dependencies]
unimorph-core = "0.1"

Quick Example

use unimorph_core::{Repository, LangCode};

fn main() -> anyhow::Result<()> {
    // Create a repository (uses default cache directory)
    let repo = Repository::open_default()?;
    
    // Parse language code
    let lang: LangCode = "heb".parse()?;
    
    // Look up all forms of a lemma
    let forms = repo.store().inflect(&lang, "כתב")?;
    for entry in forms {
        println!("{} -> {} ({})", entry.lemma, entry.form, entry.features);
    }
    
    // Analyze a surface form
    let analyses = repo.store().analyze(&lang, "כתבתי")?;
    for entry in analyses {
        println!("{} <- {} ({})", entry.form, entry.lemma, entry.features);
    }
    
    Ok(())
}

Core Components

Repository

The Repository manages data downloads and caching:

#![allow(unused)]
fn main() {
use unimorph_core::Repository;

// Default location (~/.cache/unimorph)
let repo = Repository::open_default()?;

// Custom location
let repo = Repository::open("/custom/path")?;

// Download a language
repo.download("heb").await?;

// List cached languages
let languages = repo.cached_languages()?;
}

Store

The Store provides the query interface:

#![allow(unused)]
fn main() {
let store = repo.store();

// Inflect: lemma -> forms
let forms = store.inflect("heb", "כתב")?;

// Analyze: form -> lemmas
let analyses = store.analyze("heb", "כתבתי")?;

// Statistics
let stats = store.stats("heb")?;
}

Query Builder

Flexible searching with the query builder:

#![allow(unused)]
fn main() {
let results = store.query("heb")
    .lemma("כת%")           // LIKE pattern
    .pos("V")                // Part of speech
    .features_contain(&["FUT", "1"])  // Has these features
    .limit(100)
    .execute()?;
}

Types

Core data types:

#![allow(unused)]
fn main() {
use unimorph_core::{Entry, LangCode, FeatureBundle};

// Language codes (validated)
let lang: LangCode = "heb".parse()?;

// Entries contain lemma, form, features
let entry = Entry {
    lemma: "כתב".to_string(),
    form: "כתבתי".to_string(),
    features: "V;1;SG;PST".parse()?,
};

// Feature bundles support pattern matching
let features: FeatureBundle = "V;1;SG;PST".parse()?;
assert!(features.matches("V;*;SG;*"));
assert!(features.contains("PST"));
}

Error Handling

The library uses a custom Error type:

#![allow(unused)]
fn main() {
use unimorph_core::{Result, Error};

fn example() -> Result<()> {
    let repo = Repository::open_default()?;
    
    match repo.store().inflect("heb", "xyz") {
        Ok(entries) => println!("Found {} entries", entries.len()),
        Err(Error::NotFound(msg)) => println!("Not found: {}", msg),
        Err(e) => return Err(e),
    }
    
    Ok(())
}
}

Feature Flags

Flag	Description
`default`	Standard features
`parquet`	Parquet export support

[dependencies]
unimorph-core = { version = "0.1", features = ["parquet"] }

Next Steps

Types - Core data types
Store - Query interface
Repository - Data management
Query Builder - Advanced searching

Types

Core data types in unimorph-core.

LangCode

A validated ISO 639-3 language code (3 lowercase ASCII letters).

#![allow(unused)]
fn main() {
use unimorph_core::LangCode;

// Parse from string
let lang: LangCode = "heb".parse()?;

// Validation happens at parse time
assert!("HEB".parse::<LangCode>().is_err());  // Must be lowercase
assert!("he".parse::<LangCode>().is_err());   // Must be 3 chars
assert!("h3b".parse::<LangCode>().is_err());  // Must be letters

// Convert to string
let s: &str = lang.as_ref();
let s: String = lang.to_string();
}

Entry

A single morphological entry with lemma, form, and features.

#![allow(unused)]
fn main() {
use unimorph_core::Entry;

// Entries are returned from queries
let entries = store.inflect("heb", "כתב")?;
for entry in entries {
    println!("Lemma: {}", entry.lemma);
    println!("Form: {}", entry.form);
    println!("Features: {}", entry.features);
    println!("Features (raw): {}", entry.features.raw());
    println!("Features (list): {:?}", entry.features.as_slice());
}

// Parse from TSV line
let entry = Entry::parse_line("כתב\tכתבתי\tV;1;SG;PST", 1)?;

// Serialize to JSON
let json = serde_json::to_string(&entry)?;
}

Fields

Field	Type	Description
`lemma`	`String`	Dictionary form
`form`	`String`	Inflected surface form
`features`	`FeatureBundle`	Morphological features

FeatureBundle

A semicolon-separated bundle of morphological features.

#![allow(unused)]
fn main() {
use unimorph_core::FeatureBundle;

// Parse from string
let features: FeatureBundle = "V;1;SG;PST".parse()?;

// Access individual features
assert_eq!(features.as_slice(), &["V", "1", "SG", "PST"]);
assert_eq!(features.raw(), "V;1;SG;PST");
assert_eq!(features.len(), 4);

// Check if contains a feature (position-independent)
assert!(features.contains("PST"));
assert!(features.contains("V"));
assert!(!features.contains("FUT"));

// Check if contains all features
assert!(features.contains_all(&["V", "PST"]));

// Pattern matching with wildcards
assert!(features.matches("V;*;SG;*"));
assert!(features.matches("V;1;*;PST"));
assert!(!features.matches("N;*;*;*"));

// Display
println!("{}", features);  // "V;1;SG;PST"
}

Pattern Matching

The matches method supports positional pattern matching:

Pattern	Description
`V;1;SG;PST`	Exact match
`V;;SG;`	Wildcard at positions 1 and 3
`;;*;PST`	Only check position 3

Note: Pattern must have same number of positions as the bundle.

Validation

Feature bundles cannot be empty
Individual features cannot be empty
Features are separated by semicolons

#![allow(unused)]
fn main() {
assert!("".parse::<FeatureBundle>().is_err());      // Empty
assert!("V;;SG".parse::<FeatureBundle>().is_err()); // Empty feature
}

DatasetStats

Statistics about a downloaded language dataset.

#![allow(unused)]
fn main() {
use unimorph_core::DatasetStats;

let stats = store.stats("heb")?;
if let Some(stats) = stats {
    println!("Total entries: {}", stats.total_entries);
    println!("Unique lemmas: {}", stats.unique_lemmas);
    println!("Unique forms: {}", stats.unique_forms);
    println!("Unique features: {}", stats.unique_features);
}
}

Fields

Field	Type	Description
`total_entries`	`usize`	Number of entries
`unique_lemmas`	`usize`	Distinct lemmas
`unique_forms`	`usize`	Distinct surface forms
`unique_features`	`usize`	Distinct feature bundles

Serialization

All types implement Serialize and Deserialize from serde:

#![allow(unused)]
fn main() {
use unimorph_core::Entry;

let entry = store.inflect("heb", "כתב")?.first().unwrap();

// To JSON
let json = serde_json::to_string(&entry)?;

// From JSON
let entry: Entry = serde_json::from_str(&json)?;
}

Store

The Store provides the query interface for morphological data.

Opening a Store

Usually accessed through Repository:

#![allow(unused)]
fn main() {
use unimorph_core::Repository;

let repo = Repository::open_default()?;
let store = repo.store();
}

Or open directly:

#![allow(unused)]
fn main() {
use unimorph_core::Store;

// Open existing database
let store = Store::open("path/to/datasets.db")?;

// In-memory store (for testing)
let store = Store::in_memory()?;
}

Basic Queries

Inflect (Lemma to Forms)

Look up all inflected forms of a lemma:

#![allow(unused)]
fn main() {
let forms = store.inflect("heb", "כתב")?;

for entry in &forms {
    println!("{} -> {} ({})", entry.lemma, entry.form, entry.features);
}

println!("Found {} forms", forms.len());
}

Analyze (Form to Lemmas)

Find all possible lemmas for a surface form:

#![allow(unused)]
fn main() {
let analyses = store.analyze("heb", "כתבו")?;

for entry in &analyses {
    println!("{} <- {} ({})", entry.form, entry.lemma, entry.features);
}

// Handle ambiguous forms
if analyses.len() > 1 {
    println!("Ambiguous: {} possible analyses", analyses.len());
}
}

Statistics

Get dataset statistics:

#![allow(unused)]
fn main() {
if let Some(stats) = store.stats("heb")? {
    println!("Entries: {}", stats.total_entries);
    println!("Lemmas: {}", stats.unique_lemmas);
    println!("Forms: {}", stats.unique_forms);
}
}

Check Language

Check if a language is loaded:

#![allow(unused)]
fn main() {
if store.has_language("heb")? {
    println!("Hebrew is available");
}

// List all languages
let languages = store.languages()?;
for lang in languages {
    println!("- {}", lang);
}
}

Query Builder

For flexible searching, use the query builder:

#![allow(unused)]
fn main() {
let results = store.query("heb")
    .lemma("כת%")           // LIKE pattern (% = any chars)
    .form("%ים")            // Forms ending in ים
    .pos("V")               // Part of speech
    .features_match("V;*;SG;*")  // Pattern match
    .features_contain(&["FUT"])  // Contains feature
    .limit(100)
    .offset(0)
    .execute()?;
}

See Query Builder for full documentation.

Data Management

Import Data

Import entries from TSV format:

#![allow(unused)]
fn main() {
use unimorph_core::{Entry, LangCode};

let lang: LangCode = "test".parse()?;
let entries = vec![
    Entry::parse_line("test\tform1\tN;SG", 1)?,
    Entry::parse_line("test\tform2\tN;PL", 2)?,
];

store.import(&lang, &entries, None, None)?;
}

Delete Language

Remove a language from the store:

#![allow(unused)]
fn main() {
let removed = store.delete_language("heb")?;
println!("Removed {} entries", removed);
}

Export

Export to various formats:

#![allow(unused)]
fn main() {
// Export to TSV file
let count = store.export_tsv("heb", "hebrew.tsv")?;

// Export to JSONL file
let count = store.export_jsonl("heb", "hebrew.jsonl")?;

// Export to writer (e.g., stdout)
use std::io::stdout;
let count = store.export_tsv_to_writer("heb", stdout().lock())?;

// Parquet (with feature flag)
#[cfg(feature = "parquet")]
let count = store.export_parquet("heb", "hebrew.parquet")?;
}

Thread Safety

Store is Send but not Sync. For concurrent access, use a mutex or create separate store instances:

#![allow(unused)]
fn main() {
use std::sync::Mutex;

let store = Mutex::new(Store::open("datasets.db")?);

// In threads:
let store = store.lock().unwrap();
let results = store.inflect("heb", "כתב")?;
}

Error Handling

#![allow(unused)]
fn main() {
use unimorph_core::{Store, Error};

match store.inflect("xyz", "test") {
    Ok(entries) => println!("Found {} entries", entries.len()),
    Err(Error::LanguageNotFound(lang)) => {
        println!("Language {} not downloaded", lang);
    }
    Err(e) => return Err(e.into()),
}
}

Repository

The Repository manages data downloads, caching, and provides access to the underlying store.

Creating a Repository

#![allow(unused)]
fn main() {
use unimorph_core::Repository;

// Default location (~/.cache/unimorph)
let repo = Repository::open_default()?;

// Custom location
let repo = Repository::open("/path/to/data")?;

// Custom location with PathBuf
use std::path::PathBuf;
let path = PathBuf::from("/path/to/data");
let repo = Repository::open(&path)?;
}

Downloading Data

Download a language dataset from UniMorph:

#![allow(unused)]
fn main() {
// Download (async)
repo.download("heb").await?;

// Force re-download
repo.download_with_options("heb", true).await?;
}

Compressed Files and Git LFS

Some large datasets are distributed differently due to GitHub file size limits:

Format	Languages	Notes
`.xz` (LZMA)	ces, pol, slk, ukr	Best compression for text
`.zip`	rus (segmentations), san	Archive format
Git LFS	ces (full MorfFlex)	For files > 100MB

The repository automatically:

Tries compressed versions first (.xz, .gz)
Falls back to uncompressed if not found
Detects Git LFS pointers and fetches from media endpoint
Decompresses transparently before importing

No special handling is needed - just call download() as usual.

Parse Reporting

When parsing downloaded data, use Entry::parse_tsv_with_report() for detailed diagnostics:

#![allow(unused)]
fn main() {
use unimorph_core::{Entry, ParseReport, CompressionFormat};

let content = "lemma\tform\tV;IND\nbad line\nlemma2\tform2\tN;SG\n";
let (entries, report) = Entry::parse_tsv_with_report(content);

println!("Valid entries: {}", report.valid_entries);
println!("Blank lines: {}", report.blank_lines);
println!("Malformed: {}", report.malformed_count);

// Inspect malformed entries (first 10 stored)
for entry in &report.malformed {
    println!("  Line {}: {} - {}", 
        entry.line_num, 
        entry.reason,
        entry.content
    );
}
}

The ParseReport includes:

Field	Type	Description
`valid_entries`	`usize`	Successfully parsed entries
`blank_lines`	`usize`	Empty lines (not an error)
`malformed_count`	`usize`	Total entries that failed
`malformed`	`Vec<MalformedEntry>`	Details for first 10 failures
`compression`	`CompressionFormat`	Source file format
`from_lfs`	`bool`	Whether fetched via Git LFS
`filename`	`Option<String>`	Source filename(s)

The CompressionFormat enum:

#![allow(unused)]
fn main() {
pub enum CompressionFormat {
    None,   // Plain text
    Xz,     // .xz (LZMA)
    Gzip,   // .gz
    Zip,    // .zip archive
}
}

Accessing the Store

Get the underlying store for queries:

#![allow(unused)]
fn main() {
let store = repo.store();

let forms = store.inflect("heb", "כתב")?;
}

Checking Cached Languages

#![allow(unused)]
fn main() {
// List cached languages
let languages = repo.cached_languages()?;
for lang in &languages {
    println!("Cached: {}", lang);
}

// Check if specific language is cached
if languages.iter().any(|l| l.as_ref() == "heb") {
    println!("Hebrew is cached");
}
}

Data Directory

The repository manages a data directory containing:

~/.cache/unimorph/
├── datasets.db              # SQLite database
└── available_languages.json # Cached API response

Get the data directory:

#![allow(unused)]
fn main() {
let data_dir = repo.data_dir();
println!("Data stored in: {}", data_dir.display());
}

Full Example

use unimorph_core::Repository;

#[tokio::main]
async fn main() -> anyhow::Result<()> {
    // Open repository
    let repo = Repository::open_default()?;
    
    // Download Hebrew if not cached
    let cached = repo.cached_languages()?;
    if !cached.iter().any(|l| l.as_ref() == "heb") {
        println!("Downloading Hebrew...");
        repo.download("heb").await?;
    }
    
    // Query the data
    let store = repo.store();
    let forms = store.inflect("heb", "כתב")?;
    
    println!("Found {} forms of כתב:", forms.len());
    for entry in &forms {
        println!("  {} - {}", entry.form, entry.features);
    }
    
    Ok(())
}

Error Handling

#![allow(unused)]
fn main() {
use unimorph_core::{Repository, Error};

async fn download_language(repo: &Repository, lang: &str) -> anyhow::Result<()> {
    match repo.download(lang).await {
        Ok(()) => println!("Downloaded {}", lang),
        Err(Error::Network(e)) => {
            println!("Network error: {}", e);
            println!("Check your connection and try again");
        }
        Err(Error::InvalidLanguage(l)) => {
            println!("Invalid language code: {}", l);
        }
        Err(e) => return Err(e.into()),
    }
    Ok(())
}
}

Async Runtime

Download operations are async and require a runtime:

// With tokio
#[tokio::main]
async fn main() {
    let repo = Repository::open_default().unwrap();
    repo.download("heb").await.unwrap();
}

// Or with block_on
fn main() {
    let rt = tokio::runtime::Runtime::new().unwrap();
    let repo = Repository::open_default().unwrap();
    rt.block_on(repo.download("heb")).unwrap();
}

Query Builder

The query builder provides a fluent interface for flexible searching.

Basic Usage

#![allow(unused)]
fn main() {
let results = store.query("heb")
    .limit(100)
    .execute()?;
}

Filter Methods

By Lemma

#![allow(unused)]
fn main() {
// Exact match
.lemma("כתב")

// LIKE pattern (% = any chars, _ = single char)
.lemma("כת%")      // Starts with כת
.lemma("%ב")       // Ends with ב
.lemma("%בר%")     // Contains בר
.lemma("___")      // Exactly 3 characters
}

By Form

#![allow(unused)]
fn main() {
// Exact match
.form("כתבתי")

// LIKE pattern
.form("%ים")       // Plural forms ending in ים
.form("ה%")        // Forms starting with ה
}

By Part of Speech

#![allow(unused)]
fn main() {
.pos("V")          // Verbs
.pos("N")          // Nouns
.pos("ADJ")        // Adjectives
}

By Features (Pattern Match)

Position-dependent matching with wildcards:

#![allow(unused)]
fn main() {
// Match specific positions
.features_match("V;1;SG;*")      // 1st person singular verbs
.features_match("V;*;*;PST;*")   // Past tense verbs
.features_match("N;*;PL;*")      // Plural nouns
}

By Features (Contains)

Position-independent matching:

#![allow(unused)]
fn main() {
// Has these features anywhere
.features_contain(&["FUT"])           // Future tense
.features_contain(&["PL", "MASC"])    // Plural masculine
.features_contain(&["V", "1", "SG"])  // 1st person singular verbs
}

Pagination

#![allow(unused)]
fn main() {
// First page
.limit(20)
.offset(0)

// Second page
.limit(20)
.offset(20)

// All results (careful with large datasets!)
.limit(usize::MAX)
}

Executing Queries

Get Results

#![allow(unused)]
fn main() {
let entries: Vec<Entry> = store.query("heb")
    .pos("V")
    .limit(100)
    .execute()?;

for entry in &entries {
    println!("{} {} {}", entry.lemma, entry.form, entry.features);
}
}

Count Results

#![allow(unused)]
fn main() {
let count = store.query("heb")
    .pos("V")
    .count()?;

println!("Found {} verbs", count);
}

Check Existence

#![allow(unused)]
fn main() {
let exists = store.query("heb")
    .lemma("כתב")
    .exists()?;

if exists {
    println!("Lemma found");
}
}

Get First Result

#![allow(unused)]
fn main() {
if let Some(entry) = store.query("heb")
    .lemma("כתב")
    .first()?
{
    println!("First form: {}", entry.form);
}
}

Chaining Filters

Filters are combined with AND logic:

#![allow(unused)]
fn main() {
let results = store.query("heb")
    .lemma("כת%")                    // AND
    .pos("V")                         // AND
    .features_contain(&["FUT"])       // AND
    .limit(10)
    .execute()?;
}

Examples

Find All Verb Infinitives

#![allow(unused)]
fn main() {
let infinitives = store.query("heb")
    .pos("V")
    .features_contain(&["NFIN"])
    .execute()?;
}

Find Ambiguous Forms

Forms that could be multiple parts of speech:

#![allow(unused)]
fn main() {
let form = "שמר";

let as_verb = store.query("heb")
    .form(form)
    .pos("V")
    .execute()?;

let as_noun = store.query("heb")
    .form(form)
    .pos("N")
    .execute()?;

if !as_verb.is_empty() && !as_noun.is_empty() {
    println!("{} is ambiguous (verb and noun)", form);
}
}

Paginate Through All Results

#![allow(unused)]
fn main() {
let page_size = 100;
let mut offset = 0;

loop {
    let results = store.query("heb")
        .pos("V")
        .limit(page_size)
        .offset(offset)
        .execute()?;
    
    if results.is_empty() {
        break;
    }
    
    for entry in &results {
        // Process entry
    }
    
    offset += page_size;
}
}

Export Filtered Subset

#![allow(unused)]
fn main() {
use std::io::Write;

let mut file = std::fs::File::create("verbs.tsv")?;

let verbs = store.query("heb")
    .pos("V")
    .limit(usize::MAX)
    .execute()?;

for entry in &verbs {
    writeln!(file, "{}\t{}\t{}", entry.lemma, entry.form, entry.features)?;
}
}

Performance Tips

Use limits: Always set a reasonable limit
Prefer specific filters: More filters = faster queries
Use count() first: Check result size before fetching all
Index-friendly queries: Lemma and form queries use indexes

#![allow(unused)]
fn main() {
// Good: Uses index
.lemma("כתב")

// Good: Uses index
.form("כתבתי")

// Slower: Full scan with pattern
.lemma("%תב%")

// Slower: Feature scan
.features_contain(&["FUT"])
}

Python Bindings

The unimorph-rs Python package provides fast, Rust-powered access to UniMorph morphological data with native Polars DataFrame support.

Installation

pip install unimorph-rs

For Polars DataFrame support:

pip install unimorph-rs[polars]

Links:

Requirements

Python 3.9+
Polars (optional, for DataFrame methods)

Quick Start

from unimorph import Store, download

# Download a language dataset (one-time)
download("spa")  # Spanish

# Create a store to query the data
store = Store()

# Get all inflected forms of a lemma
forms = store.inflect("spa", "hablar")
for entry in forms:
    print(f"{entry.form}: {entry.features}")

Output:

hablar: V;NFIN
hablando: V;V.CVB;PRS
hablado: V;V.PTCP;PST;MASC;SG
hablo: V;IND;PRS;1;SG
hablas: V;IND;PRS;2;SG
habla: V;IND;PRS;3;SG
...

Core API

download(lang)

Downloads a language dataset from UniMorph. Only needs to be called once per language.

from unimorph import download

download("deu")  # German
download("spa")  # Spanish
download("fra")  # French

Store

The main interface for querying morphological data.

from unimorph import Store

store = Store()

store.inflect(lang, lemma)

Get all inflected forms for a lemma (dictionary form).

forms = store.inflect("deu", "gehen")  # "to go" in German
for entry in forms:
    print(f"{entry.lemma} -> {entry.form}: {entry.features}")

store.analyze(lang, form)

Analyze a word form to find possible lemmas and features.

analyses = store.analyze("spa", "hablamos")
for entry in analyses:
    print(f"{entry.form} <- {entry.lemma}: {entry.features}")

store.search_features(lang, features, limit=None)

Search for entries containing specific morphological features.

# Find all past tense subjunctive forms in Spanish
entries = store.search_features("spa", "SBJV;PST", limit=100)

store.stats(lang)

Get statistics about a downloaded language dataset.

stats = store.stats("spa")
if stats:
    print(f"Entries: {stats.total_entries}")
    print(f"Unique lemmas: {stats.unique_lemmas}")
    print(f"Unique forms: {stats.unique_forms}")

store.languages()

List all downloaded languages.

langs = store.languages()
print(langs)  # ['deu', 'ita', 'spa', ...]

store.has_language(lang)

Check if a language is downloaded.

if store.has_language("fra"):
    print("French data is available")

Polars DataFrame Support

Note: Requires pip install unimorph-rs[polars]

All query methods have _df variants that return Polars DataFrames for easy data analysis.

from unimorph import Store, download

download("spa")
store = Store()

# Get results as a DataFrame
df = store.inflect_df("spa", "ser")
print(df)

Output:

shape: (70, 3)
+-------+---------+------------------------+
| lemma | form    | features               |
| ---   | ---     | ---                    |
| str   | str     | str                    |
+-------+---------+------------------------+
| ser   | ser     | V;NFIN                 |
| ser   | siendo  | V;V.CVB;PRS            |
| ser   | sido    | V;V.PTCP;PST;MASC;SG   |
| ser   | soy     | V;IND;PRS;1;SG         |
| ser   | eres    | V;IND;PRS;2;SG         |
| ...   | ...     | ...                    |
+-------+---------+------------------------+

DataFrame Methods

store.inflect_df(lang, lemma) - Inflections as DataFrame
store.analyze_df(lang, form) - Analyses as DataFrame
store.search_features_df(lang, features, limit=None) - Feature search as DataFrame

Working with DataFrames

import polars as pl

df = store.inflect_df("spa", "hablar")

# Filter to indicative mood only
indicative = df.filter(pl.col("features").str.contains("IND"))

# Group by tense
by_tense = df.filter(
    pl.col("features").str.contains("IND")
).with_columns(
    pl.when(pl.col("features").str.contains("PRS")).then(pl.lit("present"))
      .when(pl.col("features").str.contains("PST")).then(pl.lit("past"))
      .when(pl.col("features").str.contains("FUT")).then(pl.lit("future"))
      .otherwise(pl.lit("other"))
      .alias("tense")
)

print(by_tense)

Entry Objects

Query results return Entry objects with the following attributes:

Attribute	Type	Description
`lemma`	str	Dictionary form / citation form
`form`	str	Inflected surface form
`features`	str	UniMorph feature bundle (semicolon-separated)

entry = store.inflect("spa", "hablar")[0]
print(entry.lemma)     # "hablar"
print(entry.form)      # "hablar"
print(entry.features)  # "V;NFIN"
print(repr(entry))     # Entry(lemma='hablar', form='hablar', features='V;NFIN')

DatasetStats Objects

Statistics returned by store.stats():

Attribute	Type	Description
`language`	str	Language code
`total_entries`	int	Total number of entries
`unique_lemmas`	int	Number of unique lemmas
`unique_forms`	int	Number of unique forms
`unique_features`	int	Number of unique feature bundles

Example: Building a Conjugation Table

import polars as pl
from unimorph import Store, download

download("spa")
store = Store()

# Get all forms of "hablar" (to speak)
df = store.inflect_df("spa", "hablar")

# Filter to present indicative
present = df.filter(
    pl.col("features").str.contains("IND") & 
    pl.col("features").str.contains("PRS")
)

# Extract person and number
conjugation = present.with_columns([
    pl.when(pl.col("features").str.contains("1")).then(pl.lit("1st"))
      .when(pl.col("features").str.contains("2")).then(pl.lit("2nd"))
      .when(pl.col("features").str.contains("3")).then(pl.lit("3rd"))
      .alias("person"),
    pl.when(pl.col("features").str.contains("SG")).then(pl.lit("singular"))
      .when(pl.col("features").str.contains("PL")).then(pl.lit("plural"))
      .alias("number")
]).select(["person", "number", "form"])

print(conjugation)

About UniMorph

UniMorph is a collaborative project that provides morphological paradigms for the world's languages in a standardized format.

What is Morphology?

Morphology is the study of word structure and how words change form to express different grammatical meanings. For example:

English: "walk" -> "walks", "walked", "walking"
Spanish: "hablar" -> "hablo", "hablas", "habla", "hablamos"...
Hebrew: "כתב" -> "כותב", "כתבתי", "יכתוב"...

What UniMorph Provides

UniMorph datasets contain mappings from lemmas (dictionary forms) to their inflected forms, along with morphological features describing each form.

Data Format

Each entry is a triple:

lemma <TAB> form <TAB> features

Example (Spanish):

hablar	hablo	V;IND;PRS;1;SG
hablar	hablas	V;IND;PRS;2;SG
hablar	habla	V;IND;PRS;3;SG
hablar	hablamos	V;IND;PRS;1;PL
hablar	habláis	V;IND;PRS;2;PL
hablar	hablan	V;IND;PRS;3;PL

Coverage

UniMorph includes data for 100+ languages, ranging from:

High-resource languages: English, Spanish, German, French
Medium-resource languages: Finnish, Hungarian, Turkish
Low-resource languages: Many endangered and under-documented languages

Data Sources

UniMorph data comes from:

Wiktionary extractions
Linguistic databases
Academic contributions
Community submissions

Use Cases

Natural Language Processing

Training morphological inflection models
Data augmentation for NLU systems
Lemmatization and stemming lookup tables

Language Learning

Conjugation practice applications
Flashcard generation
Grammar reference tools

Linguistic Research

Cross-linguistic typology studies
Morphological complexity analysis
Paradigm structure research

Lexicography

Dictionary development
Inflection table generation
Coverage verification

The UniMorph Schema

UniMorph uses a standardized feature schema across all languages, making cross-linguistic comparison possible. Features are organized into dimensions:

Part of Speech (V, N, ADJ, ...)
Person (1, 2, 3)
Number (SG, PL, DU)
Tense (PST, PRS, FUT)
And many more...

See the official UniMorph schema documentation for the complete specification, or our Feature Schema page for a quick reference.

Contributing to UniMorph

UniMorph is open source. Each language has its own GitHub repository:

Main site: unimorph.github.io
Organization: github.com/unimorph

Contributions welcome:

Report data errors
Add missing forms
Contribute new languages

Citation

If you use UniMorph in research, please cite:

@inproceedings{mccarthy-etal-2020-unimorph,
    title = "{U}ni{M}orph 3.0: Universal Morphology",
    author = "McCarthy, Arya D. and others",
    booktitle = "LREC",
    year = "2020",
}

SIGMORPHON: Shared tasks on morphological analysis
Universal Dependencies: Syntactic annotation
Lexical Markup Framework: ISO standard for lexical resources

External Links

Feature Schema

UniMorph uses a standardized feature schema to annotate morphological forms. Features are semicolon-separated and position-dependent within each language.

For the complete official specification, see the UniMorph Schema documentation (PDF).

Feature Format

FEATURE1;FEATURE2;FEATURE3;...

Example: V;IND;PRS;1;SG means:

V = Verb
IND = Indicative mood
PRS = Present tense
1 = First person
SG = Singular number

Feature Dimensions

Part of Speech

Feature	Description
`V`	Verb
`N`	Noun
`ADJ`	Adjective
`ADV`	Adverb
`PRO`	Pronoun
`DET`	Determiner
`ADP`	Adposition
`NUM`	Numeral
`CONJ`	Conjunction
`PART`	Particle
`INTJ`	Interjection
`V.MSDR`	Verbal noun / Masdar
`V.PTCP`	Participle
`V.CVB`	Converb

Person

Feature	Description
`1`	First person
`2`	Second person
`3`	Third person
`4`	Fourth person (obviate)
`INCL`	Inclusive
`EXCL`	Exclusive

Number

Feature	Description
`SG`	Singular
`PL`	Plural
`DU`	Dual
`TRI`	Trial
`PAUC`	Paucal
`GRPL`	Greater plural

Gender

Feature	Description
`MASC`	Masculine
`FEM`	Feminine
`NEUT`	Neuter
`NAKH`	Animate (Algonquian)

Case

Feature	Description
`NOM`	Nominative
`ACC`	Accusative
`GEN`	Genitive
`DAT`	Dative
`INS`	Instrumental
`LOC`	Locative
`ABL`	Ablative
`VOC`	Vocative
`ESS`	Essive
`TRANS`	Translative
`COM`	Comitative
`PRIV`	Privative
`PRT`	Partitive
And many more...

Tense

Feature	Description
`PRS`	Present
`PST`	Past
`FUT`	Future
`IPFV`	Imperfective
`PFV`	Perfective
`PRF`	Perfect
`PLPRF`	Pluperfect
`PROSP`	Prospective

Aspect

Feature	Description
`IPFV`	Imperfective
`PFV`	Perfective
`HAB`	Habitual
`PROG`	Progressive
`ITER`	Iterative

Mood

Feature	Description
`IND`	Indicative
`SBJV`	Subjunctive
`IMP`	Imperative
`COND`	Conditional
`OPT`	Optative
`POT`	Potential
`PURP`	Purposive

Voice

Feature	Description
`ACT`	Active
`PASS`	Passive
`MID`	Middle
`ANTIP`	Antipassive
`CAUS`	Causative

Finiteness

Feature	Description
`FIN`	Finite
`NFIN`	Non-finite

Definiteness

Feature	Description
`DEF`	Definite
`NDEF`	Indefinite
`SPEC`	Specific
`NSPEC`	Non-specific

Comparison

Feature	Description
`CMPR`	Comparative
`SPRL`	Superlative

Polarity

Feature	Description
`POS`	Positive
`NEG`	Negative

Possession

Feature	Description
`PSS1S`	1st person singular possessor
`PSS2S`	2nd person singular possessor
`PSS3S`	3rd person singular possessor
`PSS1P`	1st person plural possessor
`PSS2P`	2nd person plural possessor
`PSS3P`	3rd person plural possessor
`PSSD`	Possessed form

Language-Specific Features

Some languages have additional features not listed above. Use unimorph features -l <lang> --list to see all features used in a specific language.

Feature Position

Feature positions vary by language. For example:

Hebrew verbs: V;PERSON;NUMBER;TENSE;GENDER

V;1;SG;PST     (1st person singular past)
V;3;PL;FUT;MASC (3rd person plural future masculine)

Spanish verbs: V;MOOD;TENSE;PERSON;NUMBER

V;IND;PRS;1;SG  (indicative present 1st singular)
V;SBJV;PST;3;PL (subjunctive past 3rd plural)

Working with Features

CLI

# List all features in a language
unimorph features -l heb --list

# See feature statistics
unimorph features -l heb --stats

# Find entries with a feature
unimorph features -l heb --search FUT

# Search by feature pattern
unimorph search -l heb -f "V;1;SG;*"

# Search by contained features
unimorph search -l heb --contains PL,MASC

Library

#![allow(unused)]
fn main() {
use unimorph_core::FeatureBundle;

let features: FeatureBundle = "V;1;SG;PST".parse()?;

// Check for specific feature
if features.contains("PST") {
    println!("Past tense");
}

// Pattern matching
if features.matches("V;*;SG;*") {
    println!("Singular verb");
}
}

References

UniMorph Schema Documentation (PDF) - Official schema specification
UniMorph Website - Main project site
UniMorph GitHub - Language repositories
Leipzig Glossing Rules - Standard for interlinear glossing
SIGMORPHON - Shared tasks using UniMorph data

Available Languages

UniMorph provides morphological data for 100+ languages. Use unimorph list --available to see the current list.

For the complete list of languages with download links, see the official UniMorph languages page.

Listing Languages

# See all available languages
unimorph list --available

# See cached (downloaded) languages
unimorph list --cached

# Refresh the available list
unimorph list --available --refresh

Language Codes

UniMorph uses ISO 639-3 three-letter language codes:

Code	Language
`ara`	Arabic
`deu`	German
`ell`	Greek
`eng`	English
`fas`	Persian
`fin`	Finnish
`fra`	French
`heb`	Hebrew
`hin`	Hindi
`hun`	Hungarian
`ita`	Italian
`jpn`	Japanese
`kat`	Georgian
`kor`	Korean
`lat`	Latin
`nld`	Dutch
`pol`	Polish
`por`	Portuguese
`ron`	Romanian
`rus`	Russian
`spa`	Spanish
`swe`	Swedish
`tur`	Turkish
`ukr`	Ukrainian
`zho`	Chinese

And many more...

Dataset Sizes

Dataset sizes vary significantly:

Language	Entries	Lemmas
Finnish (`fin`)	2.7M+	50K+
Spanish (`spa`)	1.2M+	10K+
German (`deu`)	500K+	50K+
Italian (`ita`)	500K+	10K+
Hebrew (`heb`)	33K+	1K+

Check specific sizes with:

unimorph stats <lang>

Language Repositories

Each language has its own GitHub repository under the UniMorph organization:

https://github.com/unimorph/<code>

For example:

You can also browse all languages on the UniMorph website.

Data Quality

Data quality varies by language:

High quality: Languages with extensive Wiktionary coverage
Medium quality: Languages with academic contributions
Lower quality: Newer or less-resourced languages

Check the language's GitHub repository for:

Data sources
Known issues
Contribution guidelines

Finding Language Codes

If you don't know a language's code:

# List all available and search
unimorph list --available | grep -i finnish
# Output: fin

# Or use the SIL database
# https://iso639-3.sil.org/code_tables/639/data

Setting Up Aliases

Create shortcuts for frequently used languages:

# ~/.config/unimorph/config.toml
[languages]
hebrew = "heb"
spanish = "spa"
german = "deu"
finnish = "fin"

Then use:

unimorph inflect -l hebrew כתב
# Resolves to: unimorph inflect -l heb כתב

Contributing Languages

To contribute to a language or add a new one:

Visit the language repository on GitHub
Check existing issues
Submit corrections or additions via pull request

See the UniMorph contribution guidelines for more information.

Contributing

Thank you for your interest in contributing to unimorph-rs!

Getting Started

Prerequisites

Rust (latest stable)
Git

Clone and Build

git clone https://github.com/joshrotenberg/unimorph-rs
cd unimorph-rs
cargo build

Run Tests

cargo test --all-features

Run Lints

cargo fmt --all -- --check
cargo clippy --all-targets --all-features -- -D warnings

Project Structure

unimorph-rs/
├── crates/
│   ├── unimorph-core/     # Core library
│   │   ├── src/
│   │   │   ├── lib.rs
│   │   │   ├── types.rs   # Core types
│   │   │   ├── store.rs   # SQLite backend
│   │   │   ├── query.rs   # Query builder
│   │   │   ├── repository.rs
│   │   │   └── export.rs
│   │   └── Cargo.toml
│   │
│   └── unimorph-cli/      # CLI application
│       ├── src/
│       │   ├── main.rs
│       │   ├── commands/  # Command implementations
│       │   ├── config.rs
│       │   └── colors.rs
│       └── Cargo.toml
│
├── docs/                   # mdBook documentation
│   ├── book.toml
│   └── src/
│
└── Cargo.toml             # Workspace root

Making Changes

Creating a Branch

git checkout -b feat/your-feature
# or
git checkout -b fix/your-fix

Commit Messages

Use conventional commits:

feat: add new feature
fix: resolve bug in X
docs: update documentation
test: add tests for Y
refactor: restructure Z

Pull Requests

Fork the repository
Create a feature branch
Make your changes
Run tests and lints
Submit a pull request

Development Guidelines

Code Style

Follow Rust idioms
Use rustfmt for formatting
Address all clippy warnings
Document public APIs

Testing

Add tests for new features
Maintain test coverage
Use meaningful test names

#![allow(unused)]
fn main() {
#[test]
fn inflect_returns_all_forms() {
    // ...
}
}

Error Handling

Use thiserror for library errors
Use anyhow for CLI errors
Provide helpful error messages

Documentation

Document public items
Include examples in doc comments
Update mdBook docs for user-facing changes

Additional export formats
Performance optimizations
New query capabilities
Language-specific features

Documentation

Fix typos
Improve examples
Add tutorials
Translate documentation

Testing

Add edge case tests
Improve test coverage
Add integration tests

Code of Conduct

Be respectful and constructive. We welcome contributors of all experience levels.

Getting Help

Open a GitHub issue for bugs
Use discussions for questions
Check existing issues before creating new ones

License

Contributions are licensed under the same terms as the project (MIT/Apache-2.0).

UniMorph-rs