Introduction

unimorph-rs is a complete Rust toolkit for working with UniMorph morphological data. It provides both a command-line interface and a Rust library for downloading, querying, and analyzing morphological inflection data across 180+ languages.

What is UniMorph?

UniMorph is a collaborative project providing morphological paradigms for the world's languages. Each language dataset contains entries mapping lemmas (dictionary forms) to their inflected forms along with morphological feature annotations.

For example, in Spanish:

LemmaFormFeatures
hablarhabloV;IND;PRS;1;SG
hablarhablasV;IND;PRS;2;SG
hablarhablaV;IND;PRS;3;SG
hablarhablamosV;IND;PRS;1;PL

Features

  • Fast lookups: SQLite-backed storage with indexed queries
  • 180+ languages: Access to all UniMorph language datasets
  • Transparent decompression: Handles .xz, .gz, and .zip compressed datasets automatically
  • Flexible querying: Search by lemma, form, features, or part of speech
  • Multiple output formats: Table, JSON, TSV for scripting
  • Pipe-friendly: Output designed for Unix pipelines
  • Offline-first: Data cached locally after download
  • Library + CLI: Use as a Rust library or command-line tool

Use Cases

  • Language learners: Look up conjugations and declensions
  • NLP researchers: Training data for morphological models
  • Lexicographers: Verify inflection paradigms
  • Educators: Build conjugation practice tools
  • Linguists: Cross-linguistic morphological analysis

Quick Example

# Download Hebrew dataset
unimorph download heb

# Look up all forms of a verb
unimorph inflect -l heb כתב

# Analyze a surface form
unimorph analyze -l heb כתבתי

# Search for plural masculine forms
unimorph search -l heb --contains PL,MASC --limit 10

Getting Started

Head to the Installation guide to get started, or jump straight to the Quick Start for a hands-on introduction.

Installation

Command-Line Tool

Homebrew (macOS/Linux)

brew tap joshrotenberg/brew
brew install unimorph

Cargo (from crates.io)

If you have Rust installed:

cargo install unimorph

Docker

Pull the image from GitHub Container Registry:

docker pull ghcr.io/joshrotenberg/unimorph-rs:latest

Run with a persistent data cache:

# Download a dataset
docker run -v ~/.cache/unimorph:/data ghcr.io/joshrotenberg/unimorph-rs download spa

# Query the data
docker run -v ~/.cache/unimorph:/data ghcr.io/joshrotenberg/unimorph-rs inflect spa hablar

# Export data
docker run -v ~/.cache/unimorph:/data -v $(pwd):/output ghcr.io/joshrotenberg/unimorph-rs \
    export spa -f jsonl -o /output/spanish.jsonl

You can also create a shell alias for convenience:

alias unimorph='docker run -v ~/.cache/unimorph:/data ghcr.io/joshrotenberg/unimorph-rs'

From Source

git clone https://github.com/joshrotenberg/unimorph-rs
cd unimorph-rs
cargo install --path crates/unimorph-cli  # directory still named unimorph-cli

Rust Library

Add to your Cargo.toml:

[dependencies]
unimorph-core = "0.1"

Or with cargo:

cargo add unimorph-core

Shell Completions

Generate completions for your shell:

# Bash
unimorph completions bash > ~/.local/share/bash-completion/completions/unimorph

# Zsh
unimorph completions zsh > ~/.zfunc/_unimorph

# Fish
unimorph completions fish > ~/.config/fish/completions/unimorph.fish

# PowerShell
unimorph completions powershell > _unimorph.ps1

For Zsh, ensure ~/.zfunc is in your fpath:

# Add to ~/.zshrc before compinit
fpath=(~/.zfunc $fpath)
autoload -Uz compinit && compinit

Verifying Installation

unimorph --version
unimorph --help

Data Storage

By default, unimorph stores data in:

  • Linux/macOS: ~/.cache/unimorph/
  • Custom: Set UNIMORPH_DATA environment variable or use --data-dir

Configuration is stored in:

  • All platforms: ~/.config/unimorph/config.toml

Quick Start

This guide will get you up and running with unimorph in under 5 minutes.

Download Your First Language

Let's start by downloading a language dataset. We'll use Hebrew (heb) as an example:

unimorph download heb

You'll see output like:

Downloading heb...
Downloaded 33177 entries for heb

Look Up Inflections

Now let's look up all the forms of a Hebrew verb. The inflect command takes a lemma (dictionary form) and shows all its inflected forms:

unimorph inflect -l heb כתב

Output:

LEMMA        FORM         FEATURES
------------------------------------------------------------
כתב         אכתוב       V;1;SG;FUT
כתב         יכתבו       V;3;PL;FUT;MASC
כתב         יכתוב       V;3;SG;FUT;MASC
כתב         כותב        V;SG;PRS;MASC
כתב         כתב         V;3;SG;PST;MASC
...

29 form(s) found.

Analyze a Surface Form

What if you have a word and want to know what it is? Use analyze:

unimorph analyze -l heb כתבתי

Output:

FORM         LEMMA        FEATURES
------------------------------------------------------------
כתבתי       כתב         V;1;SG;PST

1 analysis(es) found.

Search with Filters

Find entries matching specific criteria:

# Find all first person singular future forms
unimorph search -l heb --contains 1,SG,FUT --limit 5
# Find verbs (part of speech = V)
unimorph search -l heb --pos V --limit 5
# Search by lemma pattern (SQL LIKE wildcards)
unimorph search -l heb --lemma "כת%" --limit 5

Check Dataset Statistics

unimorph stats heb
Statistics for heb:
  Total entries:    33177
  Unique lemmas:    1176
  Unique forms:     27286
  Unique features:  55
  Imported at:      2024-01-15 10:30:00 UTC

Set a Default Language

Tired of typing -l heb every time? Set a default:

export UNIMORPH_LANG=heb

Or create a config file:

unimorph config init

Then edit ~/.config/unimorph/config.toml:

default_lang = "heb"

Now you can just run:

unimorph inflect כתב
unimorph analyze כתבתי

Output Formats

JSON Output

Add --json for machine-readable output:

unimorph inflect -l heb כתב --json

TSV for Piping

Use --tsv for tab-separated output without headers:

unimorph inflect -l heb כתב --tsv | head -5
כתב	אכתוב	V;1;SG;FUT
כתב	יכתבו	V;3;PL;FUT;MASC
כתב	יכתוב	V;3;SG;FUT;MASC
כתב	כותב	V;SG;PRS;MASC
כתב	כותבות	V;PL;PRS;FEM

Export Full Dataset

Export an entire language to a file:

unimorph export -l heb -o hebrew.tsv
unimorph export -l heb -o hebrew.jsonl --format jsonl

Or to stdout for piping:

unimorph export -l heb -o - | grep "FUT" | wc -l

Next Steps

Configuration

unimorph can be configured through environment variables, a config file, or command-line flags. Settings are applied in this priority order (highest to lowest):

  1. Command-line flags
  2. Environment variables
  3. Config file
  4. Built-in defaults

Config File

The config file is located at ~/.config/unimorph/config.toml on all platforms.

Creating a Config File

# Create a config file with example content
unimorph config init

# View current configuration
unimorph config show

# Show config file path
unimorph config path

Config File Format

# Default language for commands (ISO 639-3 code)
default_lang = "heb"

# Custom data directory (default: ~/.cache/unimorph)
# data_dir = "/path/to/custom/data"

# Default output format: "table", "json", or "tsv"
# output_format = "table"

# Disable colored output
# no_color = true

# Language aliases for convenience
[languages]
hebrew = "heb"
spanish = "spa"
german = "deu"
spanish = "spa"
finnish = "fin"

Language Aliases

Define shortcuts for language codes:

[languages]
he = "heb"
it = "spa"
de = "deu"

Then use:

unimorph inflect -l he כתב
# Resolves to: unimorph inflect -l heb כתב

Environment Variables

VariableDescriptionExample
UNIMORPH_LANGDefault language codeexport UNIMORPH_LANG=heb
UNIMORPH_DATACustom data directoryexport UNIMORPH_DATA=/data/unimorph
NO_COLORDisable colored outputexport NO_COLOR=1

Command-Line Flags

Global flags available on all commands:

FlagDescription
-d, --data-dir <PATH>Custom data directory
-v, --verboseEnable debug output (-vv for trace)
-q, --quietSuppress non-essential output

Data Storage

Default Locations

  • Dataset database: ~/.cache/unimorph/datasets.db
  • API cache: ~/.cache/unimorph/available_languages.json
  • Config file: ~/.config/unimorph/config.toml

Custom Data Directory

Override the data directory:

# Via environment variable
export UNIMORPH_DATA=/custom/path
unimorph download heb

# Via command-line flag
unimorph --data-dir /custom/path download heb

# Via config file
# data_dir = "/custom/path"

Resetting Data

# Clear API response cache
unimorph repair --clear-cache

# Clear all downloaded datasets (requires re-download)
unimorph repair --clear-data

Output Modes

Table (Default)

Human-readable formatted output with colors when connected to a terminal:

unimorph inflect -l heb כתב

JSON

Machine-readable JSON output:

unimorph inflect -l heb כתב --json

TSV

Tab-separated values without headers, ideal for piping:

unimorph inflect -l heb כתב --tsv

Pipe Detection

When stdout is not a terminal (e.g., piped to another command), unimorph automatically outputs in a pipe-friendly format:

# Automatically outputs just language codes, one per line
unimorph list | xargs -I{} echo "Language: {}"

CLI Overview

The unimorph command-line tool provides access to UniMorph morphological data through a set of intuitive subcommands.

Command Structure

unimorph [OPTIONS] <COMMAND> [ARGS]

Global Options

OptionDescription
-v, --verboseEnable debug output (-vv for trace)
-q, --quietSuppress non-essential output
-d, --data-dir <PATH>Custom data directory
-h, --helpPrint help
-V, --versionPrint version

Commands at a Glance

CommandAliasDescription
downloaddlDownload a language dataset
listlsList available/cached languages
inflectiLook up all forms of a lemma
analyzeaAnalyze a surface form (reverse lookup)
searchsSearch entries with flexible filtering
statsstShow dataset statistics
infoinShow detailed info about a language
exportxExport dataset to file
updateupUpdate cached datasets
featuresfExplore morphological features
deletermDelete a cached dataset
repairRepair or reset data store
configcfgManage configuration
completionsGenerate shell completions

Common Workflows

First-Time Setup

# See what languages are available
unimorph list --available

# Download a language
unimorph download heb

# Set as default (optional)
export UNIMORPH_LANG=heb

Looking Up Words

# All forms of a lemma
unimorph inflect -l heb כתב

# What lemma does this form come from?
unimorph analyze -l heb כתבתי

Searching

# By features
unimorph search -l heb --contains PL,MASC

# By part of speech
unimorph search -l heb --pos V --limit 20

# By lemma pattern
unimorph search -l heb --lemma "כת%"

Data Management

# Check for updates
unimorph update --all --check

# Update a specific language
unimorph update heb

# Export for external use
unimorph export -l heb -o hebrew.tsv

Output Formats

Most commands support multiple output formats:

FlagFormatUse Case
(default)TableHuman reading in terminal
--jsonJSONMachine parsing, APIs
--tsvTSVPiping to other tools

Examples

# Pretty table output
unimorph inflect -l heb כתב

# JSON for parsing
unimorph inflect -l heb כתב --json | jq '.[0]'

# TSV for piping
unimorph inflect -l heb כתב --tsv | cut -f2 | sort -u

Piping and Scripting

When output is piped (not a terminal), unimorph automatically uses pipe-friendly formats:

# Get all cached language codes
unimorph list | while read lang; do
  echo "Processing $lang..."
  unimorph stats "$lang"
done

# Export to stdout and filter
unimorph export -l heb -o - | grep "FUT" > future_forms.tsv

# Count forms per lemma
unimorph search -l heb --pos V --tsv --limit 1000 | cut -f1 | sort | uniq -c | sort -rn | head

Error Handling

Commands provide helpful error messages:

$ unimorph inflect כתב
Error: No language specified.

Provide a language code as an argument, or set a default:

  export UNIMORPH_LANG=heb

Or in ~/.config/unimorph/config.toml:

  default_lang = "heb"

Run 'unimorph list --available' to see available languages.

Getting Help

# General help
unimorph --help

# Command-specific help
unimorph inflect --help
unimorph search --help

Commands

This section provides detailed documentation for each unimorph command.

Data Management

  • download - Download language datasets from UniMorph
  • list - List available and cached languages
  • update - Update cached datasets to latest versions
  • delete - Remove cached datasets
  • repair - Repair or reset the data store
  • export - Export datasets to files

Querying

  • inflect - Look up all inflected forms of a lemma
  • analyze - Analyze a surface form (reverse lookup)
  • search - Search with flexible filtering
  • features - Explore morphological features

Information

  • stats - Show dataset statistics
  • info - Show detailed language info

Configuration

  • config - Manage configuration settings

download

Download a language dataset from UniMorph.

Alias: dl

Synopsis

unimorph download [OPTIONS] [LANG]

Description

Downloads a UniMorph language dataset from GitHub and imports it into the local SQLite database. Datasets are cached locally, so subsequent queries don't require network access.

If the dataset is already cached, this command does nothing unless --force is specified.

Arguments

ArgumentDescription
[LANG]Language code (ISO 639-3, e.g., heb, ita, deu). Optional if UNIMORPH_LANG is set or configured.

Options

OptionDescription
-f, --forceForce re-download even if cached
--jsonOutput as JSON
-q, --quietSuppress progress output

Examples

Basic Download

unimorph download heb
Downloading heb...
Downloaded 33177 entries for heb

Force Re-download

unimorph download heb --force

Quiet Mode

unimorph download heb --quiet

JSON Output

unimorph download heb --json
{
  "language": "heb",
  "entries": 33177,
  "status": "downloaded"
}

Download Multiple Languages

for lang in heb ita deu spa; do
  unimorph download "$lang"
done

With Default Language

export UNIMORPH_LANG=heb
unimorph download  # Downloads Hebrew

Verbose Output

Use -v for detailed import reporting:

unimorph download spa --force -v

This shows:

parsed downloaded data lang=spa filename=["spa"] compression=none from_lfs=false valid_entries=1196224 blank_lines=0 malformed=21
malformed entry lang=spa line=80710 reason=empty form
malformed entry lang=spa line=134234 reason=empty form
...
additional malformed entries not shown lang=spa additional=11

Understanding the Output

FieldDescription
filenameSource file(s) downloaded
compressionFormat: none, xz, gzip, or zip
from_lfsWhether fetched via Git LFS (large files)
valid_entriesSuccessfully parsed entries
blank_linesEmpty lines skipped (not an error)
malformedEntries that failed to parse

Malformed Entry Details

When entries fail to parse, the first 10 are logged with:

  • Line number: Where in the source file
  • Reason: Why it failed (e.g., "empty form", "expected at least 3 columns")

Common reasons for malformed entries:

  • empty form - The inflected form field is blank
  • empty lemma - The dictionary form field is blank
  • expected at least 3 columns - Line doesn't have lemma, form, and features

These indicate upstream data quality issues in the UniMorph repository.

Notes

  • Language codes are ISO 639-3 (3 lowercase letters)
  • Use unimorph list --available to see all available languages
  • Downloads are atomic: partial downloads won't corrupt your data
  • The first download creates the database at ~/.cache/unimorph/datasets.db
  • Compressed files: Large datasets (Polish, Czech, Ukrainian, Slovak) use .xz compression - handled automatically
  • Git LFS: Very large files (like Czech's full MorfFlex dataset) use Git LFS - also handled automatically

See Also

  • list - List available languages
  • update - Update existing downloads
  • delete - Remove downloaded data

list

List available and cached languages.

Alias: ls

Synopsis

unimorph list [OPTIONS]

Description

Lists UniMorph languages. By default, shows cached (downloaded) languages with entry counts. Use --available to fetch the full list of available languages from GitHub.

Options

OptionDescription
--cachedShow only cached (downloaded) languages
--availableFetch available languages from GitHub
--refreshRefresh the cached list of available languages
--jsonOutput as JSON

Examples

List Cached Languages

unimorph list
Cached languages:
  fin (2737048 entries)
  heb (33177 entries)

Use 'unimorph list --available' to see all available languages.

List All Available Languages

unimorph list --available
Available languages (145 total, 2 cached):

  ady
  afb
  ain
  ...
  heb [cached]
  ...
  zul

Use 'unimorph download <code>' to download a language.

JSON Output

unimorph list --json
["fin", "heb"]
unimorph list --available --json
[
  {"code": "ady", "cached": false},
  {"code": "afb", "cached": false},
  ...
  {"code": "heb", "cached": true},
  ...
]

Refresh Available List

unimorph list --available --refresh

Forces a fresh fetch from GitHub (the list is normally cached for 24 hours).

Pipe-Friendly Output

When piped, outputs just language codes:

unimorph list | head -3
fin
heb
# Download all available languages
unimorph list --available | while read lang; do
  unimorph download "$lang"
done

See Also

  • download - Download a language
  • stats - Show statistics for a language

inflect

Look up all inflected forms of a lemma.

Alias: i

Synopsis

unimorph inflect [OPTIONS] <LEMMA>

Description

Given a lemma (dictionary form), returns all its inflected forms with their morphological features. This is the primary way to see a word's full paradigm.

Arguments

ArgumentDescription
<LEMMA>The lemma (dictionary form) to look up

Options

OptionDescription
-l, --lang <LANG>Language code (ISO 639-3)
-f, --features <PATTERN>Filter by feature pattern (e.g., V;IND;*;SG)
--jsonOutput as JSON
--tsvOutput as TSV (tab-separated, no headers)

Examples

Basic Lookup

unimorph inflect -l heb כתב
LEMMA        FORM         FEATURES
------------------------------------------------------------
כתב         אכתוב       V;1;SG;FUT
כתב         יכתבו       V;3;PL;FUT;MASC
כתב         יכתוב       V;3;SG;FUT;MASC
כתב         כותב        V;SG;PRS;MASC
כתב         כתב         V;3;SG;PST;MASC
...

29 form(s) found.

Filter by Features

Use wildcards (*) to match any value at a position:

# Only singular forms
unimorph inflect -l heb כתב -f "V;*;SG;*"

# Only past tense
unimorph inflect -l heb כתב -f "V;*;*;PST;*"

JSON Output

unimorph inflect -l heb כתב --json
[
  {
    "lemma": "כתב",
    "form": "אכתוב",
    "features": {
      "raw": "V;1;SG;FUT",
      "features": ["V", "1", "SG", "FUT"]
    }
  },
  ...
]

TSV for Piping

unimorph inflect -l heb כתב --tsv
כתב	אכתוב	V;1;SG;FUT
כתב	יכתבו	V;3;PL;FUT;MASC
כתב	יכתוב	V;3;SG;FUT;MASC
...

Scripting Examples

# Get unique forms only
unimorph inflect -l heb כתב --tsv | cut -f2 | sort -u

# Count forms by tense
unimorph inflect -l heb כתב --tsv | cut -f3 | grep -o 'PST\|PRS\|FUT' | sort | uniq -c

# Find forms matching a pattern
unimorph inflect -l spa hablar --tsv | grep "1;SG"

Notes

  • The lemma must match exactly (case-sensitive for most languages)
  • Use search with --lemma for partial/wildcard matching
  • Returns empty results if the lemma doesn't exist in the dataset

See Also

  • analyze - Reverse lookup (form to lemma)
  • search - Flexible searching

analyze

Analyze a surface form (reverse lookup).

Alias: a

Synopsis

unimorph analyze [OPTIONS] <FORM>

Description

Given a surface form (inflected word), returns all possible analyses: the lemma it comes from and its morphological features. This is the reverse of inflect.

A form may have multiple analyses if it's ambiguous (e.g., same spelling for different lemmas or different grammatical analyses).

Arguments

ArgumentDescription
<FORM>The surface form to analyze

Options

OptionDescription
-l, --lang <LANG>Language code (ISO 639-3)
--jsonOutput as JSON
--tsvOutput as TSV (tab-separated, no headers)

Examples

Basic Analysis

unimorph analyze -l heb כתבתי
FORM         LEMMA        FEATURES
------------------------------------------------------------
כתבתי       כתב         V;1;SG;PST

1 analysis(es) found.

Ambiguous Forms

Some forms have multiple possible analyses:

unimorph analyze -l heb כתבו
FORM         LEMMA        FEATURES
------------------------------------------------------------
כתבו        כתב         V;3;PL;PST
כתבו        כתב         V;2;PL;IMP;MASC

2 analysis(es) found.

JSON Output

unimorph analyze -l heb כתבתי --json
[
  {
    "lemma": "כתב",
    "form": "כתבתי",
    "features": {
      "raw": "V;1;SG;PST",
      "features": ["V", "1", "SG", "PST"]
    }
  }
]

TSV for Piping

unimorph analyze -l heb כתבתי --tsv
כתבתי	כתב	V;1;SG;PST

Form Not Found

unimorph analyze -l heb xyz
No analyses found for 'xyz'.

The form may not exist in the dataset, or it could be:
  - A proper noun or foreign word
  - A misspelling
  - A rare or archaic form

Scripting Examples

# Analyze words from a file
cat words.txt | while read word; do
  echo "=== $word ==="
  unimorph analyze -l heb "$word"
done

# Get just the lemma
unimorph analyze -l heb כתבתי --tsv | cut -f2

# Check if a word exists
if unimorph analyze -l heb כתבתי --tsv | grep -q .; then
  echo "Found"
fi

Notes

  • Analysis is case-sensitive for most languages
  • Forms must match exactly (no fuzzy matching)
  • Use search with --form for pattern matching

See Also

  • inflect - Forward lookup (lemma to forms)
  • search - Flexible searching

search

Search entries with flexible filtering.

Alias: s

Synopsis

unimorph search [OPTIONS]

Description

Search the dataset with flexible filtering by lemma, form, features, part of speech, and more. Supports wildcards and multiple filter combinations.

Options

OptionDescription
-l, --lang <LANG>Language code (ISO 639-3)
--lemma <PATTERN>Filter by lemma (supports SQL LIKE wildcards: % and _)
--form <PATTERN>Filter by form (supports SQL LIKE wildcards)
-f, --features <PATTERN>Filter by feature pattern (e.g., V;IND;*;1;*)
-c, --contains <FEATURES>Filter by features contained (comma-separated, position-independent)
--pos <POS>Filter by part of speech (e.g., V, N, ADJ)
--limit <N>Limit number of results (default: 100)
--offset <N>Skip first N results
--countJust show count of matching entries
--jsonOutput as JSON
--tsvOutput as TSV

Examples

Search by Lemma Pattern

# Lemmas starting with "כת"
unimorph search -l heb --lemma "כת%"

# Lemmas containing "בר"
unimorph search -l heb --lemma "%בר%"

# Exact 4-letter lemmas
unimorph search -l heb --lemma "____"

Search by Form Pattern

# Forms ending with "ים"
unimorph search -l heb --form "%ים"

Filter by Features (Position-Dependent)

Use semicolon-separated patterns with * as wildcard:

# First person singular verbs
unimorph search -l heb -f "V;1;SG;*"

# Past tense forms
unimorph search -l heb -f "V;*;*;PST;*"

Filter by Features (Position-Independent)

Use --contains for features that can be at any position:

# Plural masculine forms (regardless of position)
unimorph search -l heb --contains PL,MASC

# Future tense first person
unimorph search -l heb --contains FUT,1

Filter by Part of Speech

# Only verbs
unimorph search -l heb --pos V

# Only nouns
unimorph search -l heb --pos N

Combine Filters

# Verbs with plural masculine future forms
unimorph search -l heb --pos V --contains PL,MASC,FUT

# Lemmas starting with "א" that are verbs
unimorph search -l heb --lemma "א%" --pos V

Pagination

# First 20 results
unimorph search -l heb --pos V --limit 20

# Results 21-40
unimorph search -l heb --pos V --limit 20 --offset 20

Count Only

unimorph search -l heb --pos V --count
15234 entries match.

Output Formats

# JSON
unimorph search -l heb --pos V --limit 5 --json

# TSV for piping
unimorph search -l heb --pos V --limit 5 --tsv

Scripting Examples

# Get unique lemmas for a part of speech
unimorph search -l heb --pos V --limit 10000 --tsv | cut -f1 | sort -u

# Count entries per lemma
unimorph search -l heb --pos V --limit 10000 --tsv | cut -f1 | sort | uniq -c | sort -rn | head

# Export filtered subset
unimorph search -l heb --contains FUT --tsv > future_forms.tsv

Wildcards Reference

SQL LIKE Wildcards (for --lemma and --form)

PatternMatches
%Any sequence of characters
_Any single character
abc%Starts with "abc"
%abcEnds with "abc"
%abc%Contains "abc"
a_c"a" + any char + "c"

Feature Pattern Wildcards (for -f)

PatternMatches
*Any value at that position
V;*;SG;*Verb, any person, singular, any tense

See Also

stats

Show dataset statistics.

Alias: st

Synopsis

unimorph stats [OPTIONS] [LANG]

Description

Displays statistics about a downloaded language dataset, including entry counts, unique lemmas, unique forms, and unique feature combinations.

Arguments

ArgumentDescription
[LANG]Language code (ISO 639-3). Optional if default is configured.

Options

OptionDescription
--jsonOutput as JSON

Examples

Basic Statistics

unimorph stats heb
Statistics for heb:
  Total entries:    33177
  Unique lemmas:    1176
  Unique forms:     27286
  Unique features:  55
  Imported at:      2024-01-15 10:30:00 UTC

JSON Output

unimorph stats heb --json
{
  "total_entries": 33177,
  "unique_lemmas": 1176,
  "unique_forms": 27286,
  "unique_features": 55
}

Compare Languages

for lang in heb ita fin deu; do
  echo "=== $lang ==="
  unimorph stats "$lang"
  echo
done

Scripting

# Get entry count
unimorph stats heb --json | jq '.total_entries'

# Compare sizes
unimorph list | while read lang; do
  count=$(unimorph stats "$lang" --json | jq '.total_entries')
  echo "$lang: $count"
done | sort -t: -k2 -rn

Understanding the Statistics

MetricDescription
Total entriesNumber of (lemma, form, features) triples
Unique lemmasNumber of distinct dictionary forms
Unique formsNumber of distinct surface forms
Unique featuresNumber of distinct feature bundle combinations
Imported atWhen the dataset was downloaded

See Also

  • info - More detailed language information
  • list - List all cached languages

info

Show detailed info about a cached language.

Alias: in

Synopsis

unimorph info [OPTIONS] [LANG]

Description

Displays detailed information about a downloaded language dataset, including source URL, local and remote commit information, update status, and statistics.

Arguments

ArgumentDescription
[LANG]Language code (ISO 639-3). Optional if default is configured.

Options

OptionDescription
--jsonOutput as JSON

Examples

Basic Info

unimorph info heb
Language: heb
Source: https://github.com/unimorph/heb

Local imported:  2024-01-15 10:30:00 UTC
Local commit:    b2bff12
Remote commit:   b2bff12 (2023-01-09)

Status: Up to date

Statistics:
  Total entries:   33177
  Unique lemmas:   1176
  Unique forms:    27286
  Unique features: 55

Update Available

unimorph info heb
Language: heb
Source: https://github.com/unimorph/heb

Local imported:  2024-01-15 10:30:00 UTC
Local commit:    b2bff12
Remote commit:   c4d8e23 (2024-02-01)

Status: Update available

Statistics:
  Total entries:   33177
  Unique lemmas:   1176
  Unique forms:    27286
  Unique features: 55

JSON Output

unimorph info heb --json
{
  "language": "heb",
  "source": "https://github.com/unimorph/heb",
  "local_commit": "b2bff12",
  "remote_commit": "c4d8e23",
  "imported_at": "2024-01-15T10:30:00Z",
  "update_available": true,
  "stats": {
    "total_entries": 33177,
    "unique_lemmas": 1176,
    "unique_forms": 27286,
    "unique_features": 55
  }
}

See Also

export

Export a language dataset to file.

Alias: x

Synopsis

unimorph export [OPTIONS]

Description

Exports a downloaded language dataset to a file in TSV or JSONL format. Useful for integrating with other tools, creating backups, or processing data with external programs.

Options

OptionDescription
-l, --lang <LANG>Language code (ISO 639-3)
-o, --output <PATH>Output file path (use - for stdout)
-F, --format <FORMAT>Output format: tsv or jsonl (auto-detected from extension)

Examples

Export to TSV

unimorph export -l heb -o hebrew.tsv
Exported 33177 entries to hebrew.tsv

Export to JSONL

unimorph export -l heb -o hebrew.jsonl

Or explicitly specify format:

unimorph export -l heb -o hebrew.json --format jsonl

Export to Stdout

Use -o - to write to stdout:

unimorph export -l heb -o - --format tsv | head -5
איבד	אאבד	V;1;SG;FUT
איבזר	אאבזר	V;1;SG;FUT
איבטח	אאבטח	V;1;SG;FUT
האביס	אאביס	V;1;SG;FUT
אבל	אאבל	V;1;SG;FUT

The status message goes to stderr, so piping works correctly:

unimorph export -l heb -o - 2>/dev/null | wc -l
33177

Scripting Examples

# Filter exported data
unimorph export -l heb -o - | grep "FUT" > future_forms.tsv

# Export and compress
unimorph export -l heb -o - | gzip > hebrew.tsv.gz

# Export multiple languages
for lang in heb ita deu; do
  unimorph export -l "$lang" -o "${lang}.tsv"
done

# Convert to CSV
unimorph export -l heb -o - | tr '\t' ',' > hebrew.csv

Output Formats

TSV (Tab-Separated Values)

lemma<TAB>form<TAB>features

Example:

hablar	hablo	V;IND;PRS;1;SG
hablar	hablas	V;IND;PRS;2;SG

JSONL (JSON Lines)

One JSON object per line:

{"lemma":"hablar","form":"hablo","features":"V;IND;PRS;1;SG"}
{"lemma":"hablar","form":"hablas","features":"V;IND;PRS;2;SG"}

Notes

  • Format is auto-detected from file extension (.tsv or .jsonl)
  • Use --format to override auto-detection
  • Stdout export writes status to stderr to avoid polluting data

See Also

  • search - Export filtered subsets with --tsv
  • download - Download datasets

update

Update cached language datasets.

Alias: up

Synopsis

unimorph update [OPTIONS] [LANG]

Description

Checks for and downloads updates to cached language datasets. Can update a single language or all cached languages at once.

Arguments

ArgumentDescription
[LANG]Language code to update. Omit with --all to update all.

Options

OptionDescription
--allUpdate all cached languages
--checkCheck for updates without downloading
--jsonOutput as JSON

Examples

Check for Updates

unimorph update heb --check
Checking for updates...

  heb - update available

Or if up to date:

Checking for updates...

  heb - up to date

Update a Single Language

unimorph update heb
Updating heb...
Updated heb: 33177 -> 33250 entries

Check All Languages

unimorph update --all --check
Checking for updates...

  fin - up to date
  heb - update available
  ita - up to date

1 update(s) available.

Update All Languages

unimorph update --all
Updating all cached languages...

  fin - up to date
  heb - updated (33177 -> 33250 entries)
  ita - up to date

1 language(s) updated.

JSON Output

unimorph update --all --check --json
{
  "languages": [
    {"code": "fin", "update_available": false},
    {"code": "heb", "update_available": true},
    {"code": "spa", "update_available": false}
  ],
  "updates_available": 1
}

Scripting

# Check and update only if needed
if unimorph update heb --check --json | jq -e '.update_available' > /dev/null; then
  unimorph update heb
fi

See Also

features

Explore morphological features in a language.

Alias: f

Synopsis

unimorph features [OPTIONS]

Description

Explore the morphological features used in a language dataset. View unique feature values, their frequencies, search for entries with specific features, or analyze feature positions.

Options

OptionDescription
-l, --lang <LANG>Language code (ISO 639-3)
--listList all unique feature values
--statsShow feature value counts (histogram)
--search <FEATURE>Search for entries containing a specific feature
--position <N>Show values at a specific position (0-indexed)
--limit <N>Limit number of results (default: 50)
--jsonOutput as JSON

Examples

Feature Structure Overview

unimorph features -l heb
Feature structure for heb:

  Position 0: 3 unique values (e.g., V, N, V.MSDR)
  Position 1: 6 unique values (e.g., 2, 3, 1)
  Position 2: 6 unique values (e.g., SG, PL, PRS)
  Position 3: 11 unique values (e.g., FUT, PST, IMP)
  Position 4: 2 unique values (e.g., FEM, MASC)

Use --list for all unique values, --stats for counts, --search <FEATURE> to find entries.

List All Features

unimorph features -l heb --list
Unique features in heb:

  1
  2
  3
  DEF
  FEM
  FUT
  IMP
  MASC
  N
  ...

24 unique feature values.

Feature Statistics

unimorph features -l heb --stats
Feature statistics for heb:

FEATURE              COUNT
----------------------------------------
V                    28663
SG                   16226
PL                   15158
FEM                  12384
MASC                 12384
2                    12108
FUT                  10400
PST                  9378
3                    7286
1                    4164
... and 14 more

Search by Feature

unimorph features -l heb --search FUT --limit 5
Entries with feature 'FUT':

LEMMA                FORM                 FEATURES
------------------------------------------------------------
איבד                 אאבד                 V;1;SG;FUT
איבזר                אאבזר                V;1;SG;FUT
איבטח                אאבטח                V;1;SG;FUT
האביס                אאביס                V;1;SG;FUT
אבל                  אאבל                 V;1;SG;FUT

Showing 5 of 10400 results.

Analyze Feature Position

unimorph features -l heb --position 0
Feature values at position 0 in heb:

VALUE                COUNT
----------------------------------------
V                    28663
N                    3338
V.MSDR               1176

JSON Output

unimorph features -l heb --stats --json
{
  "V": 28663,
  "SG": 16226,
  "PL": 15158,
  ...
}

Pipe-Friendly Output

When piped, outputs clean format:

# Get just feature names
unimorph features -l heb --list | head -5
1
2
3
DEF
FEM
# Feature counts as TSV
unimorph features -l heb --stats | head -5
V	28663
SG	16226
PL	15158
FEM	12384
MASC	12384

Use Cases

  • Understanding a language: See what features are used
  • Finding examples: Search for entries with specific features
  • Data exploration: Analyze feature distribution
  • Building queries: Discover feature names for search filters

See Also

delete

Delete a cached language dataset.

Alias: rm

Synopsis

unimorph delete [OPTIONS] [LANG]

Description

Removes a downloaded language dataset from the local cache. The data can be re-downloaded later with unimorph download.

Arguments

ArgumentDescription
[LANG]Language code (ISO 639-3). Optional if default is configured.

Options

OptionDescription
--jsonOutput as JSON

Examples

Delete a Language

unimorph delete heb
Deleted heb (33177 entries removed)

JSON Output

unimorph delete heb --json
{
  "language": "heb",
  "entries_removed": 33177,
  "status": "deleted"
}

Delete Multiple Languages

for lang in heb ita deu; do
  unimorph delete "$lang"
done

Notes

  • This only removes the data from the local cache
  • Statistics and metadata are also removed
  • Re-download anytime with unimorph download
  • Use unimorph repair --clear-data to delete all languages at once

See Also

repair

Repair or reset the local data store.

Synopsis

unimorph repair [OPTIONS]

Description

Utility command for troubleshooting and resetting the local data store. Can clear the API response cache or all downloaded datasets.

Options

OptionDescription
--clear-cacheClear cached API responses
--clear-dataClear all downloaded datasets (requires re-download)
--jsonOutput as JSON

Examples

Clear API Cache

Clears the cached list of available languages (normally cached for 24 hours):

unimorph repair --clear-cache
Cleared API cache

Clear All Data

Removes all downloaded language datasets:

unimorph repair --clear-data
Cleared all data (5 languages removed)

Clear Both

unimorph repair --clear-cache --clear-data

JSON Output

unimorph repair --clear-data --json
{
  "cache_cleared": false,
  "data_cleared": true,
  "languages_removed": 5
}

Use Cases

  • Corrupted data: If queries return unexpected results
  • Stale cache: If available language list seems outdated
  • Disk space: Remove all data to free space
  • Fresh start: Reset everything to initial state

Notes

  • --clear-cache only removes API response cache, not datasets
  • --clear-data removes all downloaded languages
  • Data can be re-downloaded with unimorph download

See Also

sample

Randomly sample entries from a language dataset.

Alias: rand

Synopsis

unimorph sample [OPTIONS] <N>

Description

Samples random entries from a downloaded language dataset. Useful for exploring data, creating test sets, or getting a quick overview of a language's morphology.

Arguments

ArgumentDescription
<N>Number of entries to sample

Options

OptionDescription
-l, --lang <LANG>Language code (ISO 639-3)
-s, --seed <SEED>Seed for reproducible sampling
--by-lemmaSample complete paradigms instead of random entries
--jsonOutput as JSON
--tsvOutput as TSV (tab-separated, no headers)

Examples

Random Entries

unimorph sample -l spa 5
LEMMA        FORM         FEATURES
------------------------------------------------------------
tapiar      tapiemos     V;SBJV;PRS;1;PL
apilar      apilando     V;V.CVB;PRS
hablar      hablaste     V;IND;PST;PFV;2;SG;INFM
comer       comieron     V;IND;PST;PFV;3;PL
vivir       viviremos    V;IND;FUT;1;PL

5 sampled entry(ies).

Sample Complete Paradigms

Use --by-lemma to get all forms of randomly selected lemmas:

unimorph sample -l spa 2 --by-lemma

This returns complete paradigms for 2 random lemmas, showing all their inflected forms.

Reproducible Sampling

Use --seed for reproducible results:

unimorph sample -l spa 5 --seed 42

Running with the same seed always returns the same entries.

JSON Output

unimorph sample -l spa 3 --json
[
  {
    "lemma": "hablar",
    "form": "hablamos",
    "features": {
      "raw": "V;IND;PRS;1;PL",
      "features": ["V", "IND", "PRS", "1", "PL"]
    }
  },
  ...
]

TSV for Scripting

unimorph sample -l spa 10 --tsv > sample.tsv

Scripting Examples

# Create a test set
unimorph sample -l spa 100 --seed 123 --tsv > test_set.tsv

# Sample paradigms for flashcard generation
unimorph sample -l spa 10 --by-lemma --json > flashcards.json

# Get random verbs only
unimorph sample -l spa 50 --tsv | grep "^V;" | head -10

Notes

  • Without --seed, results are different each run
  • --by-lemma returns more entries than N (all forms of N lemmas)
  • Large N values may take longer for big datasets

See Also

  • search - Find specific entries
  • inflect - Look up forms for a known lemma

config

Manage configuration.

Alias: cfg

Synopsis

unimorph config <COMMAND>

Subcommands

CommandDescription
showShow current configuration
initInitialize a new config file
pathShow the config file path

config show

Display the current configuration, including both config file settings and defaults.

unimorph config show
Configuration

  Path: /home/user/.config/unimorph/config.toml
  Status: loaded

Current Settings

  default_lang: heb
  data_dir: (default)
  output_format: (default: table)
  no_color: (not set)

JSON Output

unimorph config show --json
{
  "path": "/home/user/.config/unimorph/config.toml",
  "exists": true,
  "default_lang": "heb",
  "data_dir": null,
  "output_format": null,
  "no_color": null
}

config init

Create a new config file with example content.

unimorph config init
Created config file at /home/user/.config/unimorph/config.toml

Force Overwrite

unimorph config init --force

Overwrites existing config file.

JSON Output

unimorph config init --json
{
  "path": "/home/user/.config/unimorph/config.toml",
  "created": true
}

config path

Show the config file path.

unimorph config path
/home/user/.config/unimorph/config.toml

JSON Output

unimorph config path --json
{
  "path": "/home/user/.config/unimorph/config.toml"
}

Config File Format

The config file uses TOML format:

# Default language for commands
default_lang = "heb"

# Custom data directory
# data_dir = "/custom/path"

# Default output format: "table", "json", or "tsv"
# output_format = "table"

# Disable colored output
# no_color = true

# Language aliases
[languages]
hebrew = "heb"
spanish = "spa"

See Also

completions

Generate shell completions for your shell.

Synopsis

unimorph completions <SHELL>

Description

Generates shell completion scripts that enable tab-completion for unimorph commands, options, and arguments.

Arguments

ArgumentDescription
<SHELL>Shell to generate completions for: bash, zsh, fish, elvish, powershell

Installation

Bash

# Add to ~/.bashrc
source <(unimorph completions bash)

# Or save to a file
unimorph completions bash > ~/.local/share/bash-completion/completions/unimorph

Zsh

# Add to ~/.zshrc (before compinit)
source <(unimorph completions zsh)

# Or save to fpath
unimorph completions zsh > ~/.zfunc/_unimorph
# Then add to ~/.zshrc: fpath=(~/.zfunc $fpath)

Fish

unimorph completions fish > ~/.config/fish/completions/unimorph.fish

PowerShell

# Add to your PowerShell profile
unimorph completions powershell | Out-String | Invoke-Expression

# Or save to a file and source it
unimorph completions powershell > unimorph.ps1

Elvish

unimorph completions elvish > ~/.elvish/lib/unimorph.elv
# Then add to ~/.elvish/rc.elv: use unimorph

Examples

After installation, you can use tab completion:

# Complete commands
unimorph inf<TAB>  # completes to 'inflect'

# Complete options
unimorph inflect --<TAB>  # shows available options

# Complete language codes (if supported by your shell)
unimorph inflect -l <TAB>

Notes

  • Restart your shell or source your config file after installation
  • Some completions may require a downloaded language list to work

See Also

Library Overview

The unimorph-core crate provides a Rust library for working with UniMorph morphological data. Use it to integrate morphological lookups into your own applications.

Installation

Add to your Cargo.toml:

[dependencies]
unimorph-core = "0.1"

Quick Example

use unimorph_core::{Repository, LangCode};

fn main() -> anyhow::Result<()> {
    // Create a repository (uses default cache directory)
    let repo = Repository::open_default()?;
    
    // Parse language code
    let lang: LangCode = "heb".parse()?;
    
    // Look up all forms of a lemma
    let forms = repo.store().inflect(&lang, "כתב")?;
    for entry in forms {
        println!("{} -> {} ({})", entry.lemma, entry.form, entry.features);
    }
    
    // Analyze a surface form
    let analyses = repo.store().analyze(&lang, "כתבתי")?;
    for entry in analyses {
        println!("{} <- {} ({})", entry.form, entry.lemma, entry.features);
    }
    
    Ok(())
}

Core Components

Repository

The Repository manages data downloads and caching:

#![allow(unused)]
fn main() {
use unimorph_core::Repository;

// Default location (~/.cache/unimorph)
let repo = Repository::open_default()?;

// Custom location
let repo = Repository::open("/custom/path")?;

// Download a language
repo.download("heb").await?;

// List cached languages
let languages = repo.cached_languages()?;
}

Store

The Store provides the query interface:

#![allow(unused)]
fn main() {
let store = repo.store();

// Inflect: lemma -> forms
let forms = store.inflect("heb", "כתב")?;

// Analyze: form -> lemmas
let analyses = store.analyze("heb", "כתבתי")?;

// Statistics
let stats = store.stats("heb")?;
}

Query Builder

Flexible searching with the query builder:

#![allow(unused)]
fn main() {
let results = store.query("heb")
    .lemma("כת%")           // LIKE pattern
    .pos("V")                // Part of speech
    .features_contain(&["FUT", "1"])  // Has these features
    .limit(100)
    .execute()?;
}

Types

Core data types:

#![allow(unused)]
fn main() {
use unimorph_core::{Entry, LangCode, FeatureBundle};

// Language codes (validated)
let lang: LangCode = "heb".parse()?;

// Entries contain lemma, form, features
let entry = Entry {
    lemma: "כתב".to_string(),
    form: "כתבתי".to_string(),
    features: "V;1;SG;PST".parse()?,
};

// Feature bundles support pattern matching
let features: FeatureBundle = "V;1;SG;PST".parse()?;
assert!(features.matches("V;*;SG;*"));
assert!(features.contains("PST"));
}

Error Handling

The library uses a custom Error type:

#![allow(unused)]
fn main() {
use unimorph_core::{Result, Error};

fn example() -> Result<()> {
    let repo = Repository::open_default()?;
    
    match repo.store().inflect("heb", "xyz") {
        Ok(entries) => println!("Found {} entries", entries.len()),
        Err(Error::NotFound(msg)) => println!("Not found: {}", msg),
        Err(e) => return Err(e),
    }
    
    Ok(())
}
}

Feature Flags

FlagDescription
defaultStandard features
parquetParquet export support
[dependencies]
unimorph-core = { version = "0.1", features = ["parquet"] }

Next Steps

Types

Core data types in unimorph-core.

LangCode

A validated ISO 639-3 language code (3 lowercase ASCII letters).

#![allow(unused)]
fn main() {
use unimorph_core::LangCode;

// Parse from string
let lang: LangCode = "heb".parse()?;

// Validation happens at parse time
assert!("HEB".parse::<LangCode>().is_err());  // Must be lowercase
assert!("he".parse::<LangCode>().is_err());   // Must be 3 chars
assert!("h3b".parse::<LangCode>().is_err());  // Must be letters

// Convert to string
let s: &str = lang.as_ref();
let s: String = lang.to_string();
}

Entry

A single morphological entry with lemma, form, and features.

#![allow(unused)]
fn main() {
use unimorph_core::Entry;

// Entries are returned from queries
let entries = store.inflect("heb", "כתב")?;
for entry in entries {
    println!("Lemma: {}", entry.lemma);
    println!("Form: {}", entry.form);
    println!("Features: {}", entry.features);
    println!("Features (raw): {}", entry.features.raw());
    println!("Features (list): {:?}", entry.features.as_slice());
}

// Parse from TSV line
let entry = Entry::parse_line("כתב\tכתבתי\tV;1;SG;PST", 1)?;

// Serialize to JSON
let json = serde_json::to_string(&entry)?;
}

Fields

FieldTypeDescription
lemmaStringDictionary form
formStringInflected surface form
featuresFeatureBundleMorphological features

FeatureBundle

A semicolon-separated bundle of morphological features.

#![allow(unused)]
fn main() {
use unimorph_core::FeatureBundle;

// Parse from string
let features: FeatureBundle = "V;1;SG;PST".parse()?;

// Access individual features
assert_eq!(features.as_slice(), &["V", "1", "SG", "PST"]);
assert_eq!(features.raw(), "V;1;SG;PST");
assert_eq!(features.len(), 4);

// Check if contains a feature (position-independent)
assert!(features.contains("PST"));
assert!(features.contains("V"));
assert!(!features.contains("FUT"));

// Check if contains all features
assert!(features.contains_all(&["V", "PST"]));

// Pattern matching with wildcards
assert!(features.matches("V;*;SG;*"));
assert!(features.matches("V;1;*;PST"));
assert!(!features.matches("N;*;*;*"));

// Display
println!("{}", features);  // "V;1;SG;PST"
}

Pattern Matching

The matches method supports positional pattern matching:

PatternDescription
V;1;SG;PSTExact match
V;*;SG;*Wildcard at positions 1 and 3
*;*;*;PSTOnly check position 3

Note: Pattern must have same number of positions as the bundle.

Validation

  • Feature bundles cannot be empty
  • Individual features cannot be empty
  • Features are separated by semicolons
#![allow(unused)]
fn main() {
assert!("".parse::<FeatureBundle>().is_err());      // Empty
assert!("V;;SG".parse::<FeatureBundle>().is_err()); // Empty feature
}

DatasetStats

Statistics about a downloaded language dataset.

#![allow(unused)]
fn main() {
use unimorph_core::DatasetStats;

let stats = store.stats("heb")?;
if let Some(stats) = stats {
    println!("Total entries: {}", stats.total_entries);
    println!("Unique lemmas: {}", stats.unique_lemmas);
    println!("Unique forms: {}", stats.unique_forms);
    println!("Unique features: {}", stats.unique_features);
}
}

Fields

FieldTypeDescription
total_entriesusizeNumber of entries
unique_lemmasusizeDistinct lemmas
unique_formsusizeDistinct surface forms
unique_featuresusizeDistinct feature bundles

Serialization

All types implement Serialize and Deserialize from serde:

#![allow(unused)]
fn main() {
use unimorph_core::Entry;

let entry = store.inflect("heb", "כתב")?.first().unwrap();

// To JSON
let json = serde_json::to_string(&entry)?;

// From JSON
let entry: Entry = serde_json::from_str(&json)?;
}

Store

The Store provides the query interface for morphological data.

Opening a Store

Usually accessed through Repository:

#![allow(unused)]
fn main() {
use unimorph_core::Repository;

let repo = Repository::open_default()?;
let store = repo.store();
}

Or open directly:

#![allow(unused)]
fn main() {
use unimorph_core::Store;

// Open existing database
let store = Store::open("path/to/datasets.db")?;

// In-memory store (for testing)
let store = Store::in_memory()?;
}

Basic Queries

Inflect (Lemma to Forms)

Look up all inflected forms of a lemma:

#![allow(unused)]
fn main() {
let forms = store.inflect("heb", "כתב")?;

for entry in &forms {
    println!("{} -> {} ({})", entry.lemma, entry.form, entry.features);
}

println!("Found {} forms", forms.len());
}

Analyze (Form to Lemmas)

Find all possible lemmas for a surface form:

#![allow(unused)]
fn main() {
let analyses = store.analyze("heb", "כתבו")?;

for entry in &analyses {
    println!("{} <- {} ({})", entry.form, entry.lemma, entry.features);
}

// Handle ambiguous forms
if analyses.len() > 1 {
    println!("Ambiguous: {} possible analyses", analyses.len());
}
}

Statistics

Get dataset statistics:

#![allow(unused)]
fn main() {
if let Some(stats) = store.stats("heb")? {
    println!("Entries: {}", stats.total_entries);
    println!("Lemmas: {}", stats.unique_lemmas);
    println!("Forms: {}", stats.unique_forms);
}
}

Check Language

Check if a language is loaded:

#![allow(unused)]
fn main() {
if store.has_language("heb")? {
    println!("Hebrew is available");
}

// List all languages
let languages = store.languages()?;
for lang in languages {
    println!("- {}", lang);
}
}

Query Builder

For flexible searching, use the query builder:

#![allow(unused)]
fn main() {
let results = store.query("heb")
    .lemma("כת%")           // LIKE pattern (% = any chars)
    .form("%ים")            // Forms ending in ים
    .pos("V")               // Part of speech
    .features_match("V;*;SG;*")  // Pattern match
    .features_contain(&["FUT"])  // Contains feature
    .limit(100)
    .offset(0)
    .execute()?;
}

See Query Builder for full documentation.

Data Management

Import Data

Import entries from TSV format:

#![allow(unused)]
fn main() {
use unimorph_core::{Entry, LangCode};

let lang: LangCode = "test".parse()?;
let entries = vec![
    Entry::parse_line("test\tform1\tN;SG", 1)?,
    Entry::parse_line("test\tform2\tN;PL", 2)?,
];

store.import(&lang, &entries, None, None)?;
}

Delete Language

Remove a language from the store:

#![allow(unused)]
fn main() {
let removed = store.delete_language("heb")?;
println!("Removed {} entries", removed);
}

Export

Export to various formats:

#![allow(unused)]
fn main() {
// Export to TSV file
let count = store.export_tsv("heb", "hebrew.tsv")?;

// Export to JSONL file
let count = store.export_jsonl("heb", "hebrew.jsonl")?;

// Export to writer (e.g., stdout)
use std::io::stdout;
let count = store.export_tsv_to_writer("heb", stdout().lock())?;

// Parquet (with feature flag)
#[cfg(feature = "parquet")]
let count = store.export_parquet("heb", "hebrew.parquet")?;
}

Thread Safety

Store is Send but not Sync. For concurrent access, use a mutex or create separate store instances:

#![allow(unused)]
fn main() {
use std::sync::Mutex;

let store = Mutex::new(Store::open("datasets.db")?);

// In threads:
let store = store.lock().unwrap();
let results = store.inflect("heb", "כתב")?;
}

Error Handling

#![allow(unused)]
fn main() {
use unimorph_core::{Store, Error};

match store.inflect("xyz", "test") {
    Ok(entries) => println!("Found {} entries", entries.len()),
    Err(Error::LanguageNotFound(lang)) => {
        println!("Language {} not downloaded", lang);
    }
    Err(e) => return Err(e.into()),
}
}

Repository

The Repository manages data downloads, caching, and provides access to the underlying store.

Creating a Repository

#![allow(unused)]
fn main() {
use unimorph_core::Repository;

// Default location (~/.cache/unimorph)
let repo = Repository::open_default()?;

// Custom location
let repo = Repository::open("/path/to/data")?;

// Custom location with PathBuf
use std::path::PathBuf;
let path = PathBuf::from("/path/to/data");
let repo = Repository::open(&path)?;
}

Downloading Data

Download a language dataset from UniMorph:

#![allow(unused)]
fn main() {
// Download (async)
repo.download("heb").await?;

// Force re-download
repo.download_with_options("heb", true).await?;
}

Compressed Files and Git LFS

Some large datasets are distributed differently due to GitHub file size limits:

FormatLanguagesNotes
.xz (LZMA)ces, pol, slk, ukrBest compression for text
.ziprus (segmentations), sanArchive format
Git LFSces (full MorfFlex)For files > 100MB

The repository automatically:

  1. Tries compressed versions first (.xz, .gz)
  2. Falls back to uncompressed if not found
  3. Detects Git LFS pointers and fetches from media endpoint
  4. Decompresses transparently before importing

No special handling is needed - just call download() as usual.

Parse Reporting

When parsing downloaded data, use Entry::parse_tsv_with_report() for detailed diagnostics:

#![allow(unused)]
fn main() {
use unimorph_core::{Entry, ParseReport, CompressionFormat};

let content = "lemma\tform\tV;IND\nbad line\nlemma2\tform2\tN;SG\n";
let (entries, report) = Entry::parse_tsv_with_report(content);

println!("Valid entries: {}", report.valid_entries);
println!("Blank lines: {}", report.blank_lines);
println!("Malformed: {}", report.malformed_count);

// Inspect malformed entries (first 10 stored)
for entry in &report.malformed {
    println!("  Line {}: {} - {}", 
        entry.line_num, 
        entry.reason,
        entry.content
    );
}
}

The ParseReport includes:

FieldTypeDescription
valid_entriesusizeSuccessfully parsed entries
blank_linesusizeEmpty lines (not an error)
malformed_countusizeTotal entries that failed
malformedVec<MalformedEntry>Details for first 10 failures
compressionCompressionFormatSource file format
from_lfsboolWhether fetched via Git LFS
filenameOption<String>Source filename(s)

The CompressionFormat enum:

#![allow(unused)]
fn main() {
pub enum CompressionFormat {
    None,   // Plain text
    Xz,     // .xz (LZMA)
    Gzip,   // .gz
    Zip,    // .zip archive
}
}

Accessing the Store

Get the underlying store for queries:

#![allow(unused)]
fn main() {
let store = repo.store();

let forms = store.inflect("heb", "כתב")?;
}

Checking Cached Languages

#![allow(unused)]
fn main() {
// List cached languages
let languages = repo.cached_languages()?;
for lang in &languages {
    println!("Cached: {}", lang);
}

// Check if specific language is cached
if languages.iter().any(|l| l.as_ref() == "heb") {
    println!("Hebrew is cached");
}
}

Data Directory

The repository manages a data directory containing:

~/.cache/unimorph/
├── datasets.db              # SQLite database
└── available_languages.json # Cached API response

Get the data directory:

#![allow(unused)]
fn main() {
let data_dir = repo.data_dir();
println!("Data stored in: {}", data_dir.display());
}

Full Example

use unimorph_core::Repository;

#[tokio::main]
async fn main() -> anyhow::Result<()> {
    // Open repository
    let repo = Repository::open_default()?;
    
    // Download Hebrew if not cached
    let cached = repo.cached_languages()?;
    if !cached.iter().any(|l| l.as_ref() == "heb") {
        println!("Downloading Hebrew...");
        repo.download("heb").await?;
    }
    
    // Query the data
    let store = repo.store();
    let forms = store.inflect("heb", "כתב")?;
    
    println!("Found {} forms of כתב:", forms.len());
    for entry in &forms {
        println!("  {} - {}", entry.form, entry.features);
    }
    
    Ok(())
}

Error Handling

#![allow(unused)]
fn main() {
use unimorph_core::{Repository, Error};

async fn download_language(repo: &Repository, lang: &str) -> anyhow::Result<()> {
    match repo.download(lang).await {
        Ok(()) => println!("Downloaded {}", lang),
        Err(Error::Network(e)) => {
            println!("Network error: {}", e);
            println!("Check your connection and try again");
        }
        Err(Error::InvalidLanguage(l)) => {
            println!("Invalid language code: {}", l);
        }
        Err(e) => return Err(e.into()),
    }
    Ok(())
}
}

Async Runtime

Download operations are async and require a runtime:

// With tokio
#[tokio::main]
async fn main() {
    let repo = Repository::open_default().unwrap();
    repo.download("heb").await.unwrap();
}

// Or with block_on
fn main() {
    let rt = tokio::runtime::Runtime::new().unwrap();
    let repo = Repository::open_default().unwrap();
    rt.block_on(repo.download("heb")).unwrap();
}

Query Builder

The query builder provides a fluent interface for flexible searching.

Basic Usage

#![allow(unused)]
fn main() {
let results = store.query("heb")
    .limit(100)
    .execute()?;
}

Filter Methods

By Lemma

#![allow(unused)]
fn main() {
// Exact match
.lemma("כתב")

// LIKE pattern (% = any chars, _ = single char)
.lemma("כת%")      // Starts with כת
.lemma("%ב")       // Ends with ב
.lemma("%בר%")     // Contains בר
.lemma("___")      // Exactly 3 characters
}

By Form

#![allow(unused)]
fn main() {
// Exact match
.form("כתבתי")

// LIKE pattern
.form("%ים")       // Plural forms ending in ים
.form("ה%")        // Forms starting with ה
}

By Part of Speech

#![allow(unused)]
fn main() {
.pos("V")          // Verbs
.pos("N")          // Nouns
.pos("ADJ")        // Adjectives
}

By Features (Pattern Match)

Position-dependent matching with wildcards:

#![allow(unused)]
fn main() {
// Match specific positions
.features_match("V;1;SG;*")      // 1st person singular verbs
.features_match("V;*;*;PST;*")   // Past tense verbs
.features_match("N;*;PL;*")      // Plural nouns
}

By Features (Contains)

Position-independent matching:

#![allow(unused)]
fn main() {
// Has these features anywhere
.features_contain(&["FUT"])           // Future tense
.features_contain(&["PL", "MASC"])    // Plural masculine
.features_contain(&["V", "1", "SG"])  // 1st person singular verbs
}

Pagination

#![allow(unused)]
fn main() {
// First page
.limit(20)
.offset(0)

// Second page
.limit(20)
.offset(20)

// All results (careful with large datasets!)
.limit(usize::MAX)
}

Executing Queries

Get Results

#![allow(unused)]
fn main() {
let entries: Vec<Entry> = store.query("heb")
    .pos("V")
    .limit(100)
    .execute()?;

for entry in &entries {
    println!("{} {} {}", entry.lemma, entry.form, entry.features);
}
}

Count Results

#![allow(unused)]
fn main() {
let count = store.query("heb")
    .pos("V")
    .count()?;

println!("Found {} verbs", count);
}

Check Existence

#![allow(unused)]
fn main() {
let exists = store.query("heb")
    .lemma("כתב")
    .exists()?;

if exists {
    println!("Lemma found");
}
}

Get First Result

#![allow(unused)]
fn main() {
if let Some(entry) = store.query("heb")
    .lemma("כתב")
    .first()?
{
    println!("First form: {}", entry.form);
}
}

Chaining Filters

Filters are combined with AND logic:

#![allow(unused)]
fn main() {
let results = store.query("heb")
    .lemma("כת%")                    // AND
    .pos("V")                         // AND
    .features_contain(&["FUT"])       // AND
    .limit(10)
    .execute()?;
}

Examples

Find All Verb Infinitives

#![allow(unused)]
fn main() {
let infinitives = store.query("heb")
    .pos("V")
    .features_contain(&["NFIN"])
    .execute()?;
}

Find Ambiguous Forms

Forms that could be multiple parts of speech:

#![allow(unused)]
fn main() {
let form = "שמר";

let as_verb = store.query("heb")
    .form(form)
    .pos("V")
    .execute()?;

let as_noun = store.query("heb")
    .form(form)
    .pos("N")
    .execute()?;

if !as_verb.is_empty() && !as_noun.is_empty() {
    println!("{} is ambiguous (verb and noun)", form);
}
}

Paginate Through All Results

#![allow(unused)]
fn main() {
let page_size = 100;
let mut offset = 0;

loop {
    let results = store.query("heb")
        .pos("V")
        .limit(page_size)
        .offset(offset)
        .execute()?;
    
    if results.is_empty() {
        break;
    }
    
    for entry in &results {
        // Process entry
    }
    
    offset += page_size;
}
}

Export Filtered Subset

#![allow(unused)]
fn main() {
use std::io::Write;

let mut file = std::fs::File::create("verbs.tsv")?;

let verbs = store.query("heb")
    .pos("V")
    .limit(usize::MAX)
    .execute()?;

for entry in &verbs {
    writeln!(file, "{}\t{}\t{}", entry.lemma, entry.form, entry.features)?;
}
}

Performance Tips

  1. Use limits: Always set a reasonable limit
  2. Prefer specific filters: More filters = faster queries
  3. Use count() first: Check result size before fetching all
  4. Index-friendly queries: Lemma and form queries use indexes
#![allow(unused)]
fn main() {
// Good: Uses index
.lemma("כתב")

// Good: Uses index
.form("כתבתי")

// Slower: Full scan with pattern
.lemma("%תב%")

// Slower: Feature scan
.features_contain(&["FUT"])
}

Python Bindings

The unimorph-rs Python package provides fast, Rust-powered access to UniMorph morphological data with native Polars DataFrame support.

Installation

pip install unimorph-rs

For Polars DataFrame support:

pip install unimorph-rs[polars]

Links:

Requirements

  • Python 3.9+
  • Polars (optional, for DataFrame methods)

Quick Start

from unimorph import Store, download

# Download a language dataset (one-time)
download("spa")  # Spanish

# Create a store to query the data
store = Store()

# Get all inflected forms of a lemma
forms = store.inflect("spa", "hablar")
for entry in forms:
    print(f"{entry.form}: {entry.features}")

Output:

hablar: V;NFIN
hablando: V;V.CVB;PRS
hablado: V;V.PTCP;PST;MASC;SG
hablo: V;IND;PRS;1;SG
hablas: V;IND;PRS;2;SG
habla: V;IND;PRS;3;SG
...

Core API

download(lang)

Downloads a language dataset from UniMorph. Only needs to be called once per language.

from unimorph import download

download("deu")  # German
download("spa")  # Spanish
download("fra")  # French

Store

The main interface for querying morphological data.

from unimorph import Store

store = Store()

store.inflect(lang, lemma)

Get all inflected forms for a lemma (dictionary form).

forms = store.inflect("deu", "gehen")  # "to go" in German
for entry in forms:
    print(f"{entry.lemma} -> {entry.form}: {entry.features}")

store.analyze(lang, form)

Analyze a word form to find possible lemmas and features.

analyses = store.analyze("spa", "hablamos")
for entry in analyses:
    print(f"{entry.form} <- {entry.lemma}: {entry.features}")

store.search_features(lang, features, limit=None)

Search for entries containing specific morphological features.

# Find all past tense subjunctive forms in Spanish
entries = store.search_features("spa", "SBJV;PST", limit=100)

store.stats(lang)

Get statistics about a downloaded language dataset.

stats = store.stats("spa")
if stats:
    print(f"Entries: {stats.total_entries}")
    print(f"Unique lemmas: {stats.unique_lemmas}")
    print(f"Unique forms: {stats.unique_forms}")

store.languages()

List all downloaded languages.

langs = store.languages()
print(langs)  # ['deu', 'ita', 'spa', ...]

store.has_language(lang)

Check if a language is downloaded.

if store.has_language("fra"):
    print("French data is available")

Polars DataFrame Support

Note: Requires pip install unimorph-rs[polars]

All query methods have _df variants that return Polars DataFrames for easy data analysis.

from unimorph import Store, download

download("spa")
store = Store()

# Get results as a DataFrame
df = store.inflect_df("spa", "ser")
print(df)

Output:

shape: (70, 3)
+-------+---------+------------------------+
| lemma | form    | features               |
| ---   | ---     | ---                    |
| str   | str     | str                    |
+-------+---------+------------------------+
| ser   | ser     | V;NFIN                 |
| ser   | siendo  | V;V.CVB;PRS            |
| ser   | sido    | V;V.PTCP;PST;MASC;SG   |
| ser   | soy     | V;IND;PRS;1;SG         |
| ser   | eres    | V;IND;PRS;2;SG         |
| ...   | ...     | ...                    |
+-------+---------+------------------------+

DataFrame Methods

  • store.inflect_df(lang, lemma) - Inflections as DataFrame
  • store.analyze_df(lang, form) - Analyses as DataFrame
  • store.search_features_df(lang, features, limit=None) - Feature search as DataFrame

Working with DataFrames

import polars as pl

df = store.inflect_df("spa", "hablar")

# Filter to indicative mood only
indicative = df.filter(pl.col("features").str.contains("IND"))

# Group by tense
by_tense = df.filter(
    pl.col("features").str.contains("IND")
).with_columns(
    pl.when(pl.col("features").str.contains("PRS")).then(pl.lit("present"))
      .when(pl.col("features").str.contains("PST")).then(pl.lit("past"))
      .when(pl.col("features").str.contains("FUT")).then(pl.lit("future"))
      .otherwise(pl.lit("other"))
      .alias("tense")
)

print(by_tense)

Entry Objects

Query results return Entry objects with the following attributes:

AttributeTypeDescription
lemmastrDictionary form / citation form
formstrInflected surface form
featuresstrUniMorph feature bundle (semicolon-separated)
entry = store.inflect("spa", "hablar")[0]
print(entry.lemma)     # "hablar"
print(entry.form)      # "hablar"
print(entry.features)  # "V;NFIN"
print(repr(entry))     # Entry(lemma='hablar', form='hablar', features='V;NFIN')

DatasetStats Objects

Statistics returned by store.stats():

AttributeTypeDescription
languagestrLanguage code
total_entriesintTotal number of entries
unique_lemmasintNumber of unique lemmas
unique_formsintNumber of unique forms
unique_featuresintNumber of unique feature bundles

Example: Building a Conjugation Table

import polars as pl
from unimorph import Store, download

download("spa")
store = Store()

# Get all forms of "hablar" (to speak)
df = store.inflect_df("spa", "hablar")

# Filter to present indicative
present = df.filter(
    pl.col("features").str.contains("IND") & 
    pl.col("features").str.contains("PRS")
)

# Extract person and number
conjugation = present.with_columns([
    pl.when(pl.col("features").str.contains("1")).then(pl.lit("1st"))
      .when(pl.col("features").str.contains("2")).then(pl.lit("2nd"))
      .when(pl.col("features").str.contains("3")).then(pl.lit("3rd"))
      .alias("person"),
    pl.when(pl.col("features").str.contains("SG")).then(pl.lit("singular"))
      .when(pl.col("features").str.contains("PL")).then(pl.lit("plural"))
      .alias("number")
]).select(["person", "number", "form"])

print(conjugation)

See Also

About UniMorph

UniMorph is a collaborative project that provides morphological paradigms for the world's languages in a standardized format.

What is Morphology?

Morphology is the study of word structure and how words change form to express different grammatical meanings. For example:

  • English: "walk" -> "walks", "walked", "walking"
  • Spanish: "hablar" -> "hablo", "hablas", "habla", "hablamos"...
  • Hebrew: "כתב" -> "כותב", "כתבתי", "יכתוב"...

What UniMorph Provides

UniMorph datasets contain mappings from lemmas (dictionary forms) to their inflected forms, along with morphological features describing each form.

Data Format

Each entry is a triple:

lemma <TAB> form <TAB> features

Example (Spanish):

hablar	hablo	V;IND;PRS;1;SG
hablar	hablas	V;IND;PRS;2;SG
hablar	habla	V;IND;PRS;3;SG
hablar	hablamos	V;IND;PRS;1;PL
hablar	habláis	V;IND;PRS;2;PL
hablar	hablan	V;IND;PRS;3;PL

Coverage

UniMorph includes data for 100+ languages, ranging from:

  • High-resource languages: English, Spanish, German, French
  • Medium-resource languages: Finnish, Hungarian, Turkish
  • Low-resource languages: Many endangered and under-documented languages

Data Sources

UniMorph data comes from:

  • Wiktionary extractions
  • Linguistic databases
  • Academic contributions
  • Community submissions

Use Cases

Natural Language Processing

  • Training morphological inflection models
  • Data augmentation for NLU systems
  • Lemmatization and stemming lookup tables

Language Learning

  • Conjugation practice applications
  • Flashcard generation
  • Grammar reference tools

Linguistic Research

  • Cross-linguistic typology studies
  • Morphological complexity analysis
  • Paradigm structure research

Lexicography

  • Dictionary development
  • Inflection table generation
  • Coverage verification

The UniMorph Schema

UniMorph uses a standardized feature schema across all languages, making cross-linguistic comparison possible. Features are organized into dimensions:

  • Part of Speech (V, N, ADJ, ...)
  • Person (1, 2, 3)
  • Number (SG, PL, DU)
  • Tense (PST, PRS, FUT)
  • And many more...

See the official UniMorph schema documentation for the complete specification, or our Feature Schema page for a quick reference.

Contributing to UniMorph

UniMorph is open source. Each language has its own GitHub repository:

Contributions welcome:

  • Report data errors
  • Add missing forms
  • Contribute new languages

Citation

If you use UniMorph in research, please cite:

@inproceedings{mccarthy-etal-2020-unimorph,
    title = "{U}ni{M}orph 3.0: Universal Morphology",
    author = "McCarthy, Arya D. and others",
    booktitle = "LREC",
    year = "2020",
}

Feature Schema

UniMorph uses a standardized feature schema to annotate morphological forms. Features are semicolon-separated and position-dependent within each language.

For the complete official specification, see the UniMorph Schema documentation (PDF).

Feature Format

FEATURE1;FEATURE2;FEATURE3;...

Example: V;IND;PRS;1;SG means:

  • V = Verb
  • IND = Indicative mood
  • PRS = Present tense
  • 1 = First person
  • SG = Singular number

Feature Dimensions

Part of Speech

FeatureDescription
VVerb
NNoun
ADJAdjective
ADVAdverb
PROPronoun
DETDeterminer
ADPAdposition
NUMNumeral
CONJConjunction
PARTParticle
INTJInterjection
V.MSDRVerbal noun / Masdar
V.PTCPParticiple
V.CVBConverb

Person

FeatureDescription
1First person
2Second person
3Third person
4Fourth person (obviate)
INCLInclusive
EXCLExclusive

Number

FeatureDescription
SGSingular
PLPlural
DUDual
TRITrial
PAUCPaucal
GRPLGreater plural

Gender

FeatureDescription
MASCMasculine
FEMFeminine
NEUTNeuter
NAKHAnimate (Algonquian)

Case

FeatureDescription
NOMNominative
ACCAccusative
GENGenitive
DATDative
INSInstrumental
LOCLocative
ABLAblative
VOCVocative
ESSEssive
TRANSTranslative
COMComitative
PRIVPrivative
PRTPartitive
And many more...

Tense

FeatureDescription
PRSPresent
PSTPast
FUTFuture
IPFVImperfective
PFVPerfective
PRFPerfect
PLPRFPluperfect
PROSPProspective

Aspect

FeatureDescription
IPFVImperfective
PFVPerfective
HABHabitual
PROGProgressive
ITERIterative

Mood

FeatureDescription
INDIndicative
SBJVSubjunctive
IMPImperative
CONDConditional
OPTOptative
POTPotential
PURPPurposive

Voice

FeatureDescription
ACTActive
PASSPassive
MIDMiddle
ANTIPAntipassive
CAUSCausative

Finiteness

FeatureDescription
FINFinite
NFINNon-finite

Definiteness

FeatureDescription
DEFDefinite
NDEFIndefinite
SPECSpecific
NSPECNon-specific

Comparison

FeatureDescription
CMPRComparative
SPRLSuperlative

Polarity

FeatureDescription
POSPositive
NEGNegative

Possession

FeatureDescription
PSS1S1st person singular possessor
PSS2S2nd person singular possessor
PSS3S3rd person singular possessor
PSS1P1st person plural possessor
PSS2P2nd person plural possessor
PSS3P3rd person plural possessor
PSSDPossessed form

Language-Specific Features

Some languages have additional features not listed above. Use unimorph features -l <lang> --list to see all features used in a specific language.

Feature Position

Feature positions vary by language. For example:

Hebrew verbs: V;PERSON;NUMBER;TENSE;GENDER

V;1;SG;PST     (1st person singular past)
V;3;PL;FUT;MASC (3rd person plural future masculine)

Spanish verbs: V;MOOD;TENSE;PERSON;NUMBER

V;IND;PRS;1;SG  (indicative present 1st singular)
V;SBJV;PST;3;PL (subjunctive past 3rd plural)

Working with Features

CLI

# List all features in a language
unimorph features -l heb --list

# See feature statistics
unimorph features -l heb --stats

# Find entries with a feature
unimorph features -l heb --search FUT

# Search by feature pattern
unimorph search -l heb -f "V;1;SG;*"

# Search by contained features
unimorph search -l heb --contains PL,MASC

Library

#![allow(unused)]
fn main() {
use unimorph_core::FeatureBundle;

let features: FeatureBundle = "V;1;SG;PST".parse()?;

// Check for specific feature
if features.contains("PST") {
    println!("Past tense");
}

// Pattern matching
if features.matches("V;*;SG;*") {
    println!("Singular verb");
}
}

References

Available Languages

UniMorph provides morphological data for 100+ languages. Use unimorph list --available to see the current list.

For the complete list of languages with download links, see the official UniMorph languages page.

Listing Languages

# See all available languages
unimorph list --available

# See cached (downloaded) languages
unimorph list --cached

# Refresh the available list
unimorph list --available --refresh

Language Codes

UniMorph uses ISO 639-3 three-letter language codes:

CodeLanguage
araArabic
deuGerman
ellGreek
engEnglish
fasPersian
finFinnish
fraFrench
hebHebrew
hinHindi
hunHungarian
itaItalian
jpnJapanese
katGeorgian
korKorean
latLatin
nldDutch
polPolish
porPortuguese
ronRomanian
rusRussian
spaSpanish
sweSwedish
turTurkish
ukrUkrainian
zhoChinese

And many more...

Dataset Sizes

Dataset sizes vary significantly:

LanguageEntriesLemmas
Finnish (fin)2.7M+50K+
Spanish (spa)1.2M+10K+
German (deu)500K+50K+
Italian (ita)500K+10K+
Hebrew (heb)33K+1K+

Check specific sizes with:

unimorph stats <lang>

Language Repositories

Each language has its own GitHub repository under the UniMorph organization:

https://github.com/unimorph/<code>

For example:

You can also browse all languages on the UniMorph website.

Data Quality

Data quality varies by language:

  • High quality: Languages with extensive Wiktionary coverage
  • Medium quality: Languages with academic contributions
  • Lower quality: Newer or less-resourced languages

Check the language's GitHub repository for:

  • Data sources
  • Known issues
  • Contribution guidelines

Finding Language Codes

If you don't know a language's code:

# List all available and search
unimorph list --available | grep -i finnish
# Output: fin

# Or use the SIL database
# https://iso639-3.sil.org/code_tables/639/data

Setting Up Aliases

Create shortcuts for frequently used languages:

# ~/.config/unimorph/config.toml
[languages]
hebrew = "heb"
spanish = "spa"
german = "deu"
finnish = "fin"

Then use:

unimorph inflect -l hebrew כתב
# Resolves to: unimorph inflect -l heb כתב

Contributing Languages

To contribute to a language or add a new one:

  1. Visit the language repository on GitHub
  2. Check existing issues
  3. Submit corrections or additions via pull request

See the UniMorph contribution guidelines for more information.

Contributing

Thank you for your interest in contributing to unimorph-rs!

Getting Started

Prerequisites

  • Rust (latest stable)
  • Git

Clone and Build

git clone https://github.com/joshrotenberg/unimorph-rs
cd unimorph-rs
cargo build

Run Tests

cargo test --all-features

Run Lints

cargo fmt --all -- --check
cargo clippy --all-targets --all-features -- -D warnings

Project Structure

unimorph-rs/
├── crates/
│   ├── unimorph-core/     # Core library
│   │   ├── src/
│   │   │   ├── lib.rs
│   │   │   ├── types.rs   # Core types
│   │   │   ├── store.rs   # SQLite backend
│   │   │   ├── query.rs   # Query builder
│   │   │   ├── repository.rs
│   │   │   └── export.rs
│   │   └── Cargo.toml
│   │
│   └── unimorph-cli/      # CLI application
│       ├── src/
│       │   ├── main.rs
│       │   ├── commands/  # Command implementations
│       │   ├── config.rs
│       │   └── colors.rs
│       └── Cargo.toml
│
├── docs/                   # mdBook documentation
│   ├── book.toml
│   └── src/
│
└── Cargo.toml             # Workspace root

Making Changes

Creating a Branch

git checkout -b feat/your-feature
# or
git checkout -b fix/your-fix

Commit Messages

Use conventional commits:

feat: add new feature
fix: resolve bug in X
docs: update documentation
test: add tests for Y
refactor: restructure Z

Pull Requests

  1. Fork the repository
  2. Create a feature branch
  3. Make your changes
  4. Run tests and lints
  5. Submit a pull request

Development Guidelines

Code Style

  • Follow Rust idioms
  • Use rustfmt for formatting
  • Address all clippy warnings
  • Document public APIs

Testing

  • Add tests for new features
  • Maintain test coverage
  • Use meaningful test names
#![allow(unused)]
fn main() {
#[test]
fn inflect_returns_all_forms() {
    // ...
}
}

Error Handling

  • Use thiserror for library errors
  • Use anyhow for CLI errors
  • Provide helpful error messages

Documentation

  • Document public items
  • Include examples in doc comments
  • Update mdBook docs for user-facing changes

Areas for Contribution

Good First Issues

Look for issues labeled good first issue on GitHub.

Feature Ideas

  • Additional export formats
  • Performance optimizations
  • New query capabilities
  • Language-specific features

Documentation

  • Fix typos
  • Improve examples
  • Add tutorials
  • Translate documentation

Testing

  • Add edge case tests
  • Improve test coverage
  • Add integration tests

Code of Conduct

Be respectful and constructive. We welcome contributors of all experience levels.

Getting Help

  • Open a GitHub issue for bugs
  • Use discussions for questions
  • Check existing issues before creating new ones

License

Contributions are licensed under the same terms as the project (MIT/Apache-2.0).