download

Download a language dataset from UniMorph.

Alias: dl

Synopsis

unimorph download [OPTIONS] [LANG]

Description

Downloads a UniMorph language dataset from GitHub and imports it into the local SQLite database. Datasets are cached locally, so subsequent queries don't require network access.

If the dataset is already cached, this command does nothing unless --force is specified.

Arguments

ArgumentDescription
[LANG]Language code (ISO 639-3, e.g., heb, ita, deu). Optional if UNIMORPH_LANG is set or configured.

Options

OptionDescription
-f, --forceForce re-download even if cached
--jsonOutput as JSON
-q, --quietSuppress progress output

Examples

Basic Download

unimorph download heb
Downloading heb...
Downloaded 33177 entries for heb

Force Re-download

unimorph download heb --force

Quiet Mode

unimorph download heb --quiet

JSON Output

unimorph download heb --json
{
  "language": "heb",
  "entries": 33177,
  "status": "downloaded"
}

Download Multiple Languages

for lang in heb ita deu spa; do
  unimorph download "$lang"
done

With Default Language

export UNIMORPH_LANG=heb
unimorph download  # Downloads Hebrew

Verbose Output

Use -v for detailed import reporting:

unimorph download spa --force -v

This shows:

parsed downloaded data lang=spa filename=["spa"] compression=none from_lfs=false valid_entries=1196224 blank_lines=0 malformed=21
malformed entry lang=spa line=80710 reason=empty form
malformed entry lang=spa line=134234 reason=empty form
...
additional malformed entries not shown lang=spa additional=11

Understanding the Output

FieldDescription
filenameSource file(s) downloaded
compressionFormat: none, xz, gzip, or zip
from_lfsWhether fetched via Git LFS (large files)
valid_entriesSuccessfully parsed entries
blank_linesEmpty lines skipped (not an error)
malformedEntries that failed to parse

Malformed Entry Details

When entries fail to parse, the first 10 are logged with:

  • Line number: Where in the source file
  • Reason: Why it failed (e.g., "empty form", "expected at least 3 columns")

Common reasons for malformed entries:

  • empty form - The inflected form field is blank
  • empty lemma - The dictionary form field is blank
  • expected at least 3 columns - Line doesn't have lemma, form, and features

These indicate upstream data quality issues in the UniMorph repository.

Notes

  • Language codes are ISO 639-3 (3 lowercase letters)
  • Use unimorph list --available to see all available languages
  • Downloads are atomic: partial downloads won't corrupt your data
  • The first download creates the database at ~/.cache/unimorph/datasets.db
  • Compressed files: Large datasets (Polish, Czech, Ukrainian, Slovak) use .xz compression - handled automatically
  • Git LFS: Very large files (like Czech's full MorfFlex dataset) use Git LFS - also handled automatically

See Also

  • list - List available languages
  • update - Update existing downloads
  • delete - Remove downloaded data