sample

Randomly sample entries from a language dataset.

Alias: rand

Synopsis

unimorph sample [OPTIONS] <N>

Description

Samples random entries from a downloaded language dataset. Useful for exploring data, creating test sets, or getting a quick overview of a language's morphology.

Arguments

ArgumentDescription
<N>Number of entries to sample

Options

OptionDescription
-l, --lang <LANG>Language code (ISO 639-3)
-s, --seed <SEED>Seed for reproducible sampling
--by-lemmaSample complete paradigms instead of random entries
--jsonOutput as JSON
--tsvOutput as TSV (tab-separated, no headers)

Examples

Random Entries

unimorph sample -l spa 5
LEMMA        FORM         FEATURES
------------------------------------------------------------
tapiar      tapiemos     V;SBJV;PRS;1;PL
apilar      apilando     V;V.CVB;PRS
hablar      hablaste     V;IND;PST;PFV;2;SG;INFM
comer       comieron     V;IND;PST;PFV;3;PL
vivir       viviremos    V;IND;FUT;1;PL

5 sampled entry(ies).

Sample Complete Paradigms

Use --by-lemma to get all forms of randomly selected lemmas:

unimorph sample -l spa 2 --by-lemma

This returns complete paradigms for 2 random lemmas, showing all their inflected forms.

Reproducible Sampling

Use --seed for reproducible results:

unimorph sample -l spa 5 --seed 42

Running with the same seed always returns the same entries.

JSON Output

unimorph sample -l spa 3 --json
[
  {
    "lemma": "hablar",
    "form": "hablamos",
    "features": {
      "raw": "V;IND;PRS;1;PL",
      "features": ["V", "IND", "PRS", "1", "PL"]
    }
  },
  ...
]

TSV for Scripting

unimorph sample -l spa 10 --tsv > sample.tsv

Scripting Examples

# Create a test set
unimorph sample -l spa 100 --seed 123 --tsv > test_set.tsv

# Sample paradigms for flashcard generation
unimorph sample -l spa 10 --by-lemma --json > flashcards.json

# Get random verbs only
unimorph sample -l spa 50 --tsv | grep "^V;" | head -10

Notes

  • Without --seed, results are different each run
  • --by-lemma returns more entries than N (all forms of N lemmas)
  • Large N values may take longer for big datasets

See Also

  • search - Find specific entries
  • inflect - Look up forms for a known lemma