sample
Randomly sample entries from a language dataset.
Alias: rand
Synopsis
unimorph sample [OPTIONS] <N>
Description
Samples random entries from a downloaded language dataset. Useful for exploring data, creating test sets, or getting a quick overview of a language's morphology.
Arguments
| Argument | Description |
|---|---|
<N> | Number of entries to sample |
Options
| Option | Description |
|---|---|
-l, --lang <LANG> | Language code (ISO 639-3) |
-s, --seed <SEED> | Seed for reproducible sampling |
--by-lemma | Sample complete paradigms instead of random entries |
--json | Output as JSON |
--tsv | Output as TSV (tab-separated, no headers) |
Examples
Random Entries
unimorph sample -l spa 5
LEMMA FORM FEATURES
------------------------------------------------------------
tapiar tapiemos V;SBJV;PRS;1;PL
apilar apilando V;V.CVB;PRS
hablar hablaste V;IND;PST;PFV;2;SG;INFM
comer comieron V;IND;PST;PFV;3;PL
vivir viviremos V;IND;FUT;1;PL
5 sampled entry(ies).
Sample Complete Paradigms
Use --by-lemma to get all forms of randomly selected lemmas:
unimorph sample -l spa 2 --by-lemma
This returns complete paradigms for 2 random lemmas, showing all their inflected forms.
Reproducible Sampling
Use --seed for reproducible results:
unimorph sample -l spa 5 --seed 42
Running with the same seed always returns the same entries.
JSON Output
unimorph sample -l spa 3 --json
[
{
"lemma": "hablar",
"form": "hablamos",
"features": {
"raw": "V;IND;PRS;1;PL",
"features": ["V", "IND", "PRS", "1", "PL"]
}
},
...
]
TSV for Scripting
unimorph sample -l spa 10 --tsv > sample.tsv
Scripting Examples
# Create a test set
unimorph sample -l spa 100 --seed 123 --tsv > test_set.tsv
# Sample paradigms for flashcard generation
unimorph sample -l spa 10 --by-lemma --json > flashcards.json
# Get random verbs only
unimorph sample -l spa 50 --tsv | grep "^V;" | head -10
Notes
- Without
--seed, results are different each run --by-lemmareturns more entries than N (all forms of N lemmas)- Large N values may take longer for big datasets