Repository

The Repository manages data downloads, caching, and provides access to the underlying store.

Creating a Repository

#![allow(unused)]
fn main() {
use unimorph_core::Repository;

// Default location (~/.cache/unimorph)
let repo = Repository::open_default()?;

// Custom location
let repo = Repository::open("/path/to/data")?;

// Custom location with PathBuf
use std::path::PathBuf;
let path = PathBuf::from("/path/to/data");
let repo = Repository::open(&path)?;
}

Downloading Data

Download a language dataset from UniMorph:

#![allow(unused)]
fn main() {
// Download (async)
repo.download("heb").await?;

// Force re-download
repo.download_with_options("heb", true).await?;
}

Compressed Files and Git LFS

Some large datasets are distributed differently due to GitHub file size limits:

FormatLanguagesNotes
.xz (LZMA)ces, pol, slk, ukrBest compression for text
.ziprus (segmentations), sanArchive format
Git LFSces (full MorfFlex)For files > 100MB

The repository automatically:

  1. Tries compressed versions first (.xz, .gz)
  2. Falls back to uncompressed if not found
  3. Detects Git LFS pointers and fetches from media endpoint
  4. Decompresses transparently before importing

No special handling is needed - just call download() as usual.

Parse Reporting

When parsing downloaded data, use Entry::parse_tsv_with_report() for detailed diagnostics:

#![allow(unused)]
fn main() {
use unimorph_core::{Entry, ParseReport, CompressionFormat};

let content = "lemma\tform\tV;IND\nbad line\nlemma2\tform2\tN;SG\n";
let (entries, report) = Entry::parse_tsv_with_report(content);

println!("Valid entries: {}", report.valid_entries);
println!("Blank lines: {}", report.blank_lines);
println!("Malformed: {}", report.malformed_count);

// Inspect malformed entries (first 10 stored)
for entry in &report.malformed {
    println!("  Line {}: {} - {}", 
        entry.line_num, 
        entry.reason,
        entry.content
    );
}
}

The ParseReport includes:

FieldTypeDescription
valid_entriesusizeSuccessfully parsed entries
blank_linesusizeEmpty lines (not an error)
malformed_countusizeTotal entries that failed
malformedVec<MalformedEntry>Details for first 10 failures
compressionCompressionFormatSource file format
from_lfsboolWhether fetched via Git LFS
filenameOption<String>Source filename(s)

The CompressionFormat enum:

#![allow(unused)]
fn main() {
pub enum CompressionFormat {
    None,   // Plain text
    Xz,     // .xz (LZMA)
    Gzip,   // .gz
    Zip,    // .zip archive
}
}

Accessing the Store

Get the underlying store for queries:

#![allow(unused)]
fn main() {
let store = repo.store();

let forms = store.inflect("heb", "כתב")?;
}

Checking Cached Languages

#![allow(unused)]
fn main() {
// List cached languages
let languages = repo.cached_languages()?;
for lang in &languages {
    println!("Cached: {}", lang);
}

// Check if specific language is cached
if languages.iter().any(|l| l.as_ref() == "heb") {
    println!("Hebrew is cached");
}
}

Data Directory

The repository manages a data directory containing:

~/.cache/unimorph/
├── datasets.db              # SQLite database
└── available_languages.json # Cached API response

Get the data directory:

#![allow(unused)]
fn main() {
let data_dir = repo.data_dir();
println!("Data stored in: {}", data_dir.display());
}

Full Example

use unimorph_core::Repository;

#[tokio::main]
async fn main() -> anyhow::Result<()> {
    // Open repository
    let repo = Repository::open_default()?;
    
    // Download Hebrew if not cached
    let cached = repo.cached_languages()?;
    if !cached.iter().any(|l| l.as_ref() == "heb") {
        println!("Downloading Hebrew...");
        repo.download("heb").await?;
    }
    
    // Query the data
    let store = repo.store();
    let forms = store.inflect("heb", "כתב")?;
    
    println!("Found {} forms of כתב:", forms.len());
    for entry in &forms {
        println!("  {} - {}", entry.form, entry.features);
    }
    
    Ok(())
}

Error Handling

#![allow(unused)]
fn main() {
use unimorph_core::{Repository, Error};

async fn download_language(repo: &Repository, lang: &str) -> anyhow::Result<()> {
    match repo.download(lang).await {
        Ok(()) => println!("Downloaded {}", lang),
        Err(Error::Network(e)) => {
            println!("Network error: {}", e);
            println!("Check your connection and try again");
        }
        Err(Error::InvalidLanguage(l)) => {
            println!("Invalid language code: {}", l);
        }
        Err(e) => return Err(e.into()),
    }
    Ok(())
}
}

Async Runtime

Download operations are async and require a runtime:

// With tokio
#[tokio::main]
async fn main() {
    let repo = Repository::open_default().unwrap();
    repo.download("heb").await.unwrap();
}

// Or with block_on
fn main() {
    let rt = tokio::runtime::Runtime::new().unwrap();
    let repo = Repository::open_default().unwrap();
    rt.block_on(repo.download("heb")).unwrap();
}