finntk.omor.extract

Functions for extracting lemmas from OMorFi analyses.

finntk.omor.extract.extract_lemmas(word_form)[source]

Extract lemmas specifically mentioned by OMorFi.

finntk.omor.extract.extract_lemmas_combs(word_form)[source]

Works like extract_lemmas, but also tries to combine adjacent subwords to make lemmas which may be out of volcaburary for OMorFi.

Note that this will over generate (by design). For example: voileipäkakku will generate voi, voileipä and voileipäkakku as desired, but will also spuriously generate leipäkakku.

finntk.omor.extract.extract_lemmas_recurs(word_form)[source]

Works like extract_lemmas, but also tries to expand each lemma into more lemmas. This helps in some cases (but can overgenerate even more). For example, it will mean that synnyinkaupunkini will generate synty, kaupunki, synnyinkaupunki, synnyin and syntyä.

finntk.omor.extract.extract_lemmas_span(word_form)[source]

Works like extract_lemmas, but doesn’t extract individual subwords. However, if a word is only recognised by as a compound word by OMorFi it will glue the parts together, lemmatising only the last subword. This means it extracts only lemmas which span the whole word form.

finntk.omor.extract.extract_true_lemmas_span(word_form, norm_func=<function iden_func>, return_pos=False)[source]

Works like extract_lemmas_span, but uses true_lemmatise. It also returns some of the features associated with each lemma.

finntk.omor.extract.lemma_intersect(toks1, toks2)[source]

Given two iterables of tokens, return the intersection of their lemmas. This can work as a simple, high recall, method of matching for example, two inflected noun phrases.