It depends on what you mean by «has an exact form». If you mean «has a single, invariant form», the answer would most likely be «no». According to several definitions of «stem», Latin words (including nouns) may have more than one stem (or alternatively, you could say that the stem of a word may have more than one form: it’s a bit difficult to distinguish these two possibilities). The use of different stems for different forms of the word is called «heteroclisy».
Different definitions of «stem»
The concept of a «stem» is not completely clear/unambiguous. I’m aware of at least three distinct definitions:
-
Etymological/historical linguistics/comparative IE linguistics: The stem of līber in a historical sense is lībero-; the nominative singular form is historically derived from lībero-s via syncope and loss of s after r (as explained in the «Type A.» section of Alex B.’s answer to Why do some 2nd decl. «-er» adjectives and nouns drop the «e» in the stem?).
This word happens to be built on a single stem/single form in a historical sense, but that is not true for all words. IE «heteroclite» nouns existed; an example of a descendant in Latin is iecur «liver». This type of heteroclite noun is mentioned in the answers to Does any Latin noun originally end in -r?
-
Synchronic analysis of Latin morphophonology according to certain modern linguistic theories: The stem of līber is a bit difficult to analyze; one option would be to treat this as a synchronically heteroclite adjective in Classical Latin, with the masculine nominative being built on a consonant-final stem līber- and the other masculine forms being built on a vowel-final stem lībero-. This seems to be the analysis that would be adopted by Cser 2016, who refers to the masculine nominative singular piger as being built on a stem pigr-, with no stem-final vowel (p. 129), despite the use of [ō] in the ablative singular form pigrō.
-
Draconis’s answer indicates that «stem» has been used in some sources to refer to a different concept that is similar to what the other two approaches would call the «root» or «base» of a noun. I’m not sure which specific sources use that definition. It might have lasted longer in pedagogical contexts than in linguistic texts.
Heteroclisy in more detail
One clear example of a heteroclite Latin noun is vās: the ablative singular vāse is built on the stem vās-, while the genitive plural vāsōrum is built on the stem vāso-. (Historically, the explanation in this case is that the different stem is derived from a «collateral» form of the noun; Lewis and Short says that in anteclassical Latin, vāsum existed as a separate nominative singular form. But in Classical Latin, the stems vās- and vāsō- came to be used in complementary contexts, which allows vāse and vāsōrum to be analyzed as forms of a single noun. I’ve generally seen this noun described as having the stem vās- in the singular and vāso- in the plural, although the nominative/accusative plural vāsa is formally ambiguous: it would look the same regardless of whether it represented vās- + -a or vāso- + -a).
A number of other Latin nouns may or may not be analyzed as «heteroclite» depending on how you think of the idea of a «stem». This issue came up in a discussion I had with Joonas Ilmavirta beneath the answer that he posted to the question What consonants can a noun stem end in? Many third-declension nouns show an i in the genitive plural (before the ending -um) but not in most of the other forms (e.g. vermis has the genitive plural vermium, which seems to be built on the stem vermi-, but the accusative vermem and the ablative verme, which seem to be built on the stem verm-).
The stems of nominative singular forms
The situation gets even more complicated if you try to include nominative singular forms in your analysis; for this reason, I’ve seen a number of pedagogical texts that don’t attempt to relate the nominative singular form to a stem, but that just present the nominative singular as a form that must be memorized as a whole. Nonetheless, from an etymological and possibly even from a synchronic perspective, almost all Latin nominative singular forms can in fact be analyzed as being composed of a stem and a suffix.
According to the theories that use definitions 1 or 2 of the term «stem», Latin has a nominative singular suffix with a fairly small number of variant forms or allomorphs: mainly -s, -∅, and -m. (Cser 2016 lists one additional allomorph, -ēs, to account for certain third-declension nouns (p. 127)). The form of this suffix is conditioned morphologically by the gender of the noun, and conditioned phonologically by the form of the end of the noun’s stem (for example, nouns with a stem ending in a consistently take -∅).
Note that I did not include -r in my list of forms of the nominative singular suffix. As far as I know, Latin has no words ending in -er that would be analyzed as having a stem ending in -e. If you’re familiar with that fact, you can determine that līber does not have the stem lībe.
For some words, the stem of nominative singular can be analyzed as being the same in an «underlying» way as the stem of the other forms, even though it appears different on the «surface». For example, the nominative singular form rēx be seen as being built synchronically on the same stem rēg- as the other forms of this noun. The explanation is that rēg- + -s becomes rēx according to known rules for combining Latin sounds: g turns into k before s.
I remember encountering theories that take a different approach to explaining variant phonetic forms like this (I forget where, and which language was being discussed): we could say instead that the stem of the noun rēx is «stored with» two allomorphs, rēg and rēk, and that there are rules that take these two allomorphs as an input and select the appropriate allomorph (rather than taking rēg as an input and turning its g into a k). This is a bit of a hairsplitting distinction in most contexts, however. Cser 2016 says that the «rule-based» framework adopted in that dissertation is simply for convenience.
The «stem» (def 1 or 2) is viewed as including the «theme vowel»
According to definitions 1 or 2, the «stem» of a noun includes the «theme vowel»/»thematic vowel». For example, servus is an «o-stem» noun, so the stem would end in o: servo-. (If I remember correctly, from an etymological perspective, I think only second-declension nouns have stems that end in a PIE «theme vowel»; the first-declension a-stem nouns and fourth-declension u-stem nouns come from PIE forms that ended in a consonant that was vocalized in the history of Latin). Allen and Greenough’s grammar, which seems to use a very etymology-based definition of «stem», says that servus has the «base» serv- and the «stem» servo- (27a).
The form we get when we truncate the vowel (serv-) would be called the «root».
According to definition 2, Latin stems in this sense do show some allomorphy: for example, in the nominative singular form of servus we find [servu] (with short u) rather than [servo]. This is explainable in terms of diachronic sound changes, but the correct synchronic analysis isn’t necessarily the same as the diachronic explanation. The stem vowel may (appear to) disappear entirely in some forms, such as the ablative/dative plural servīs. One possible synchronic analysis (given by Cser 2016) is that servīs is divisible into a vowel-final stem servo- and a vowel-initial suffix -īs, with a morphophonological process of vowel deletion resulting in the lack of [o] or [ō] in the surface form servīs. If I remember correctly, the historical development is a bit different: I think that the ī in forms like this comes from earlier diphthongs oi (or for a-stem nouns, ai), which were weakened outside of the first syllable to ei, which later monophthongized to [ī].
A similar analysis is often used in the linguistic analysis of Romance languages, the modern descendants of Latin. You can see an overview of this approach applied to Spanish in Bermudez-Otero 2012.
Sources
Cser, András. 2016. «Aspects of the phonology and morphology of Classical Latin».
Bermudez-Otero, Ricardo. 2012. «The Spanish lexicon stores stems with theme vowels, not roots with inflectional class features».
Based on various answers on Stack Overflow and blogs I’ve come across, this is the method I’m using, and it seems to return real words quite well. The idea is to split the incoming text into an array of words (use whichever method you’d like), and then find the parts of speech (POS) for those words and use that to help stem and lemmatize the words.
You’re sample above doesn’t work too well, because the POS can’t be determined. However, if we use a real sentence, things work much better.
import nltk
from nltk.corpus import wordnet
lmtzr = nltk.WordNetLemmatizer().lemmatize
def get_wordnet_pos(treebank_tag):
if treebank_tag.startswith('J'):
return wordnet.ADJ
elif treebank_tag.startswith('V'):
return wordnet.VERB
elif treebank_tag.startswith('N'):
return wordnet.NOUN
elif treebank_tag.startswith('R'):
return wordnet.ADV
else:
return wordnet.NOUN
def normalize_text(text):
word_pos = nltk.pos_tag(nltk.word_tokenize(text))
lemm_words = [lmtzr(sw[0], get_wordnet_pos(sw[1])) for sw in word_pos]
return [x.lower() for x in lemm_words]
print(normalize_text('cats running ran cactus cactuses cacti community communities'))
# ['cat', 'run', 'ran', 'cactus', 'cactuses', 'cacti', 'community', 'community']
print(normalize_text('The cactus ran to the community to see the cats running around cacti between communities.'))
# ['the', 'cactus', 'run', 'to', 'the', 'community', 'to', 'see', 'the', 'cat', 'run', 'around', 'cactus', 'between', 'community', '.']
In linguistic morphology and information retrieval, stemming is the process of reducing inflected (or sometimes derived) words to their word stem, base or root form—generally a written word form. The stem need not be identical to the morphological root of the word; it is usually sufficient that related words map to the same stem, even if this stem is not in itself a valid root. Algorithms for stemming have been studied in computer science since the 1960s. Many search engines treat words with the same stem as synonyms as a kind of query expansion, a process called conflation.
A computer program or subroutine that stems word may be called a stemming program, stemming algorithm, or stemmer.
ExamplesEdit
A stemmer for English operating on the stem cat should identify such strings as cats, catlike, and catty. A stemming algorithm might also reduce the words fishing, fished, and fisher to the stem fish. The stem need not be a word, for example the Porter algorithm reduces, argue, argued, argues, arguing, and argus to the stem argu.
HistoryEdit
The first published stemmer was written by Julie Beth Lovins in 1968.[1] This paper was remarkable for its early date and had great influence on later work in this area.[citation needed] Her paper refers to three earlier major attempts at stemming algorithms, by Professor John W. Tukey of Princeton University, the algorithm developed at Harvard University by Michael Lesk, under the direction of Professor Gerard Salton, and a third algorithm developed by James L. Dolby of R and D Consultants, Los Altos, California.
A later stemmer was written by Martin Porter and was published in the July 1980 issue of the journal Program. This stemmer was very widely used and became the de facto standard algorithm used for English stemming. Dr. Porter received the Tony Kent Strix award in 2000 for his work on stemming and information retrieval.
Many implementations of the Porter stemming algorithm were written and freely distributed; however, many of these implementations contained subtle flaws. As a result, these stemmers did not match their potential. To eliminate this source of error, Martin Porter released an official free software (mostly BSD-licensed) implementation[2] of the algorithm around the year 2000. He extended this work over the next few years by building Snowball, a framework for writing stemming algorithms, and implemented an improved English stemmer together with stemmers for several other languages.
The Paice-Husk Stemmer was developed by Chris D Paice at Lancaster University in the late 1980s, it is an iterative stemmer and features an externally stored set of stemming rules. The standard set of rules provides a ‘strong’ stemmer and may specify the removal or replacement of an ending. The replacement technique avoids the need for a separate stage in the process to recode or provide partial matching. Paice also developed a direct measurement for comparing stemmers based on counting the over-stemming and under-stemming errors.
AlgorithmsEdit
Unsolved problem in computer science:
Is there any perfect stemming algorithm in English language?
There are several types of stemming algorithms which differ in respect to performance and accuracy and how certain stemming obstacles are overcome.
A simple stemmer looks up the inflected form in a lookup table. The advantages of this approach are that it is simple, fast, and easily handles exceptions. The disadvantages are that all inflected forms must be explicitly listed in the table: new or unfamiliar words are not handled, even if they are perfectly regular (e.g. cats ~ cat), and the table may be large. For languages with simple morphology, like English, table sizes are modest, but highly inflected languages like Turkish may have hundreds of potential inflected forms for each root.
A lookup approach may use preliminary part-of-speech tagging to avoid overstemming.[3]
The production techniqueEdit
The lookup table used by a stemmer is generally produced semi-automatically. For example, if the word is «run», then the inverted algorithm might automatically generate the forms «running», «runs», «runned», and «runly». The last two forms are valid constructions, but they are unlikely.[citation needed].
Suffix-stripping algorithmsEdit
Suffix stripping algorithms do not rely on a lookup table that consists of inflected forms and root form relations. Instead, a typically smaller list of «rules» is stored which provides a path for the algorithm, given an input word form, to find its root form. Some examples of the rules include:
- if the word ends in ‘ed’, remove the ‘ed’
- if the word ends in ‘ing’, remove the ‘ing’
- if the word ends in ‘ly’, remove the ‘ly’
Suffix stripping approaches enjoy the benefit of being much simpler to maintain than brute force algorithms, assuming the maintainer is sufficiently knowledgeable in the challenges of linguistics and morphology and encoding suffix stripping rules. Suffix stripping algorithms are sometimes regarded as crude given the poor performance when dealing with exceptional relations (like ‘ran’ and ‘run’). The solutions produced by suffix stripping algorithms are limited to those lexical categories which have well known suffixes with few exceptions. This, however, is a problem, as not all parts of speech have such a well formulated set of rules. Lemmatisation attempts to improve upon this challenge.
Prefix stripping may also be implemented. Of course, not all languages use prefixing or suffixing.
Additional algorithm criteriaEdit
Suffix stripping algorithms may differ in results for a variety of reasons. One such reason is whether the algorithm constrains whether the output word must be a real word in the given language. Some approaches do not require the word to actually exist in the language lexicon (the set of all words in the language). Alternatively, some suffix stripping approaches maintain a database (a large list) of all known morphological word roots that exist as real words. These approaches check the list for the existence of the term prior to making a decision. Typically, if the term does not exist, alternate action is taken. This alternate action may involve several other criteria. The non-existence of an output term may serve to cause the algorithm to try alternate suffix stripping rules.
It can be the case that two or more suffix stripping rules apply to the same input term, which creates an ambiguity as to which rule to apply. The algorithm may assign (by human hand or stochastically) a priority to one rule or another. Or the algorithm may reject one rule application because it results in a non-existent term whereas the other overlapping rule does not. For example, given the English term friendlies, the algorithm may identify the ies suffix and apply the appropriate rule and achieve the result of friendl. Friendl is likely not found in the lexicon, and therefore the rule is rejected.
One improvement upon basic suffix stripping is the use of suffix substitution. Similar to a stripping rule, a substitution rule replaces a suffix with an alternate suffix. For example, there could exist a rule that replaces ies with y. How this affects the algorithm varies on the algorithm’s design. To illustrate, the algorithm may identify that both the ies suffix stripping rule as well as the suffix substitution rule apply. Since the stripping rule results in a non-existent term in the lexicon, but the substitution rule does not, the substitution rule is applied instead. In this example, friendlies becomes friendly instead of friendl’.
Diving further into the details, a common technique is to apply rules in a cyclical fashion (recursively, as computer scientists would say). After applying the suffix substitution rule in this example scenario, a second pass is made to identify matching rules on the term friendly, where the ly stripping rule is likely identified and accepted. In summary, friendlies becomes (via substitution) friendly which becomes (via stripping) friend.
This example also helps illustrate the difference between a rule-based approach and a brute force approach. In a brute force approach, the algorithm would search for friendlies in the set of hundreds of thousands of inflected word forms and ideally find the corresponding root form friend. In the rule-based approach, the three rules mentioned above would be applied in succession to converge on the same solution. Chances are that the brute force approach would be slower, as lookup algorithms have a direct access to the solution, while rule-based should try several options, and combinations of them, and then choose which result seems to be the best.
Lemmatisation algorithmsEdit
A more complex approach to the problem of determining a stem of a word is lemmatisation. This process involves first determining the part of speech of a word, and applying different normalization rules for each part of speech. The part of speech is first detected prior to attempting to find the root since for some languages, the stemming rules change depending on a word’s part of speech.
This approach is highly conditional upon obtaining the correct lexical category (part of speech). While there is overlap between the normalization rules for certain categories, identifying the wrong category or being unable to produce the right category limits the added benefit of this approach over suffix stripping algorithms. The basic idea is that, if the stemmer is able to grasp more information about the word being stemmed, then it can apply more accurate normalization rules (which unlike suffix stripping rules can also modify the stem).
Stochastic algorithmsEdit
Stochastic algorithms involve using probability to identify the root form of a word. Stochastic algorithms are trained (they «learn») on a table of root form to inflected form relations to develop a probabilistic model. This model is typically expressed in the form of complex linguistic rules, similar in nature to those in suffix stripping or lemmatisation. Stemming is performed by inputting an inflected form to the trained model and having the model produce the root form according to its internal ruleset, which again is similar to suffix stripping and lemmatisation, except that the decisions involved in applying the most appropriate rule, or whether or not to stem the word and just return the same word, or whether to apply two different rules sequentially, are applied on the grounds that the output word will have the highest probability of being correct (which is to say, the smallest probability of being incorrect, which is how it is typically measured).
Some lemmatisation algorithms are stochastic in that, given a word which may belong to multiple parts of speech, a probability is assigned to each possible part. This may take into account the surrounding words, called the context, or not. Context-free grammars do not take into account any additional information. In either case, after assigning the probabilities to each possible part of speech, the most likely part of speech is chosen, and from there the appropriate normalization rules are applied to the input word to produce the normalized (root) form.
n-gram analysisEdit
Some stemming techniques use the n-gram context of a word to choose the correct stem for a word.[4]
Hybrid approachesEdit
Hybrid approaches use two or more of the approaches described above in unison. A simple example is a suffix tree algorithm which first consults a lookup table using brute force. However, instead of trying to store the entire set of relations between words in a given language, the lookup table is kept small and is only used to store a minute amount of «frequent exceptions» like «ran => run». If the word is not in the exception list, apply suffix stripping or lemmatisation and output the result.
Affix stemmersEdit
In linguistics, the term affix refers to either a prefix or a suffix. In addition to dealing with suffixes, several approaches also attempt to remove common prefixes. For example, given the word indefinitely, identify that the leading «in» is a prefix that can be removed. Many of the same approaches mentioned earlier apply, but go by the name affix stripping. A study of affix stemming for several European languages can be found here.[5]
Matching algorithmsEdit
Such algorithms use a stem database (for example a set of documents that contain stem words). These stems, as mentioned above, are not necessarily valid words themselves (but rather common sub-strings, as the «brows» in «browse» and in «browsing»). In order to stem a word the algorithm tries to match it with stems from the database, applying various constraints, such as on the relative length of the candidate stem within the word (so that, for example, the short prefix «be», which is the stem of such words as «be», «been» and «being», would not be considered as the stem of the word «beside»).[citation needed].
Language challengesEdit
While much of the early academic work in this area was focused on the English language (with significant use of the Porter Stemmer algorithm), many other languages have been investigated.[6][7][8][9][10]
Hebrew and Arabic are still considered difficult research languages for stemming. English stemmers are fairly trivial (with only occasional problems, such as «dries» being the third-person singular present form of the verb «dry», «axes» being the plural of «axe» as well as «axis»); but stemmers become harder to design as the morphology, orthography, and character encoding of the target language becomes more complex. For example, an Italian stemmer is more complex than an English one (because of a greater number of verb inflections), a Russian one is more complex (more noun declensions), a Hebrew one is even more complex (due to nonconcatenative morphology, a writing system without vowels, and the requirement of prefix stripping: Hebrew stems can be two, three or four characters, but not more), and so on.
Multilingual stemmingEdit
Multilingual stemming applies morphological rules of two or more languages simultaneously instead of rules for only a single language when interpreting a search query. Commercial systems using multilingual stemming exist.[citation needed]
Error metricsEdit
There are two error measurements in stemming algorithms, overstemming and understemming. Overstemming is an error where two separate inflected words are stemmed to the same root, but should not have been—a false positive. Understemming is an error where two separate inflected words should be stemmed to the same root, but are not—a false negative. Stemming algorithms attempt to minimize each type of error, although reducing one type can lead to increasing the other.
For example, the widely used Porter stemmer stems «universal», «university», and «universe» to «univers». This is a case of overstemming: though these three words are etymologically related, their modern meanings are in widely different domains, so treating them as synonyms in a search engine will likely reduce the relevance of the search results.
An example of understemming in the Porter stemmer is «alumnus» → «alumnu», «alumni» → «alumni», «alumna»/»alumnae» → «alumna». This English word keeps Latin morphology, and so these near-synonyms are not conflated.
ApplicationsEdit
Stemming is used as an approximate method for grouping words with a similar basic meaning together. For example, a text mentioning «daffodils» is probably closely related to a text mentioning «daffodil» (without the s). But in some cases, words with the same morphological stem have idiomatic meanings which are not closely related: a user searching for «marketing» will not be satisfied by most documents mentioning «markets» but not «marketing».
Information retrievalEdit
Stemmers are common elements in query systems such as Web search engines. The effectiveness of stemming for English query systems were soon found to be rather limited, however, and this has led early information retrieval researchers to deem stemming irrelevant in general.[11] An alternative approach, based on searching for n-grams rather than stems, may be used instead. Also, stemmers may provide greater benefits in other languages than English.[12][13]
Domain analysisEdit
Stemming is used to determine domain vocabularies in domain analysis.[14]
Use in commercial productsEdit
Many commercial companies have been using stemming since at least the 1980s and have produced algorithmic and lexical stemmers in many languages.[15][16]
The Snowball stemmers have been compared with commercial lexical stemmers with varying results.[17][18]
Google Search adopted word stemming in 2003.[19] Previously a search for «fish» would not have returned «fishing». Other software search algorithms vary in their use of word stemming. Programs that simply search for substrings will obviously find «fish» in «fishing» but when searching for «fishes» will not find occurrences of the word «fish».
Text miningEdit
Stemming is used as a task in pre-processing texts before performing text mining analyses on it.
See alsoEdit
- Computational linguistics
- Derivation—stemming is a form of reverse derivation
- Inflection
- Lemma (morphology)—linguistic definition
- Lemmatization
- Lexeme
- Morphology (linguistics)
- Natural language processing—stemming is generally regarded as a form of NLP
- NLTK—implements several stemming algorithms in Python
- Root (linguistics)—linguistic definition of the term «root»
- Snowball (programming language)—designed for creating stemming algorithms
- Stem (linguistics)—linguistic definition of the term «stem»
- Text mining—stemming algorithms play a major role in commercial NLP software
ReferencesEdit
- ^ Lovins, Julie Beth (1968). «Development of a Stemming Algorithm» (PDF). Mechanical Translation and Computational Linguistics. 11: 22–31.
- ^ «Porter Stemming Algorithm».
- ^ Yatsko, V. A.; Y-stemmer
- ^ McNamee, Paul (September 2005). «Exploring New Languages with HAIRCUT at CLEF 2005» (PDF). CEUR Workshop Proceedings. 1171. Retrieved 2017-12-21.
- ^ Jongejan, B.; and Dalianis, H.; Automatic Training of Lemmatization Rules that Handle Morphological Changes in pre-, in- and Suffixes Alike, in the Proceedings of the ACL-2009, Joint conference of the 47th Annual Meeting of the Association for Computational Linguistics and the 4th International Joint Conference on Natural Language Processing of the Asian Federation of Natural Language Processing, Singapore, August 2–7, 2009, pp. 145-153
[1] - ^ Dolamic, Ljiljana; and Savoy, Jacques; Stemming Approaches for East European Languages (CLEF 2007)
- ^ Savoy, Jacques; Light Stemming Approaches for the French, Portuguese, German and Hungarian Languages, ACM Symposium on Applied Computing, SAC 2006, ISBN 1-59593-108-2
- ^ Popovič, Mirko; and Willett, Peter (1992); The Effectiveness of Stemming for Natural-Language Access to Slovene Textual Data, Journal of the American Society for Information Science, Volume 43, Issue 5 (June), pp. 384–390
- ^ Stemming in Hungarian at CLEF 2005
- ^ Viera, A. F. G. & Virgil, J. (2007); Uma revisão dos algoritmos de radicalização em língua portuguesa, Information Research, 12(3), paper 315
- ^ Baeza-Yates, Ricardo; and Ribeiro-Neto, Berthier (1999); Modern Information Retrieval, ACM Press/Addison Wesley
- ^ Kamps, Jaap; Monz, Christof; de Rijke, Maarten; and Sigurbjörnsson, Börkur (2004); Language-Dependent and Language-Independent Approaches to Cross-Lingual Text Retrieval, in Peters, C.; Gonzalo, J.; Braschler, M.; and Kluck, M. (eds.); Comparative Evaluation of Multilingual Information Access Systems, Springer Verlag, pp. 152–165
- ^ Airio, Eija (2006); Word Normalization and Decompounding in Mono- and Bilingual IR, Information Retrieval 9:249–271
- ^ Frakes, W.; Prieto-Diaz, R.; & Fox, C. (1998). «DARE: Domain Analysis and Reuse Environment», Annals of Software Engineering (5), pp. 125-141
- ^ Language Extension Packs Archived 14 September 2011 at the Wayback Machine, dtSearch
- ^ Building Multilingual Solutions by using Sharepoint Products and Technologies Archived 17 January 2008 at the Wayback Machine, Microsoft Technet
- ^ CLEF 2003: Stephen Tomlinson compared the Snowball stemmers with the Hummingbird lexical stemming (lemmatization) system
- ^ CLEF 2004: Stephen Tomlinson «Finnish, Portuguese and Russian Retrieval with Hummingbird SearchServer»
- ^ The Essentials of Google Search, Web Search Help Center, Google Inc.
Further readingEdit
- Dawson, J. L. (1974); Suffix Removal for Word Conflation, Bulletin of the Association for Literary and Linguistic Computing, 2(3): 33–46
- Frakes, W. B. (1984); Term Conflation for Information Retrieval, Cambridge University Press
- Frakes, W. B. & Fox, C. J. (2003); Strength and Similarity of Affix Removal Stemming Algorithms, SIGIR Forum, 37: 26–30
- Frakes, W. B. (1992); Stemming algorithms, Information retrieval: data structures and algorithms, Upper Saddle River, NJ: Prentice-Hall, Inc.
- Hafer, M. A. & Weiss, S. F. (1974); Word segmentation by letter successor varieties, Information Processing & Management 10 (11/12), 371–386
- Harman, D. (1991); How Effective is Suffixing?, Journal of the American Society for Information Science 42 (1), 7–15
- Hull, D. A. (1996); Stemming Algorithms – A Case Study for Detailed Evaluation, JASIS, 47(1): 70–84
- Hull, D. A. & Grefenstette, G. (1996); A Detailed Analysis of English Stemming Algorithms, Xerox Technical Report
- Kraaij, W. & Pohlmann, R. (1996); Viewing Stemming as Recall Enhancement, in Frei, H.-P.; Harman, D.; Schauble, P.; and Wilkinson, R. (eds.); Proceedings of the 17th ACM SIGIR conference held at Zurich, August 18–22, pp. 40–48
- Krovetz, R. (1993); Viewing Morphology as an Inference Process, in Proceedings of ACM-SIGIR93, pp. 191–203
- Lennon, M.; Pierce, D. S.; Tarry, B. D.; & Willett, P. (1981); An Evaluation of some Conflation Algorithms for Information Retrieval, Journal of Information Science, 3: 177–183
- Lovins, J. (1971); Error Evaluation for Stemming Algorithms as Clustering Algorithms, JASIS, 22: 28–40
- Lovins, J. B. (1968); Development of a Stemming Algorithm, Mechanical Translation and Computational Linguistics, 11, 22—31
- Jenkins, Marie-Claire; and Smith, Dan (2005); Conservative Stemming for Search and Indexing
- Paice, C. D. (1990); Another Stemmer, SIGIR Forum, 24: 56–61
- Paice, C. D. (1996) Method for Evaluation of Stemming Algorithms based on Error Counting, JASIS, 47(8): 632–649
- Popovič, Mirko; and Willett, Peter (1992); The Effectiveness of Stemming for Natural-Language Access to Slovene Textual Data, Journal of the American Society for Information Science, Volume 43, Issue 5 (June), pp. 384–390
- Porter, Martin F. (1980); An Algorithm for Suffix Stripping, Program, 14(3): 130–137
- Savoy, J. (1993); Stemming of French Words Based on Grammatical Categories Journal of the American Society for Information Science, 44(1), 1–9
- Ulmschneider, John E.; & Doszkocs, Tamas (1983); A Practical Stemming Algorithm for Online Search Assistance[permanent dead link], Online Review, 7(4), 301–318
- Xu, J.; & Croft, W. B. (1998); Corpus-Based Stemming Using Coocurrence of Word Variants, ACM Transactions on Information Systems, 16(1), 61–81
External linksEdit
- Apache OpenNLP—includes Porter and Snowball stemmers
- SMILE Stemmer—free online service, includes Porter and Paice/Husk’ Lancaster stemmers (Java API)
- Themis—open source IR framework, includes Porter stemmer implementation (PostgreSQL, Java API)
- Snowball—free stemming algorithms for many languages, includes source code, including stemmers for five romance languages
- Snowball on C#—port of Snowball stemmers for C# (14 languages)
- Python bindings to Snowball API
- Ruby-Stemmer—Ruby extension to Snowball API
- PECL—PHP extension to the Snowball API
- Oleander Porter’s algorithm—stemming library in C++ released under BSD
- Unofficial home page of the Lovins stemming algorithm—with source code in a couple of languages
- Official home page of the Porter stemming algorithm—including source code in several languages
- Official home page of the Lancaster stemming algorithm—Lancaster University, UK
- Official home page of the UEA-Lite Stemmer —University of East Anglia, UK
- Overview of stemming algorithms
- PTStemmer—A Java/Python/.Net stemming toolkit for the Portuguese language
- jsSnowball—open source JavaScript implementation of Snowball stemming algorithms for many languages
- Snowball Stemmer—implementation for Java
- hindi_stemmer—open source stemmer for Hindi
- czech_stemmer—open source stemmer for Czech
- Comparative Evaluation of Arabic Language Morphological Analysers and Stemmers
- Tamil Stemmer
You need to know the stem of a Latin noun in order to be able to decline it. In this post, learn how to find the stem of any Latin noun!
First of all, let’s review what a stem is. The stem is the part of the noun that the case endings are added to. It is the basic form of the word that appears in all case forms except the nominative singular of third declension nouns and a few second declension nouns (and the accusative singular, for third declension neuter nouns).
The stem is also what allows you to identify the word and establish its meaning. So being able to determine the stem of a noun is crucial.
(Some people call this the base instead of the stem. But don’t worry – the process of finding the stem/base is the same, no matter what you call it.)
Let’s get started!
Fortunately, finding the stem of a Latin noun is quite simple. You simply look at the genitive singular and remove the case ending. Whatever you have left is the stem.
Here are the genitive singular endings for the different declensions:
FIRST: -ae
SECOND: -ī
THIRD: -is
FOURTH: -ūs
FIFTH: -eī / ēī
When you see the genitive singular of a noun, simply remove the ending and you will have the stem. (You also use the genitive singular to determine the declension of a Latin noun.)
Now let’s look at an example noun. The dictionary form of the noun is the nominative singular followed by the genitive singular. So you want to look at the second word to determine the stem.
mater, matris
What is the stem of mater, matris?
Well, mater, matris is third declension. So we take the -is off of the genitive singular.
matris – is = matr–
This means that our stem is matr-!
Okay, time to practice. What is the stem of each of the following nouns?
- Exercise 1
- Answers
- mōns, montis
- exercitus, exercitūs
- verbum, verbī
- iter, itineris
- sella, sellae
- rēs, reī
- mūrus, mūrī
- agricola, agricolae
- montis – is = mont-
- exercitūs – ūs = exercit-
- verbī – ī = verb-
- itineris – is = itiner-
- sellae – ae = sell-
- reī – eī = r-
- mūrī – ī = mūr-
- agricolae – ae = agricol-
Still with me? Let’s address a few potential issues.
Why can’t we find the stem of a Latin noun from the nominative?
You may have noticed that, in most of the words above, the stem was also present in the nominative singular. So, you ask, why can’t we just take the ending off the nominative? Why do we need to look at the genitive singular?
Well, you can simply take the ending off of the nominative – for all declensions EXCEPT third declension (and a few –er nouns in second declension).
If we look back at #1 and #4 in Exercise 1, we notice something. The stem of mōns is mont– and the stem of iter is itiner-. The stem is NOT present in the nominative. So we really do need to know the genitive in order to find the stems of third declension nouns.
Because we already need to memorize the genitive anyway to determine the declension of a noun, it makes sense to simply remove the genitive ending to find the stem because this ALWAYS works. No exceptions.
TIP: Be extra careful with third declension nouns. Their stems can be weird and unpredictable.
What if the dictionary doesn’t give the full genitive singular?
Most beginning Latin textbooks and resources will give you the full nominative and genitive singular for each noun. But since first, second, fourth, and fifth declension stems are very regular, often dictionaries use shorthand.
FIRST DECLENSION: sella, sellae becomes sella, ae. When this happens, to get the stem you can remove the nominative singular ending -a.
SECOND DECLENSION: mūrus, mūrī becomes mūrus, ī; so so remove the -us.
FOURTH DECLENSION: exercitus, exercitūs becomes exercitus, ūs; so remove the -us.
FIFTH DECLENSION: rēs, reī becomes rēs, eī; so remove the -ēs.
But for third declension, dictionaries will include the full genitive singular.
More Practice With 3rd Declension Nouns
Since 3rd declension stems can be so weird, let’s do another quick exercise. Find the stems of the following four nouns.
- Exercise 2
- Answers
- nox, noctis
- flūmen, flūminis
- iuventūs, iuventūtis
- missiō, missiōnis
- noctis – is = noct-
- flūminis – is = flūmin–
- iuventūtis – is = iuventūt–
- missiōnis – is = missiōn–
Okay, how are you feeling? We just have one more topic to address, and then you will know how to find the stem of any Latin noun.
How to Find the Stem of a Plural-Only Noun
Some Latin nouns (called pluralia tantum or plural-only) have no singular forms. In other words, they only exist in the plural. In consequence, you cannot remove the genitive singular ending to find the stem. Instead, you have to remove the genitive plural ending.
Or, if the dictionary form is abbreviated (as it typically will be for plural-only nouns), you must remove the nominative plural ending.
Here are the nominative and genitive plural endings of the five declensions.
FIRST: -ae, ārum
SECOND: -ī, ōrum (masculine); a, ōrum (neuter)
THIRD: -ēs, um/ium (masculine/feminine); a/ia, um/ium (neuter)
FOURTH: -ūs, uum (masculine); ua, uum (neuter)
FIFTH: -ēs, ērum
Practice identifying the stems of the following plural-only Latin nouns.
- Exercise 3
- Answers
- moenia, ium
- Quinquātrūs, uum
- angustiae, ārum
- arma, ōrum
- moenia – ia = moen-
- Quinquātrūs – ūs = Quinquātr–
- angustiae – ae = angusti–
- arma – a = arm–
And there you have it. Now you know how to find the stem of any Latin noun. I hope that you have found my explanations and exercises helpful. And good luck on your Latin journey!
Make sure you follow my new @latinwithlivia account on Instagram. I will be sharing lots of Latin study tips and trivia, as well as fun facts about ancient Rome. Don’t miss out!
Oh, and are you wondering what the genitive case even does? I know we have talked about it a lot in this post! If you’re curious, you can read all about the genitive case here.
YOU MAY ALSO LOVE:
- How To Determine The Gender of Latin Nouns
- How To Find The Declension of Any Latin Noun
- Latin Noun Endings: A Guide to All 5 Declensions
- Latin Nominative Case: What You Need To Know
- How To Find The Conjugation of Any Latin Verb