type SnowballStemmer
source code
object --+
|
api.StemmerI --+
|
SnowballStemmer
A word stemmer based on the Snowball stemming algorithms.
At the moment, this port is able to stem words from fourteen
languages: Danish, Dutch, English, Finnish, French, German,
Hungarian, Italian, Norwegian, Portuguese, Romanian, Russian,
Spanish and Swedish.
Furthermore, there is also the original English Porter algorithm:
Porter, M. "An algorithm for suffix stripping."
Program 14.3 (1980): 130-137.
The algorithms have been developed by
U{Dr Martin Porter<http://tartarus.org/~martin/>}.
These stemmers are called Snowball, because he invented
a programming language with this name for creating
new stemming algorithms. There is more information available
on the U{Snowball Website<http://snowball.tartarus.org/>}.
The stemmer is invoked as shown below:
>>> from nltk import SnowballStemmer
>>> SnowballStemmer.languages # See which languages are supported
('danish', 'dutch', 'english', 'finnish', 'french', 'german', 'hungarian',
'italian', 'norwegian', 'porter', 'portuguese", 'romanian',
'russian', 'spanish', 'swedish')
>>> stemmer = SnowballStemmer("german") # Choose a language
>>> stemmer.stem(u"Autobahnen") # Stem a word
u'autobahn'
Invoking the stemmers that way is useful if you do not know the
language to be stemmed at runtime. Alternatively, if you already know
the language, then you can invoke the language specific stemmer directly:
>>> from nltk.stem.snowball import GermanStemmer
>>> stemmer = GermanStemmer()
>>> stemmer.stem(u"Autobahnen")
u'autobahn'
@author: Peter Michael Stahl
@contact: pemistahl@gmail.com
@contact: U{http://twitter.com/pemistahl}
@cvar languages: A tuple that contains the available language names
@type languages: C{tuple}
@ivar stopwords: A list that contains stopwords for the respective language
in Unicode format.
@type stopwords: C{list}
|
|
languages = ('danish', 'dutch', 'english', 'finnish', 'french'...
|
__init__(self,
language,
ignore_stopwords=False)
(Constructor)
| source code
|
Create a language specific instance of the Snowball stemmer.
- Parameters:
language (str, unicode) - The language whose subclass is instantiated.
ignore_stopwords (bool) - If set to True, stopwords are not stemmed and
returned unchanged. Set to False by default.
- Raises:
ValueError - If there is no stemmer for the specified language, a
ValueError is raised.
- Overrides:
object.__init__
|
languages
- Value:
('danish',
'dutch',
'english',
'finnish',
'french',
'german',
'hungarian',
'italian',
...
|
|