Package nltk :: Module data
[hide private]
[frames] | no frames]

Module data

source code

Functions to find and load NLTK resource files, such as corpora, grammars, and saved processing objects. Resource files are identified using URLs, such as"nltk:corpora/abc/rural.txt" or "http://nltk.org/sample/toy.cfg". The following URL protocols are supported:

If no protocol is specified, then the default protocol "nltk:" will be used.

This module provides to functions that can be used to access a resource file, given its URL: load() loads a given resource, and adds it to a resource cache; and retrieve() copies a given resource to a local file.

Classes [hide private]
PathPointer
An abstract base class for 'path pointers,' used by NLTK's data package to identify specific paths.
FileSystemPathPointer
A path pointer that identifies a file which can be accessed directly via a given absolute path.
BufferedGzipFile
A GzipFile subclass that buffers calls to read() and write().
GzipFileSystemPathPointer
A subclass of FileSystemPathPointer that identifies a gzip-compressed file located at a given absolute path.
ZipFilePathPointer
A path pointer that identifies a file contained within a zipfile, which can be accessed by reading that zipfile.
LazyLoader
OpenOnDemandZipFile
A subclass of zipfile.ZipFile that closes its file pointer whenever it is not using it; and re-opens it when it needs to read data from the zipfile.
    Seekable Unicode Stream Reader
SeekableUnicodeStreamReader
A stream reader that automatically encodes the source byte stream into unicode (like codecs.StreamReader); but still supports the seek() and tell() operations correctly.
Functions [hide private]
str
find(resource_name)
Find the given resource by searching through the directories and zip files in nltk.data.path, and return a corresponding path name.
source code
 
retrieve(resource_url, filename=None, verbose=True)
Copy the given resource to a local file.
source code
 
load(resource_url, format='auto', cache=True, verbose=False, logic_parser=None, fstruct_parser=None)
Load a given resource from the NLTK data package.
source code
 
show_cfg(resource_url, escape='##')
Write out a grammar file, ignoring escaped and empty lines
source code
 
clear_cache()
Remove all objects from the resource cache.
source code
 
_open(resource_url)
Helper function that returns an open file object for a resource, given its resource URL.
source code
Variables [hide private]
  path = ['/users/sb/nltk/data', '/Users/sb/nltk_data', '/usr/sh...
A list of directories where the NLTK data package might reside.
  _resource_cache = {}
A dictionary used to cache resources so that they won't need to be loaded more than once.
  FORMATS = {'cfg': 'A context free grammar, parsed by nltk.pars...
A dictionary describing the formats that are supported by NLTK's load() method.
  AUTO_FORMATS = {'cfg': 'cfg', 'fcfg': 'fcfg', 'fol': 'fol', 'l...
A dictionary mapping from file extensions to format names, used by load() when format="auto" to decide the format for a given resource url.
  d = '/users/sb/nltk/data'
Function Details [hide private]

find(resource_name)

source code 

Find the given resource by searching through the directories and zip files in nltk.data.path, and return a corresponding path name. If the given resource is not found, raise a LookupError, whose message gives a pointer to the installation instructions for the NLTK downloader.

Zip File Handling:

  • If resource_name contains a component with a .zip extension, then it is assumed to be a zipfile; and the remaining path components are used to look inside the zipfile.
  • If any element of nltk.data.path has a .zip extension, then it is assumed to be a zipfile.
  • If a given resource name that does not contain any zipfile component is not found initially, then find() will make a second attempt to find that resource, by replacing each component p in the path with p.zip/p. For example, this allows find() to map the resource name corpora/chat80/cities.pl to a zip file path pointer to corpora/chat80.zip/chat80/cities.pl.
  • When using find() to locate a directory contained in a zipfile, the resource name must end with the '/' character. Otherwise, find() will not locate the directory.
Parameters:
  • resource_name (str) - The name of the resource to search for. Resource names are posix-style relative path names, such as 'corpora/brown'. In particular, directory names should always be separated by the '/' character, which will be automatically converted to a platform-appropriate path separator.
Returns: str

retrieve(resource_url, filename=None, verbose=True)

source code 

Copy the given resource to a local file. If no filename is specified, then use the URL's filename. If there is already a file named filename, then raise a ValueError.

Parameters:
  • resource_url (str) - A URL specifying where the resource should be loaded from. The default protocol is "nltk:", which searches for the file in the the NLTK data package.

load(resource_url, format='auto', cache=True, verbose=False, logic_parser=None, fstruct_parser=None)

source code 

Load a given resource from the NLTK data package. The following resource formats are currently supported:

  • 'pickle'
  • 'yaml'
  • 'cfg' (context free grammars)
  • 'pcfg' (probabilistic CFGs)
  • 'fcfg' (feature-based CFGs)
  • 'fol' (formulas of First Order Logic)
  • 'logic' (Logical formulas to be parsed by the given logic_parser)
  • 'val' (valuation of First Order Logic model)
  • 'raw'

If no format is specified, load() will attempt to determine a format based on the resource name's file extension. If that fails, load() will raise a ValueError exception.

Parameters:
  • resource_url (str) - A URL specifying where the resource should be loaded from. The default protocol is "nltk:", which searches for the file in the the NLTK data package.
  • cache (bool) - If true, add this resource to a cache. If load finds a resource in its cache, then it will return it from the cache rather than loading it. The cache uses weak references, so a resource wil automatically be expunged from the cache when no more objects are using it.
  • verbose (bool) - If true, print a message when loading a resource. Messages are not displayed when a resource is retrieved from the cache.
  • logic_parser (LogicParser) - The parser that will be used to parse logical expressions.
  • fstruct_parser (FeatStructParser) - The parser that will be used to parse the feature structure of an fcfg.

show_cfg(resource_url, escape='##')

source code 

Write out a grammar file, ignoring escaped and empty lines

Parameters:
  • resource_url (str) - A URL specifying where the resource should be loaded from. The default protocol is "nltk:", which searches for the file in the the NLTK data package.
  • escape (str) - Prepended string that signals lines to be ignored

clear_cache()

source code 

Remove all objects from the resource cache.

See Also: load()

_open(resource_url)

source code 

Helper function that returns an open file object for a resource, given its resource URL. If the given resource URL uses the 'ntlk' protocol, or uses no protocol, then use nltk.data.find to find its path, and open it with the given mode; if the resource URL uses the 'file' protocol, then open the file with the given mode; otherwise, delegate to urllib2.urlopen.

Parameters:
  • resource_url (str) - A URL specifying where the resource should be loaded from. The default protocol is "nltk:", which searches for the file in the the NLTK data package.

Variables Details [hide private]

path

A list of directories where the NLTK data package might reside. These directories will be checked in order when looking for a resource in the data package. Note that this allows users to substitute in their own versions of resources, if they have them (e.g., in their home directory under ~/nltk_data).

Value:
['/users/sb/nltk/data',
 '/Users/sb/nltk_data',
 '/usr/share/nltk_data',
 '/usr/local/share/nltk_data',
 '/usr/lib/nltk_data',
 '/usr/local/lib/nltk_data']

FORMATS

A dictionary describing the formats that are supported by NLTK's load() method. Keys are format names, and values are format descriptions.

Value:
{'cfg': 'A context free grammar, parsed by nltk.parse_cfg().',
 'fcfg': 'A feature CFG, parsed by nltk.parse_fcfg().',
 'fol': 'A list of first order logic expressions, parsed by nltk.sem.p\
arse_fol() using nltk.sem.logic.LogicParser.',
 'logic': 'A list of first order logic expressions, parsed by nltk.sem\
.parse_logic().  Requires an additional logic_parser parameter',
 'pcfg': 'A probabilistic CFG, parsed by nltk.parse_pcfg().',
 'pickle': 'A serialized python object, stored using the pickle module\
...

AUTO_FORMATS

A dictionary mapping from file extensions to format names, used by load() when format="auto" to decide the format for a given resource url.

Value:
{'cfg': 'cfg',
 'fcfg': 'fcfg',
 'fol': 'fol',
 'logic': 'logic',
 'pcfg': 'pcfg',
 'pickle': 'pickle',
 'val': 'val',
 'yaml': 'yaml'}