Package nltk :: Module downloader
[hide private]
[frames] | no frames]

Module downloader

source code


The NLTK corpus and module downloader.  This module defines several
interfaces which can be used to download corpora, models, and other
data packages that can be used with NLTK.

Downloading Packages
====================
If called with no arguments, L{download() <Downloader.download>}
function will display an interactive interface which can be used to
download and install new packages.  If Tkinter is available, then a
graphical interface will be shown; otherwise, a simple text interface
will be provided.

Individual packages can be downloaded by calling the C{download()}
function with a single argument, giving the package identifier for the
package that should be downloaded:

  >>> download('treebank') # doctest: +SKIP
  [nltk_data] Downloading package 'treebank'...
  [nltk_data]   Unzipping corpora/treebank.zip.

NLTK also provides a number of "package collections", consisting of
a group of related packages.  To download all packages in a
colleciton, simply call C{download()} with the collection's
identifier:

  >>> download('all-corpora') # doctest: +SKIP
  [nltk_data] Downloading package 'abc'...
  [nltk_data]   Unzipping corpora/abc.zip.
  [nltk_data] Downloading package 'alpino'...
  [nltk_data]   Unzipping corpora/alpino.zip.
    ...
  [nltk_data] Downloading package 'words'...
  [nltk_data]   Unzipping corpora/words.zip.

Download Directory
==================
By default, packages are installed in either a system-wide directory
(if Python has sufficient access to write to it); or in the current
user's home directory.  However, the C{download_dir} argument may be
used to specify a different installation target, if desired.

See L{Downloader.default_download_dir()} for more a detailed
description of how the default download directory is chosen.

NLTK Download Server
====================
Before downloading any packages, the corpus and module downloader
contacts the NLTK download server, to retrieve an index file
describing the available packages.  By default, this index file is
loaded from C{<http://nltk.googlecode.com/svn/trunk/nltk_data/index.xml>}.
If necessary, it is possible to create a new L{Downloader} object,
specifying a different URL for the package index file.

Usage::

    python nltk/downloader.py [-d DATADIR] [-q] [-f] [-k] PACKAGE_IDS

or with py2.5+:

    python -m nltk.downloader [-d DATADIR] [-q] [-f] [-k] PACKAGE_IDS

Classes [hide private]
Package
A directory entry for a downloadable package.
Collection
A directory entry for a collection of downloadable packages.
DownloaderMessage
A status message object, used by incr_download to communicate its progress.
StartCollectionMessage
Data server has started working on a collection of packages.
FinishCollectionMessage
Data server has finished working on a collection of packages.
StartPackageMessage
Data server has started working on a package.
FinishPackageMessage
Data server has finished working on a package.
StartDownloadMessage
Data server has started downloading a package.
FinishDownloadMessage
Data server has finished downloading a package.
StartUnzipMessage
Data server has started unzipping a package.
FinishUnzipMessage
Data server has finished unzipping a package.
UpToDateMessage
The package download file is already up-to-date
StaleMessage
The package download file is out-of-date or corrupt
ErrorMessage
Data server encountered an error
ProgressMessage
Indicates how much progress the data server has made
SelectDownloadDirMessage
Indicates what download directory the data server is using
Downloader
A class used to access the NLTK data server, which can be used to download corpora and other data packages.
DownloaderShell
DownloaderGUI
Graphical interface for downloading packages from the NLTK data server.
Functions [hide private]
 
md5_hexdigest(file)
Calculate and return the MD5 checksum for a given file.
source code
 
unzip(filename, root, verbose=True)
Extract the contents of the zip file filename into the directory root.
source code
 
_unzip_iter(filename, root, verbose=True) source code
 
build_index(root, base_url)
Create a new data.xml index file, by combining the xml description files for various packages and collections.
source code
 
_indent_xml(xml, prefix='')
Helper for build_index(): Given an XML ElementTree, modify it (and its descendents) text and tail attributes to generate an indented tree, where each nested element is indented by 2 spaces with respect to its parent.
source code
 
_check_package(pkg_xml, zipfilename, zf)
Helper for build_index(): Perform some checks to make sure that the given package is consistent.
source code
 
_svn_revision(filename)
Helper for build_index(): Calculate the subversion revision number for a given file (by using subprocess to run svn).
source code
 
_find_collections(root)
Helper for build_index(): Yield a list of ElementTree.Element objects, each holding the xml for a single package collection.
source code
 
_find_packages(root)
Helper for build_index(): Yield a list of tuples (pkg_xml, zf, subdir), where:
source code
 
download(info_or_id=None, download_dir=None, quiet=False, force=False, prefix='[nltk_data] ', halt_on_error=True, raise_on_error=False) source code
 
download_shell() source code
 
download_gui() source code
 
update() source code
Variables [hide private]
  TKINTER = True
  _downloader = Downloader()
Function Details [hide private]

md5_hexdigest(file)

source code 

Calculate and return the MD5 checksum for a given file. file may either be a filename or an open stream.

build_index(root, base_url)

source code 

Create a new data.xml index file, by combining the xml description files for various packages and collections. root should be the path to a directory containing the package xml and zip files; and the collection xml files. The root directory is expected to have the following subdirectories:

 root/
   packages/ .................. subdirectory for packages
     corpora/ ................. zip & xml files for corpora
     grammars/ ................ zip & xml files for grammars
     taggers/ ................. zip & xml files for taggers
     tokenizers/ .............. zip & xml files for tokenizers
     etc.
   collections/ ............... xml files for collections

For each package, there should be two files: package.zip contains the package itself, as a compressed zip file; and package.xml is an xml description of the package. The zipfile package.zip should expand to a single subdirectory named package/. The base filename package must match the identifier given in the package's xml file.

For each collection, there should be a single file collection.zip, describing the collection.

All identifiers (for both packages and collections) must be unique.

_find_packages(root)

source code 

Helper for build_index(): Yield a list of tuples (pkg_xml, zf, subdir), where:

  • pkg_xml is an ElementTree.Element holding the xml for a package
  • zf is a zipfile.ZipFile for the package's contents.
  • subdir is the subdirectory (relative to root) where the package was found (e.g. 'corpora' or 'grammars').