Package nltk :: Module downloader
[hide private]
[frames] | no frames]

Source Code for Module nltk.downloader

   1  # Natural Language Toolkit: Corpus & Model Downloader 
   2  # 
   3  # Copyright (C) 2001-2011 NLTK Project 
   4  # Author: Edward Loper <edloper@gradient.cis.upenn.edu> 
   5  # URL: <http://www.nltk.org/> 
   6  # For license information, see LICENSE.TXT 
   7   
   8  """ 
   9  The NLTK corpus and module downloader.  This module defines several 
  10  interfaces which can be used to download corpora, models, and other 
  11  data packages that can be used with NLTK. 
  12   
  13  Downloading Packages 
  14  ==================== 
  15  If called with no arguments, L{download() <Downloader.download>} 
  16  function will display an interactive interface which can be used to 
  17  download and install new packages.  If Tkinter is available, then a 
  18  graphical interface will be shown; otherwise, a simple text interface 
  19  will be provided. 
  20   
  21  Individual packages can be downloaded by calling the C{download()} 
  22  function with a single argument, giving the package identifier for the 
  23  package that should be downloaded: 
  24   
  25    >>> download('treebank') # doctest: +SKIP 
  26    [nltk_data] Downloading package 'treebank'... 
  27    [nltk_data]   Unzipping corpora/treebank.zip. 
  28   
  29  NLTK also provides a number of \"package collections\", consisting of 
  30  a group of related packages.  To download all packages in a 
  31  colleciton, simply call C{download()} with the collection's 
  32  identifier: 
  33   
  34    >>> download('all-corpora') # doctest: +SKIP 
  35    [nltk_data] Downloading package 'abc'... 
  36    [nltk_data]   Unzipping corpora/abc.zip. 
  37    [nltk_data] Downloading package 'alpino'... 
  38    [nltk_data]   Unzipping corpora/alpino.zip. 
  39      ... 
  40    [nltk_data] Downloading package 'words'... 
  41    [nltk_data]   Unzipping corpora/words.zip. 
  42   
  43  Download Directory 
  44  ================== 
  45  By default, packages are installed in either a system-wide directory 
  46  (if Python has sufficient access to write to it); or in the current 
  47  user's home directory.  However, the C{download_dir} argument may be 
  48  used to specify a different installation target, if desired. 
  49   
  50  See L{Downloader.default_download_dir()} for more a detailed 
  51  description of how the default download directory is chosen. 
  52   
  53  NLTK Download Server 
  54  ==================== 
  55  Before downloading any packages, the corpus and module downloader 
  56  contacts the NLTK download server, to retrieve an index file 
  57  describing the available packages.  By default, this index file is 
  58  loaded from C{<http://nltk.googlecode.com/svn/trunk/nltk_data/index.xml>}. 
  59  If necessary, it is possible to create a new L{Downloader} object, 
  60  specifying a different URL for the package index file. 
  61   
  62  Usage:: 
  63   
  64      python nltk/downloader.py [-d DATADIR] [-q] [-f] [-k] PACKAGE_IDS 
  65   
  66  or with py2.5+: 
  67   
  68      python -m nltk.downloader [-d DATADIR] [-q] [-f] [-k] PACKAGE_IDS 
  69  """ 
  70  #---------------------------------------------------------------------- 
  71  """ 
  72   
  73    0     1  2    3 
  74  [label][----][label][----] 
  75  [column  ][column     ] 
  76   
  77  Notes 
  78  ===== 
  79  Handling data files..  Some questions: 
  80   
  81  * Should the data files be kept zipped or unzipped?  I say zipped. 
  82   
  83  * Should the data files be kept in svn at all?  Advantages: history; 
  84    automatic version numbers; 'svn up' could be used rather than the 
  85    downloader to update the corpora.  Disadvantages: they're big, 
  86    which makes working from svn a bit of a pain.  And we're planning 
  87    to potentially make them much bigger.  I don't think we want 
  88    people to have to download 400mb corpora just to use nltk from 
  89    svn. 
  90   
  91  * Compromise: keep the data files in trunk/data rather than in 
  92    trunk/nltk.  That way you can check them out in svn if you want 
  93    to; but you don't need to, and you can use the downloader instead. 
  94   
  95  * Also: keep models in mind.  When we change the code, we'd 
  96    potentially like the models to get updated.  This could require a 
  97    little thought. 
  98   
  99  * So.. let's assume we have a trunk/data directory, containing a bunch 
 100    of packages.  The packages should be kept as zip files, because we 
 101    really shouldn't be editing them much (well -- we may edit models 
 102    more, but they tend to be binary-ish files anyway, where diffs 
 103    aren't that helpful).  So we'll have trunk/data, with a bunch of 
 104    files like abc.zip and treebank.zip and propbank.zip.  For each 
 105    package we could also have eg treebank.xml and propbank.xml, 
 106    describing the contents of the package (name, copyright, license, 
 107    etc).  Collections would also have .xml files.  Finally, we would 
 108    pull all these together to form a single index.xml file.  Some 
 109    directory structure wouldn't hurt.  So how about:: 
 110   
 111      /trunk/data/ ....................... root of data svn 
 112        index.xml ........................ main index file 
 113        src/ ............................. python scripts 
 114        packages/ ........................ dir for packages 
 115          corpora/ ....................... zip & xml files for corpora 
 116          grammars/ ...................... zip & xml files for grammars 
 117          taggers/ ....................... zip & xml files for taggers 
 118          tokenizers/ .................... zip & xml files for tokenizers 
 119          etc. 
 120        collections/ ..................... xml files for collections 
 121   
 122    Where the root (/trunk/data) would contain a makefile; and src/ 
 123    would contain a script to update the info.xml file.  It could also 
 124    contain scripts to rebuild some of the various model files.  The 
 125    script that builds index.xml should probably check that each zip 
 126    file expands entirely into a single subdir, whose name matches the 
 127    package's uid. 
 128   
 129  Changes I need to make: 
 130    - in index: change "size" to "filesize" or "compressed-size" 
 131    - in index: add "unzipped-size" 
 132    - when checking status: check both compressed & uncompressed size. 
 133      uncompressed size is important to make sure we detect a problem 
 134      if something got partially unzipped.  define new status values 
 135      to differentiate stale vs corrupt vs corruptly-uncompressed?? 
 136      (we shouldn't need to re-download the file if the zip file is ok 
 137      but it didn't get uncompressed fully.) 
 138    - add other fields to the index: author, license, copyright, contact, 
 139      etc. 
 140   
 141  the current grammars/ package would become a single new package (eg 
 142  toy-grammars or book-grammars). 
 143   
 144  xml file should have: 
 145    - authorship info 
 146    - license info 
 147    - copyright info 
 148    - contact info 
 149    - info about what type of data/annotation it contains? 
 150    - recommended corpus reader? 
 151   
 152  collections can contain other collections.  they can also contain 
 153  multiple package types (corpora & models).  Have a single 'basics' 
 154  package that includes everything we talk about in the book? 
 155   
 156  n.b.: there will have to be a fallback to the punkt tokenizer, in case 
 157  they didn't download that model. 
 158   
 159  default: unzip or not? 
 160       
 161  """ 
 162  import time, re, os, zipfile, sys, textwrap, threading, itertools 
 163  from cStringIO import StringIO 
 164  try: 
 165      from hashlib import md5 
 166  except: 
 167      from md5 import md5 
 168       
 169  try: 
 170      TKINTER = True 
 171      from Tkinter import * 
 172      from tkMessageBox import * 
 173      from nltk.draw.table import Table 
 174      from nltk.draw import ShowText 
 175  except: 
 176      TKINTER = False 
 177      TclError = ValueError 
 178   
 179  from nltk.etree import ElementTree 
 180  import nltk 
 181  urllib2 = nltk.internals.import_from_stdlib('urllib2') 
182 183 184 ###################################################################### 185 # Directory entry objects (from the data server's index file) 186 ###################################################################### 187 188 -class Package(object):
189 """ 190 A directory entry for a downloadable package. These entries are 191 extracted from the XML index file that is downloaded by 192 L{Downloader}. Each package consists of a single file; but if 193 that file is a zip file, then it can be automatically decompressed 194 when the package is installed. 195 """
196 - def __init__(self, id, url, name=None, subdir='', 197 size=None, unzipped_size=None, 198 checksum=None, svn_revision=None, 199 copyright='Unknown', contact='Unknown', 200 license='Unknown', author='Unknown', 201 unzip=True, 202 **kw):
203 self.id = id 204 """A unique identifier for this package.""" 205 206 self.name = name or id 207 """A string name for this package.""" 208 209 self.subdir = subdir 210 """The subdirectory where this package should be installed. 211 E.g., C{'corpora'} or C{'taggers'}.""" 212 213 self.url = url 214 """A URL that can be used to download this package's file.""" 215 216 self.size = int(size) 217 """The filesize (in bytes) of the package file.""" 218 219 self.unzipped_size = int(unzipped_size) 220 """The total filesize of the files contained in the package's 221 zipfile.""" 222 223 self.checksum = checksum 224 """The MD-5 checksum of the package file.""" 225 226 self.svn_revision = svn_revision 227 """A subversion revision number for this package.""" 228 229 self.copyright = copyright 230 """Copyright holder for this package.""" 231 232 self.contact = contact 233 """Name & email of the person who should be contacted with 234 questions about this package.""" 235 236 self.license = license 237 """License information for this package.""" 238 239 self.author = author 240 """Author of this package.""" 241 242 ext = os.path.splitext(url.split('/')[-1])[1] 243 self.filename = os.path.join(subdir, id+ext) 244 """The filename that should be used for this package's file. It 245 is formed by joining C{self.subdir} with C{self.id}, and 246 using the same extension as C{url}.""" 247 248 self.unzip = bool(int(unzip)) # '0' or '1' 249 """A flag indicating whether this corpus should be unzipped by 250 default.""" 251 252 # Include any other attributes provided by the XML file. 253 self.__dict__.update(kw)
254 255 @staticmethod
256 - def fromxml(xml):
257 if isinstance(xml, basestring): 258 xml = ElementTree.parse(xml) 259 return Package(**xml.attrib)
260
261 - def __repr__(self):
262 return '<Package %s>' % self.id
263
264 -class Collection(object):
265 """ 266 A directory entry for a collection of downloadable packages. 267 These entries are extracted from the XML index file that is 268 downloaded by L{Downloader}. 269 """
270 - def __init__(self, id, children, name=None, **kw):
271 self.id = id 272 """A unique identifier for this collection.""" 273 274 self.name = name or id 275 """A string name for this collection.""" 276 277 self.children = children 278 """A list of the L{Collections} or L{Packages} directly 279 contained by this collection.""" 280 281 self.packages = None 282 """A list of L{Packages} contained by this collection or any 283 collections it recursively contains.""" 284 285 # Include any other attributes provided by the XML file. 286 self.__dict__.update(kw)
287 288 @staticmethod
289 - def fromxml(xml):
290 if isinstance(xml, basestring): 291 xml = ElementTree.parse(xml) 292 children = [child.get('ref') for child in xml.findall('item')] 293 return Collection(children=children, **xml.attrib)
294
295 - def __repr__(self):
296 return '<Collection %s>' % self.id
297
298 ###################################################################### 299 # Message Passing Objects 300 ###################################################################### 301 302 -class DownloaderMessage(object):
303 """A status message object, used by L{incr_download} to 304 communicate its progress."""
305 -class StartCollectionMessage(DownloaderMessage):
306 """Data server has started working on a collection of packages."""
307 - def __init__(self, collection): self.collection = collection
308 -class FinishCollectionMessage(DownloaderMessage):
309 """Data server has finished working on a collection of packages."""
310 - def __init__(self, collection): self.collection = collection
311 -class StartPackageMessage(DownloaderMessage):
312 """Data server has started working on a package."""
313 - def __init__(self, package): self.package = package
314 -class FinishPackageMessage(DownloaderMessage):
315 """Data server has finished working on a package."""
316 - def __init__(self, package): self.package = package
317 -class StartDownloadMessage(DownloaderMessage):
318 """Data server has started downloading a package."""
319 - def __init__(self, package): self.package = package
320 -class FinishDownloadMessage(DownloaderMessage):
321 """Data server has finished downloading a package."""
322 - def __init__(self, package): self.package = package
323 -class StartUnzipMessage(DownloaderMessage):
324 """Data server has started unzipping a package."""
325 - def __init__(self, package): self.package = package
326 -class FinishUnzipMessage(DownloaderMessage):
327 """Data server has finished unzipping a package."""
328 - def __init__(self, package): self.package = package
329 -class UpToDateMessage(DownloaderMessage):
330 """The package download file is already up-to-date"""
331 - def __init__(self, package): self.package = package
332 -class StaleMessage(DownloaderMessage):
333 """The package download file is out-of-date or corrupt"""
334 - def __init__(self, package): self.package = package
335 -class ErrorMessage(DownloaderMessage):
336 """Data server encountered an error"""
337 - def __init__(self, package, message):
338 self.package = package 339 if isinstance(message, Exception): 340 self.message = str(message) 341 else: 342 self.message = message
343
344 -class ProgressMessage(DownloaderMessage):
345 """Indicates how much progress the data server has made"""
346 - def __init__(self, progress): self.progress = progress
347 -class SelectDownloadDirMessage(DownloaderMessage):
348 """Indicates what download directory the data server is using"""
349 - def __init__(self, download_dir): self.download_dir = download_dir
350
351 ###################################################################### 352 # NLTK Data Server 353 ###################################################################### 354 355 -class Downloader(object):
356 """ 357 A class used to access the NLTK data server, which can be used to 358 download corpora and other data packages. 359 """ 360 361 #///////////////////////////////////////////////////////////////// 362 # Configuration 363 #///////////////////////////////////////////////////////////////// 364 365 INDEX_TIMEOUT = 60*60 # 1 hour 366 """The amount of time after which the cached copy of the data 367 server index will be considered 'stale,' and will be 368 re-downloaded.""" 369 370 DEFAULT_URL = 'http://nltk.googlecode.com/svn/trunk/nltk_data/index.xml' 371 """The default URL for the NLTK data server's index. An 372 alternative URL can be specified when creating a new 373 C{Downloader} object.""" 374 375 #///////////////////////////////////////////////////////////////// 376 # Status Constants 377 #///////////////////////////////////////////////////////////////// 378 379 INSTALLED = 'installed' 380 """A status string indicating that a package or collection is 381 installed and up-to-date.""" 382 NOT_INSTALLED = 'not installed' 383 """A status string indicating that a package or collection is 384 not installed.""" 385 STALE = 'out of date' 386 """A status string indicating that a package or collection is 387 corrupt or out-of-date.""" 388 PARTIAL = 'partial' 389 """A status string indicating that a collection is partially 390 installed (i.e., only some of its packages are installed.)""" 391 392 #///////////////////////////////////////////////////////////////// 393 # Cosntructor 394 #///////////////////////////////////////////////////////////////// 395
396 - def __init__(self, server_index_url=None, download_dir=None):
397 self._url = server_index_url or self.DEFAULT_URL 398 """The URL for the data server's index file.""" 399 400 self._collections = {} 401 """Dictionary from collection identifier to L{Collection}""" 402 403 self._packages = {} 404 """Dictionary from package identifier to L{Package}""" 405 406 self._download_dir = download_dir 407 """The default directory to which packages will be downloaded.""" 408 409 self._index = None 410 """The XML index file downloaded from the data server""" 411 412 self._index_timestamp = None 413 """Time at which L{self._index} was downloaded. If it is more 414 than L{INDEX_TIMEOUT} seconds old, it will be re-downloaded.""" 415 416 self._status_cache = {} 417 """Dictionary from package/collection identifier to status 418 string (L{INSTALLED}, L{NOT_INSTALLED}, L{STALE}, or 419 L{PARTIAL}). Cache is used for packages only, not 420 collections.""" 421 422 self._errors = None 423 """Flag for telling if all packages got successfully downloaded or not.""" 424 425 # decide where we're going to save things to. 426 if self._download_dir is None: 427 self._download_dir = self.default_download_dir()
428 429 #///////////////////////////////////////////////////////////////// 430 # Information 431 #///////////////////////////////////////////////////////////////// 432
433 - def list(self, download_dir=None, show_packages=True, 434 show_collections=True, header=True, more_prompt=False, 435 skip_installed=False):
436 lines = 0 # for more_prompt 437 if download_dir is None: 438 download_dir = self._download_dir 439 print 'Using default data directory (%s)' % download_dir 440 if header: 441 print '='*(26+len(self._url)) 442 print ' Data server index for <%s>' % self._url 443 print '='*(26+len(self._url)) 444 lines += 3 # for more_prompt 445 stale = partial = False 446 447 categories = [] 448 if show_packages: categories.append('packages') 449 if show_collections: categories.append('collections') 450 for category in categories: 451 print '%s:' % category.capitalize() 452 lines += 1 # for more_prompt 453 for info in sorted(getattr(self, category)()): 454 status = self.status(info, download_dir) 455 if status == self.INSTALLED and skip_installed: continue 456 if status == self.STALE: stale = True 457 if status == self.PARTIAL: partial = True 458 prefix = {self.INSTALLED:'*', self.STALE:'-', 459 self.PARTIAL:'P', self.NOT_INSTALLED: ' '}[status] 460 name = textwrap.fill('-'*27 + (info.name or info.id), 461 75, subsequent_indent=27*' ')[27:] 462 print ' [%s] %s %s' % (prefix, info.id.ljust(20, '.'), name) 463 lines += len(name.split('\n')) # for more_prompt 464 if more_prompt and lines > 20: 465 user_input = raw_input("Hit Enter to continue: ") 466 if (user_input.lower() in ('x', 'q')): return 467 lines = 0 468 print 469 msg = '([*] marks installed packages' 470 if stale: msg += '; [-] marks out-of-date or corrupt packages' 471 if partial: msg += '; [P] marks partially installed collections' 472 print textwrap.fill(msg+')', subsequent_indent=' ', width=76)
473
474 - def packages(self):
475 self._update_index() 476 return self._packages.values()
477
478 - def corpora(self):
479 self._update_index() 480 return [pkg for (id,pkg) in self._packages.items() 481 if pkg.subdir == 'corpora']
482
483 - def models(self):
484 self._update_index() 485 return [pkg for (id,pkg) in self._packages.items() 486 if pkg.subdir != 'corpora']
487
488 - def collections(self):
489 self._update_index() 490 return self._collections.values()
491 492 #///////////////////////////////////////////////////////////////// 493 # Downloading 494 #///////////////////////////////////////////////////////////////// 495
496 - def _info_or_id(self, info_or_id):
497 if isinstance(info_or_id, basestring): 498 return self.info(info_or_id) 499 else: 500 return info_or_id
501 502 # [xx] When during downloading is it 'safe' to abort? Only unsafe 503 # time is *during* an unzip -- we don't want to leave a 504 # partially-unzipped corpus in place because we wouldn't notice 505 # it. But if we had the exact total size of the unzipped corpus, 506 # then that would be fine. Then we could abort anytime we want! 507 # So this is really what we should do. That way the threaded 508 # downloader in the gui can just kill the download thread anytime 509 # it wants. 510
511 - def incr_download(self, info_or_id, download_dir=None, force=False):
512 # If they didn't specify a download_dir, then use the default one. 513 if download_dir is None: 514 download_dir = self._download_dir 515 yield SelectDownloadDirMessage(download_dir) 516 517 # If they gave us a list of ids, then download each one. 518 if isinstance(info_or_id, (list,tuple)): 519 for msg in self._download_list(info_or_id, download_dir, force): 520 yield msg 521 return 522 523 # Look up the requested collection or package. 524 try: info = self._info_or_id(info_or_id) 525 except (IOError, ValueError), e: 526 yield ErrorMessage(None, 'Error loading %s: %s' % 527 (info_or_id, e)) 528 return 529 530 # Handle collections. 531 if isinstance(info, Collection): 532 yield StartCollectionMessage(info) 533 for msg in self.incr_download(info.children, download_dir, force): 534 yield msg 535 yield FinishCollectionMessage(info) 536 537 # Handle Packages (delegate to a helper function). 538 else: 539 for msg in self._download_package(info, download_dir, force): 540 yield msg
541
542 - def _num_packages(self, item):
543 if isinstance(item, Package): return 1 544 else: return len(item.packages)
545
546 - def _download_list(self, items, download_dir, force):
547 # Look up the requested items. 548 for i in range(len(items)): 549 try: items[i] = self._info_or_id(items[i]) 550 except (IOError, ValueError), e: 551 yield ErrorMessage(items[i], e) 552 return 553 554 # Download each item, re-scaling their progress. 555 num_packages = sum(self._num_packages(item) for item in items) 556 progress = 0 557 for i, item in enumerate(items): 558 if isinstance(item, Package): delta = 1./num_packages 559 else: delta = float(len(item.packages))/num_packages 560 for msg in self.incr_download(item, download_dir, force): 561 if isinstance(msg, ProgressMessage): 562 yield ProgressMessage(progress + msg.progress*delta) 563 else: 564 yield msg 565 566 progress += 100*delta
567
568 - def _download_package(self, info, download_dir, force):
569 yield StartPackageMessage(info) 570 yield ProgressMessage(0) 571 572 # Do we already have the current version? 573 status = self.status(info, download_dir) 574 if not force and status == self.INSTALLED: 575 yield UpToDateMessage(info) 576 yield ProgressMessage(100) 577 yield FinishPackageMessage(info) 578 return 579 580 # Remove the package from our status cache 581 self._status_cache.pop(info.id, None) 582 583 # Check for (and remove) any old/stale version. 584 filepath = os.path.join(download_dir, info.filename) 585 if os.path.exists(filepath): 586 if status == self.STALE: 587 yield StaleMessage(info) 588 os.remove(filepath) 589 590 # Ensure the download_dir exists 591 if not os.path.exists(download_dir): 592 os.mkdir(download_dir) 593 if not os.path.exists(os.path.join(download_dir, info.subdir)): 594 os.mkdir(os.path.join(download_dir, info.subdir)) 595 596 # Download the file. This will raise an IOError if the url 597 # is not found. 598 yield StartDownloadMessage(info) 599 yield ProgressMessage(5) 600 try: 601 infile = urllib2.urlopen(info.url) 602 outfile = open(filepath, 'wb') 603 #print info.size 604 num_blocks = max(1, float(info.size)/(1024*16)) 605 for block in itertools.count(): 606 s = infile.read(1024*16) # 16k blocks. 607 outfile.write(s) 608 if not s: break 609 if block % 2 == 0: # how often? 610 yield ProgressMessage(min(80, 5+75*(block/num_blocks))) 611 infile.close() 612 outfile.close() 613 except IOError, e: 614 yield ErrorMessage(info, 'Error downloading %r from <%s>:' 615 '\n %s' % (info.id, info.url, e)) 616 return 617 yield FinishDownloadMessage(info) 618 yield ProgressMessage(80) 619 620 # If it's a zipfile, uncompress it. 621 if info.filename.endswith('.zip'): 622 zipdir = os.path.join(download_dir, info.subdir) 623 # Unzip if we're unzipping by default; *or* if it's already 624 # been unzipped (presumably a previous version). 625 if info.unzip or os.path.exists(os.path.join(zipdir, info.id)): 626 yield StartUnzipMessage(info) 627 for msg in _unzip_iter(filepath, zipdir, verbose=False): 628 # Somewhat of a hack, but we need a proper package reference 629 msg.package = info 630 yield msg 631 yield FinishUnzipMessage(info) 632 633 yield FinishPackageMessage(info)
634
635 - def download(self, info_or_id=None, download_dir=None, quiet=False, 636 force=False, prefix='[nltk_data] ', halt_on_error=True, 637 raise_on_error=False):
638 # If no info or id is given, then use the interactive shell. 639 if info_or_id is None: 640 # [xx] hmm -- changing self._download_dir here seems like 641 # the wrong thing to do. Maybe the _interactive_download 642 # function should make a new copy of self to use? 643 if download_dir is not None: self._download_dir = download_dir 644 self._interactive_download() 645 return True 646 647 else: 648 # Define a helper function for displaying output: 649 def show(s, prefix2=''): 650 print textwrap.fill(s, initial_indent=prefix+prefix2, 651 subsequent_indent=prefix+prefix2+' '*4)
652 653 for msg in self.incr_download(info_or_id, download_dir, force): 654 # Error messages 655 if isinstance(msg, ErrorMessage): 656 show(msg.message) 657 if raise_on_error: 658 raise ValueError(msg.message) 659 if halt_on_error: 660 return False 661 self._errors = True 662 if not quiet: 663 print "Error installing package. Retry? [n/y/e]" 664 choice = raw_input().strip() 665 if choice in ['y', 'Y']: 666 if not self.download(msg.package.id, download_dir, 667 quiet, force, prefix, 668 halt_on_error, raise_on_error): 669 return False 670 elif choice in ['e', 'E']: 671 return False 672 673 # All other messages 674 if not quiet: 675 # Collection downloading messages: 676 if isinstance(msg, StartCollectionMessage): 677 show('Downloading collection %r' % msg.collection.id) 678 prefix += ' | ' 679 print prefix 680 elif isinstance(msg, FinishCollectionMessage): 681 print prefix 682 prefix = prefix[:-4] 683 if self._errors: 684 show('Downloaded collection %r with errors' % 685 msg.collection.id) 686 else: 687 show('Done downloading collection %r' % 688 msg.collection.id) 689 690 # Package downloading messages: 691 elif isinstance(msg, StartPackageMessage): 692 show('Downloading package %r to %s...' % 693 (msg.package.id, download_dir)) 694 elif isinstance(msg, UpToDateMessage): 695 show('Package %s is already up-to-date!' % 696 msg.package.id, ' ') 697 #elif isinstance(msg, StaleMessage): 698 # show('Package %s is out-of-date or corrupt' % 699 # msg.package.id, ' ') 700 elif isinstance(msg, StartUnzipMessage): 701 show('Unzipping %s.' % msg.package.filename, ' ') 702 703 # Data directory message: 704 elif isinstance(msg, SelectDownloadDirMessage): 705 download_dir = msg.download_dir 706 return True
707
708 - def is_stale(self, info_or_id, download_dir=None):
709 return self.status(info_or_id, download_dir) == self.STALE
710
711 - def is_installed(self, info_or_id, download_dir=None):
712 return self.status(info_or_id, download_dir) == self.INSTALLED
713
714 - def clear_status_cache(self, id=None):
715 if id is None: 716 self._status_cache.clear() 717 else: 718 self._status_cache.pop(id, None)
719
720 - def status(self, info_or_id, download_dir=None):
721 """ 722 Return a constant describing the status of the given package 723 or collection. Status can be one of L{INSTALLED}, 724 L{NOT_INSTALLED}, L{STALE}, or L{PARTIAL}. 725 """ 726 if download_dir is None: download_dir = self._download_dir 727 info = self._info_or_id(info_or_id) 728 729 # Handle collections: 730 if isinstance(info, Collection): 731 pkg_status = [self.status(pkg.id) for pkg in info.packages] 732 if self.STALE in pkg_status: 733 return self.STALE 734 elif self.PARTIAL in pkg_status: 735 return self.PARTIAL 736 elif (self.INSTALLED in pkg_status and 737 self.NOT_INSTALLED in pkg_status): 738 return self.PARTIAL 739 elif self.NOT_INSTALLED in pkg_status: 740 return self.NOT_INSTALLED 741 else: 742 return self.INSTALLED 743 744 # Handle packages: 745 else: 746 filepath = os.path.join(download_dir, info.filename) 747 if download_dir != self._download_dir: 748 status = self._pkg_status(info, filepath) 749 else: 750 if info.id not in self._status_cache: 751 self._status_cache[info.id] = self._pkg_status(info, 752 filepath) 753 return self._status_cache[info.id]
754
755 - def _pkg_status(self, info, filepath):
756 if not os.path.exists(filepath): 757 return self.NOT_INSTALLED 758 759 # Check if the file has the correct size. 760 try: filestat = os.stat(filepath) 761 except OSError: return self.NOT_INSTALLED 762 if filestat.st_size != int(info.size): 763 return self.STALE 764 765 # Check if the file's checksum matches 766 if md5_hexdigest(filepath) != info.checksum: 767 return self.STALE 768 769 # If it's a zipfile, and it's been at least partially 770 # unzipped, then check if it's been fully unzipped. 771 if filepath.endswith('.zip'): 772 unzipdir = filepath[:-4] 773 if not os.path.exists(unzipdir): 774 return self.INSTALLED # but not unzipped -- ok! 775 if not os.path.isdir(unzipdir): 776 return self.STALE 777 778 unzipped_size = sum(os.stat(os.path.join(d, f)).st_size 779 for d, _, files in os.walk(unzipdir) 780 for f in files) 781 if unzipped_size != info.unzipped_size: 782 return self.STALE 783 784 # Otherwise, everything looks good. 785 return self.INSTALLED
786
787 - def update(self, quiet=False, prefix='[nltk_data] '):
788 """ 789 Re-download any packages whose status is STALE. 790 """ 791 self.clear_status_cache() 792 for pkg in self.packages(): 793 if self.status(pkg) == self.STALE: 794 self.download(pkg, quiet=quiet, prefix=prefix)
795 796 #///////////////////////////////////////////////////////////////// 797 # Index 798 #///////////////////////////////////////////////////////////////// 799
800 - def _update_index(self, url=None):
801 """A helper function that ensures that self._index is 802 up-to-date. If the index is older than self.INDEX_TIMEOUT, 803 then download it again.""" 804 # Check if the index is aleady up-to-date. If so, do nothing. 805 if not (self._index is None or url is not None or 806 time.time()-self._index_timestamp > self.INDEX_TIMEOUT): 807 return 808 809 # If a URL was specified, then update our URL. 810 self._url = url or self._url 811 812 # Download the index file. 813 self._index = nltk.internals.ElementWrapper( 814 ElementTree.parse(urllib2.urlopen(self._url)).getroot()) 815 self._index_timestamp = time.time() 816 817 # Build a dictionary of packages. 818 packages = [Package.fromxml(p) for p in 819 self._index.findall('packages/package')] 820 self._packages = dict((p.id, p) for p in packages) 821 822 # Build a dictionary of collections. 823 collections = [Collection.fromxml(c) for c in 824 self._index.findall('collections/collection')] 825 self._collections = dict((c.id, c) for c in collections) 826 827 # Replace identifiers with actual children in collection.children. 828 for collection in self._collections.values(): 829 for i, child_id in enumerate(collection.children): 830 if child_id in self._packages: 831 collection.children[i] = self._packages[child_id] 832 if child_id in self._collections: 833 collection.children[i] = self._collections[child_id] 834 835 # Fill in collection.packages for each collection. 836 for collection in self._collections.values(): 837 packages = {} 838 queue = [collection] 839 for child in queue: 840 if isinstance(child, Collection): 841 queue.extend(child.children) 842 else: 843 packages[child.id] = child 844 collection.packages = packages.values() 845 846 # Flush the status cache 847 self._status_cache.clear()
848
849 - def index(self):
850 """ 851 Return the XML index describing the packages available from 852 the data server. If necessary, this index will be downloaded 853 from the data server. 854 """ 855 self._update_index() 856 return self._index
857
858 - def info(self, id):
859 """Return the L{Package} or L{Collection} record for the 860 given item.""" 861 self._update_index() 862 if id in self._packages: return self._packages[id] 863 if id in self._collections: return self._collections[id] 864 raise ValueError('Package %r not found in index' % id)
865
866 - def xmlinfo(self, id):
867 """Return the XML info record for the given item""" 868 self._update_index() 869 for package in self._index.findall('packages/package'): 870 if package.get('id') == id: 871 return package 872 for collection in self._index.findall('collections/collection'): 873 if collection.get('id') == id: 874 return collection 875 raise ValueError('Package %r not found in index' % id)
876 877 #///////////////////////////////////////////////////////////////// 878 # URL & Data Directory 879 #///////////////////////////////////////////////////////////////// 880
881 - def _set_url(self, url):
882 # If we're unable to contact the given url, then keep the 883 # original url. 884 original_url = self._url 885 try: 886 self._update_index(url) 887 except: 888 self._url = original_url 889 raise
890 891 url = property(lambda self: self._url, _set_url, doc=""" 892 The URL for the data server's index file.""") 893
894 - def default_download_dir(self):
895 """ 896 Return the directory to which packages will be downloaded by 897 default. This value can be overridden using the constructor, 898 or on a case-by-case basis using the C{download_dir} argument when 899 calling L{download()}. 900 901 On Windows, the default download directory is 902 C{I{PYTHONHOME}/lib/nltk}, where C{I{PYTHONHOME}} is the 903 directory containing Python (e.g. C{C:\\Python25}). 904 905 On all other platforms, the default directory is determined 906 as follows: 907 908 - If C{/usr/share} exists and is writable, then 909 return C{/usr/share/nltk} 910 - If C{/usr/local/share} exists and is writable, then 911 return C{/usr/local/share/nltk} 912 - If C{/usr/lib} exists and is writable, then 913 return C{/usr/lib/nltk} 914 - If C{/usr/local/lib} exists and is writable, then 915 return C{/usr/local/lib/nltk} 916 - Otherwise, return C{~/nltk_data}, where C{~} is the 917 current user's home directory. 918 """ 919 # Check if we have sufficient permissions to install in a 920 # variety of system-wide locations. 921 for nltkdir in nltk.data.path: 922 if (os.path.exists(nltkdir) and 923 nltk.internals.is_writable(nltkdir)): 924 return nltkdir 925 926 # On Windows, use %APPDATA% 927 if sys.platform == 'win32' and 'APPDATA' in os.environ: 928 homedir = os.environ['APPDATA'] 929 930 # Otherwise, install in the user's home directory. 931 else: 932 homedir = os.path.expanduser('~/') 933 if homedir == '~/': 934 raise ValueError("Could not find a default download directory") 935 936 # append "nltk_data" to the home directory 937 return os.path.join(homedir, 'nltk_data')
938
939 - def _set_download_dir(self, download_dir):
940 self._download_dir = download_dir 941 # Clear the status cache. 942 self._status_cache.clear()
943 944 download_dir = property(lambda self: self._download_dir, 945 _set_download_dir, doc=""" 946 The default directory to which packages will be downloaded. 947 This defaults to the value returned by L{default_download_dir()}. 948 To override this default on a case-by-case basis, use the 949 C{download_dir} argument when calling L{download()}.""") 950 951 #///////////////////////////////////////////////////////////////// 952 # Interactive Shell 953 #///////////////////////////////////////////////////////////////// 954
955 - def _interactive_download(self):
956 # Try the GUI first; if that doesn't work, try the simple 957 # interactive shell. 958 if TKINTER: 959 try: 960 DownloaderGUI(self).mainloop() 961 except TclError: 962 DownloaderShell(self).run() 963 else: 964 DownloaderShell(self).run()
965
966 -class DownloaderShell(object):
967 - def __init__(self, dataserver):
968 self._ds = dataserver
969
970 - def _simple_interactive_menu(self, *options):
971 print '-'*75 972 spc = (68 - sum(len(o) for o in options))/(len(options)-1)*' ' 973 print ' ' + spc.join(options) 974 #w = 76/len(options) 975 #fmt = ' ' + ('%-'+str(w)+'s')*(len(options)-1) + '%s' 976 #print fmt % options 977 print '-'*75
978 979
980 - def run(self):
981 print 'NLTK Downloader' 982 while True: 983 self._simple_interactive_menu( 984 'd) Download', 'l) List', 'c) Config', 'h) Help', 'q) Quit') 985 user_input = raw_input('Downloader> ').strip() 986 if not user_input: print; continue 987 command = user_input.lower().split()[0] 988 args = user_input.split()[1:] 989 try: 990 if command == 'l': 991 print 992 self._ds.list(self._ds.download_dir, header=False, 993 more_prompt=True) 994 elif command == 'h': 995 self._simple_interactive_help() 996 elif command == 'c': 997 self._simple_interactive_config() 998 elif command in ('q', 'x'): 999 return 1000 elif command == 'd': 1001 self._simple_interactive_download(args) 1002 else: 1003 print 'Command %r unrecogmized' % user_input 1004 except urllib2.HTTPError, e: 1005 print 'Error reading from server: %s'%e 1006 except urllib2.URLError, e: 1007 print 'Error connecting to server: %s'%e.reason 1008 # try checking if user_input is a package name, & 1009 # downloading it? 1010 print
1011
1012 - def _simple_interactive_download(self, args):
1013 if args: 1014 for arg in args: 1015 try: self._ds.download(arg, prefix=' ') 1016 except (IOError, ValueError), e: print e 1017 else: 1018 while True: 1019 print 1020 print 'Download which package (l=list; x=cancel)?' 1021 user_input = raw_input(' Identifier> ') 1022 if user_input.lower()=='l': 1023 self._ds.list(self._ds.download_dir, header=False, 1024 more_prompt=True, skip_installed=True) 1025 continue 1026 elif user_input.lower() in ('x', 'q', ''): 1027 return 1028 elif user_input: 1029 for id in user_input.split(): 1030 try: self._ds.download(id, prefix=' ') 1031 except (IOError, ValueError), e: print e 1032 break
1033
1034 - def _simple_interactive_help(self):
1035 print 1036 print 'Commands:' 1037 print ' d) Download a package or collection h) Help' 1038 print ' l) List packages & collections q) Quit' 1039 print ' c) View & Modify Configuration'
1040
1041 - def _show_config(self):
1042 print 1043 print 'Data Server:' 1044 print ' - URL: <%s>' % self._ds.url 1045 print (' - %d Package Collections Available' % 1046 len(self._ds.collections())) 1047 print (' - %d Individual Packages Available' % 1048 len(self._ds.packages())) 1049 print 1050 print 'Local Machine:' 1051 print ' - Data directory: %s' % self._ds.download_dir
1052
1053 - def _simple_interactive_config(self):
1054 self._show_config() 1055 while True: 1056 print 1057 self._simple_interactive_menu( 1058 's) Show Config', 'u) Set Server URL', 1059 'd) Set Data Dir', 'm) Main Menu') 1060 user_input = raw_input('Config> ').strip().lower() 1061 if user_input == 's': 1062 self._show_config() 1063 elif user_input == 'd': 1064 new_dl_dir = raw_input(' New Directory> ').strip().lower() 1065 if new_dl_dir in ('', 'x', 'q'): 1066 print ' Cancelled!' 1067 elif os.path.isdir(new_dl_dir): 1068 self._ds.download_dir = new_dl_dir 1069 else: 1070 print ('Directory %r not found! Create it first.' % 1071 new_dl_dir) 1072 elif user_input == 'u': 1073 new_url = raw_input(' New URL> ').strip().lower() 1074 if new_url in ('', 'x', 'q'): 1075 print ' Cancelled!' 1076 else: 1077 if not new_url.startswith('http://'): 1078 new_url = 'http://'+new_url 1079 try: self._ds.url = new_url 1080 except Exception, e: 1081 print 'Error reading <%r>:\n %s' % (new_url, e) 1082 elif user_input == 'm': 1083 break
1084
1085 -class DownloaderGUI(object):
1086 """ 1087 Graphical interface for downloading packages from the NLTK data 1088 server. 1089 """ 1090 1091 #///////////////////////////////////////////////////////////////// 1092 # Column Configuration 1093 #///////////////////////////////////////////////////////////////// 1094 1095 COLUMNS = ['', 'Identifier', 'Name', 'Size', 'Status', 1096 'Unzipped Size', 1097 'Copyright', 'Contact', 'License', 'Author', 1098 'SVN Revision', 'Subdir', 'Checksum'] 1099 """A list of the names of columns. This controls the order in 1100 which the columns will appear. If this is edited, then 1101 L{_package_to_columns()} may need to be edited to match.""" 1102 1103 COLUMN_WEIGHTS = {'': 0, 'Name': 5, 'Size': 0, 'Status': 0} 1104 """A dictionary specifying how columns should be resized when the 1105 table is resized. Columns with weight 0 will not be resized at 1106 all; and columns with high weight will be resized more. 1107 Default weight (for columns not explicitly listed) is 1.""" 1108 1109 COLUMN_WIDTHS = {'':1, 'Identifier':20, 'Name':45, 1110 'Size': 10, 'Unzipped Size': 10, 1111 'Status': 12} 1112 """A dictionary specifying how wide each column should be, in 1113 characters. The default width (for columns not explicitly 1114 listed) is specified by L{DEFAULT_COLUMN_WIDTH}.""" 1115 1116 DEFAULT_COLUMN_WIDTH = 30 1117 """The default width for columns that are not explicitly listed 1118 in C{COLUMN_WIDTHS}.""" 1119 1120 INITIAL_COLUMNS = ['', 'Identifier', 'Name', 'Size', 'Status'] 1121 """The set of columns that should be displayed by default.""" 1122 1123 # Perform a few import-time sanity checks to make sure that the 1124 # column configuration variables are defined consistently: 1125 for c in COLUMN_WEIGHTS: assert c in COLUMNS 1126 for c in COLUMN_WIDTHS: assert c in COLUMNS 1127 for c in INITIAL_COLUMNS: assert c in COLUMNS 1128 1129 #///////////////////////////////////////////////////////////////// 1130 # Color Configuration 1131 #///////////////////////////////////////////////////////////////// 1132 1133 _BACKDROP_COLOR = ('#000', '#ccc') 1134 1135 _ROW_COLOR = {Downloader.INSTALLED: ('#afa', '#080'), 1136 Downloader.PARTIAL: ('#ffa', '#880'), 1137 Downloader.STALE: ('#faa', '#800'), 1138 Downloader.NOT_INSTALLED: ('#fff', '#888')} 1139 1140 _MARK_COLOR = ('#000', '#ccc') 1141 1142 #_FRONT_TAB_COLOR = ('#ccf', '#008') 1143 #_BACK_TAB_COLOR = ('#88a', '#448') 1144 _FRONT_TAB_COLOR = ('#fff', '#45c') 1145 _BACK_TAB_COLOR = ('#aaa', '#67a') 1146 1147 _PROGRESS_COLOR = ('#f00', '#aaa') 1148 1149 _TAB_FONT = 'helvetica -16 bold' 1150 1151 #///////////////////////////////////////////////////////////////// 1152 # Constructor 1153 #///////////////////////////////////////////////////////////////// 1154
1155 - def __init__(self, dataserver, use_threads=True):
1156 self._ds = dataserver 1157 self._use_threads = use_threads 1158 1159 # For the threaded downloader: 1160 self._download_lock = threading.Lock() 1161 self._download_msg_queue = [] 1162 self._download_abort_queue = [] 1163 self._downloading = False 1164 1165 # For tkinter after callbacks: 1166 self._afterid = {} 1167 1168 # A message log. 1169 self._log_messages = [] 1170 self._log_indent = 0 1171 self._log('NLTK Downloader Started!') 1172 1173 # Create the main window. 1174 top = self.top = Tk() 1175 top.geometry('+50+50') 1176 top.title('NLTK Downloader') 1177 top.configure(background=self._BACKDROP_COLOR[1]) 1178 1179 # Set up some bindings now, in case anything goes wrong. 1180 top.bind('<Control-q>', self.destroy) 1181 top.bind('<Control-x>', self.destroy) 1182 self._destroyed = False 1183 1184 self._column_vars = {} 1185 1186 # Initialize the GUI. 1187 self._init_widgets() 1188 self._init_menu() 1189 try: 1190 self._fill_table() 1191 except urllib2.HTTPError, e: 1192 showerror('Error reading from server', e) 1193 except urllib2.URLError, e: 1194 showerror('Error connecting to server', e.reason) 1195 1196 self._show_info() 1197 self._select_columns() 1198 self._table.select(0) 1199 1200 # Make sure we get notified when we're destroyed, so we can 1201 # cancel any download in progress. 1202 self._table.bind('<Destroy>', self._destroy)
1203
1204 - def _log(self, msg):
1205 self._log_messages.append('%s %s%s' % (time.ctime(), 1206 ' | '*self._log_indent, msg))
1207 1208 #///////////////////////////////////////////////////////////////// 1209 # Internals 1210 #///////////////////////////////////////////////////////////////// 1211
1212 - def _init_widgets(self):
1213 # Create the top-level frame structures 1214 f1 = Frame(self.top, relief='raised', border=2, padx=8, pady=0) 1215 f1.pack(sid='top', expand=True, fill='both') 1216 f1.grid_rowconfigure(2, weight=1) 1217 f1.grid_columnconfigure(0, weight=1) 1218 Frame(f1, height=8).grid(column=0, row=0) # spacer 1219 tabframe = Frame(f1) 1220 tabframe.grid(column=0, row=1, sticky='news') 1221 tableframe = Frame(f1) 1222 tableframe.grid(column=0, row=2, sticky='news') 1223 buttonframe = Frame(f1) 1224 buttonframe.grid(column=0, row=3, sticky='news') 1225 Frame(f1, height=8).grid(column=0, row=4) # spacer 1226 infoframe = Frame(f1) 1227 infoframe.grid(column=0, row=5, sticky='news') 1228 Frame(f1, height=8).grid(column=0, row=6) # spacer 1229 progressframe = Frame(self.top, padx=3, pady=3, 1230 background=self._BACKDROP_COLOR[1]) 1231 progressframe.pack(side='bottom', fill='x') 1232 self.top['border'] = 0 1233 self.top['highlightthickness'] = 0 1234 1235 # Create the tabs 1236 self._tab_names = ['Collections', 'Corpora', 1237 'Models', 'All Packages',] 1238 self._tabs = {} 1239 for i, tab in enumerate(self._tab_names): 1240 label = Label(tabframe, text=tab, font=self._TAB_FONT) 1241 label.pack(side='left', padx=((i+1)%2)*10) 1242 label.bind('<Button-1>', self._select_tab) 1243 self._tabs[tab.lower()] = label 1244 1245 # Create the table. 1246 column_weights = [self.COLUMN_WEIGHTS.get(column, 1) 1247 for column in self.COLUMNS] 1248 self._table = Table(tableframe, self.COLUMNS, 1249 column_weights=column_weights, 1250 highlightthickness=0, listbox_height=16, 1251 reprfunc=self._table_reprfunc) 1252 self._table.columnconfig(0, foreground=self._MARK_COLOR[0]) # marked 1253 for i, column in enumerate(self.COLUMNS): 1254 width = self.COLUMN_WIDTHS.get(column, self.DEFAULT_COLUMN_WIDTH) 1255 self._table.columnconfig(i, width=width) 1256 self._table.pack(expand=True, fill='both') 1257 self._table.focus() 1258 self._table.bind_to_listboxes('<Double-Button-1>', 1259 self._download) 1260 self._table.bind('<space>', self._table_mark) 1261 self._table.bind('<Return>', self._download) 1262 self._table.bind('<Left>', self._prev_tab) 1263 self._table.bind('<Right>', self._next_tab) 1264 self._table.bind('<Control-a>', self._mark_all) 1265 1266 # Create entry boxes for URL & download_dir 1267 infoframe.grid_columnconfigure(1, weight=1) 1268 1269 info = [('url', 'Server Index:', self._set_url), 1270 ('download_dir','Download Directory:',self._set_download_dir)] 1271 self._info = {} 1272 for (i, (key, label, callback)) in enumerate(info): 1273 Label(infoframe, text=label).grid(column=0, row=i, sticky='e') 1274 entry = Entry(infoframe, font='courier', relief='groove', 1275 disabledforeground='black') 1276 self._info[key] = (entry, callback) 1277 entry.bind('<Return>', self._info_save) 1278 entry.bind('<Button-1>', lambda e,key=key: self._info_edit(key)) 1279 entry.grid(column=1, row=i, sticky='ew') 1280 1281 # If the user edits url or download_dir, and then clicks outside 1282 # the entry box, then save their results. 1283 self.top.bind('<Button-1>', self._info_save) 1284 1285 # Create Download & Refresh buttons. 1286 self._download_button = Button( 1287 buttonframe, text='Download', command=self._download, width=8) 1288 self._download_button.pack(side='left') 1289 self._refresh_button = Button( 1290 buttonframe, text='Refresh', command=self._refresh, width=8) 1291 self._refresh_button.pack(side='right') 1292 1293 # Create Progress bar 1294 self._progresslabel = Label(progressframe, text='', 1295 foreground=self._BACKDROP_COLOR[0], 1296 background=self._BACKDROP_COLOR[1]) 1297 self._progressbar = Canvas(progressframe, width=200, height=16, 1298 background=self._PROGRESS_COLOR[1], 1299 relief='sunken', border=1) 1300 self._init_progressbar() 1301 self._progressbar.pack(side='right') 1302 self._progresslabel.pack(side='left')
1303
1304 - def _init_menu(self):
1305 menubar = Menu(self.top) 1306 1307 filemenu = Menu(menubar, tearoff=0) 1308 filemenu.add_command(label='Download', underline=0, 1309 command=self._download, accelerator='Return') 1310 filemenu.add_separator() 1311 filemenu.add_command(label='Change Server Index', underline=7, 1312 command=lambda: self._info_edit('url')) 1313 filemenu.add_command(label='Change Download Directory', underline=0, 1314 command=lambda: self._info_edit('download_dir')) 1315 filemenu.add_separator() 1316 filemenu.add_command(label='Show Log', underline=5, 1317 command=self._show_log) 1318 filemenu.add_separator() 1319 filemenu.add_command(label='Exit', underline=1, 1320 command=self.destroy, accelerator='Ctrl-x') 1321 menubar.add_cascade(label='File', underline=0, menu=filemenu) 1322 1323 # Create a menu to control which columns of the table are 1324 # shown. n.b.: we never hide the first two columns (mark and 1325 # identifier). 1326 viewmenu = Menu(menubar, tearoff=0) 1327 for column in self._table.column_names[2:]: 1328 var = IntVar(self.top) 1329 assert column not in self._column_vars 1330 self._column_vars[column] = var 1331 if column in self.INITIAL_COLUMNS: var.set(1) 1332 viewmenu.add_checkbutton(label=column, underline=0, variable=var, 1333 command=self._select_columns) 1334 menubar.add_cascade(label='View', underline=0, menu=viewmenu) 1335 1336 # Create a sort menu 1337 # [xx] this should be selectbuttons; and it should include 1338 # reversed sorts as options. 1339 sortmenu = Menu(menubar, tearoff=0) 1340 for column in self._table.column_names[1:]: 1341 sortmenu.add_command(label='Sort by %s' % column, 1342 command=(lambda c=column: 1343 self._table.sort_by(c, 'ascending'))) 1344 sortmenu.add_separator() 1345 #sortmenu.add_command(label='Descending Sort:') 1346 for column in self._table.column_names[1:]: 1347 sortmenu.add_command(label='Reverse sort by %s' % column, 1348 command=(lambda c=column: 1349 self._table.sort_by(c, 'descending'))) 1350 menubar.add_cascade(label='Sort', underline=0, menu=sortmenu) 1351 1352 helpmenu = Menu(menubar, tearoff=0) 1353 helpmenu.add_command(label='About', underline=0, 1354 command=self.about) 1355 helpmenu.add_command(label='Instructions', underline=0, 1356 command=self.help, accelerator='F1') 1357 menubar.add_cascade(label='Help', underline=0, menu=helpmenu) 1358 self.top.bind('<F1>', self.help) 1359 1360 self.top.config(menu=menubar)
1361
1362 - def _select_columns(self):
1363 for (column, var) in self._column_vars.items(): 1364 if var.get(): 1365 self._table.show_column(column) 1366 else: 1367 self._table.hide_column(column)
1368
1369 - def _refresh(self):
1370 self._ds.clear_status_cache() 1371 try: 1372 self._fill_table() 1373 except urllib2.HTTPError, e: 1374 showerror('Error reading from server', e) 1375 except urllib2.URLError, e: 1376 showerror('Error connecting to server', e.reason) 1377 self._table.select(0)
1378
1379 - def _info_edit(self, info_key):
1380 self._info_save() # just in case. 1381 (entry, callback) = self._info[info_key] 1382 entry['state'] = 'normal' 1383 entry['relief'] = 'sunken' 1384 entry.focus()
1385
1386 - def _info_save(self, e=None):
1387 focus = self._table 1388 for entry, callback in self._info.values(): 1389 if entry['state'] == 'disabled': continue 1390 if e is not None and e.widget is entry and e.keysym != 'Return': 1391 focus = entry 1392 else: 1393 entry['state'] = 'disabled' 1394 entry['relief'] = 'groove' 1395 callback(entry.get()) 1396 focus.focus()
1397
1398 - def _table_reprfunc(self, row, col, val):
1399 if self._table.column_names[col].endswith('Size'): 1400 if isinstance(val, basestring): return ' %s' % val 1401 elif val < 1024**2: return ' %.1f KB' % (val/1024.**1) 1402 elif val < 1024**3: return ' %.1f MB' % (val/1024.**2) 1403 else: return ' %.1f GB' % (val/1024.**3) 1404 1405 if col in (0, ''): return str(val) 1406 else: return ' %s' % val
1407
1408 - def _set_url(self, url):
1409 if url == self._ds.url: return 1410 try: 1411 self._ds.url = url 1412 self._fill_table() 1413 except IOError, e: 1414 showerror('Error Setting Server Index', str(e)) 1415 self._show_info()
1416 1417
1418 - def _set_download_dir(self, download_dir):
1419 if self._ds.download_dir == download_dir: return 1420 # check if the dir exists, and if not, ask if we should create it? 1421 1422 # Clear our status cache, & re-check what's installed 1423 self._ds.download_dir = download_dir 1424 try: 1425 self._fill_table() 1426 except urllib2.HTTPError, e: 1427 showerror('Error reading from server', e) 1428 except urllib2.URLError, e: 1429 showerror('Error connecting to server', e.reason) 1430 self._show_info()
1431
1432 - def _show_info(self):
1433 print 'showing info', self._ds.url 1434 for entry,cb in self._info.values(): 1435 entry['state'] = 'normal' 1436 entry.delete(0, 'end') 1437 self._info['url'][0].insert(0, self._ds.url) 1438 self._info['download_dir'][0].insert(0, self._ds.download_dir) 1439 for entry,cb in self._info.values(): 1440 entry['state'] = 'disabled'
1441
1442 - def _prev_tab(self, *e):
1443 for i, tab in enumerate(self._tab_names): 1444 if tab.lower() == self._tab and i > 0: 1445 self._tab = self._tab_names[i-1].lower() 1446 try: 1447 return self._fill_table() 1448 except urllib2.HTTPError, e: 1449 showerror('Error reading from server', e) 1450 except urllib2.URLError, e: 1451 showerror('Error connecting to server', e.reason)
1452
1453 - def _next_tab(self, *e):
1454 for i, tab in enumerate(self._tab_names): 1455 if tab.lower() == self._tab and i < (len(self._tabs)-1): 1456 self._tab = self._tab_names[i+1].lower() 1457 try: 1458 return self._fill_table() 1459 except urllib2.HTTPError, e: 1460 showerror('Error reading from server', e) 1461 except urllib2.URLError, e: 1462 showerror('Error connecting to server', e.reason)
1463
1464 - def _select_tab(self, event):
1465 self._tab = event.widget['text'].lower() 1466 try: 1467 self._fill_table() 1468 except urllib2.HTTPError, e: 1469 showerror('Error reading from server', e) 1470 except urllib2.URLError, e: 1471 showerror('Error connecting to server', e.reason)
1472 1473 _tab = 'collections' 1474 #_tab = 'corpora' 1475 _rows = None
1476 - def _fill_table(self):
1477 selected_row = self._table.selected_row() 1478 self._table.clear() 1479 if self._tab == 'all packages': 1480 items = self._ds.packages() 1481 elif self._tab == 'corpora': 1482 items = self._ds.corpora() 1483 elif self._tab == 'models': 1484 items = self._ds.models() 1485 elif self._tab == 'collections': 1486 items = self._ds.collections() 1487 else: 1488 assert 0, 'bad tab value %r' % self._tab 1489 rows = [self._package_to_columns(item) for item in items] 1490 self._table.extend(rows) 1491 1492 # Highlight the active tab. 1493 for tab, label in self._tabs.items(): 1494 if tab == self._tab: 1495 label.configure(foreground=self._FRONT_TAB_COLOR[0], 1496 background=self._FRONT_TAB_COLOR[1]) 1497 else: 1498 label.configure(foreground=self._BACK_TAB_COLOR[0], 1499 background=self._BACK_TAB_COLOR[1]) 1500 1501 self._table.sort_by('Identifier', order='ascending') 1502 self._color_table() 1503 self._table.select(selected_row) 1504 1505 # This is a hack, because the scrollbar isn't updating its 1506 # position right -- I'm not sure what the underlying cause is 1507 # though. (This is on OS X w/ python 2.5) The length of 1508 # delay that's necessary seems to depend on how fast the 1509 # comptuer is. :-/ 1510 self.top.after(150, self._table._scrollbar.set, 1511 *self._table._mlb.yview()) 1512 self.top.after(300, self._table._scrollbar.set, 1513 *self._table._mlb.yview())
1514
1515 - def _update_table_status(self):
1516 for row_num in range(len(self._table)): 1517 status = self._ds.status(self._table[row_num, 'Identifier']) 1518 self._table[row_num, 'Status'] = status 1519 self._color_table()
1520
1521 - def _download(self, *e):
1522 # If we're using threads, then delegate to the threaded 1523 # downloader instead. 1524 if self._use_threads: 1525 return self._download_threaded(*e) 1526 1527 marked = [self._table[row, 'Identifier'] 1528 for row in range(len(self._table)) 1529 if self._table[row, 0] != ''] 1530 selection = self._table.selected_row() 1531 if not marked and selection is not None: 1532 marked = [self._table[selection, 'Identifier']] 1533 1534 download_iter = self._ds.incr_download(marked, self._ds.download_dir) 1535 self._log_indent = 0 1536 self._download_cb(download_iter, marked)
1537 1538 _DL_DELAY=10
1539 - def _download_cb(self, download_iter, ids):
1540 try: msg = download_iter.next() 1541 except StopIteration: 1542 #self._fill_table(sort=False) 1543 self._update_table_status() 1544 afterid = self.top.after(10, self._show_progress, 0) 1545 self._afterid['_download_cb'] = afterid 1546 return 1547 1548 def show(s): 1549 self._progresslabel['text'] = s 1550 self._log(s)
1551 if isinstance(msg, ProgressMessage): 1552 self._show_progress(msg.progress) 1553 elif isinstance(msg, ErrorMessage): 1554 show(msg.message) 1555 if msg.package is not None: 1556 self._select(msg.package.id) 1557 self._show_progress(None) 1558 return # halt progress. 1559 elif isinstance(msg, StartCollectionMessage): 1560 show('Downloading collection %r' % msg.collection.id) 1561 self._log_indent += 1 1562 elif isinstance(msg, StartPackageMessage): 1563 show('Downloading package %r' % msg.package.id) 1564 elif isinstance(msg, UpToDateMessage): 1565 show('Package %s is up-to-date!' % msg.package.id) 1566 #elif isinstance(msg, StaleMessage): 1567 # show('Package %s is out-of-date or corrupt' % msg.package.id) 1568 elif isinstance(msg, FinishDownloadMessage): 1569 show('Finished downloading %r.' % msg.package.id) 1570 elif isinstance(msg, StartUnzipMessage): 1571 show('Unzipping %s' % msg.package.filename) 1572 elif isinstance(msg, FinishCollectionMessage): 1573 self._log_indent -= 1 1574 show('Finished downloading collection %r.' % msg.collection.id) 1575 self._clear_mark(msg.collection.id) 1576 elif isinstance(msg, FinishPackageMessage): 1577 self._clear_mark(msg.package.id) 1578 afterid = self.top.after(self._DL_DELAY, self._download_cb, 1579 download_iter, ids) 1580 self._afterid['_download_cb'] = afterid
1581
1582 - def _select(self, id):
1583 for row in range(len(self._table)): 1584 if self._table[row, 'Identifier'] == id: 1585 self._table.select(row) 1586 return
1587
1588 - def _color_table(self):
1589 # Color rows according to status. 1590 for row in range(len(self._table)): 1591 bg, sbg = self._ROW_COLOR[self._table[row, 'Status']] 1592 fg, sfg = ('black', 'white') 1593 self._table.rowconfig(row, foreground=fg, selectforeground=sfg, 1594 background=bg, selectbackground=sbg) 1595 # Color the marked column 1596 self._table.itemconfigure(row, 0, 1597 foreground=self._MARK_COLOR[0], 1598 background=self._MARK_COLOR[1])
1599 1600
1601 - def _clear_mark(self, id):
1602 for row in range(len(self._table)): 1603 if self._table[row, 'Identifier'] == id: 1604 self._table[row, 0] = ''
1605
1606 - def _mark_all(self, *e):
1607 for row in range(len(self._table)): 1608 self._table[row,0] = 'X'
1609
1610 - def _table_mark(self, *e):
1611 selection = self._table.selected_row() 1612 if selection >= 0: 1613 if self._table[selection][0] != '': 1614 self._table[selection,0] = '' 1615 else: 1616 self._table[selection,0] = 'X' 1617 self._table.select(delta=1)
1618
1619 - def _show_log(self):
1620 text = '\n'.join(self._log_messages) 1621 ShowText(self.top, 'NLTK Downloader Log', text)
1622
1623 - def _package_to_columns(self, pkg):
1624 """ 1625 Given a package, return a list of values describing that 1626 package, one for each column in L{self.COLUMNS}. 1627 """ 1628 row = [] 1629 for column_index, column_name in enumerate(self.COLUMNS): 1630 if column_index == 0: # Mark: 1631 row.append('') 1632 elif column_name == 'Identifier': 1633 row.append(pkg.id) 1634 elif column_name == 'Status': 1635 row.append(self._ds.status(pkg)) 1636 else: 1637 attr = column_name.lower().replace(' ', '_') 1638 row.append(getattr(pkg, attr, 'n/a')) 1639 return row
1640 1641 #///////////////////////////////////////////////////////////////// 1642 # External Interface 1643 #///////////////////////////////////////////////////////////////// 1644
1645 - def destroy(self, *e):
1646 if self._destroyed: return 1647 self.top.destroy() 1648 self._destroyed = True
1649
1650 - def _destroy(self, *e):
1651 if self.top is not None: 1652 for afterid in self._afterid.values(): 1653 self.top.after_cancel(afterid) 1654 1655 # Abort any download in progress. 1656 if self._downloading and self._use_threads: 1657 self._abort_download() 1658 1659 # Make sure the garbage collector destroys these now; 1660 # otherwise, they may get destroyed when we're not in the main 1661 # thread, which would make Tkinter unhappy. 1662 self._column_vars.clear()
1663
1664 - def mainloop(self, *args, **kwargs):
1665 self.top.mainloop(*args, **kwargs)
1666 1667 #///////////////////////////////////////////////////////////////// 1668 # HELP 1669 #///////////////////////////////////////////////////////////////// 1670 1671 HELP = textwrap.dedent("""\ 1672 This tool can be used to download a variety of corpora and models 1673 that can be used with NLTK. Each corpus or model is distributed 1674 in a single zip file, known as a \"package file.\" You can 1675 download packages individually, or you can download pre-defined 1676 collections of packages. 1677 1678 When you download a package, it will be saved to the \"download 1679 directory.\" A default download directory is chosen when you run 1680 1681 the downloader; but you may also select a different download 1682 directory. On Windows, the default download directory is 1683 1684 1685 \"package.\" 1686 1687 The NLTK downloader can be used to download a variety of corpora, 1688 models, and other data packages. 1689 1690 Keyboard shortcuts:: 1691 [return]\t Download 1692 [up]\t Select previous package 1693 [down]\t Select next package 1694 [left]\t Select previous tab 1695 [right]\t Select next tab 1696 """) 1697
1698 - def help(self, *e):
1699 # The default font's not very legible; try using 'fixed' instead. 1700 try: 1701 ShowText(self.top, 'Help: NLTK Dowloader', 1702 self.HELP.strip(), width=75, font='fixed') 1703 except: 1704 ShowText(self.top, 'Help: NLTK Downloader', 1705 self.HELP.strip(), width=75)
1706
1707 - def about(self, *e):
1708 ABOUT = ("NLTK Downloader\n"+ 1709 "Written by Edward Loper") 1710 TITLE = 'About: NLTK Downloader' 1711 try: 1712 from tkMessageBox import Message 1713 Message(message=ABOUT, title=TITLE).show() 1714 except ImportError: 1715 ShowText(self._top, TITLE, ABOUT)
1716 1717 #///////////////////////////////////////////////////////////////// 1718 # Progress Bar 1719 #///////////////////////////////////////////////////////////////// 1720 1721 _gradient_width = 5
1722 - def _init_progressbar(self):
1723 c = self._progressbar 1724 width, height = int(c['width']), int(c['height']) 1725 for i in range(0, (int(c['width'])*2)/self._gradient_width): 1726 c.create_line(i*self._gradient_width+20, -20, 1727 i*self._gradient_width-height-20, height+20, 1728 width=self._gradient_width, 1729 fill='#%02x0000' % (80 + abs(i%6-3)*12)) 1730 c.addtag_all('gradient') 1731 c.itemconfig('gradient', state='hidden') 1732 1733 # This is used to display progress 1734 c.addtag_withtag('redbox', c.create_rectangle( 1735 0, 0, 0, 0, fill=self._PROGRESS_COLOR[0]))
1736
1737 - def _show_progress(self, percent):
1738 c = self._progressbar 1739 if percent is None: 1740 c.coords('redbox', 0, 0, 0, 0) 1741 c.itemconfig('gradient', state='hidden') 1742 else: 1743 width, height = int(c['width']), int(c['height']) 1744 x = percent * int(width) / 100 + 1 1745 c.coords('redbox', 0, 0, x, height+1)
1746
1747 - def _progress_alive(self):
1748 c = self._progressbar 1749 if not self._downloading: 1750 c.itemconfig('gradient', state='hidden') 1751 else: 1752 c.itemconfig('gradient', state='normal') 1753 x1, y1, x2, y2 = c.bbox('gradient') 1754 if x1 <= -100: 1755 c.move('gradient', (self._gradient_width*6)-4, 0) 1756 else: 1757 c.move('gradient', -4, 0) 1758 afterid = self.top.after(200, self._progress_alive) 1759 self._afterid['_progress_alive'] = afterid
1760 1761 #///////////////////////////////////////////////////////////////// 1762 # Threaded downloader 1763 #///////////////////////////////////////////////////////////////// 1764
1765 - def _download_threaded(self, *e):
1766 # If the user tries to start a new download while we're already 1767 # downloading something, then abort the current download instead. 1768 if self._downloading: 1769 self._abort_download() 1770 return 1771 1772 # Change the 'download' button to an 'abort' button. 1773 self._download_button['text'] = 'Cancel' 1774 1775 marked = [self._table[row, 'Identifier'] 1776 for row in range(len(self._table)) 1777 if self._table[row, 0] != ''] 1778 selection = self._table.selected_row() 1779 if not marked and selection is not None: 1780 marked = [self._table[selection, 'Identifier']] 1781 1782 # Create a new data server object for the download operation, 1783 # just in case the user modifies our data server during the 1784 # download (e.g., clicking 'refresh' or editing the index url). 1785 ds = Downloader(self._ds.url, self._ds.download_dir) 1786 1787 # Start downloading in a seperate thread. 1788 assert self._download_msg_queue == [] 1789 assert self._download_abort_queue == [] 1790 self._DownloadThread(ds, marked, self._download_lock, 1791 self._download_msg_queue, 1792 self._download_abort_queue).start() 1793 1794 # Monitor the download message queue & display its progress. 1795 self._log_indent = 0 1796 self._downloading = True 1797 self._monitor_message_queue() 1798 1799 # Display an indication that we're still alive and well by 1800 # cycling the progress bar. 1801 self._progress_alive()
1802
1803 - def _abort_download(self):
1804 if self._downloading: 1805 self._download_lock.acquire() 1806 self._download_abort_queue.append('abort') 1807 self._download_lock.release()
1808
1809 - class _DownloadThread(threading.Thread):
1810 - def __init__(self, data_server, items, lock, message_queue, abort):
1811 self.data_server = data_server 1812 self.items = items 1813 self.lock = lock 1814 self.message_queue = message_queue 1815 self.abort = abort 1816 threading.Thread.__init__(self)
1817
1818 - def run (self):
1819 for msg in self.data_server.incr_download(self.items): 1820 self.lock.acquire() 1821 self.message_queue.append(msg) 1822 # Check if we've been told to kill ourselves: 1823 if self.abort: 1824 self.message_queue.append('aborted') 1825 self.lock.release() 1826 return 1827 self.lock.release() 1828 self.lock.acquire() 1829 self.message_queue.append('finished') 1830 self.lock.release()
1831 1832 _MONITOR_QUEUE_DELAY=100
1833 - def _monitor_message_queue(self):
1834 def show(s): 1835 self._progresslabel['text'] = s 1836 self._log(s)
1837 1838 # Try to acquire the lock; if it's busy, then just try again later. 1839 if not self._download_lock.acquire(): 1840 return 1841 for msg in self._download_msg_queue: 1842 1843 # Done downloading? 1844 if msg == 'finished' or msg == 'aborted': 1845 #self._fill_table(sort=False) 1846 self._update_table_status() 1847 self._downloading = False 1848 self._download_button['text'] = 'Download' 1849 del self._download_msg_queue[:] 1850 del self._download_abort_queue[:] 1851 self._download_lock.release() 1852 if msg == 'aborted': 1853 show('Download aborted!') 1854 self._show_progress(None) 1855 else: 1856 afterid = self.top.after(100, self._show_progress, None) 1857 self._afterid['_monitor_message_queue'] = afterid 1858 return 1859 1860 # All other messages 1861 elif isinstance(msg, ProgressMessage): 1862 self._show_progress(msg.progress) 1863 elif isinstance(msg, ErrorMessage): 1864 show(msg.message) 1865 if msg.package is not None: 1866 self._select(msg.package.id) 1867 self._show_progress(None) 1868 self._downloading = False 1869 return # halt progress. 1870 elif isinstance(msg, StartCollectionMessage): 1871 show('Downloading collection %r' % msg.collection.id) 1872 self._log_indent += 1 1873 elif isinstance(msg, StartPackageMessage): 1874 self._ds.clear_status_cache(msg.package.id) 1875 show('Downloading package %r' % msg.package.id) 1876 elif isinstance(msg, UpToDateMessage): 1877 show('Package %s is up-to-date!' % msg.package.id) 1878 #elif isinstance(msg, StaleMessage): 1879 # show('Package %s is out-of-date or corrupt; updating it' % 1880 # msg.package.id) 1881 elif isinstance(msg, FinishDownloadMessage): 1882 show('Finished downloading %r.' % msg.package.id) 1883 elif isinstance(msg, StartUnzipMessage): 1884 show('Unzipping %s' % msg.package.filename) 1885 elif isinstance(msg, FinishUnzipMessage): 1886 show('Finished installing %s' % msg.package.id) 1887 elif isinstance(msg, FinishCollectionMessage): 1888 self._log_indent -= 1 1889 show('Finished downloading collection %r.' % msg.collection.id) 1890 self._clear_mark(msg.collection.id) 1891 elif isinstance(msg, FinishPackageMessage): 1892 self._update_table_status() 1893 self._clear_mark(msg.package.id) 1894 1895 # Let the user know when we're aborting a download (but 1896 # waiting for a good point to abort it, so we don't end up 1897 # with a partially unzipped package or anything like that). 1898 if self._download_abort_queue: 1899 self._progresslabel['text'] = 'Aborting download...' 1900 1901 # Clear the message queue and then release the lock 1902 del self._download_msg_queue[:] 1903 self._download_lock.release() 1904 1905 # Check the queue again after MONITOR_QUEUE_DELAY msec. 1906 afterid = self.top.after(self._MONITOR_QUEUE_DELAY, 1907 self._monitor_message_queue) 1908 self._afterid['_monitor_message_queue'] = afterid 1909
1910 ###################################################################### 1911 # Helper Functions 1912 ###################################################################### 1913 # [xx] It may make sense to move these to nltk.internals. 1914 1915 -def md5_hexdigest(file):
1916 """ 1917 Calculate and return the MD5 checksum for a given file. C{file} 1918 may either be a filename or an open stream. 1919 """ 1920 if isinstance(file, basestring): 1921 file = open(file, 'rb') 1922 1923 md5_digest = md5() 1924 while True: 1925 block = file.read(1024*16) # 16k blocks 1926 if not block: break 1927 md5_digest.update(block) 1928 return md5_digest.hexdigest()
1929
1930 # change this to periodically yield progress messages? 1931 # [xx] get rid of topdir parameter -- we should be checking 1932 # this when we build the index, anyway. 1933 -def unzip(filename, root, verbose=True):
1934 """ 1935 Extract the contents of the zip file C{filename} into the 1936 directory C{root}. 1937 """ 1938 for message in _unzip_iter(filename, root, verbose): 1939 if isinstance(message, ErrorMessage): 1940 raise Exception, message
1941
1942 -def _unzip_iter(filename, root, verbose=True):
1943 if verbose: 1944 sys.stdout.write('Unzipping %s' % os.path.split(filename)[1]) 1945 sys.stdout.flush() 1946 1947 try: zf = zipfile.ZipFile(filename) 1948 except zipfile.error, e: 1949 yield ErrorMessage(filename, 'Error with downloaded zip file') 1950 return 1951 except Exception, e: 1952 yield ErrorMessage(filename, e) 1953 return 1954 1955 # Get lists of directories & files 1956 namelist = zf.namelist() 1957 dirlist = [x for x in namelist if x.endswith('/')] 1958 filelist = [x for x in namelist if not x.endswith('/')] 1959 1960 # Create the target directory if it doesn't exist 1961 if not os.path.exists(root): 1962 os.mkdir(root) 1963 1964 # Create the directory structure 1965 for dirname in sorted(dirlist): 1966 pieces = dirname[:-1].split('/') 1967 for i in range(len(pieces)): 1968 dirpath = os.path.join(root, *pieces[:i+1]) 1969 if not os.path.exists(dirpath): 1970 os.mkdir(dirpath) 1971 1972 # Extract files. 1973 for i, filename in enumerate(filelist): 1974 filepath = os.path.join(root, *filename.split('/')) 1975 out = open(filepath, 'wb') 1976 try: contents = zf.read(filename) 1977 except Exception, e: 1978 yield ErrorMessage(filename, e) 1979 return 1980 out.write(contents) 1981 out.close() 1982 if verbose and (i*10/len(filelist) > (i-1)*10/len(filelist)): 1983 sys.stdout.write('.') 1984 sys.stdout.flush() 1985 if verbose: 1986 print
1987
1988 ###################################################################### 1989 # Index Builder 1990 ###################################################################### 1991 # This may move to a different file sometime. 1992 import subprocess, zipfile 1993 1994 -def build_index(root, base_url):
1995 """ 1996 Create a new data.xml index file, by combining the xml description 1997 files for various packages and collections. C{root} should be the 1998 path to a directory containing the package xml and zip files; and 1999 the collection xml files. The C{root} directory is expected to 2000 have the following subdirectories:: 2001 2002 root/ 2003 packages/ .................. subdirectory for packages 2004 corpora/ ................. zip & xml files for corpora 2005 grammars/ ................ zip & xml files for grammars 2006 taggers/ ................. zip & xml files for taggers 2007 tokenizers/ .............. zip & xml files for tokenizers 2008 etc. 2009 collections/ ............... xml files for collections 2010 2011 For each package, there should be two files: C{I{package}.zip} 2012 contains the package itself, as a compressed zip file; and 2013 C{I{package}.xml} is an xml description of the package. The 2014 zipfile C{I{package}.zip} should expand to a single subdirectory 2015 named C{I{package/}}. The base filename C{I{package}} must match 2016 the identifier given in the package's xml file. 2017 2018 For each collection, there should be a single file 2019 C{I{collection}.zip}, describing the collection. 2020 2021 All identifiers (for both packages and collections) must be unique. 2022 """ 2023 # Find all packages. 2024 packages = [] 2025 for pkg_xml, zf, subdir in _find_packages(os.path.join(root, 'packages')): 2026 zipstat = os.stat(zf.filename) 2027 url = '%s/%s/%s' % (base_url, subdir, os.path.split(zf.filename)[1]) 2028 unzipped_size = sum(zf_info.file_size for zf_info in zf.infolist()) 2029 2030 # Fill in several fields of the package xml with calculated values. 2031 pkg_xml.set('unzipped_size', '%s' % unzipped_size) 2032 pkg_xml.set('size', '%s' % zipstat.st_size) 2033 pkg_xml.set('checksum', '%s' % md5_hexdigest(zf.filename)) 2034 pkg_xml.set('subdir', subdir) 2035 #pkg_xml.set('svn_revision', _svn_revision(zf.filename)) 2036 pkg_xml.set('url', url) 2037 2038 # Record the package. 2039 packages.append(pkg_xml) 2040 2041 # Find all collections 2042 collections = list(_find_collections(os.path.join(root, 'collections'))) 2043 2044 # Check that all UIDs are unique 2045 uids = set() 2046 for item in packages+collections: 2047 if item.get('id') in uids: 2048 raise ValueError('Duplicate UID: %s' % item.get('id')) 2049 uids.add(item.get('id')) 2050 2051 # Put it all together 2052 top_elt = ElementTree.Element('nltk_data') 2053 top_elt.append(ElementTree.Element('packages')) 2054 for package in packages: top_elt[0].append(package) 2055 top_elt.append(ElementTree.Element('collections')) 2056 for collection in collections: top_elt[1].append(collection) 2057 2058 _indent_xml(top_elt) 2059 return top_elt
2060
2061 -def _indent_xml(xml, prefix=''):
2062 """ 2063 Helper for L{build_index()}: Given an XML ElementTree, modify it 2064 (and its descendents) C{text} and C{tail} attributes to generate 2065 an indented tree, where each nested element is indented by 2 2066 spaces with respect to its parent. 2067 """ 2068 if len(xml) > 0: 2069 xml.text = (xml.text or '').strip() + '\n' + prefix + ' ' 2070 for child in xml: 2071 _indent_xml(child, prefix+' ') 2072 for child in xml[:-1]: 2073 child.tail = (child.tail or '').strip() + '\n' + prefix + ' ' 2074 xml[-1].tail = (xml[-1].tail or '').strip() + '\n' + prefix
2075
2076 -def _check_package(pkg_xml, zipfilename, zf):
2077 """ 2078 Helper for L{build_index()}: Perform some checks to make sure that 2079 the given package is consistent. 2080 """ 2081 # The filename must patch the id given in the XML file. 2082 uid = os.path.splitext(os.path.split(zipfilename)[1])[0] 2083 if pkg_xml.get('id') != uid: 2084 raise ValueError('package identifier mismatch (%s vs %s)' % 2085 (pkg_xml.get('id'), uid)) 2086 2087 # Zip file must expand to a subdir whose name matches uid. 2088 if sum( (name!=uid and not name.startswith(uid+'/')) 2089 for name in zf.namelist() ): 2090 raise ValueError('Zipfile %s.zip does not expand to a single ' 2091 'subdirectory %s/' % (uid, uid))
2092
2093 2094 -def _svn_revision(filename):
2095 """ 2096 Helper for L{build_index()}: Calculate the subversion revision 2097 number for a given file (by using C{subprocess} to run C{svn}). 2098 """ 2099 p = subprocess.Popen(['svn', 'status', '-v', filename], 2100 stdout=subprocess.PIPE, 2101 stderr=subprocess.PIPE) 2102 (stdout, stderr) = p.communicate() 2103 if p.returncode != 0 or stderr or not stdout: 2104 raise ValueError('Error determining svn_revision for %s: %s' % 2105 (os.path.split(filename)[1], textwrap.fill(stderr))) 2106 return stdout.split()[2]
2107
2108 -def _find_collections(root):
2109 """ 2110 Helper for L{build_index()}: Yield a list of ElementTree.Element 2111 objects, each holding the xml for a single package collection. 2112 """ 2113 packages = [] 2114 for dirname, subdirs, files in os.walk(root): 2115 for filename in files: 2116 if filename.endswith('.xml'): 2117 xmlfile = os.path.join(dirname, filename) 2118 yield ElementTree.parse(xmlfile).getroot()
2119
2120 -def _find_packages(root):
2121 """ 2122 Helper for L{build_index()}: Yield a list of tuples C{(pkg_xml, 2123 zf, subdir)}, where: 2124 - C{pkg_xml} is an ElementTree.Element holding the xml for a 2125 package 2126 - C{zf} is a zipfile.ZipFile for the package's contents. 2127 - C{subdir} is the subdirectory (relative to C{root}) where 2128 the package was found (e.g. 'corpora' or 'grammars'). 2129 """ 2130 from nltk.corpus.reader.util import _path_from 2131 # Find all packages. 2132 packages = [] 2133 for dirname, subdirs, files in os.walk(root): 2134 relpath = '/'.join(_path_from(root, dirname)) 2135 for filename in files: 2136 if filename.endswith('.xml'): 2137 xmlfilename = os.path.join(dirname, filename) 2138 zipfilename = xmlfilename[:-4]+'.zip' 2139 try: zf = zipfile.ZipFile(zipfilename) 2140 except Exception, e: 2141 raise ValueError('Error reading file %r!\n%s' % 2142 (zipfilename, e)) 2143 try: pkg_xml = ElementTree.parse(xmlfilename).getroot() 2144 except Exception, e: 2145 raise ValueError('Error reading file %r!\n%s' % 2146 (xmlfilename, e)) 2147 2148 # Check that the UID matches the filename 2149 uid = os.path.split(xmlfilename[:-4])[1] 2150 if pkg_xml.get('id') != uid: 2151 raise ValueError('package identifier mismatch (%s ' 2152 'vs %s)' % (pkg_xml.get('id'), uid)) 2153 2154 # Check that the zipfile expands to a subdir whose 2155 # name matches the uid. 2156 if sum( (name!=uid and not name.startswith(uid+'/')) 2157 for name in zf.namelist() ): 2158 raise ValueError('Zipfile %s.zip does not expand to a ' 2159 'single subdirectory %s/' % (uid, uid)) 2160 2161 yield pkg_xml, zf, relpath 2162 # Don't recurse into svn subdirectories: 2163 try: subdirs.remove('.svn') 2164 except ValueError: pass
2165 2166 ###################################################################### 2167 # Main: 2168 ###################################################################### 2169 2170 # There should be a command-line interface 2171 2172 # Aliases 2173 _downloader = Downloader() 2174 download = _downloader.download
2175 -def download_shell(): DownloaderShell(_downloader).run()
2176 -def download_gui(): DownloaderGUI(_downloader).mainloop()
2177 -def update(): _downloader.update()
2178 2179 if __name__ == '__main__': 2180 from optparse import OptionParser 2181 parser = OptionParser() 2182 parser.add_option("-d", "--dir", dest="dir", 2183 help="download package to directory DIR", metavar="DIR") 2184 parser.add_option("-q", "--quiet", dest="quiet", action="store_true", 2185 default=False, help="work quietly") 2186 parser.add_option("-f", "--force", dest="force", action="store_true", 2187 default=False, help="download even if already installed") 2188 parser.add_option("-e", "--exit-on-error", dest="halt_on_error", action="store_true", 2189 default=False, help="exit if an error occurs") 2190 2191 (options, args) = parser.parse_args() 2192 2193 if args: 2194 for pkg_id in args: 2195 rv = download(info_or_id=pkg_id, download_dir=options.dir, 2196 quiet=options.quiet, force=options.force, 2197 halt_on_error=options.halt_on_error) 2198 if rv==False and options.halt_on_error: 2199 break 2200 else: 2201 download(download_dir=options.dir, 2202 quiet=options.quiet, force=options.force, 2203 halt_on_error=options.halt_on_error) 2204