view README.txt @ 0:b0942f44413f

import from git://github.com/mozilla/toolbox.git
author Jeff Hammel <k0scist@gmail.com>
date Sun, 11 May 2014 09:15:35 -0700
parents
children 2ba55733b788
line wrap: on
line source

The Story of Toolbox
====================

Toolbox is fundamentally a document-oriented approach to resource
indexing.  A "tool" consists three mandatory string fields -- name,
description, and URL -- that are generic to the large class of problems
of web resources, as well as classifiers, such as author, usage, type,
etc. A tool may have an arbitrary number of classifier fields as
needed.  Each classifier consists of a set of values with which a tool
is tagged. This gives toolbox the flexibility to fit a large number of
data models, such as PYPI, DOAP, and others.


Running Toolbox
---------------

You can download and run the toolbox software yourself:
http://github.com/k0s/toolbox

To serve in baseline mode, install the software and run::

 paster serve paste.ini

This will serve the handlers and static content using the paste
(http://pythonpaste.org) webserver using ``README.txt`` as the
``/about`` page and serving the data in ``sample``.

The dispatcher (``toolbox.dispatcher:Dispatcher``) is the central (WSGI)
webapp that designates per-request to a number of handlers (from
``handlers.py``).  The dispatcher has a few options:

* about: path to a restructured text file to serve at ``/about``
* model_type: name of the backend to use (memory_cache, file_cache, or couch)
* template_dir: extra directory to look for templates

These may be configured in the ``paste.ini`` file in the
``[app:toolbox]`` section by prepending with the namespace
``toolbox.``. It is advisable that you copy the example ``paste.ini``
file for your own usage needs.  Additional ``toolbox.``-namespaced
arguments will be passed to the model.  For instance, to specify the
directory for the ``file_cache`` model, the provided ``paste.ini`` uses
``toolbox.directory = %(here)s/sample``.


Architecture
------------

Toolbox uses a fairly simple architecture with a single abstract data
model allowing an arbitrary number of implementations to be constructed::

 Interfaces            Implementations

 +----+              +-+-----+
 |HTTP|              | |files|
 +----+---\  +-----+ | +-----+
           |-|model|-+-+-----+
 +------+-/  +-----+ | |couch|
 |script|            | +-----+
 +------+            +-+------+
                     | |memory|
                     | +------+
                     +-+---+
                       |...|
                       +---+

Toolbox was originally intended to use a directory of files, one per project,
as the backend. These were originally intended to be HTML files as the
above model may be clearly mapped as HTML::

 <div class="project"><h1><a href="{{url}}">{{name}}</a></h1>
 <p class="description">{{description}}</p>
 {{for field in fields}}
  <ul class="{{field}}">
  {{for value in values[field]}}
   <li>{{value}}</li>
  {{endfor}}
 {{endfor}}
 </div>

This microformat approach allows not only easy editing of the HTML
documents, but the documents may be indepently served and displayed
without the toolbox server-side. 

The HTML microformat was never implemented (though, since the model
backend is pluggable, it easily could be). Instead, the original
implementation used JSON blobs stored in one file per tool. This
approach loses the displayable aspect, though since JSON is a defined
format with several good tools for exploring and manipulating the data
perhaps this disavantage is offset.

A couch backend was also written.

      +------------+-----------+------------+
      |Displayable?|File-based?|Concurrency?|
+-----+------------+-----------+------------+
|HTML |Yes         |Yes        |No          |
+-----+------------+-----------+------------+
|JSON |Not really  |Yes        |No          |
+-----+------------+-----------+------------+
|Couch|No          |No         |Yes?        |
+-----+------------+-----------+------------+

The concurrency issue with file-based documennt backends may be
overcome by using locked files.  Ideally, this is accomplished at the
filesystem level.  If your filesystem does not promote this
functionality, it may be introduced programmatically.  A rough cartoon
of a good implementation is as follows:

1. A worker thread is spawned to write the data asynchronously. The
data is sent to the worker thread.

2. The worker checks for the presence of a lockfile (herein further
detailed). If the lockfile exists and is owned by an active process,
the worker waits until said process is done with it. (For a more
robust implementation, the worker sends a request to write the file to
some controller.)

3. The worker owns a lockfile based on its PID in some directory
parallel to the directory root under consideration (for example,
``/tmp/toolbox/lock/${PID}-${filename}.lck``).

4. The worker writes to the file.

5. The worker removes the lock

The toolbox web service uses a dispatcher->handler framework.  The
handlers are loosely pluggable (they are assigned in the dispatcher),
but could (and probably should) be made completely pluggable.  That
said, the toolbox web system features an integration of templates,
static resources (javascript, css, images), and handlers, so true
pluggability is further away than just supporting pluggable handlers
in the dispatcher.

Deployment, however, may be tailored as desired.  Any of the given
templates may be overridden via passing a ``template_dir`` parameter
with a path to a directory that have templates of the appropriate
names as found in toolbox's ``templates`` directory. 

Likewise, the static files (css, js, etc.) are served using ``paste``'s 
``StaticURLParser`` out of toolbox's ``static`` directory. (See
toolbox's ``factory.py``.) Notably this is *not* done using the WSGI
app itself.  Doing it with middleware allows the deployment to be
customizable by writing your own factory.  For example, instead of
using the ``paste`` webserver and the included ``paste.ini``, you
could use nginx or apache and ``mod_wsgi`` with a factory file
invoking ``Dispatcher`` with the desired arguments and serving the
static files with an arbitrary static file server.

It is common sense, if rarely followed, that deployment should be
simple.  If you want to get toolbox running on your desktop and/or for
testing, you should be able to do this easily (see the ``INSTALL.sh``
for a simple installation using ``bash``; you'll probably want to
perform these steps by hand for any sort of real-world deployment).
If you want a highly customized deployment, then this will require
more expertise and manual setup.

The template data and the JSON are closely tied together.  This has the
distinct advantage of avoiding data translation steps and avoiding
code duplication.

Toolbox uses several light-footprint libraries:

* webob for Request/Response handling: http://pythonpaste.org/webob/

* tempita for (HTML) templates: http://pythonpaste.org/tempita/

* whoosh for search.  This pure-python implementation of full-text
  search is relatively fast (for python) and should scale decently to
  the target scale of toolbox (1000s or 10000s of tools). While not as
  fast as lucene, whoosh is easy to deploy and has a good API and
  preserves toolbox as a deployable software product versus an
  instance that requires the expert configuration, maintainence, and
  tuning of several disparate software products that is both
  non-automatable (cannot be installed with a script) and
  time-consuming. http://packages.python.org/Whoosh/

* jQuery: jQuery is the best JavaScript library and everyone
  should use it. http://jquery.com/

* jeditable for AJAXy editing: http://www.appelsiini.net/projects/jeditable

* jquery-token for autocomplete: http://loopj.com/jquery-tokeninput/

* less for dynamic stylesheets: http://lesscss.org/


User Interaction
----------------

A user will typically interact with Toolbox through the AJAX web
interface.  The server side returns relatively simple (HTML) markup,
but structured in such a way that JavaScript may be utilized to
promote rich interaction.  The simple HTML + complex JS manifests
several things:

1. The document is a document. The tools HTML presented to the user (with
the current objectionable exception of the per-project Delete button)
is a document form of the data. It can be clearly and easily
translated to data (for e.g. import/export) or simply marked up using
(e.g.) JS to add functionality. By keeping concerns seperate
(presentation layer vs. interaction layer) a self-evident clarity is
maintained.

2. Computation is shifted client-side. Often, an otherwise lightweight
webapp loses considerable performance rendering complex templates. By
keeping the templates light-weight and doing control presentation and
handling in JS, high performance is preserved.


What Toolbox Doesn't Do
-----------------------

* versioning: toolbox exposes editing towards a canonical document.
  It doesn't do versioning.  A model instance may do whatever
  versioning it desires, and since the models are pluggable, it would
  be relatively painless to subclass e.g. the file-based model and
  have a post-save hook which does an e.g. ``hg commit``. Customized
  templates could be used to display this information.

* authentication: the information presented by toolbox is freely
  readable and editable. This is by intention, as by going to a "wiki"
  model and presenting a easy to use, context-switching-free interface
  curation is encouraged (ignoring the possibly imaginary problem of
  wiki-spam). Access-level auth could be implemented using WSGI
  middleware (e.g. repoze.who or bitsyauth) or through a front end
  "webserver" integration layer such as Apache or nginx. Finer grained
  control of the presentation layer could be realized by using custom
  templates.


What Toolbox Would Like To Do
-----------------------------

Ultimately, toolbox should be as federated as possible.  The basic
architecture of toolbox as a web service + supporting scripts makes
this feasible and more self-contained than most proposed federated
services.  The basic federated model has proved, in practice,
difficult to achieve through purely the (HTTP) client-server model, as
without complete federation and adherence to protocol offline cron
jobs should be utilized to pull external data sources. If a webservice
only desires to talk to others of its own type and are willing to keep
a queue of requests for when hosts are offline, entire HTTP federation
may be implemented with only a configuration-specified discovery
service to find the nodes.


Evolution
---------

Often, a piece software is presented as a state out of context (that
is minus the evolution which led it to be and led it to look further
out towards beyond the horizon).  While this is an interesting special
effect for an art project, software being communication this
is only conducive to software in the darkest of black-box approaches.

"Beers are like web frameworks: if they're not micro, you don't know
what you're talking about." - hipsterhacker

For sites that fit the architecture of a given framework, it may be
advisable to make use of them.  However, for most webapp/webservice
categories which have a finite scope and definitive intent, it is
often easier, more maintainable, and more legible to build a complete
HTTP->WSGI->app architecture than to try to hammer a framework into
fitting your problem or redefining the problem to fit the framework.
This approach was used for toolbox.

The GenshiView template, http://k0s.org/hg/GenshiView, was invoked to
generate a basic dispatcher->handler system.  The cruft was removed,
leaving only the basic structure and the TempitaHandler since tempita
is lightweight and it was envisioned that filesystem tempita templates
(MakeItSo!) would be used elsewhere in the project.  The basic
handlers (projects views, field-sorted view, new, etc.) were written
and soon a usable interface was constructed.

A ``sample`` directory was created to hold the JSON blobs. Because
this was done early on, two goals were achieved: 

1. the software could be dogfooded immediately using actual applicable
data. This helped expose a number of issues concerning the data format
right away.

2. There was a place to put tools before the project reached a
deployable state (previously, a few had lived in a static state using
a rough sketch of the HTML microformat discussed above on
k0s.org). Since the main point of toolbox is to record Mozilla tools,
the wealth of references mentioned in passing could be put somewhere,
instead of passed by and forgotten.  One wishes that they do not miss
the train while purchasing a ticket.

The original intent, when the file-based JSON blob approach was to be
the deployed backend, was to have two repositories: one for the code
and one for the JSON blobs.  When this approach was scrapped, the
file-based JSON blobs were relegated to the ``sample`` directory, with
the intent to be to import them into e.g. a couch database on actual
deployment (using an import script). The samples could then be used
for testing.

The model has a single "setter" function, ``def update``, used for
both creating and updating projects.  Due to this and due to the fact
the model was ABC/pluggable from the beginning, a converter ``export``
function could be trivially written at the ABC-level::

    def export(self, other):
        """export the current model to another model instance"""
        for project in self.get():
            other.update(project)

This with an accompanying CLI utility was used to migrate from JSON
blob files in the ``sample`` directory to the couch instance.  This
particular methodology as applied to an unexpected problem (the
unanticipated switch from JSON blobs to couch) is a good example of
the power of using a problem to drive the software forward (in this
case, creation of a universal export function and associated command
line utility). The alternative, a one-off manual data migration, would
have been just as time consuming, would not be repeatable, would not
have extended toolbox, and may have (like many one-offs do) infected
the code base with associated semi-permanant vestiges.  In general,
problems should be used to drive innovation.  This can only be done if
the software is kept in a reasonably good state.  Otherwise
considerable (though probably worthwhile) refactoring should be done
prior to feature extension which will become cost-prohibitive in
time-critical situations where a one-off is (more) likely to be employed.


Use Cases
---------

The target use-case is software tools for Mozilla, or, more generally,
a software index.  For this case, the default fields uses are given in
the paste.ini file: usage, author, type, language. More fields may be
added to the running instance in the future.

However, the classifier classification can be used for a wide variety
of web-locatable resources.  A few examples:

* songs: artist, album, genre, instruments
* de.li.cio.us: type, media, author, site


Resources
---------

* http://readthedocs.org/