Webber: Harmonizing web contents and metadata
The Webber Development Team webbones@rediris.es
10 Jan 2001. version 1.3.0
Introduction
Webber is a component-oriented tool for harmonizing and providing metadata for Web server contents. Originally conceived as an HTML preprocessor, it has evolved to a full-featured environment for producing and maintaning the contents and metadata of Web pages. It uses a set of components (independent pieces of code) to process a tree of source files and directs them to cooperate in producing the final version of Web pages, according to user specifications.
This way, content writers can concentrate on writing information, without concerns on the final aspect and location of it, while site administrators have a powerful and easy-to-use tool for assuring a coherent view of this information. The use of components simplifies the way contents are processed, since the cooperation of small communicating elements eases the introduction of new and ad-hoc processing capabilities and the reuse of pre-existing modules.
Webber provides a tuple space for component communication. Each variable in this tuple space can be accessed for reading or writing by the executing component, leaving the appropriate information for further use by other components. The execution of Webber itself is controlled via some of these variables. Initial values for variables into the tuple space are defined either inside template files or inside individual source files. An inheritance mechanism along URL paths is provided: values for variables are inherited from one directory to those contained in it, unless the variable is overwritten. This way, site- or branch-wide default values can be applied. This inheritance mechanism is specially important for metadata, where hierarchically derived information (think in data such as author, publisher, knowledge area, etc.) can be specially relevant.
Strictly speaking, Webber acts as the framework that maintains the values into the tuple space and keeps the inheritance through different directories prior to invoking the document processing components (processors), that are the pieces that perform the actual work for producing the outcome of the whole Webber execution. These processors are called at different moments by Webber and the only requirements on them are their interface with Webber. This interface must be written in Perl (as Webber does) and the processors must access variables into the Webber tuple space through it. A Webber processor may vary from very simple procedures (essentially, a series of print statements where variables are replaced) to very sophisticated ones (for example, keyword extraction programs). The Webber distribution includes a certain number of these processors and we hope to offer more of them in the near future.
Running Webber
Once installed, Webber can be executed by simply calling it from the shell. The simplest ways to invoke Webber are:
$ webber File
and
$ webber Directory
In both cases, Webber will run using the default configuration file and will process, in the first case, the source file identified by File, and, in the second case, all source files contained in Directory.
You can use several file or directory names when calling Webber. This way, the following command:
$ webber File1 Dir2 File3
will process source file File1, then all source files in directory Dir2 and, finally, source file File3.
Webber accepts a series of arguments that controls the way it works. These arguments are:
- -C to define an alternate configuration file.
- -d to turn debugging on. The program will output information about
its status as it goes through source files.
- -f to force target updating. Webber tests whether target
modification time is more recent than source modification time. If option -f
is not used, all sources older than their corresponding targets will not be
processed.
- -h to show a short help message and exit the program.
- -H to show the help available for the processors currently known
to Webber.
- -i to make Webber work using standard input instead of processing
a file. You must bear in mind that, in this case, no template is read.
- -I to list the processors Webber knows, including a short
description of them.
- -m to create target directories as needed. By default, Webber
assumes that target directories has been already created.
- -r to go recursively through source directories, starting (of
course) at the one(s) identified in the command line.
- -t to make Webber work like in the case of -i, but reading
a template file (identified by the parameter provided with -t) prior to
processing the standard input.
Variables, source files and templates
Webber source files provide initial values for variables into the tuple space using the following syntax:
#VariableName = VariableValue
#VariableName + VariableValue
#VariableName * VariableValue
In the first case, the value is assigned to the variable, overwriting its (possible) previous content. If the second form is used, the value is appended to the previous value of the variable, using the character ` ' (a whitespace) to compose both values. The third form prepends the value to the previous value of the variable, using whitespace as composition character as well. The `+' and `*' assignments have the same effect as `=' if the variable has not been previously defined and no default value has been provided for it in the configuration file.
Any string can be used as variable name, but bear in mind that Webber sets certain variables at runtime, all of them with the prefix ``wbb'', so it is advisable to avoid this prefix when giving names to your local variables. If a variable is intended to be mainly used by a certain processor, we recommend to prefix its name with a string that identifies the processor (we usually employ the string processorName. for this). This does not prevent any other processor from reading or writing the variable (an access control model for the tuple space is being currently defined), but can help other processor writers in avoiding conflicts.
In the rest of this text, we will refer to Webber variables as #VarName, denoting that we are talking about a variable into the Webber tuple space and not into the context of Perl.
Variable values can have any length and span multiple lines. Webber built-in parser detects the end of a variable assignment by the start of another assignment (a line that starts with the character `#'), by reading a comment (a line starting with the character sequence `##'), and (of course) when it reaches the end of the file. So you can use multiline variable values with any characters on them, as long as you do not use lines starting with `#'.
Processors access the variables into the Webber tuple space through a Perl hash, called %var, using the variable name as its key. Since the %var hash is inside the Perl namespace of the Webber package, a processor interface to the Webber tuple space must use the Webber:: prefix to reference it, or use any available Perl mechanism for implicitly reference the hash into its namespace. For example, let's consider the following lines:
#One = This is one #Another = This is another value #Thirdone = third form #One + value #Thirdone * This is the
This way, the value of the variable One can be accessed by means of the Perl construct $Webber::var{One}, yielding the string ``This is one value'', while $Webber::var{Another} will yield ``This is another value''. Last, $Webber::var{Thirdone} will yield ``This is the third form''.
Webber source files consists of a set of variable assignments, which will be used by processors to produce the target file. Templates are special source files containing a series of variable assignments that are applicable to all the files in a directory and, through inheritance, to all files in its subdirectories. The purpose of templates is to provide default values that can be overwritten or extended by individual source files.
Configuration
To start working, Webber needs to read certain values that define its behavior and to insert the values of some fudamental variables into the tuple space. These values are stored in the configuration file. By default, Webber uses an installation-defined configuration file, although it can be overriden by the use of the -C flag when calling the program (see above).
The Webber configuration file is a Perl source file (it is called from Webber by means of a do statement), so you can put on it whatever actions you need to initialize Webber. In most cases, however, the configuration file will contain a list of assignments for Webber variables.
We can consider three different sections into a typical Webber configuration file:
- Webber main variables
- Initial values for fundamental variables
- Initial values for site-specific variables
None of these sections is mandatory, although it is a sensible practice to keep, at least, the two first ones. Webber can work without any tuple space initialization, but built-in default values are not very comfortable to deal with.
Webber main variables
There are a few variables that Webber accesses to know how it has to behave. These variables always have the prefix ``wbb'' and correspond to:
- $wbbSourceRoot is the directory where the Webber source tree is
rooted. Its value determines the starting point of inheritance for variables.
Whenever a file is to be processed, all variable assignments in templates from
the source root to the directory where the file lives are applied. For example,
if you set the source root to /wbbsrc and you are processing a file named
/wbbsrc/docs/tech/report1, then variable definitions made in the
templates for /wbbsrc/docs and /wbbsrc/docs/tech are applicable.
If Webber is called with the -i flag, no templates are used. When
Webber is called with the -t flag, a single template (as identified
by the parameter to -t) is used.
- $wbbTargetRoot is the directory where the Webber target tree is
rooted. It is used to determine the name of the file where the Webber
processors will write their output, i. e., the file that will contain the
processed information. A one-to-one mapping is made from directories into the
source and the target trees and Webber will not create the directories unless
the -m flag is used, as described in the previous section. Of course, the
source and target trees can be the same. When the -i or -t flags
are used, output is written to standard output.
- @wbbProcLib is a list of directories where Webber will look for
processors, apart from other directories that can be defined by means of
a specific variable (see #wbbProcs below). This list should be used
for defining a library of ``standard'' processors.
- %wbbLangs uses as key the value of the language
identifiers known to Webber and provides processors with a longer description
of each of these languages. We recommend to have a look inside the processors
(for example, the one this documentation is produced with) included with the
Webber distribution for some possible uses of this framework.
- $wbbDefLang is the default language assumed when Webber cannot
determine it (from the file name, see #wbbFileLangRegExp below).
It can be used, for example, when producing metadata.
Initial values for fundamental variables
There are a number of Webber fundamental variables, which must always exist, since Webber makes use of them to direct source file processing. In addition, there are certain variables that are set by Webber in order to provide processors with information about the file being processed and its characteristics. This section in the configuration file gives defaults values for them, through a Perl hash called %wbbDef, that uses the variable name as key. As for all other variables, their values are accesible through the hash %Webber::var, using the variable name as key. Remember that the variable name does not include the # character we use to represent it.
Since we are dealing with the initial values for variables into the Webber tuple space, their values can be assigned either by the templates or by the individual source files. The values included in the configuration files are intended to provide a default value, acting as an ``initial template''. These variables are:
- #wbbVersion contains the version of Webber that is running.
- #wbbTemplateName is the name of the Webber template file in each
directory. It is not mandatory to have a template file per directory but,
if one exists, its name must be the one identified by this variable.
- #wbbSourceRegExp is a regular expression that determines which
files are to be considered sources for Webber in a directory. This allows to
share source and target trees and to avoid the processing of non-source files,
such as images or raw data. You do not have to worry about the template files
matching this regular expression: Webber automatically skips processing
template files as source files.
- #wbbFileNameRegExp is a regular expression used by Webber to form
the name of the target file. This regular expression is applied to the name
of the source file and the result of the first parenthesized match is composed
with the current extension to build target file name. Bear in mind that
it must be a parenthesized regular expression, so Perl internal variable
$1 can be used, and that only the first match (that is, only $1) will be
used.
- #wbbFileLangRegExp is a regular expression used by Webber for
extracting the language identifier applicable to a source file from its
name, using the same mechanism described for #wbbFileNameRegExp above.
Other alternative is setting the language by
means of the #wbbLang (see below) variable inside a template or a
source file. In fact, what Webber does is assign a value for #wbbLang
by means of #wbbFileLangRegExp.
- #wbbExtension defines the extension to be used when creating the
target file name, appended to the name extracted by #wbbFileNameRegExp.
- #wbbTargetFileMode defines the mode in which target files will
be created by Webber. It must be a numeric string coding the file permissions
as accepted by chmod. If this variable is not set, the default value
'0444' (that is, read-only permission for everyone) is used.
- #wbbTarget defines the name of the target file, with the language
identifier (if any) removed. It is intended (for example) to linking different
language versions inside the processors.
- #wbbProcs defines additional location(s) to look for processors.
- #wbbPre defines the pre-processor(s) to be used for building the
target file.
- #wbbProc defines the processor(s) to be used for building the
target file.
- #wbbPost defines the post-processor(s) to be used for building the
target file.
- #wbbExtParser tells Webber not to use its built-in parser for
reading initial variable values from the source. If this
variable is set to a value other than zero, Webber will not parse the source.
An external parser (typically, one or more pre-processors)
will be used to extract variables from the source. If this variable is not
set, the default behavior is applied, and Webber will parse the source.
- #wbbDateMeta defines the date for the target file, as it should
be shown in its metadata.
- #wbbDateWeb defines the date for the target file, as it should
be shown in its content.
- #wbbLang defines the language identifier for the target file.
- #wbbIn defines the main contents of the source file. The
``main content'' concept is intentionally fuzzy. The only definition for it
is that, as we will see when discussing processors, this variable has an
special default processing by Webber.
- #wbbOut defines the main content of the target. We are dealing
again with a fuzzy concept, that is also defined by the special default
processing that Webber applies to it.
Initial values for site-specific variables
In addition to the values discussed so far, the Webber configuration file may contain default values for any other variables into the tuple space that you consider necessary, by means of the use of the %wbbDef mechanism described above. In this sense, the configuration file acts like an initial local root template.
Processors
Processors constitute the core of the information processing features of Webber. The only requirements on them is to be Perl procedures included into a Perl module, and the way they must access variables into the Webber tuple space. Modules that include processors are loaded inside Webber by means of a require Perl statement, so processor references inside a template or a source file require:
- Identifying the processor by means of the standard Perl mechanism
Module::procedure. This way, to use the processor printIn
included in module PrintIn a Webber source file must use the
following variable assignment:
#Proc= PrintIn::printIn
- Using the Perl standard extension ``.pm'' to identify files defining
modules. In the example above, module PrintIn should be defined in
a file called PrintIn.pm.
- Files defining modules must be located either in the standard Perl
library directories or in the directories defined by means of
@wbbProcLib and #wbbProcs.
As we have said before, variables into the Webber tuple space are made available to processor interfaces by means of a hash named %Webber::var, using the variable name (without the # character) as key.
Currently, processors are sequentialy invoked by Webber (a model for the use of concurrent components is under development), and Webber distinguishes three categories of processors, that are invoked at different moments during the production of the target. Processors are activated according to their category and, for a given category, in the same order they appear in the corresponding variable.The applicable processor categories, the variables into the Webber tuple space that define them, and the moment at which they are called are as follows:
- Pre-processors are called at the beginning of the processing of the
source. At this point, no special content variable is set by Webber, nor any
special action is taken by Webber after calling them. Pre-processors are
defined by means of the Webber variable #wbbPre.
- Processors are called inmediately after the last pre-processor returns.
If no processor in this category is defined, Webber provides a default behavior:
the content of #wbbIn is copied to #wbbOut. Processors are defined
by means of the Webber variable #wbbProc.
- Post-processors are called inmediately after the last processor returns
(or after copying #wbbIn to #wbbOut if no processor was defined).
Once the last post-processor returns, Webber prints the content of
#wbbOut, whatever it is, to the target. Post-processors are defined
by means of the Webber variable #wbbPost.
During the execution of a processor of any category, anything the processors writes to standard output is directed to the target. Simplest and/or ad-hoc processors can therefore consist of a series of print statements.
A sample processor
The code included below is a very simple processor included into the current Webber distribution and will help us to illustrate how a Webber processor works, and to give some recommendations for writing one of them. We do not pretend that you follow these recommendations for any Webber processor you write. If you are making a processor for doing some ad-hoc formatting in a certain directory, following these guidelines can unnecesarily cludge your code (we have not followed them in the specific processors that you can find under the samples directory of the distribution). Nevertheless, we think this recommendations are sensible practices for coding general-purpose or site-wide Webber processors.
These are the contents of the file proc/PrintIn.pm included in the current Webber distribution:
#!/usr/bin/perl # # Webber processor for simply printing the contents of #wbbIn # package PrintIn; my $name= "PrintIn"; my $version= "1.0"; sub info { print "$name v$version: Copy #wbbIn into #wbbOut\n"; } sub help { print <<FINAL $name Webber processor, version $version This program must run inside Webber. This is the simplest Webber processor. It just passes to #wbbOut the current contents of #wbbIn. $name must be used as (one of) the last processor(s). $name uses the following Webber variables: #wbbIn: Its value is passed to #wbbOut by the processor. #wbbOut: Its current value is concatenated with #wbbIn by the processor. FINAL } sub printIn { $var = \%Webber::var; $$var{'wbbOut'} .= "<!-- Webber proc $name v$version -->\n"; $$var{'wbbOut'} .= $$var{"wbbIn"}; } if ($0 =~ /$name/) { &help; die ("\n"); } 1;
The first three lines of code define the module namespace by means of a package sentence, and a pair of variables containing the module name and its current version. It is advisable to include two procedures that give information about the processor(s) included within the module. These procedures must be called help and info, to be accessible by the -H and -I options of Webber, respectively. If you look at the end of the code, the help procedure is also called whenever the module is called directly, providing a simple way for obtaining information about the processor(s), and for the auto-documentation (using Webber itself) of the module.
The procedure printIn constitute the only processor included in this module. Although the body consists only of three lines, they illustrate the mechanisms that a well-behaved Webber processor should use. The first line provides a reference to the Webber tuple space hash, so it is shorter (and simpler) to reference variables in the rest of the code. The second line adds a line into #wbbOut that makes reference to the use of this processor: this way, it will be possible to identify the processors used when a page was built reading its content. The third line performs the actual processing, in this case simply including the content of #wbbIn into #wbbOut.