Figure Excerpt of a PRD and its HTML code

The wrapper has the purpose to convert information implicitly stored as an HTML document into information explicitly stored as a data-structure for further processing (e.g. processing into XML).

According to Sahuguet et al. [1999a], a wrapper must actually fulfil three roles:

1. retrieving the Web document

2. extracting information from it

3. exporting it in a structured way

The whole work of a wrapper is impeded by the fact that the Web is not a stable environment (there are network failures, ill-formed documents and change in the layout is common) [Sahuguet et al. 1999a]. For example the document structure of a PRD document will gradually evolve over time as the project managers who write those documents and the programme managers who base their decisions upon them think that certain parts of the document are no longer relevant and therefore can be omitted, want to insert new attributes (like the name of team members and their tasks within the project). Or it would make sense to structure certain parts of the document in a better way (e.g. for later comparisons of deliverables it would be good to use the same deliverable structure in a PRD and its corresponding HR, which is not the case at the moment).

To enable a wrapper to be robust in such an environment, Sahuguet et al. [1999a] have identified some key points which must be fulfilled to build a robust wrapper:

• Modular layered architecture:

(1) retrieval rules (2) extraction rules (3) mapping rules

• Declarative specification of each layer

• Multi-granularity extraction:

It must be possible to get not only inter-tag information but also intra-tag information like a comma-separated enumeration inside a table cell

• Precise semantics

e.g. use of support tools like wizards

In the case study, the W4F toolkit [Sahuguet et al. 99a] is used to wrap PRD and HR documents (HTML) into XML documents.

The architecture of the W4F toolkit and the wrapping process is visualised in Figure 10. The process starts with retrieving the document from the intranet, then it is parsed into a tree a la DOM. In the next step, the information is extracted from this tree into NSL (Nested String List), and from there mapped into the desired XML structure. The following section shows the different steps in this process and how the corresponding layers of the tool kit work and what they look like in the case study.

Figure 10 - W4F Wrapper Architecture, adapted from Sahuguet et al. [1999b]
0 0

Post a comment