Extraction Engine

Extraction rules are then applied on the parse tree to extract information. Extraction rules are expressed using the HTML Extraction Language (HEL) [Sahuguet et al. 1999a]. Sahuguet et al. [1999a] have developed an extraction wizard (see section 6.2.4) which automatically constructs the extraction rules.

These extraction rules are described as paths along the tree which always return text values. In Figure 12, the extraction rule first parses along the document hierarchy ("html.body[0]"), then it skips along the flow of the document ("-> table[h0].tr[0].td[1].

Figure 11 - HTML Document parsed into the DOM

txt"). The variable h0 signals to the extraction rule that the constraint h0 must be fulfilled in order to extract the information into a NSL. The constraint h0 signals that with respect to the path ("html.body[0]-> table[h0].tr[0].td[0].b[0].pcdata[0].txt") the text token must equal "Project No:". If the condition is true, the NSL is filled with data according to the extraction rule.

As the required information might not be entirely captured by the HTML structure which is available through the document object-model (e.g. an enumeration inside a table cell), HEL provides the two regular expression patterns match and split (which follow the Perl syntax [Wall et al. 1996]) for capturing finer granularities.

EXTRACTION_RULES

all = html.body[0](

//Project

->table[h0].tr[0].td[1].txt

//No - Identifier

# ->tr[0].td[2

.tt[0]

.txt, match/[0-9]/ //Issue

# ->tr[1].td[1"

.tt[0]

.txt

//Title

# ->tr[2].td[1"

.tt[0]

.txt,

//StartDate

# ->tr[2].td[3"

.tt[0]

.txt,

//EndDate

# ->tr[3].td[1"

.tt[0]

.txt

//CorporateProgramme

# ->tr[3].td[3"

.tt[0]

tXt, split/ /

//ProgrammeManager

# ->tr[4].td[1].tt[0].txt\

//Themes

// Constraints

WHERE html.body[0]->table[h0].tr[0].td[0].b[0].pcdata[0].txt =~ "Project No:"

AND ...

0 0

Post a comment