非表示:
キーワード:
-
要旨:
The paper describes the architecture of XTRACT, a system for inferring an
accurate, meaningful, near optimal DTD schema for a repository of XML
documents. The paper presents some very interesting ideas on an important and
challenging subject.
The XTRACT system executes three steps:
1. Generalization (finding patterns in the input sequences and replacing them
with regular expressions to generate general candidate DTDs)
2. Factoring (factoring candidate DTDs using adaptions of algorithms for the
optimization of Boolean functions)
3. applying MDL principle (applying the Minimum Description Length principle to
find the near optimal DTD among the candidates).
The authors provide experimental results in comparison with DDbE (Data
Description by Example generated by IBM alphaworks(R))
The paper's key contribution lies in applying the MDL principle for defining an
information-theoretic measure to quantify and resolve the tradeoff between the
conciseness and precision of DTDs. This is indeed a reasonable and intriguing
first cut on this difficult problem, but I am not fully convinced that this
should be the bottom line. It could well be that conciseness by general regular
expressions may reduce the readability and intuitiveness of a DTD. But this
paper should be an excellent starting point for more intensive work along these
lines.