|
このページは大阪弁化フィルタによって翻訳生成されたんですわ。 |
In this chapter I discuss the model on which the PBD system is based. On a web page from which we wish to extract specific information records, there are two steps involved:
1. Segregate the records in a web page by:
ァ Providing hints from the listing of information records, or
ァ Using repetition of tags in a web page as an important information.
2. Extract individual fields of record by:
ァ Rules-based extraction,
ァ Appearance-based extraction, or
ァ Sample-based extraction.
Before going further into explaining each of the methods in segregating records from a web page and then identifying individual fields in a record, I consider it important on what considerations the model was based.
The model has to be simplistic enough for a novice user. Here the novice means a regular Internet surfer or a person who usually writes his/her document using a word processor such as Word. We identified that there is a trade-off involved in the design of the system. If finer controls are given to the user then the system use becomes much more complicated. In designing PBD systems for novice users, one of the crucial assumption is to take the system should follow the principles of the novice mind. The novice need not like to think like a computer to program but it should be the other way around. The system should be simple and intuitive for the novice to understand. Since we cannot give a more robust system without increasing the user control of the system then we need to increase the sophistication of the system. Increasing the sophistication of the system essentially means using direct artificial methods, which take more computation time. Also such methods are based mostly on heuristics for the Web. With the changing character of Web, the system would be too brittle. The system model developed below is a result of an iterative exercise keeping the trade-off between user control and sophistication of system in mind. Further, the system was tested against subjects to affirm its accuracy and usability and thus prove that proper intelligence has been built into the system. The chapter on usability shows the results in more detail.
Before information about a single record could be extracted (such as its price of a book, its title etc.), the system needs to know what is a record. That is done by user providing a text from the area in the web page having list of records. The system then proposes candidate record samples to user. The user can accept or ask the system to try for a better one. This goes on until the user is satisfied with sample record. The following figure illustrates the segregating of records by providing hints.




In our PBD system, the web page is modeled as a string containing tags and text. When the user accepts the sample record, the system finds similar records to the appearance of sample record. Thus we then have set of records,
Record Set = S = {[R1], [R2], [R3]?
Where Ri is the text and tags of individual records.
In majority of the web sites I observed, the record information was enclosing within one enclosing tag and thus I needed to gain only those tags to identify what is in a record. However on a few sites, the information for a single record was split among two tags. For example on Amazon web site, the title and author was in one enclosing tag and the pricing information was in the other enclosing tags. For such web sites, I found the following technique helpful but remember this technique costly in terms of computation and results are not very correct either.
On sites like book sites, auto-sites that display listings, there is an important tag information that can be used to segregate records from the web page. Though the information within tags changes but the tags remain same across all records. This means that if one could remove all the textual information from the web page and let only the tags, then there would be a string of tags, which would be repeating it. This is a heuristic, which would not work in all case. If there are fewer records returned, then one cannot detect such a string of tags. The more records there are from a search on a website, the more likelihood that the tag string would be repeating itself very often. A simple example would illustrate the idea. If the following are the tags left after removing all textual information from a web page,
repeating itself four times and could most likely enclose record information. One can note that a threshold on repetition needs to be defined too. If the threshold were low, many tag strings would come up as possibly the tags enclosing record information. If the threshold is very high to be greater than the number of times the actual enclosing tag string repeats itself, then also one won稚 receive any result and there would be no tag string returned. Thus an optimum is required which can be identified and re-identified by the user.The individual fields are retrieved using the following tools
![]()


This tool works like this: Anything that precedes the
If discount on book is desired then precedes % could be specified. The system would return 20, 7 and 13.
AND-ing or OR-ing can combine follows and precedes rules to create more complex rules for extraction.
Appearance based extraction is based on the idea that certain fields in a record are formatted differently (e.g. underlined, bold, different color) from other fields in a records. This is valuable in cases where distinguishing attributes like % and $ are not present for fields. For example,
extract title of book. However, we see that title of every book is underlined. The user specifies ABC of Java and indicates to system. The system would return Java 1.1, ABC of Java and Programming by Example.We can see from the above simple example that every piece

![]()


Sample-based extraction works as follows: The user specifies samples to the system and the system constructs a generic pattern for extraction. In the above example, if the system is given samples 19.95 and 30.09, it would construct a pattern dd.dd where d stands for digit. Anything that matches this pattern would be a price. The system would return 19.95, 30.09 and 49.99 as the price of book.
The following figure illustrates the generalization of publishing year from a set of samples from the web site.


The following illustrates the system Scriptor which is a practical implementation of the system model. The source code is given in the appendix. The system was developed in Java and WebL. WebL is a web query language developed by the Compaq Systems Lab.
You provide the system with a hint and then click on
mso-position-horizontal-relative:text;mso-position-vertical-relative:text'
o:allowincell="f" adj="-14625,-19575,-8160,,,,-14625,-19575">
Listing of information records




The system shows a sample to tell you if that is what would be a typical record look like.

You specify to the system that anything that has structural information like Advanced Java ( hyperlink, blue in color) is a title.

You specify $ in the attribute box and tell the system that anything in a record that follows a $ sign is a price of the book. If there are multiple occurrences of $, the first one is picked for information processing.

You specify samples to the system from two or more records from the listing of information records. The system generalizes on these samples.






And one can go on to view other records too as a result of tools applied to the website.
Software was developed to experiment with the usability of the PBD system model. An instructor acquainted the users with the software for 20 minutes. The purpose of instructor was to get them familiar with the software.
All subjects were given the same web pages to extract fields using rule-based, appearance-based or pattern-based extraction. Novice subjects were selected for this experiment. The pages selected were from diverse sites as well as diverse in appearance.
The following summarizes the data gathered from the experiment for different web sites.
|
|
Part |
Description |
Price |
|
Number of Subjects Getting
Correct Extraction |
8 |
8 |
2 |
|
Total Subjects |
8 |
8 |
8 |
|
Percent Correct |
100% |
100% |
25% |

|
|
CRN |
Description |
Prerequisites |
|||
|
Number of Subjects Getting
Correct Extraction |
10 |
10 |
10 |
|||
|
Total Subjects |
10 |
10 |
10 |
|||
|
|
100% |
100% |
100% |
|
|
Item |
Description |
Price |
|
Number of Subjects Getting
Correct Extractions |
9 |
6 |
9 |
|
Total Subjects |
10 |
10 |
10 |
|
Percent Correct |
90% |
60% |
90% |

|
|
Ticker |
Price |
Change |
|
Number of Subjects Getting
Correct Extraction |
9 |
8 |
9 |
|
Total Subjects |
10 |
10 |
10 |
|
Percent Correct |
90% |
80% |
90% |

|
|
Year and Car |
Mileage |
Color |
Price |
|
Number of Subjects Getting
Correct Extraction |
7 |
10 |
10 |
10 |
|
Total Subjects |
10 |
10 |
10 |
10 |
|
Percent Correct |
70% |
100% |
100% |
100% |

The subjects were given web pages in the following order: Autotrader, Secapl, E-bay, School of Information Science Course List-University of Pittsburgh, Microwarehouse.
|
Subjects |
Time Autotrader completed (minutes) |
Time Secapl Completed (minutes) |
Time E-bay Completed (minutes) |
Time SIS completed (minutes) |
Time Microwarehouse
Completed (minutes |
|
1 |
11 |
9 |
10 |
7 |
11 |
|
2 |
6 |
7 |
4 |
3 |
7 |
|
3 |
7 |
3 |
4 |
3 |
4 |
|
4 |
5 |
4 |
10 |
4 |
4 |
|
5 |
6 |
5 |
4 |
3 |
5 |
|
6 |
10 |
6 |
5 |
5 |
7 |
|
7 |
9 |
7 |
6 |
3 |
7 |
|
8 |
10 |
3 |
9 |
3 |
7 |
|
Average |
8 |
5.5 |
6.5 |
3.9 |
6.5 |
Subjects found rule-based extraction using follows and precedes to be quite intuitive. They tried first these rules to extract. When it failed they tried hints. Very few subjects used learned-pattern-based extraction. The price field in Microwarehouse record could be obtained correctly only through pattern-based extraction. That is why we see lesser accuracy there.
ァ Script generation time is quite impressive in comparison with manually scripting the web sites using Perl or other language. Moreover, these are times for novice users. A novice subject generated script in less than 12 minutes on any site far less than the time it takes for a programmer to code script manually.
ァ Accuracy is quite impressive too. The subject obtained good score in extracting fields with accuracy above 80% in many cases.
There are various issues that weren稚 consider that I introduced with modes available for novice and expert. I believe the learning curve to PBD systems for expert would still be far lower than a comparable syntactic language like Perl. The results can be verified by conducting a usability study.
In the end, I conclude that it is feasible and recommended involving a novice user in the design of information agents and thus this approach helps in solving the problem to extract specific information from a web site independent of its underlying structure. There are considerations that have to be kept in mind when designing PBD systems for such one-of-a-kind web sites namely the trade-off between user control and sophistication built into the system. The finer controls built into the system, the lesser a PBD system becomes appealing. Because the intuitivity of the system suffers. However, if one try to build sophistication into the system based mainly on heuristics, the system becomes brittle to work when systems change with time and the heuristics become invalidated.
1. D. Maulsby, I. Witten: Teaching Agents to Learn: From User Study to Implementation, IEEE Computer, November 1997.
2. H. Lieberman, B. Nardi, D. Wright: Training Agents to Recognize Text by Example, Third Conference on Autonomous Agents, 1999
3. D. Smith, A. Cypher, L. Tesler: Novice Programming Comes of Age, Communication of the ACM, March 2000.
4. B. Myers, R. McDaniel, D. Wolber: Intelligence in Demonstrational Interfaces, Communication of the ACM, March 2000.
5. M. Bauer, D. Dengler, G. Paul, M. Meyer: Programming by Demonstration for Intelligent Agents, Communication of the ACM, March 2000.
ァ WeblEngine.java is the code that makes calls to WebL.jar. WebL.jar is the library of files provided by Compaq Systems Lab.
ァ GenericAgent.java is the code that makes calls to WebLEngine.java.
ァ EventHandler.java handles events for GUI Files and makes subsequent calls directly to WeblEngine.java or indirectly through GenericAgent.java
ァ GuiUser.java initializes all variables and is the start point of program.
ァ All files with agt extension are WebL files loaded dynamically by GenericAgent.java or EventHandler.java.
GenericAgent.java
import java.io.*;
import java.util.*;
import java.lang.*;
import webl.lang.*;
import webl.lang.expr.*;
import webl.util.*;
EventHandler.java
import java.awt.event.*;
import java.awt.*;
import java.util.*;
import java.io.*;
public class EventHandler
}
WeblEngine.java
import webl.lang.*;
import webl.lang.expr.*;
import webl.util.*;
import java.io.*;
import java.util.*;
GuiUser.java
import java.util.*;
import java.io.*;
public class GuiUser {
mso-fareast-font-family:"Times New Roman";mso-bidi-font-family:"Times New Roman"; letter-spacing:-1.0pt;mso-ansi-language:EN-US;mso-fareast-language:EN-US; mso-bidi-language:AR-SA'>
LabelGui.java
import java.awt.*;
import java.awt.event.*;
import java.io.*;
import java.util.*;
PatternGui.java
import java.awt.*;
import java.awt.event.*;
import java.io.*;
import java.util.*;
mso-fareast-font-family:"Times New Roman";mso-bidi-font-family:"Times New Roman"; letter-spacing:-1.0pt;mso-ansi-language:EN-US;mso-fareast-language:EN-US; mso-bidi-language:AR-SA'>RulesGui.java
import java.awt.*;
import java.awt.event.*;
import java.io.*;
import java.util.*;
public class RulesGui
{
letter-spacing:-1.0pt;mso-ansi-language:EN-US;mso-fareast-language:EN-US; mso-bidi-language:AR-SA'>SearchGui.java
import java.awt.*;
import java.awt.event.*;
import java.util.*;
import java.io.*;
SelectGui.java
import java.awt.*;
import java.awt.event.*;
import java.util.*;
import java.io.*;
public class SelectGui {
}
ViewGui.java
import java.awt.*;
import java.awt.event.*;
import java.util.*;
import java.io.*;
public class ViewGui {
Init.agt
var pathname = "";
var P = nil;
var website = "default";
var markuptxt = "";
var pieceset = nil;
var piece = nil;
var str = nil;
var pieceptr = 0;
var currList = nil;
var currptr = 0;
var stackstmt = [..];
var level = 0;
var lastlevel = -1;
var index = 0;
var patternindex = 0;
var weblobj = [..];
var minweblobj = -1;
var weblmarkuptxt = "";
var txt = [];
var emptyObj = [..];
emptyObj[0]:="";
emptyObj[1]:="";
emptyObj[2]:="";
var resultList = [];
var repeatThreshold = 3;
var weblstmt = [..];
var searchString = "";
var searchStringPiece = nil;
var taggedPiece = nil;
var queryStmt = nil;
var queryPiece = nil;
var startQueryPiece = nil;
var childTag = "";
var records = nil;
var CurrentRec = nil;
var labelSearchStringPiece = nil;
var TempRecord = nil;
var labelSearchString = "";
var labelTaggedPiece = nil;
var labelQueryPiece = nil;
var labelQueryStmt = "";
var CheckNilStmt = "";
var dummy = "";
var TempPage = nil;
var DummyRec = nil;
var PatStmt = "";
InitQuery.agt
TempPage = NewPage(Markup(P), "text/html");
searchStringPiece = Pat(TempPage, searchString)[0];
Replace(searchStringPiece, taggedPiece);
startQueryPiece = Parent(Elem(TempPage, "searchtag")[0]);
Pattern.agt
import Str;
stack = [..];
stackstmt = [..];
level = 0;
lastlevel = -1;
index = 0;
weblmarkuptxt = "";
weblstmt = [..];
end;
PieceSearch.agt
PrintLn("PieceSearch begins");
Replace(labelSearchStringPiece, labelTaggedPiece);
labelQueryPiece = Parent(Elem(TempRecord, "newlabel")[0]);
while (Name(Parent(labelQueryPiece)) != "body") do
"Elem(CurrentRec, \"" + Name(labelQueryPiece) + "\")";
end;
PrintLn("--------------------");
PrintLn(labelQueryStmt);
PrintLn("--------------------");
labelQueryStmt;
PieceSearch2.agt
PrintLn("PieceSearch2 begins");
dummy = "";
PatStmt = labelQueryStmt;
CheckNilStmt = "if (Size(" + PatStmt +
PrintLn(labelQueryStmt);
PrintLn("--------------------");
labelQueryStmt;
Query.agt
queryPiece = startQueryPiece;
queryStmt = "Elem(P, \"" + Name(queryPiece) + "\")";
PrintLn(queryStmt);
queryStmt;
Select.agt
import Str;
txt = [];
PrintLn("Select.agt starts...");
every record in records do
Size(txt);
ShowRec.agt