Evaluating Information Extraction using xROC Curves
Suppose that you have an information extraction system, which behaves
like a blackbox. The box takes as input a document and gives in the
output a set of relation tuples. For example, the information
extraction system may process newspaper articles and identify mergers
and acquisitions (Company1 bought Company2), or management succession
events (Person1 succeeds Person2 in the the CEO position), and so on.
As expected, such systems are inherently noisy and generate imperfect
output. Sometimes they miss tuples that appear in the documents and
sometimes they generate spurious tuples. One of the important
questions is how to evaluate such a system objectively and with the
minimum amount of effort.
A common evaluation strategy is to use precision-recall curves to show
how the system behaves under different settings. The precision of the
output is defined as the number of correct tuples in the output over
the total number of generated tuples; recall is defined as the as the
number of correct tuples in the output over the total number of
correct tuples that can be extracted from the documents.
Unfortunately, precision is problematic due to its dependence on the
input class distribution, as the following example illustrates:
* Example: Consider an extraction system E that generates a table of
companies and their headquarters locations, Headquarters(Company,
Location) from news articles in The New York TimesBusiness'' and
the "Sports'' section. The "Business'' documents contain many
tuples for the target relation, while "Sports'' documents do not
contain any. The information extraction system works well, but
occasionally extracts spurious tuples from some documents,
independently of their topic. If the test set contains a large
number of "Sports'' documents then the extraction system will also
generate a large number of incorrect tuples from these "bad''
documents, bringing down the precision of the output. Actually,
the more "Sports'' documents in the test set, the worse the
reported precision, even though the underlying extraction system
remains the same. Notice, though, that the recall is not affected
by the document distribution in the test set and remains constant,
independently of the number of "Sports" documents in the test set.
The fact that precision depends on the distribution of good and bad
documents in the test set is well-known in machine learning, from the
task of classifier evaluation. To evaluate classifiers, it is
preferable to use ROC curves, which are independent of the class
distribution in the test set. The ROC curves summarize graphically the
tradeoffs between the different types of errors. When characterizing a
binary decision process with ROC curves, we plot the true positive
rate (the fraction of true positives correctly classified as
positives, i.e., recall) as the ordinate, and the false positive rate
(the fraction of true negatives incorrectly classified as positives)
as the abscissa.
The standard application of ROC curves for information extraction is
unfortunately problematic, for two reasons.
First reason: We typically do not know what a "true negative" is.
Unlike document classification, a "bad tuple" does not exist apriori
in a document. It only exists because the extraction system can
extract it.
* Solution 1: One way to overcome this problem is to measure the
number of all bad tuples that can be extracted from a document
using all possible settings and all available extraction systems.
Then, we can use this number as the normalizing factor to define
the false positive rate. This solution works when dealing with a
static set of extraction systems. Alas, the definition of false
positive rate becomes unstable if we introduce later another
system (or another setting) that generates previously unseen noisy
tuples; this changes the number of all bad tuples, which serves as
the normalizing constant, and forces recomputation of all false
positive rates.
* Solution 2: Another way to avoid this problem is by having an
un-normalized x-axis (abscissa). Instead of having the false
positive rate, we can have the average number of bad tuples
generated. In this case, the new curve is called the "Free
Response Operating Characteristic" (FROC) curve. Such techniques
are widely used in radiology to evaluate the performance of
systems that detect nodules in MRI and CAT scans. (A nodule refers
to a small aggregation of cells, indicative of a disease.) A
problem with this approach is the lack of a "probabilistic"
interpretation of the x-axis; the probabilistic interpretation can
be convenient when analyzing/integrating the extraction system as
part of of a bigger system, and we are not simply trying to
measure its performance in a vacuum.
Second reason: It is too expensive to know all the "true positives".
You may have noticed that recall is the y-axis in the ROC/FROC curves.
Unfortunately, computing recall is a rather expensive task. If we want
to be correct, we need to have annotators read a document and infer
which tuples can be extracted from each document; the number of all
good tuples will be used as the normalizing constant to compute the
recall of each extraction system. This is an expensive task,
especially compared to the computation of precision or compared to the
computation of the false positive rate. To compute the false positives
we need only to evaluate the extracted tuples, a presumably much
easier task than reading a lengthy document.
* Solution 1: We can play the same trick as above, to avoid the
problem of reading/annotating the documents. We process each
document multiple times, using all possible settings and all
possible extraction systems. The union of the extracted tuples can
be then validated to identify the set of all correct tuples. As in
the case of true negatives, though, the definition becomes
unstable if we have a dynamic set of extraction systems that can
identify more good tuples at some point in the future, forcing a
re-calculation of recall metrics for all systems.
* Solution 2: We can also have an un-normalized y-axis. For
instance, we can have as the ordinate (y-axis) the average number
of good tuples extracted for each document. (I have not seen
anything like the FROC curves that will leave the true positive
rate unnormalized, though.) The downside is that by leaving recall
unnormalized, the values now depend on the distribution of good
and bad documents in the input: the more bad documents with no
good tuples in the test set, the lower the unnormalized value will
be. Therefore, this definition seems to go against the spirit of
ROC curves.
I am not aware of any alternative definitions of the ROC curves that
do not have the problems of the solutions that I described above. If
you have any ideas or pointers, post it in the comments.
Labels: evaluation, FROC, information extraction, research, ROC
Posted by Panos Ipeirotis at 7:50 PM
2 comments Links to this post
Monday, July 9, 2007
Blending Mind Maps with Wikipedia
Following up on the topic, of how to use wikis to organize research
literature about a given topic.
I have been thinking about the idea of keeping wikis to keep track of
the state-of-the-art in any field. One way to organize the literature
is to transform an existing survey article into a wiki and let people
edit at-will (e.g., see this article on duplicate record detection).
Another alternative is to keep a set of tables in a wiki, which
summarize the results of a paper in a single line (e.g., see here and
here for NLP-related tasks).
A problem with both approaches is the lack of a network that will
allow people to mark and annotate the relations across different
papers. The idea of using CMapTools (http://cmap.ihmc.us/) was
interesting but I could not manage to install the software to see how
it really works.
Today, I run into WikiMindMap, a web-based service that uses the wiki
structure to convert Wikipedia articles into a mind map. The
heuristics that it uses are rather basic but one could use this tool
to organize the literature in a wiki (having the advantage of
collaboration), and still have all the advantages of a visual tool
that can show connections across entities. See for example the
wikified article, mentioned above, transformed into a mind map.
Labels: mind maps, surveys, wikipedia, wikis
Posted by Panos Ipeirotis at 4:49 PM
0 comments Links to this post
Friday, July 6, 2007
Data Cleaning Service
Often, I get questions from colleagues and students on how to match
"dirty" data, i.e., string database entries that are similar but not
exactly identical. Although there is a very significant amount of
research literature, there are not that many software packages
available for free. For this reason, I decided to implement the
technique that we described in our WWW2003 paper, and make it
available on the web for anyone to use:
http://hyperion.stern.nyu.edu/SMWebSite/Default.aspx
So, now if you have two lists of names that need to be matched you can
try this service. It works well for small datasets with a few
thousands entries in each. I am pretty sure that it can scale to much
bigger datasets, but it will take some time for the service to finish
the job. One of the neat features of the service is that you can
submit the job, keep the id of the submitted job, and come back to
retrieve the results later.
If you have any ideas, or any bug reports, feel free to contact me.
Labels: data cleaning, deduplication, demo, research, string matching
Posted by Panos Ipeirotis at 7:38 PM
0 comments Links to this post
Wikipedia-based Web Services
We have launched a beta version of some web services that use
Wikipedia data for named entity extraction, for term identification,
and for document expansion. You can take a look and try the services
at:
http://wikinet.stern.nyu.edu/Wikinet/Wikinet.aspx
We are actively working on this, so if you see a bug or some
 
No comments:
Post a Comment