Tuesday, 12 February 2008

2007_07_01_archive



Evaluating Information Extraction using xROC Curves

Suppose that you have an information extraction system, which behaves

like a blackbox. The box takes as input a document and gives in the

output a set of relation tuples. For example, the information

extraction system may process newspaper articles and identify mergers

and acquisitions (Company1 bought Company2), or management succession

events (Person1 succeeds Person2 in the the CEO position), and so on.

As expected, such systems are inherently noisy and generate imperfect

output. Sometimes they miss tuples that appear in the documents and

sometimes they generate spurious tuples. One of the important

questions is how to evaluate such a system objectively and with the

minimum amount of effort.

A common evaluation strategy is to use precision-recall curves to show

how the system behaves under different settings. The precision of the

output is defined as the number of correct tuples in the output over

the total number of generated tuples; recall is defined as the as the

number of correct tuples in the output over the total number of

correct tuples that can be extracted from the documents.

Unfortunately, precision is problematic due to its dependence on the

input class distribution, as the following example illustrates:

* Example: Consider an extraction system E that generates a table of

companies and their headquarters locations, Headquarters(Company,

Location) from news articles in The New York TimesBusiness'' and

the "Sports'' section. The "Business'' documents contain many

tuples for the target relation, while "Sports'' documents do not

contain any. The information extraction system works well, but

occasionally extracts spurious tuples from some documents,

independently of their topic. If the test set contains a large

number of "Sports'' documents then the extraction system will also

generate a large number of incorrect tuples from these "bad''

documents, bringing down the precision of the output. Actually,

the more "Sports'' documents in the test set, the worse the

reported precision, even though the underlying extraction system

remains the same. Notice, though, that the recall is not affected

by the document distribution in the test set and remains constant,

independently of the number of "Sports" documents in the test set.

The fact that precision depends on the distribution of good and bad

documents in the test set is well-known in machine learning, from the

task of classifier evaluation. To evaluate classifiers, it is

preferable to use ROC curves, which are independent of the class

distribution in the test set. The ROC curves summarize graphically the

tradeoffs between the different types of errors. When characterizing a

binary decision process with ROC curves, we plot the true positive

rate (the fraction of true positives correctly classified as

positives, i.e., recall) as the ordinate, and the false positive rate

(the fraction of true negatives incorrectly classified as positives)

as the abscissa.

The standard application of ROC curves for information extraction is

unfortunately problematic, for two reasons.

First reason: We typically do not know what a "true negative" is.

Unlike document classification, a "bad tuple" does not exist apriori

in a document. It only exists because the extraction system can

extract it.

* Solution 1: One way to overcome this problem is to measure the

number of all bad tuples that can be extracted from a document

using all possible settings and all available extraction systems.

Then, we can use this number as the normalizing factor to define

the false positive rate. This solution works when dealing with a

static set of extraction systems. Alas, the definition of false

positive rate becomes unstable if we introduce later another

system (or another setting) that generates previously unseen noisy

tuples; this changes the number of all bad tuples, which serves as

the normalizing constant, and forces recomputation of all false

positive rates.

* Solution 2: Another way to avoid this problem is by having an

un-normalized x-axis (abscissa). Instead of having the false

positive rate, we can have the average number of bad tuples

generated. In this case, the new curve is called the "Free

Response Operating Characteristic" (FROC) curve. Such techniques

are widely used in radiology to evaluate the performance of

systems that detect nodules in MRI and CAT scans. (A nodule refers

to a small aggregation of cells, indicative of a disease.) A

problem with this approach is the lack of a "probabilistic"

interpretation of the x-axis; the probabilistic interpretation can

be convenient when analyzing/integrating the extraction system as

part of of a bigger system, and we are not simply trying to

measure its performance in a vacuum.

Second reason: It is too expensive to know all the "true positives".

You may have noticed that recall is the y-axis in the ROC/FROC curves.

Unfortunately, computing recall is a rather expensive task. If we want

to be correct, we need to have annotators read a document and infer

which tuples can be extracted from each document; the number of all

good tuples will be used as the normalizing constant to compute the

recall of each extraction system. This is an expensive task,

especially compared to the computation of precision or compared to the

computation of the false positive rate. To compute the false positives

we need only to evaluate the extracted tuples, a presumably much

easier task than reading a lengthy document.

* Solution 1: We can play the same trick as above, to avoid the

problem of reading/annotating the documents. We process each

document multiple times, using all possible settings and all

possible extraction systems. The union of the extracted tuples can

be then validated to identify the set of all correct tuples. As in

the case of true negatives, though, the definition becomes

unstable if we have a dynamic set of extraction systems that can

identify more good tuples at some point in the future, forcing a

re-calculation of recall metrics for all systems.

* Solution 2: We can also have an un-normalized y-axis. For

instance, we can have as the ordinate (y-axis) the average number

of good tuples extracted for each document. (I have not seen

anything like the FROC curves that will leave the true positive

rate unnormalized, though.) The downside is that by leaving recall

unnormalized, the values now depend on the distribution of good

and bad documents in the input: the more bad documents with no

good tuples in the test set, the lower the unnormalized value will

be. Therefore, this definition seems to go against the spirit of

ROC curves.

I am not aware of any alternative definitions of the ROC curves that

do not have the problems of the solutions that I described above. If

you have any ideas or pointers, post it in the comments.

Labels: evaluation, FROC, information extraction, research, ROC

Posted by Panos Ipeirotis at 7:50 PM

2 comments Links to this post

Monday, July 9, 2007

Blending Mind Maps with Wikipedia

Following up on the topic, of how to use wikis to organize research

literature about a given topic.

I have been thinking about the idea of keeping wikis to keep track of

the state-of-the-art in any field. One way to organize the literature

is to transform an existing survey article into a wiki and let people

edit at-will (e.g., see this article on duplicate record detection).

Another alternative is to keep a set of tables in a wiki, which

summarize the results of a paper in a single line (e.g., see here and

here for NLP-related tasks).

A problem with both approaches is the lack of a network that will

allow people to mark and annotate the relations across different

papers. The idea of using CMapTools (http://cmap.ihmc.us/) was

interesting but I could not manage to install the software to see how

it really works.

Today, I run into WikiMindMap, a web-based service that uses the wiki

structure to convert Wikipedia articles into a mind map. The

heuristics that it uses are rather basic but one could use this tool

to organize the literature in a wiki (having the advantage of

collaboration), and still have all the advantages of a visual tool

that can show connections across entities. See for example the

wikified article, mentioned above, transformed into a mind map.

Labels: mind maps, surveys, wikipedia, wikis

Posted by Panos Ipeirotis at 4:49 PM

0 comments Links to this post

Friday, July 6, 2007

Data Cleaning Service

Often, I get questions from colleagues and students on how to match

"dirty" data, i.e., string database entries that are similar but not

exactly identical. Although there is a very significant amount of

research literature, there are not that many software packages

available for free. For this reason, I decided to implement the

technique that we described in our WWW2003 paper, and make it

available on the web for anyone to use:

http://hyperion.stern.nyu.edu/SMWebSite/Default.aspx

So, now if you have two lists of names that need to be matched you can

try this service. It works well for small datasets with a few

thousands entries in each. I am pretty sure that it can scale to much

bigger datasets, but it will take some time for the service to finish

the job. One of the neat features of the service is that you can

submit the job, keep the id of the submitted job, and come back to

retrieve the results later.

If you have any ideas, or any bug reports, feel free to contact me.

Labels: data cleaning, deduplication, demo, research, string matching

Posted by Panos Ipeirotis at 7:38 PM

0 comments Links to this post

Wikipedia-based Web Services

We have launched a beta version of some web services that use

Wikipedia data for named entity extraction, for term identification,

and for document expansion. You can take a look and try the services

at:

http://wikinet.stern.nyu.edu/Wikinet/Wikinet.aspx

We are actively working on this, so if you see a bug or some


No comments: