How to cite this?

     Rodriguez-Esteban, Raul. Methods in biomedical text mining. Ph.D. Thesis, Columbia University, 2008.

Download the pdf version. It looks nicer.

For any comments, email me: raul AT ee columbia edu


Methods in biomedical text mining
Raul Rodriguez-Esteban
Columbia University, December 2007

Contents

1  Overview of text mining of biomedical interactions
    1.1  Text mining
    1.2  Biomedical text mining
    1.3  Interactions from text
        1.3.1  GENIES
        1.3.2  GeneWays
    1.4  Curation and evaluation
2  Automatic curation of text-mined facts
    2.1  Introduction
    2.2  Approach
    2.3  Methods
        2.3.1  Training data
    2.4  Mathematical background
        2.4.1  Machine-learning algorithms
        2.4.2  Features used in our analysis
        2.4.3  Separating data into training and testing: Cross-validation
        2.4.4  Comparison of methods: Receiver operating characteristic (ROC) scores
    2.5  Results
    2.6  Discussion
3  Overview of biomedical term recognition and classification
    3.1  Introduction
        3.1.1  Term recognition
        3.1.2  Named entity expressions
        3.1.3  Term classification
        3.1.4  Biomedical term recognition and classification
    3.2  Mathematical Background
        3.2.1  Recognition and classification framework
        3.2.2  Conditional random fields
4  Biomedical term recognition and classification using large corpora and search engines
    4.1  Introduction
    4.2  Term recognition
        4.2.1  Text pre-processing and indexing
        4.2.2  Syntactical model
        4.2.3  Term recognition process
    4.3  Term classification
        4.3.1  Features
        4.3.2  Local, regional, global: Word sense disambiguation
5  Six senses: the bleak sensory landscape of biomedical texts
6  A recipe for high impact
    6.1  Ingredients of a scholarly study
    6.2  Information flow through publication-type niches
    6.3  Additional information
        6.3.1  Data
        6.3.2  Analysis
7  How many scientific papers should be retracted?
    7.1  Analyzing retraction patterns
    7.2  Mathematical model to calculate the number of articles that should have been retracted
    7.3  Retraction rates are on the rise
8  Future work and conclusions
    8.1  Future work
    8.2  Conclusion
    8.3  Papers that resulted from the work in this thesis
9  Bibliography

Chapter 1
Overview of text mining of biomedical interactions

The purpose of this introduction is to give a historical overview and background to the project of automatic curation of text-mined data described in Chapter 2. This introduction describes the beginnings of text mining as a discipline and its arrival to the biomedical domain. It also describes the first interaction text mining projects, which precede and set a path to the development of GeneWays. This introduction is necessary to understand the architectural and structural choices made for the design of GeneWays. Finally, this introduction reviews the different efforts made in evaluating text-mined interaction data before the automatic curation project was developed. The aim is not only to contextualize the state of the art and decisions made during the project but also to present the basis on which it stood and the challenges it faced.

1.1  Text mining

The field of text mining is a relatively new discipline born of the knowledge discovery in databases (KDD) and data mining (DM) community. As it is often the case when a discipline is born, it borrowed techniques and approaches from similar, more established fields before establishing its own identity 1.
Alessandro Zanasi claims that the first time he heard the term "text mining" was when it was spoken by Charles Huot in 1994, during the IBM-ECAM (European Centre for Applied Mathematics) [1] workshop in Paris. Whether it was used with the same meaning that it has today, in the context of applications such as information extraction [2] or document classification [3], is unclear. In 1995 and 1996, Ronen Feldman and colleagues offered the first contributions to the field that can be called text mining with more certainty [4,5,6], originally called knowledge discovery in text (KDT). The word mining was soon introduced in 1996 in the context of KDT, followed by the coinage of the name "text mining", a variation of the name data mining. By 1997 the expression text mining had become an accepted name for the new discipline. The new discipline quickly spawned courses, workshops, and books and opened new avenues of research and notable subfields, such as web text mining (1998) and biomedical text mining (1998). Text mining brought together researchers from the KDD and DM communities and from the fields of natural language processing (NLP), automatic knowledge acquisition, information retrieval, and information extraction, to name a few. Text mining became the predominant name for the discipline, widely replacing other names such as KDT, KDT and text mining, textual data mining, and text data mining. That some of these names are still in use reflects not only a stylistic choice but also, in some cases, differences in understanding of aims, scope, and methods.
Marti Hearst [7] was one of the first to summarize the state of the nascent discipline in 1999, attempting to define its scope with respect to other fields such as data mining, computational linguistics, or information retrieval. Hearst stressed that the defining quality of text mining is that its goal is to discover novel information, unlike fields such as information retrieval and data mining. In this respect, text mining is indebted to literature-based discovery, a field championed by Don Swanson beginning with his seminal paper in 1986, "Undiscovered public knowledge" [8,9]. Literature-based discovery was intended to be a systematic search for pieces of knowledge that could be combined to create a novel discovery. Originally, literature-based discovery was largely a non-automated process. Swanson recalled stumbling onto the idea through a serendipitous finding of two unrelated articles that could be combined to answer a question that no other single article answered. His acceptance speech upon receiving the American Society for Information Science and Technology (ASIST) 2000 Award of Merit is worth quoting because it addresses core principles of the text-mining field:
"More than 40 years ago the fragmentation of scientific knowledge was a problem actively discussed but without much visible progress toward a solution; perhaps people then had the consummate wisdom to know that no problem is so big that you can't run away from it. Three aspects of the context and nature of this fragmentation seem notable:
1. The disparity between the total quantity of recorded knowledge, however it might be measured, and the limited human capacity to assimilate it, is not only enormous now but grows unremittingly. Exactly how the limitations of the human intellect and life span affect the growth of knowledge is unknown. Metaphorically, how can the frontiers of science be pushed forward if, someday, it will take a lifetime just to reach them? [...]
2. In response to the information explosion, specialties are somehow spontaneously created, then grow too large and split further into subspecialties without even a declaration of independence. One unintended result is the fragmentation of knowledge owing to inadequate cross-specialty communication. And as knowledge continues to grow, fragmentation will inevitably get worse because it is driven by the human imperative to escape inundation.
3. Of particular interest to me is the possibility that information in one specialty might be of value in another without anyone becoming aware of the fact. Specialized literatures, or other "units" of knowledge, that do not intercommunicate by citing one another may nonetheless have many implicit textual interconnections based on meaning. Indeed the number of unintended or implicit text-based connections within the literature of science may greatly exceed the number that are explicit, because there are far more possible combinations of units (that potentially could be related) than there are units. The connection explosion may be more portentous than the information explosion."
Heart's opinion is shared by Ananiadou and McNaught [10] and others [11]: "The primary goal of text mining is to retrieve knowledge that is hidden in text, and to present the distilled knowledge to users in a concise form". However, a more common point of view, first proposed by Ronen Feldman, defines text mining as different from data mining only because it deals with data that by its nature is unstructured, unlike data organized in databases, which are the primary source for data mining [4,12,13,14,15]. Kao and Poteet [16] go even further, stating that "Text mining is the discovery and extraction of interesting, non-trivial knowledge from free or unstructured text. This encompasses everything from information retrieval (i.e., document or web site retrieval) to text classification and clustering, to (somewhat more recently) entity, relation, and event extraction." In practice, this expansive view of text mining is not shared by many others, especially considering that information retrieval or text classification predate text mining by many years. Kao and Poteet's opinion implies that text mining is an umbrella term covering a laundry list of textual processing methods. A more common view seems to be that the aim of text mining is to find interesting, useful, or valuable patterns-that are not necessary novel-in text collections. This perspective places text mining closer to knowledge acquisition and information extraction.
Given the fuzzy lines that separate text mining from similar fields, it is not clear whether it can be defined meaningfully beyond a mix of different conceptions held by different researchers. The confusion is compounded further because applications from related fields may be regarded as necessary processing steps for effective text mining. In other words, text-mining projects might require sub-tasks from other fields. Therefore, text mining in some contexts might be used for the sole purpose of indicating the scientific agenda in which the study should be considered, not for defining the task itself as "text mining". Furthermore, as other fields have built on advances in text mining, text mining also has become an intermediate step in projects of different nature.
Related disciplines such as semantic analysis, text analysis, information retrieval, information extraction, and knowledge acquisition have a much older pedigree within the computation and information sciences than does text mining. Like text mining, they derive from activities that originally could be handled by human intellect and rudimentary record-keeping but became more complex with the progressive accumulation of knowledge and information. Fielden [17] plotted the evolution of the size of information repositories over the course of human history, showing an exponential growth in the last decades. More comprehensively, Peter Lyman and Hal Varian led a study designed to estimate the quantity of information produced worldwide every year [18,19]; they estimated a grand total of 5 exabytes2 , or 800 megabytes per person per year, of which 92% were in magnetic storage. Printed text represented 33 terabytes, whereas the "surface internet" accounted for 167 terabytes and the "deep internet" (or database, dynamically-generated pages) for about 92 petabytes). This unparalleled growth has been accompanied by extraordinary improvements in the devices and methods in the different computation and information sciences. Text mining, a late arrival, has the advantage of drawing from an extensive set of diverse techniques developed not only in the related disciplines, but also in other fields such as machine learning, artificial intelligence, probabilistic analysis, statistics, pattern recognition, data management, and information theory. While other disciplines, like information retrieval, fledged out before the current pervasive use and availability of electronic text, text mining was born in a seemingly limitless and growing frontier of resources and opportunities. Text miners, in turn, have acted like they have a hammer and see a nail in everything. Perhaps this is the best explanation for the success of text mining: Applications have driven its evolution [1].
Given the fragmentary state of the field, it is not surprising that there is not currently a journal that specializes in text mining. The door is open for further transformation of the text-mining domain, whether in terms of its buzz or its consolidation in the spectrum of computation and information sciences.

1.2  Biomedical text mining

The first attempts at text mining of the biomedical literature date back to 1998. As explained in Section 1.1, the label "text mining" may have consumed some areas that formerly went by a different name, such as knowledge acquisition and information extraction. Text mining builds on previous informatics and computational work on semantic analysis, dictionary creation, knowledge acquisition, classification, etc. Its application to the biomedical realm is a natural extension given the existing opportunities: exponential growth of the literature-both in size and in electronic availability-; the gradual shift to electronic medical records; the on-going work in annotated resources (e.g., Gene Ontology (GO), Online Mendelian Inheritance in Man (OMIM), Swissprot); and the increasing need for integration between information sources of disparate origin, also known as integromics [20]. The internet is the main engine that has fuelled this growth. Even though computers and electronic communications long predated the internet, it is the internet that has crystallized change because it has dramatically lowered the cost of information access and exchange and brought to the social forefront the challenges and opportunities of biomedical electronic information (e.g., the Health Insurance Portability and Accountability Act of 1996; the Open Access movement).
Integromics is proving to be of crucial importance in current developments as more data are becoming available in different formats in electronic and on-line form, including supplementary information tables, genome linkage maps, DNA sequences, taxonomies, ontologies, hospital medical records, and semi-structured forms (e.g., questionnaires), etc. An example is Medline [21], an exponentially growing biomedical bibliography that accounts for upwards of 16 million articles. In many cases, Medline has references to full-text articles that may be retrieved with the appropriate licenses. However, information related to those articles, like on-line repositories or supplementary text and tables, is harder to access. Medline's growth can be considered even more dramatic if we include the "deep Medline" trove of additional resources that are ready for mining.
Biomedical text miners may claim the superiority of text-mined data over other resources, especially over manually curated data. Text mining casts a wide net over the biomedical spectrum, allowing individual researchers to deal with Swanson's three arguments for library knowledge discovery (see Section 1.1). The resulting catch is larger than can typically be gathered manually. As of October 2007, the hand-curated Database of Interacting Proteins (DIP, [22]) held 56,186 interactions. Perhaps the most extensive effort in manual literature-derived extraction of interactions is BioGRID [23], with 70,000 interactions. In comparison, some text-mining interaction repositories hold over half a million interactions (see Section 1.3.2). The NCBI Gene Expression Omnibus repository of microarray expression datasets contains about half a billion data samples [24]. Hence, text-mined biomedical data has a place within the suite of tools available to biomedical informaticians and researchers. For the examples given, this place lies somewhere between high throughput methods and manually hand-curated sets, each with its own niche applications. The challenge for biomedical text mining is to assert its usefulness both for acquiring information with quality that approaches (or surpasses) hand-curated data and for reaching the widest coverage for system-wide analysis (e.g., characterizing complex diseases [25]).
Some applications in biomedical text mining have mirrored those of text mining at large, like document classification, data integration, literature-based discovery, and literature analysis (e.g., scientific trends and emerging topics [26]). Others have been more specific to biomedicine, such as biomedical annotation, phenome/phenotype analysis, public health informatics (e.g., news analysis [27], hospital rankings [28]), clinical informatics, and nursing informatics. The most flourishing areas, however, may be loosely defined as those closely linked to systems biology [29,30] and medical text mining. The former deals with such topics as biomedical interaction extraction, functional analysis, or genome annotation among others (see a list of main tools and repositories in [31]). The latter deals with the range of narratives found in the textual supports associated with clinical settings, from the ICU bedside to the clinical trials desk. Biomedical text-mining articles are published mostly in journals and conferences in biomedical informatics and computational biology, and sometimes in non-informatics journals like Genome Biology.

1.3  Interactions from text

Systems biology has been a hotbed for developments in biomedical text mining, as mentioned in Section 1.2. One of the focuses has been on interactions between different types of molecules, especially proteins (i.e., PPI, protein-protein interactions). The success of PPI can be seen in its application for integrative studies, the popularity of its tools, and its use as support for public databases like DIP [32], MINT [33], and BIND [34]. The interactions, taken from a functional genomics point of view, range from physical (e.g. protein binding) to indirect (e.g., proteins in the same pathway but not physically interacting) interactions to other phenomena, such as co-expression. The interaction triplet has been an important text-mining interaction model since its inception. This triplet consists of the two elements that are involved in the interaction and the verb or action word that relates them. In the GeneWays ontology [35], the elements of the triplet are called the upstream term, downstream term, and action. Triplets usually are taken from single sentences. For example, in the sentence "Gene A activates gene B.", "gene A" is the upstream term, "gene B" is the downstream term, and "activate" is the action. This model was first introduced in a preliminary study by Sekimizu and colleagues [36], in which they sought verbs that could characterize gene-gene interactions. Rindflesch and colleagues [37] experimented with sentences that included the verb "bind".
An alternative model to triplets is used in co-occurrence studies. Stapley and Benoit [38], for example, used co-occurrence in a study of selected PubMed abstracts, followed later by the larger-scale project PubGene [39]. With this method, interactions are inferred from co-occurrence statistics of two terms in documents. If the terms co-occur in text more often than could be expected, it is argued, that there is basis to suggest that they may be related. Co-occurrence is a statistical method used early in information retrieval. It has been used in different types of analyses but, due to its limitations, it has not become a method of general use in the text mining of interactions. Co-occurrence, for example, is of limited help in distinguishing interactions of very low frequency. Another drawback of this method is that the nature of the interaction is lost.
Some of the problems that biomedical interaction extraction entails are common to biomedical texts at large, such as extensive and open-ended vocabulary, erratic abbreviations, word sense ambiguity, and convoluted sentences. Others are more specific. For example, negative particles or words with negative meaning may completely change the meaning of an interaction, e.g., "We could not find any interaction between gene A and gene B" (for a study on a negative interactome, see [40]). Anaphora is another challenging problem that rarely is tackled (although see [41]). Anaphora refers to situations in which the name of an object is elided, generally because a pronoun is used to avoid repetition, e.g. "It activates gene B." The challenge is to identify the object to which "it" refers. More generally speaking, biomedical interaction extraction faces the hurdles of the different pre-processing steps plus the complexity of identifying the interactions themselves.
Blaschke and colleagues [42] proposed an early rule-based model for interaction extraction that tried to capture a simple lexical pattern in sentences: "protein A - action - protein B". The names of the proteins and the action were identified using controlled vocabularies. Thomas and colleagues [43] used syntactic instead of lexical patterns (an example of a syntactic pattern is noun phrase - verb - noun phrase) to find triplet candidates that were then narrowed down through a hand-crafted scoring system. The syntactic analysis performed was of the shallow type, which can be done more quickly than deep or full parsing and it only identifies units at the syntagma level of the sentence (e.g., noun phrases and verb phrases). Proux and colleagues [44] developed the approaches used in [43] and [42] by using first syntactic parsing and then applying lexical patterns to find interactions. Similar approaches were explored by [45] and [46], although they did not report to have fully implemented them. Blaschke and colleagues created a generalized pattern approach, calling these patterns "frames" [47,48]. Frames are flexible patterns that may include additional information to enrich the analysis (e.g., the distance in words between the interaction terms of the sentence). Park and colleagues [49] and Yakushiji and colleagues [50] went further by including full syntactic parsing in their systems. Full parsing allows for categorization of all syntactic dependencies among the words of a sentence. The GENIES parser [51] was born within this context of incipient improvement and tests of new approaches.

1.3.1  GENIES

GENIES [51] evolved from MedLee [52], a medical natural-language processing application in use at the Clinical Information System (CIS) of New York Presbyterian Hospital. MedLee was inspired by the sub-language theory of Zellig Harris [53], its main trait was its semantic grammar, which combined grammatical patterns and lexical information to capture different structures. In contrast to other semantic grammar systems, pattern matching in MedLee must be exact. Only the sequences that conform precisely to a specific pattern both grammatically and semantically are considered. This approach was chosen to extract information both efficiently and reliably.
The MedLee processing pipeline consists of several parts:
GENIES uses the pre-processor and parser modules from MedLee adapting them to the systems biology domain. The semantic categories in GENIES (e.g., amino acid, cell, complex, domain, DNA region, etc.) mostly differ from those in MedLee (e.g., body location, finding, device, disease, procedure, etc.) although a few overlap (e.g., certainty, connective). The grammar rules of the parser module were adapted to the new semantic categories. An important difference from Harris's sub-language theory is that the patterns were constructed manually from perceived patterns of interest in biomedical texts, whereas Harris proposed using statistical methods. Furthermore, GENIES receives its input from a term tagger that uses BLAST [54] pattern-matching algorithm to recognize a term even if it is written with slight variations [55].

1.3.2  GeneWays

GeneWays [35] is a system designed for the automatic extraction of signal transduction pathways, although, more broadly, it can be characterized as a system designed to capture gene, protein, and small molecule interactions. GeneWays 6.0 stores 4,035,759 relationships (of which 2,652,916 are unique) from 232,265 full-length articles published between 1994 and 2004 in 78 journals, it is the largest repository in its class. GeneWays's major strength is its use of an extensive collection of full-text articles. Other prime examples of large-scale interaction repositories are:
  1. iHOP [56], with 30,000 different genes and half a million sentences (and 500,000 website hits per month [57])
  2. PubGene [39], with 1,087,757 relationships (139,756 unique) and 13,712 genes
  3. PRIME, with 920,000 unique protein interactions [58]
At its core, GeneWays is GENIES, and it has built around GENIES the following set of modules for extra processing and display:
Hence, GeneWays covers a number of steps including retrieval, processing, storage, and display, that allow end users to select their information of interest. GeneWays data can be used in multiple ways to furthering research of different issues in systems biology, text mining, and scientometrics [61,62,63,64,65].

1.4  Curation and evaluation

The idea of automatically curating a text-mined knowledge base was first proposed in the original GeneWays paper as an extra module called the AI curator. It was presented in the following way [35]:
"Note that the automatically generated knowledge base is of necessity noisy: the GeneWays system extracts some percentage of statements incorrectly, and, even among correctly extracted statements, we should expect redundancy and contradictions. Therefore, the database requires curation, a process in which the original statements are annotated with statements regarding confidence in the corresponding information. The traditional way to perform such curation is through manual labor of human experts-a monumental task even for the database at its current size of roughly 3 million redundant statements extracted from 150,000 articles. To reduce the manual work, we are implementing a Curator module that would allow GeneWays to compute the estimates of reliability automatically."
Curation is considered a step in the process of ascertaining the truth of certain facts, especially for the scientist who is confronted with multiple, sometimes conflicting, pieces of information and who needs to make decisions within the knowledge pocket of her particular scientific specialization [66,63,65]. Curation also provides a way to assign a value to our degree of confidence about a fact within a continuous scale of truth. The value of truth assigned is an attempt to represent our limited ability to completely understand a text, or even the limited ability of the writer to express what she wants to say.
Evaluation is a central part of curation. Systems biology studies face the difficult task of measuring recall in a broad and intricate search space combined with the limitations of manual evaluation of precision. Many evaluations in the literature, including many of those cited herein, were not described in detail, which makes it hard to establish their characteristics. Often, they entail in-house evaluations in which unnamed experts follow protocols that are not detailed. This is understandable from the point of view that, in most cases, evaluations are considered a necessary, but not central, contribution. Friedman and Hripcsak [67] exposed a number of pitfalls in the task of evaluating NLP systems and defined 20 criteria to avoid them (see Table 1.4).
Minimizing Bias
1. The developer should not see the test set of documents.
2. If domain experts are used to determine the reference standard, they should not be developers of the system or designers of the study.
3. The developer should not perform the evaluation.
4. The NLP system should be frozen prior to the testing phase.
5. If generalizability of the processor is being tested, the developer should not know details of the study beforehand.
6. Ideally, the person designing the evaluation study should not be a developer of the system.
Establishing a Reference Standard
7. If domain experts are used to determine the reference standard, there should be a sufficient number to assess variability of the reference standard.
8. The test set should be large enough in that there is sufficient power to distinguish levels of performance.
9. The choice of the reference standard should be based on the objectives of the study (e.g. extraction capability vs. performance in an application).
10. If domain experts are used to determine the reference standard, the type of expert should be appropriate (e.g. radiologist vs. internist).
Describing the Evaluation Methods
11. The method used to determine the reference standard should be clearly described, particularly if domain experts were used.
12. The manner in which the test documents were chosen should be described.
13. Methods used to calculate performance measures should be clearly presented and if non-standard measures are used, they should be described.
Presenting Results
14. Performance measures should relate to the complete test set.
15. If human experts are used, inter-rater and intra-rater agreement should be given.
16. Confidence intervals should be given for all measures.
Discussing Conclusions
17. Limitations of the study should be discussed.
18. Results should be presented in light of requirements of the target application.
19. Overgeneralization of the results should be avoided.
20. An analysis of system failures should be given along with a discussion concerning the degree of difficulty of needed corrections.
Table 1.1: Criteria for Evaluating Performance of NLP Systems. [67]
In Sekimizu and colleagues' seminal paper [36], two experts evaluated a random set of several hundred assertions with typical interaction verb connectors (activate, interact, encode, regulate, prevent, contain, inhibit, or bind). The researchers identified a different precision associated with each action type, a phenomenon also noted by Ono and colleagues [68]. Rindflesch and colleagues [37] used a test set of manually annotated sentences as gold standard to evaluate their application's ability to identify the action type "bind". Blaschke and colleagues [42] used two networks of known protein interactions as prediction targets for their system. They trained their system with selected Medline abstracts that included information about these networks, and then tested to see whether it had learned the network correctly. Stapley and Benoit [38] focused on relationships extracted from Medline documents with the MeSH term "DNA repair". Reducing the search space to a specific domain eliminated the need to choose random documents from a repository (e.g. Medline), as was done initially in [36]. A limited evaluation, however, reduced the generalizability of the results. Thomas and colleagues [43] proposed, but did not implement, a scoring system to measure the level of certainty that a relationship has been well extracted by using as a template a pattern of co-occurrence. The scoring system would assign a score to each template based on three factors:
  1. Textual context (i.e., neighborhood of words and sentences).
  2. Degree of confidence that the term is a protein.
  3. Frequency in which the relationship appears.
However, they implemented a score based on the degree of likelihood that the terms of a given template are proper names. A template was considered more reliable if it identified relationships whose terms are proper names. Their rationale was that proper names are more likely to indicate protein names. Their scoring point scale (or scoring strategy) was, otherwise, arbitrary and it was used to make a preliminary filter of results. Proux and colleagues [44] used 200 sentences pre-evaluated by experts (evaluated as correct, incorrect, or undecided) as evaluation. This method became the most commonly used. Blaschke and Valencia [47] used word distance between terms and actions to yield estimated likelihoods of precision. They used a heuristic approach to compute probabilities for different word distances in order to give each result an estimated precision likelihood. Jenssen and colleagues [39] used actual micro-array expression data as a gold-standard for their gene co-occurrence text mining. The relationships found were compared to micro-array co-expression results.
In-house expert evaluation has been the most common method both for result evaluation and for gold-standard generation. Efforts like the Critical Assessment of Information Extraction (BioCreAtIvE) [69], which followed the lead of the successful Critical Assessment of PRediction of Interactions (CAPRI) [70], have raised awareness in the biomedical text-mining community about the use of common evaluation test sets and standards. Daraselia and colleagues [71] performed a manual evaluation supplemented by cross-comparisons with the DIP and BIND databases. These databases are manually curated repositories that use co-occurrences as a pre-screening filter. Chen and Sharp [72] went further in this approach and used a set of interactions selected from DIP as a prediction target for their system. Hakenberg and colleagues [73] created evaluation sets from articles referenced in DIP and the annotated BioCreAtIvE corpus. Koike and colleagues [58] used abstracts from GO term annotations. The main strength of evaluation techniques that use publicly available, manually curated corpora is that comparisons can be made between evaluation results of different applications.
The state of evaluation in the text mining of biomedical interactions sets the stage for the automatic curation project in Chapter 2, the aim of which is to go beyond existing evaluation schemes as so far described.

Chapter 2
Automatic curation of text-mined facts

...he will throughly purge his floor, and gather his wheat into the garner; but he will burn up the chaff with unquenchable fire.
Matthew 3:12 [74]
Synopsis
Current automated approaches for extracting biologically important facts from scientific articles are imperfect: while being capable of efficient, fast and inexpensive analysis of enormous quantities of scientific prose, they make errors. In order to emulate the human experts evaluating the quality of the automatically extracted facts, we have developed an artificial intelligence program ("a robotic curator") that closely approaches human experts in the quality of distinguishing the correctly extracted facts from the incorrectly extracted ones.
Figure 2.1: Cocaine: the predicted accuracy of individual text-mined facts involving semantic relation stimulate. Each directed arc from an entity A to an entity B in this figure should be interpreted as a statement "A stimulates B", where, for example, A is cocaine and B is progesterone. The predicted accuracy of individual statements is indicated both in color and in width of the corresponding arc. Note that, for example, the relation between cocaine and progesterone was derived from multiple sentences, and different instances of extraction output had markedly different accuracy. Altogether we collected 3,910 individual facts involving cocaine. So long as the same fact can be repeated in different sentences, only 1,820 facts out of 3,910 were unique. The facts cover 80 distinct semantic relations, out of which stimulate is just one example.

2.1  Introduction

Information extraction uses computer-aided methods to recover and structure meaning that is locked in natural-language texts. The assertions uncovered in this way are amenable to computational processing that approximates human reasoning. In the special case of biomedical applications, the texts are represented by books and research articles, and the extracted meaning comprises diverse classes of facts, such as relations between molecules, cells, anatomical structures, and maladies.
Unfortunately, the current tools of information extraction produce imperfect, noisy results. Although even imperfect results are useful, it is highly desirable for most applications to have the ability to rank the text-derived facts by the confidence in the quality of their extraction (as we did for relations involving cocaine, see Figure 2.1). We focus on automatically extracted statements about molecular interactions, such as small molecule A binds protein B, protein B activates gene C, or protein D phosphorylates small molecule E. (In the following description we refer to phrases that represent biological entities (such as small molecule A, protein B, and gene C) as terms, and to biological relations between these entities (such as activate or phosphorylate) as relations or verbs.)
Several earlier studies have examined aspects of evaluating the quality of text-mined facts, as explained in Section 1.4. For example, Sekimizu et al. and Ono et al. attempted to attribute different confidence values to different verbs that are associated with extracted relations, such as activate, regulate, and inhibit [36,68]. Thomas et al. proposed to attach a quality value to each extracted statement about molecular interactions [43], although the researchers did not implement the suggested scoring system in practice. In an independent study [47], Blaschke and Valencia used word-distances between biological terms in a given sentence as an indicator of the precision of extracted facts. In our present analysis we applied several machine-learning techniques to a large training set of 98,679 manually evaluated examples (pairs of extracted facts and corresponding sentences) to design a tool that mimics the work of a human curator who manually cleans the output of an information-extraction program.

2.2  Approach

Our goal is to design a tool that can be used with any information-extraction system developed for molecular biology. In this study, our training data came from the GeneWays project (specifically, GeneWays 6.0 database, [51,35]) and thus our approach is biased toward relationships that are captured by that specific system3. We believe that the spectrum of relationships represented in the GeneWays ontology is sufficiently broad that our results will prove useful for other information-extraction projects.
Our approach followed the path of supervised machine-learning. First, we generated a large training set of facts that were originally gathered by our information-extraction system, and then manually labeled as "correct" or "incorrect" by a team of human curators. Second, we used a battery of machine-learning tools to imitate computationally the work of the human evaluators. Third, we split the training set into ten parts, so that we could evaluate the significance of performance differences among the several competing machine-learning approaches.
Figure 2.2: Accuracy of the raw (non-curated) extracted relations in the GeneWays 6.0 database. The accuracy was computed by averaging over all individual specific information extraction examples manually evaluated by the human curators. The plot compactly represents both the per-relation accuracy of the extraction process (indicated with the length of the corresponding bar) and the abundance of the corresponding relations in the database (represented by the bar color). There are relations extracted with a high precision; there are also many noisy relationships. The database accuracy was markedly increased by the automated curation outlined in this study, see Figure 2.3.
Figure 2.3: Accuracy and abundance of the extracted and automatically curated relations. This plot represents both the per-relation accuracy after both information extraction and automated curation were done. Accuracy is indicated with the length of the relation-specific bars, while the abundance of the corresponding relations in the manually curated data set is represented by color. Here, the MaxEnt 2 method was used for the automated curation. The results shown correspond to a score-based decision threshold set to zero; that is, all negative-score predictions were treated as "incorrect." An increase in the score-based decision boundary can raise the precision of the output at the expense of a decrease in the recall-see Figure 2.10.

2.3  Methods

2.3.1  Training data

With the help of a text-annotation company, ForScience Inc., we generated a training set of approximately 45,000 repeatedly-annotated unique facts, or almost 100,000 independent evaluations. These facts were originally extracted by the GeneWays pipeline, then were annotated by biology-savvy doctoral-level curators as "correct" or "incorrect," referring to quality of information extraction. Examples of automatically extracted relations, sentences corresponding to each relation, and the labels provided by three evaluators are shown in Table 2.1.
Each extracted fact was evaluated by one, two, or three different curators. The complete evaluation set comprised 98,679 individual evaluations performed by four different people, so most of the statement-sentence pairs were evaluated multiple times, with each person evaluating a given pair at most once. In total, 13,502 statement/sentence pairs were evaluated by just one person, 10,457 by two people, 21,421 by three people, and 57 by all four people. Examples of both high inter-annotator agreement and low-agreement sentences are shown in Table 2.2.
The statements in the training data set were grouped into chunks; each chunk was associated with a specific biological project, such as analysis of interactions in Drosophila melanogaster. Pair-wise agreement between evaluators was high (92%) in most chunks4, with the exception of a chunk of 5,271 relations where agreement was only 74%. These relatively low-agreement evaluations were not included in the training data for our analysis5.
Figure 2.4: Sentence Evaluation Tool. Evaluators choose from different options. A triplet can be either correctly extracted, incorrectly extracted or the evaluator is "unable to decide". Correctly extracted triplets may be hypothetical. Triplets may have been incorrectly extracted for several reasons: incorrect upstream term, incorrect downstream term, incorrect action verb (action type), missing or extra negation, wrong upstream vs. downstream order (upstream term is downstream or vice versa) or that the sentence does not support the action presented. Other possibilities are: sentence boundary error (e.g. two sentences were presented as one) or the action is incorrect biologically.
To facilitate evaluation, we developed a Sentence Evaluation Tool implemented in Java programming language by Mitzi Morris and Ivan Iossifov, Figure 2.4. This tool presented to an evaluator a set of annotation choices regarding each extracted fact; the choices are listed in Table 2.2. The tool also presented in a single window the fact itself and the sentence it was derived from. In the case a broader context was required for the judgment, the evaluator had a choice to retrieve the complete journal article containing this sentence by clicking a single button on the program interface.
For convenience of representing the results of manual evaluation, we computed an evaluation score for each statement as follows. Each sentence-statement score was computed as a sum of the scores assigned by individual evaluators; for each evaluator, -1 was added if the expert believed that the presented information was extracted incorrectly, and +1 was added if he or she believed that extraction was correct. For a set of three experts, this method permitted four possible scores: 3 (1,1,1), 1 (1,1,-1), -1 (1,-1,-1), and -3. Similarly, for just two experts, the possible scores are 2 (1, 1), 0 (1,-1), and -2(-1,-1).6
Sentence [Source] Extracted relation Evaluation (Confidence)
NIK binds to Nck in cultured cells. [76] nik bind nck Correct (High)
One is that presenilin is required for the proper trafficking of Notch and APP to their proteases, which may reside in an intracellular compartment. [77] presenilin required for notch Correct (High)
Serine 732 phosphorylation of FAK by Cdk5 is important for microtubule organization, nuclear movement, and neuronal migration. [78] cdk5 phosphorylate fak Correct (High)
Histogram quantifying the percent of Arr2 bound to rhodopsin-containing membranes after treatment with blue light (B) or blue light followed by orange light (BO). [79] arr2 bind rhodopsin Correct (Low)
It is now generally accepted that a shift from monomer to dimer and cadherin clustering activates classic cadherins at the surface into an adhesively competent conformation. [80] cadherin activate cadherins Correct (Low)
Binding of G to CSP was four times greater than binding to syntaxin. [81] csp bind syntaxin Incorrect (Low)
Treatment with NEM applied with cGMP made activation by cAMP more favorable by about 2.5 kcal/mol. [82] camp activate cgmp Incorrect (Low)
This matrix is likely to consist of actin filaments, as similar filaments can be induced by actin-stabilizing toxins (O. S. et al., unpublished data). [83] actin induce actin Incorrect (High)
A ligand-gated association between cytoplasmic domains of UNC5 and DCC family receptors converts netrin-induced growth cone attraction to repulsion. [84] cytoplasmic domains associate unc5 Incorrect (High)
Table 2.1: Sentence examples.
A sample of sentences that were used as an input to automated information extraction (the first column), biological relations extracted from these sentences (either correctly or incorrectly, the second column), and the corresponding evaluations provided by 3 human experts (the third column). A high-confidence label corresponds to a perfect agreement among all experts; a low-confidence label indicates that one of the experts disagreed with the other two.
Term level
Upstream term is a junk substance
Action is incorrect biologically
Downstream term is a junk substance
Relation level
Correctly extracted
Sentence is hypothesis, not fact
Unable to decide
Incorrectly extracted
Incorrect upstream
Incorrect downstream
Incorrect action type
Missing or extra negation
Wrong action direction
Sentence does not support the action
Sentence level
Wrong sentence boundary
Table 2.2: List of annotation choices available to the evaluators. The term "action" refers to the type of the extracted relation. For example, in statement A binds B "binds" is the action, "A" is the upstream term, and "B" is the downstream term. Action direction is defined as upstream to downstream, and "junk substance" is an obviously incorrectly identified term/entity.

2.4  Mathematical background

2.4.1  Machine-learning algorithms

General framework

The objects that we want to classify, the fact-sentence pairs, have complex properties. We want to place each of them into one of two classes, correct or incorrect. In the training data, each extracted fact is matched to a unique sentence from which it was extracted, even though multiple sentences can express the same fact and a single sentence can contain multiple facts. The ith object (the ith fact-sentence pair) comes with a set of known features or properties that we encode into a feature vector, Fi:

Fi = (fi,1, fi,2, ¼, fi,n).
(2.1)
In the following description we use C to indicate the random variable that represents class (with possible values ccorrect and cincorrect), and F to represent a 1 ×n random vector of feature values (also often called attributes), such that Fj is the jth element of F. For example, for fact p53 activates JAK, feature F1 would have value 1 because the upstream term p53 is found in a dictionary derived from the GenBank database [85]; otherwise, it would have value 0.

Full Bayesian inference

The full Bayesian classifier assigns the ith object to the kth class if the posterior probability P(C=ck|F=Fi) is greater for the kth class than for any alternative class. This posterior probability is computed in the following way (a re-stated version of Bayes' theorem).

P(C=ck|F=Fi)=P(C=ck) × P(F=Fi|C=ck)

P(F=Fi)
.
(2.2)
In the real-life applications, we estimate probability P(F=Fi|C=ck) from the training data as a ratio of the number of objects that belong to the class ck and have the same set of feature values as specified by the vector Fi to the total number of objects in class ck in the training data.
In other words, we estimate the conditional probability for every possible value of the feature vector F for every value of class C. Assuming that all features can be discretized, we have to estimate

(v1 ×v2 ×¼vn - 1) ×m
(2.3)
parameters, where vi is the number of discrete values observed for the ith feature and m is the number of classes.
Clearly, even for a space of only 20 binary features7 the number of parameters that we would need to estimate is (220-1) ×2 = 2,097,150, which exceeds several times the number of data points in our training set.

Naïve Bayes classifier

The most affordable approximation to the full Bayesian analysis is the Naïve Bayes classifier. It is based on the assumption of conditional independence of features:

P(F=Fi|C=ck)
=
P(F1=fi,1|C=ck)
×
P(F2=fi,2|C=ck) ¼
×
P(Fn=fi,n|C=ck).
(2.4)
Obviously, we can estimate P(Fj=fi,j|C=ck)'s reasonably well with a relatively small set of training data, but the assumption of conditional independence (Equation 2.5) comes at a price: the Naïve Bayes classifier is usually markedly less successful in its job than are its more sophisticated relatives.8
In an application with m classes and n features (given that the ith feature has vi admissible discrete values), a Naïve Bayes algorithm requires estimation of m ×åi=1,n (vi -1) parameters (which value, in our case, is equal to 4,208).
Figure 2.5: The correlation matrix for the features used by the classification algorithms.
The half-matrix below the diagonal was derived from analysis of the whole GeneWays 6.0 database; the half-matrix above the diagonal represents a correlation matrix estimated from only the manually annotated data set. The white dotted lines outline clusters of features, suggested by analysis of the annotated data set; we used these clusters in implementation of the Clustered Bayes classifier. We used two versions of the Clustered Bayes classifier: with all 68 features (Clustered Bayes 68), and with a subset of only 44 features, but higher number of discrete values allowed for non-binary features (Clustered Bayes 44). The Clustered Bayes 44 classifier did not use features 1, 6, 7, 8, 9, 12, 27, 28, 31, 34, 37, 40, 42, 47, 48, 49, 52, 54, 55, 60, 62, 63, and 65.

Middle ground between the full and Naïve Bayes: Clustered Bayes

We can find an intermediate ground between the full and Naïve Bayes classifiers by assuming that features in the random vector F are arranged into groups or clusters, such that all features within the same cluster are dependent on one another (conditionally on the class), and all features from different classes are conditionally independent. That is, we can assume that the feature random vector (F) and the observed feature vector for the ith object (Fi) can be partitioned into sub-vectors:

F
=
(F1, F2, ¼, FM), and
(2.5)
Fi
=
(fi,1, fi,2, ¼, fi,M),
(2.6)
respectively, where Fj is the jth cluster of features; fi,j is the set of values for this cluster with respect to the ith object, and M is the total number of clusters of features.
The Clustered Bayes classifier is based on the following assumption about conditional independence of clusters of features:

P(F=Fi|C=ck)
=
P(F1=fi,1|C=ck)
×
P(F2=fi,2|C=ck) ¼
×
P(FM=fi,M|C=ck).
(2.7)
We tested two versions of the Clustered Bayes classifier: one version used all 68 features (Clustered Bayes 68) with a coarser discretization of feature values; another version used a subset of 44 features (Clustered Bayes 44) but allowed for more discrete values for each continuous-valued feature, see legend to Figure 2.5.

Linear and quadratic discriminants

Another method that can be viewed as an approximation to full Bayesian analysis is Discriminant Analysis invented by Sir Ronald A. Fisher [86]. This method requires no assumption about conditional independence of features; instead, it assumes that the conditional probability P(F=Fi|C=ck) is a multivariate normal distribution.

P(F=Fi|C=ck)= e-[1/2](Fi-mk)¢Vk-1(Fi-mk)


Ö

(2p)n |Vk|
,
(2.8)
where n is the total number of features/variables in the class-specific multivariate distributions. The method has two variations. The first, Linear Discriminant Analysis, assumes that different classes have different mean values for features (vectors mk), but the same variance-covariance matrix, V = Vk for all k.9 In the second variation, Quadratic Discriminant Analysis (QDA), the assumption of the common variance-covariance matrix for all classes, is relaxed, such that every class is assumed to have a distinct variance-covariance matrix, Vk.10
In this study we present results for QDA; the difference from the linear discriminant analysis was insignificant for our data (not shown). In terms of the number of parameters to estimate, QDA uses only two symmetrical class-specific covariance matrices and the two class-specific mean vectors. For 68 features the method requires estimation of 2 ×(68 ×69)/2 + 2 ×68 = 4,828 parameters.

Maximum-entropy method

The current version of the maximum-entropy method was formulated by E.T. Jaynes [87,88]; the method can be traced to earlier work by J. Willard Gibbs. The idea behind the approach is as follows. Imagine that we need to estimate a probability distribution from an incomplete or small data set-this problem is the same as that of estimating the probability of the class given the feature vector, P(C=ck|F=Fi), from a relatively small training set. Although we have no hope of estimating the distribution completely, we can estimate with sufficient reliability the first (and, potentially, the second) moments of the distribution. Then, we can try to find a probability distribution that has the same moments as our unknown distribution and the highest possible Shannon's entropy-the intuition behind this approach being that the maximum-entropy distribution will minimize unnecessary assumptions about the unknown distribution. The maximum-entropy distribution with constraints imposed by the first-order feature moments alone (the mean values of features) is known to have the form of an exponential distribution [89]:

P(C=ck|F=Fj)=
exp æ
è
- n
å
i=1 
li,k fj,i ö
ø

2
å
l=1 
exp æ
è
- n
å
i=1 
li,l fj,i ö
ø
,
(2.9)
and the maximum-entropy distribution for the case when both the first- and the second-order moments of the unknown distribution are fixed has the form of a multidimensional normal distribution [89]. The conditional distribution that we are trying to estimate can be written in the following exponential form:

P(C=ck|F=Fj) =
exp æ
è
- n
å
i=1 
li,k fj,i - n
å
x=1 
n
å
y=x 
nx,y,k fj,x fj,y ö
ø

2
å
l=1 
exp æ
è
- n
å
i=1 
li,l fj,i - n
å
x=1 
n
å
y=x 
nx,y,l fj,x fj,y ö
ø
.
(2.10)
Parameters li,k's and nx,y,k's are k-class-specific weights of individual features and feature pairs, respectively, and in principle can be expressed in terms of the first and second moments of the distributions. The values of parameters in Equations 2.10 and 2.11 are estimated by maximizing the product of probabilities for the individual training examples.
We tested two versions of the maximum-entropy classifier. MaxEnt 1 uses only information about the first moments of features in the training data (Equation 2.10); MaxEnt 2 uses the set of all individual features and the products of feature pairs (Equation 2.11). To select the most informative pairs of features we used a mutual information approach, as described in the subsection dealing with classification features.
For two classes (correct and incorrect) and 68 features MaxEnt 1 requires estimation of 136 parameters. In contrast, MaxEnt 2 requires estimation of 4,828 parameters: weight parameters for all first moments for two classes, plus weights for the second moments for two classes. MaxEnt 2-v is a version of MaxEnt 2 classifier where the squared values of features are not used, so that the classifier requires estimation of only 4,692 weight parameters.
Figure 2.6: A hypothetical three-layered feed-forward neural network. We used a similar network with 68 input units (one unit per classification feature) and 10 hidden-layer units.

Feed-forward neural network

A typical feed-forward artificial neural network is a directed acyclic graph organized into three (or more) layers. In our case, we chose a three-layered network, with a set of nodes of the input layer, {xi}i=1, ¼, Nx, nodes of the hidden layer, {yj}j=1, ¼, Ny, and a single node representing the output layer, z1, see Figure 2.6. The number of input nodes, Nx, is determined by the number of features used in the analysis (68 in our case). The number of hidden nodes, Ny, determines both the network's expressive power and its ability to generalize. Too small a number of hidden nodes makes a simplistic network that cannot learn from complex data. Too large a number makes a network that tends to overtrain-that works perfectly on the training data, but poorly on new data. We experimented with different values of Ny and settled on Ny = 10.
The values of the input nodes, {xi}i=1, ¼, Nx, are feature values of the object that we need to classify. The value of each node, yj, in the hidden layer is determined in the following way:

yj = F( wj,1 x1 + wj,2 x2 + ¼+ wj,Nx xNx),
(2.11)
where F(x) is a hyperbolic tangent function that creates an S-shaped curve:

F(x) = ex - e-x

ex + e-x
,
(2.12)
and {wj,k} are weight parameters. Finally, the value of the output node, z1 is determined as a linear combination of the values of all hidden nodes:

z1 = a1 y1 + a2 y2 + ¼+ aNy yNy,
(2.13)
where {ak} are additional weight parameters. We trained our network, using a back-propagation algorithm [90], to distinguish two classes, correct and incorrect, where positive values of z1 corresponded to the class correct.
The feed-forward neural network that we used in our analysis can be thought of as a model with Nx ×Ny + Ny parameters (690 in our case).

Support vector machines

The Support Vector Machines (SVM, [91,92]) algorithm solves a binary classification problem by dividing two sets of data geometrically, by finding a hyperplane that separates the two classes of objects in the training data in an optimum way (maximizing the margin between the two classes).
The SVM is a kernel-based algorithm, where the kernel is an inner product of two feature vectors (function/transformation of the original data). In this study, we used three of the most popular kernels: the linear, polynomial and Rbf (radial basis function) kernels. The linear kernel KL(x1, x2) = áx1, x2 ñ is simply the inner product of the two input feature vectors; an SVM with the linear kernel searches for a class-separating hyperplane in the original space of the data. Using a polynomial kernel, KPd(x1, x2) = (1+áx1, x2 ñ)d, is equivalent to transforming the data into a higher-dimensional space and searching for a separating plane there.11 Finally, using an Rbf kernel, KRbfg(x1, x2) = e-g ||x1-x2 ||2, corresponds to finding a separating hyperplane in an infinite-dimensional space.
In the most real-world cases the two classes cannot be separated perfectly by a hyperplane, and some classification errors are unavoidable. SVM algorithms use the C-parameter to control the error rate during the training phase (if the error is not constrained, the margin of every hyperplane can be extended infinitely). In this study, we used the default values for the C-parameter suggested by the SVM Light tool. Table 2.3 lists the SVM models and C-parameter values that we used in this study.
Model Kernel Kernel parameter C-parameter
SVM (OSU SVM) Linear 1
SVM-t0 (SVM Light) Linear 1
SVM-t1-d2 Polynomial d=2 0.3333
SVM-t1-d3 Polynomial d=3 0.1429
SVM-t2-g0.5 Rbf g=0.5 1.2707
SVM-t2-g1 Rbf g=1 0.7910
SVM-t2-g2 Rbf g=2 0.5783
Table 2.3: Parameter values used for various SVM classifiers in this study.
The output of an SVM analysis is not probabilistic, but there are tools to convert an SVM classification output into "posterior probabilities," see chapter by J. Platt in [93]. (A similar comment is applicable to the artificial neural network.)
The number of support vectors used by the SVM classifier depends on the size and properties of the training data set. The average number of (1 ×68-dimensional) support vectors used in 10 cross-validation experiments was 12,757.5, 11,994.4, 12,092, 12,289.9, 12,679.7, and 14,163.8, for SVM, SVM-t1-d2, SVM-t1-d3, SVM-t2-g0.5, SVM-t2-g1, and SVM-t2-g2 classifiers, respectively. The total number of data-derived values (which we loosely call "parameters") used by the SVM in our cross-validation experiments was therefore, on average, between 827,614 and 880,270 for various SVM versions.
Method Implementation URL Number of parameters
Naïve Bayes this study, WEKA http://www.cs.waikato.ac.nz/ml/weka/ 4,208
Clustered Bayes 68 this study N/A 276,432
Clustered Bayes 44 this study N/A 361,270
Discriminant Analysis this study N/A 4,828
SVM OSU SVM Toolbox for Matlab http://sourceforge.net/projects/svm 827,614
SVM-t* SVM light [94] http://svmlight.joachims.org/ 827,614 to 880,270
Neural Network Neural Network toolbox for Matlab N/A 690
MaxEnt 1 Maximum Entropy Modeling Toolkit for Python and C++ http://homepages.inf.ed.ac.uk /s0450736/maxent_toolkit.html 136
MaxEnt 2 same as the MaxEnt 1 same as the MaxEnt 1 4,828
MaxEnt 2-v same as the MaxEnt 1 same as the MaxEnt 1 4,692
Meta-Classifier OSU SVM Toolbox for Matlab http://sourceforge.net/projects/svm > 11,560
Table 2.4: Machine learning methods used in this study and their implementations.

Meta-method

We implemented the meta-classifier on the basis of the SVM algorithm (linear kernel with C=1) applied to predictions (converted into probabilities that the object belongs to the class correct) provided by the individual "simple" classifiers. The meta-method used 1,445 support vectors (1 ×7-dimensional), in addition to combined parameters of the seven individual classifiers used as input to the meta-classifier.

Implementation

A summary of the sources of software used in our study is shown in Table 2.4.

2.4.2  Features used in our analysis

We selected 68 individual features covering a range of characteristics that could help in the classification, see Table 2.5. To capture the flow of information in a molecular interaction graph (the edge direction), in each extracted relation we identified an "upstream term" (corresponding to the graph node with the outgoing directed edge) and a "downstream term" (the node with the incoming directed edge): for example, in the phrase "JAK phosphorylates p53," JAK is the upstream term, and p53 is the downstream term. Features in the group keywords represent a list of tokens that may signal that the sentence is hypothetical, interrogative, negative, or that there is a confusion in the relation extraction (e.g. the particle "by" in passive-voice sentences). We eventually abandoned keywords as we found them to be uninformative features, but they are still listed for the sake of completeness.
Group of features Feature(s) Values Number of features
Dictionary look-ups {Upstream, downstream} term can be found in {GeneBank, NCBI taxonomy, LocusLink, SwissProt, FlyBase, drug list, disease list, Specialist Lexicon, Bacteria, English Dictionary} Binary 20
Word metrics Length of the sentence (word count) Positive integer 1
Distance between the upstream and the downstream term Integer 1
Minimum non-negative word distance between the upstream and the downstream term Non-negative Integer 1
Distance between the upstream term and the action Integer 1
Distance between the downstream term and the action Integer 1
Previous scores Average score of relationships with the same {upstream term, downstream term, action} Real 3
Count of evaluated relationships with the same {upstream term, downstream term, action} Positive integer 3
Total count of relationships with the same {upstream term, downstream term, action} Positive integer 3
Average score of relationships that share the same pair of upstream and downstream terms Real 1
Total count of evaluated relationships that share the same pair of upstream and downstream terms Positive integer 1
Total count of relationships with both the same upstream and downstream terms Positive integer 1
Number of relations extracted from the same sentence Positive integer 1
Number of evaluated relations extracted from the same sentence Positive integer 1
Average score of relations from the same sentence Real 1
Number of relations sharing upstream term in same sentence Positive integer 1
Number of evaluated relations sharing upstream term in the same sentence Positive integer 1
Average score of relations sharing upstream term in same sentence Real 1
Relations sharing downstream term in the same sentence Positive integer 1
Evaluated relations sharing downstream term in the same sentence Positive integer 1
Average score of relations sharing downstream term in the same sentence Real 1
Number of relations sharing same action in the same sentence Positive integer 1
Number of evaluated relations sharing action in the same sentence Positive integer 1
Average score of relations sharing action in the same sentence Real 1
Punctuation Number of {periods, commas, semi-colons, colons} in the sentence Non-negative integer 4
Number of {periods, commas, semi-colons, colons} between upstream and downstream terms Non-negative integer 4
Terms Semantic sub-class category of the {upstream, downstream} term Integer 2
Probability that the {upstream, downstream} term has been correctly recognized Real 2
Probability that the {upstream, downstream} term has been correctly mapped Real 2
Part-of-speech tags {Upstream, downstream} term is a noun phrase Binary 2
Action is a verb Binary 1
Other Relationship is negative Binary 1
Action index Positive integer 1
Keyword is present Binary (not used)
Table 2.5: List of the features that we used in the present study. Dictionary lookups are binary features indicating absence or presence of a term in a specific dictionary. Previous scores are the average scores that a term or an action has in other relations evaluated. Term-recognition probabilities are generated by the GeneWays pipeline and reflect the likelihood that a term had been correctly recognized and mapped. Sharing of the same action (verb) by two different facts within the same sentence occurs in phrases such as A and B were shown to phosphorylate C. In this example, two individual relations, A phosphorylates C and B phosphorylates C, share the same verb, phosphorylate. Semantic categories are entities (semantic classes) in the GeneWays ontology (e.g. gene, protein, geneorprotein). Part-of-speech tags were generated by the Maximum Entropy tagger, MXPOST [95].
To represent the second-order features (pairs of features), we defined a new feature as a product of the normalized values of two features. We obtained the normalized values of features by subtracting the mean value from each feature value, then dividing the result by the standard deviation for this feature.

2.4.3  Separating data into training and testing: Cross-validation

To evaluate the success of our classifiers we used a 10-fold cross-validation approach, where we used [9/10] of data for training and [1/10] for testing. More precisely, given a partition of the manually evaluated data into 10 equal portions, we created 10 different pairs of training-test subsets, where we used each of the 10 equal data subsets in turn as testing data set, and used the larger remaining data subset as the training set. We then used 10 training-test set pairs to compare all algorithms.

2.4.4  Comparison of methods: Receiver operating characteristic (ROC) scores

To quantify and compare success of the various classification methods we used receiver operating characteristic (ROC) scores, also called areas under ROC curve [96].
An ROC score is computed in the following way. All test-set predictions of a particular classification method are ordered by the decreasing quality score provided by this method; for example, in the case of the Clustered Bayes algorithm, the quality score is the posterior probability that the test object belongs to the class correct. The ranked list is then converted into binary predictions by applying a decision threshold, T: All test objects with a quality score above T are classified as correct and all test objects with low-than-threshold scores are classified as incorrect. The ROC score is then computed by plotting the proportion of true-positive predictions (in the test set we know both the correct label and the quality score of each object) against false-positive predictions for the whole spectrum of possible values of T, then integrating the area under the curve obtained in this way, see Figure 2.7.
The ROC score is an estimate of the probability that the classifier under scrutiny will label correctly a pair of statements, one of which is from the class correct and one from the class incorrect [96]. A completely random classifier therefore would have an ROC score of 0.5, whereas a hypothetical perfect classifier would have an ROC score of 1. It is also possible to design a classifier that performs less accurately than would one that is completely random; in this case the ROC score is less than 0.5, which indicates that we can improve the accuracy of the classifier by simply reversing all predictions.

2.5  Results

The raw extracted facts produced by our system are noisy. Although many relation types are extracted with accuracy above 80%, and even above 90% (see Figure 2.2), there are particularly noisy verbs/relations that bring the average accuracy of the "raw" data to about 65%.12 Therefore, additional purification of text-mining output, either computational or manual, is indeed important.
The classification problem of separating correctly and incorrectly extracted facts appears to belong to a class of easier problems. Even the simplest Naïve Bayes method, had an average ROC score of 0.84 which improved to almost 0.95 with more sophisticated approaches. Judging by the average ROC score, the quality of prediction increased in the following order of methods: Clustered Bayes 68, Naïve Bayes, MaxEnt 1, Clustered Bayes 44, Quadratic Discriminant Analysis, artificial neural network, support vector machines, and MaxEnt 2/MaxEnt 2-v (see Table 2.8). The Meta-method was always slightly more accurate than MaxEnt 2, as explained in legend to Table 2.8 and shown in Figure 2.7.
Figure 2.7: Receiver-operating characteristic (ROC) curves for the classification methods that we used in the present study. We show only the linear-kernel SVM and the Clustered Bayes 44 ROC curves to avoid excessive data clutter.
Table 2.8 provides a somewhat misleading impression that MaxEnt 2 and MaxEnt 2-v are not significantly more accurate than their closest competitors (the SVM family), because of the overlapping confidence intervals. However, when we trace the performance of all classifiers in individual cross-validation experiments (see Figure 2.9) it becomes clear that MaxEnt 2 and MaxEnt 2-v outperformed their rivals in every cross-validation experiment. The SVM and artificial neural network methods performed essentially identically, and were always more accurate than three other methods: QDA, Clustered Bayes 44, and MaxEnt 1. Finally, the performance of the Clustered Bayes 68 and the Naïve Bayes methods was reliably the least accurate of all methods studied.
Figure 2.8: Correlation matrix. Comparison of a correlation matrix for the features (colored half of the matrix) computed using only the annotated set of data and a matrix of mutual information between all feature pairs and the statement class (correct or incorrect). The plot indicates that a significant amount of information critical for classification is encoded in pairs of weakly correlated features. The white dotted lines outline clusters of features, suggested by analysis of the annotated data set; we used these clusters in implementation of the Clustered Bayes classifier.
It is a matter of both academic curiosity and of practical importance to know how the performance of our artificial intelligence curator compares to that of humans. If we define the correct answer as a majority-vote of the three human evaluators (see Table 2.6), the average accuracy of MaxEnt 2 is slightly lower than, but statistically indistinguishable from humans (at the 99% level of significance, see Table 2.6; capital letters "A", "L", "S", and "M" hide the real names of the human evaluators). If, however, in the spirit of Turing's test of machine intelligence [97], we treat the MaxEnt 2 algorithm on an equal footing with the human evaluators, compute the average over predictions of all four anonymous evaluators, and compare the quality of the performance of each evaluator with regard to the average, MaxEnt 2 always performs slightly more accurately than one of the human evaluators.13 (In all cases we compared performance of the algorithm on data that was not used for its training; see Tables 2.7 and 2.6.)
The features that we used in our analysis are obviously not all equally important. To elucidate the relative importance of the individual features and of feature pairs, we computed the mutual information between all pairs of features and the class variable, (see Figure 2.8). The mutual information of class variable, C, and a pair of feature variables, (Fi,Fj) is defined in the following way (e.g., see [98]).

I(C;Fi,Fj)
=
I(Fi,Fj;C)
=
H(Fi,Fj)+H(C)-H(C,Fi,Fj),
(2.14)
where function H(P[x]) is Claude E. Shannon's entropy of distribution P(x) (see p. 14 of [99]), defined in the following way:

H(P) = -
å
x 
P(x) logP(x),
(2.15)
where summation is done over all admissible values of x. Figure 2.8 shows that the most informative standalone features, as expected, are those that contain information about the manually evaluated terms and relations of each type, and about properties of the sentence that was used to extract the corresponding fact. In addition, some dictionary-related features, such as finding a term in the LocusLink, are fairly informative. Some features, however, become informative only in combination with other features. For example, the minimum positive distance between two terms in a sentence is not very informative by itself, but becomes fairly useful in combination with other features, such as the number of commas in the sentence, or the length of the sentence (see Figure 2.8). Similarly, while finding a term in GenBank does not help the classifier by itself, the feature becomes informative in combination with syntactic properties of the sentence and statistics about the manually evaluated data.
Figure 2.9: Ranks of all classification methods used in this study in 10 cross-validation experiments.
Assignment of facts to classes correct and incorrect by evaluators is subject to random errors: Facts that were seen by many evaluators would be assigned to the appropriate class with higher probability than facts that were seen by only one evaluator. This introduction of noise affects directly the estimate of the accuracy of an artificial intelligence curator: If the gold standard is noisy, the apparent accuracy of the algorithm compared to the gold standard is lower than the real accuracy. Indeed, the three-evaluator gold standard, see Table 2.6, indicated that the actual optimum accuracy of the MaxEnt 2 classifier is higher than 88% percent. (The 88% accuracy estimate came from comparison of MaxEnt 2 predictions to the whole set of annotated facts, half of which were seen by only one or two evaluators, see Figure 2.10.) When MaxEnt 2 was compared with the three-human gold standard, the estimated accuracy was about 91% (see Table 2.6).
Evaluator Correct Incorrect Accuracy
[99% CI]
Batch A
A. 10,981 208 (11,189) 0.981410
[0.978014 0.984628]
L. 10,547 642 (11,189) 0.942622
[0.936902 0.948253]
M. 10,867 322 (11,189) 0.971222
[0.967111 0.975244]
MaxEnt 2 10,537 652 (11,189) 0.941728
[0.935919 0.947359]
Batch B
A. 9,796 430 (10,226) 0.957950
[0.952767 0.962938]
M. 9,898 328 (10,226) 0.967925
[0.963329 0.972325]