How to cite this?
Rodriguez-Esteban, Raul. Methods in biomedical text mining. Ph.D. Thesis, Columbia University, 2008. Download the pdf version. It looks nicer. For any comments, email me: raul AT ee columbia edu |
|
"More than 40 years ago the fragmentation of scientific knowledge was a problem actively discussed but without much visible progress toward a solution; perhaps people then had the consummate wisdom to know that no problem is so big that you can't run away from it. Three aspects of the context and nature of this fragmentation seem notable: 1. The disparity between the total quantity of recorded knowledge, however it might be measured, and the limited human capacity to assimilate it, is not only enormous now but grows unremittingly. Exactly how the limitations of the human intellect and life span affect the growth of knowledge is unknown. Metaphorically, how can the frontiers of science be pushed forward if, someday, it will take a lifetime just to reach them? [...] 2. In response to the information explosion, specialties are somehow spontaneously created, then grow too large and split further into subspecialties without even a declaration of independence. One unintended result is the fragmentation of knowledge owing to inadequate cross-specialty communication. And as knowledge continues to grow, fragmentation will inevitably get worse because it is driven by the human imperative to escape inundation. 3. Of particular interest to me is the possibility that information in one specialty might be of value in another without anyone becoming aware of the fact. Specialized literatures, or other "units" of knowledge, that do not intercommunicate by citing one another may nonetheless have many implicit textual interconnections based on meaning. Indeed the number of unintended or implicit text-based connections within the literature of science may greatly exceed the number that are explicit, because there are far more possible combinations of units (that potentially could be related) than there are units. The connection explosion may be more portentous than the information explosion."Heart's opinion is shared by Ananiadou and McNaught [10] and others [11]: "The primary goal of text mining is to retrieve knowledge that is hidden in text, and to present the distilled knowledge to users in a concise form". However, a more common point of view, first proposed by Ronen Feldman, defines text mining as different from data mining only because it deals with data that by its nature is unstructured, unlike data organized in databases, which are the primary source for data mining [4,12,13,14,15]. Kao and Poteet [16] go even further, stating that "Text mining is the discovery and extraction of interesting, non-trivial knowledge from free or unstructured text. This encompasses everything from information retrieval (i.e., document or web site retrieval) to text classification and clustering, to (somewhat more recently) entity, relation, and event extraction." In practice, this expansive view of text mining is not shared by many others, especially considering that information retrieval or text classification predate text mining by many years. Kao and Poteet's opinion implies that text mining is an umbrella term covering a laundry list of textual processing methods. A more common view seems to be that the aim of text mining is to find interesting, useful, or valuable patterns-that are not necessary novel-in text collections. This perspective places text mining closer to knowledge acquisition and information extraction. Given the fuzzy lines that separate text mining from similar fields, it is not clear whether it can be defined meaningfully beyond a mix of different conceptions held by different researchers. The confusion is compounded further because applications from related fields may be regarded as necessary processing steps for effective text mining. In other words, text-mining projects might require sub-tasks from other fields. Therefore, text mining in some contexts might be used for the sole purpose of indicating the scientific agenda in which the study should be considered, not for defining the task itself as "text mining". Furthermore, as other fields have built on advances in text mining, text mining also has become an intermediate step in projects of different nature. Related disciplines such as semantic analysis, text analysis, information retrieval, information extraction, and knowledge acquisition have a much older pedigree within the computation and information sciences than does text mining. Like text mining, they derive from activities that originally could be handled by human intellect and rudimentary record-keeping but became more complex with the progressive accumulation of knowledge and information. Fielden [17] plotted the evolution of the size of information repositories over the course of human history, showing an exponential growth in the last decades. More comprehensively, Peter Lyman and Hal Varian led a study designed to estimate the quantity of information produced worldwide every year [18,19]; they estimated a grand total of 5 exabytes2 , or 800 megabytes per person per year, of which 92% were in magnetic storage. Printed text represented 33 terabytes, whereas the "surface internet" accounted for 167 terabytes and the "deep internet" (or database, dynamically-generated pages) for about 92 petabytes). This unparalleled growth has been accompanied by extraordinary improvements in the devices and methods in the different computation and information sciences. Text mining, a late arrival, has the advantage of drawing from an extensive set of diverse techniques developed not only in the related disciplines, but also in other fields such as machine learning, artificial intelligence, probabilistic analysis, statistics, pattern recognition, data management, and information theory. While other disciplines, like information retrieval, fledged out before the current pervasive use and availability of electronic text, text mining was born in a seemingly limitless and growing frontier of resources and opportunities. Text miners, in turn, have acted like they have a hammer and see a nail in everything. Perhaps this is the best explanation for the success of text mining: Applications have driven its evolution [1]. Given the fragmentary state of the field, it is not surprising that there is not currently a journal that specializes in text mining. The door is open for further transformation of the text-mining domain, whether in terms of its buzz or its consolidation in the spectrum of computation and information sciences.
"Note that the automatically generated knowledge base is of necessity noisy: the GeneWays system extracts some percentage of statements incorrectly, and, even among correctly extracted statements, we should expect redundancy and contradictions. Therefore, the database requires curation, a process in which the original statements are annotated with statements regarding confidence in the corresponding information. The traditional way to perform such curation is through manual labor of human experts-a monumental task even for the database at its current size of roughly 3 million redundant statements extracted from 150,000 articles. To reduce the manual work, we are implementing a Curator module that would allow GeneWays to compute the estimates of reliability automatically."Curation is considered a step in the process of ascertaining the truth of certain facts, especially for the scientist who is confronted with multiple, sometimes conflicting, pieces of information and who needs to make decisions within the knowledge pocket of her particular scientific specialization [66,63,65]. Curation also provides a way to assign a value to our degree of confidence about a fact within a continuous scale of truth. The value of truth assigned is an attempt to represent our limited ability to completely understand a text, or even the limited ability of the writer to express what she wants to say. Evaluation is a central part of curation. Systems biology studies face the difficult task of measuring recall in a broad and intricate search space combined with the limitations of manual evaluation of precision. Many evaluations in the literature, including many of those cited herein, were not described in detail, which makes it hard to establish their characteristics. Often, they entail in-house evaluations in which unnamed experts follow protocols that are not detailed. This is understandable from the point of view that, in most cases, evaluations are considered a necessary, but not central, contribution. Friedman and Hripcsak [67] exposed a number of pitfalls in the task of evaluating NLP systems and defined 20 criteria to avoid them (see Table 1.4).
Minimizing Bias |
1. The developer should not see the test set of documents. |
2. If domain experts are used to determine the reference standard, they should not be developers of the system or designers of the study. |
3. The developer should not perform the evaluation. |
4. The NLP system should be frozen prior to the testing phase. |
5. If generalizability of the processor is being tested, the developer should not know details of the study beforehand. |
6. Ideally, the person designing the evaluation study should not be a developer of the system. |
Establishing a Reference Standard |
7. If domain experts are used to determine the reference standard, there should be a sufficient number to assess variability of the reference standard. |
8. The test set should be large enough in that there is sufficient power to distinguish levels of performance. |
9. The choice of the reference standard should be based on the objectives of the study (e.g. extraction capability vs. performance in an application). |
10. If domain experts are used to determine the reference standard, the type of expert should be appropriate (e.g. radiologist vs. internist). |
Describing the Evaluation Methods |
11. The method used to determine the reference standard should be clearly described, particularly if domain experts were used. |
12. The manner in which the test documents were chosen should be described. |
13. Methods used to calculate performance measures should be clearly presented and if non-standard measures are used, they should be described. |
Presenting Results |
14. Performance measures should relate to the complete test set. |
15. If human experts are used, inter-rater and intra-rater agreement should be given. |
16. Confidence intervals should be given for all measures. |
Discussing Conclusions |
17. Limitations of the study should be discussed. |
18. Results should be presented in light of requirements of the target application. |
19. Overgeneralization of the results should be avoided. |
20. An analysis of system failures should be given along with a discussion concerning the degree of difficulty of needed corrections. |
...he will throughly purge his floor, and gather his wheat into the garner; but he will burn up the chaff with unquenchable fire.Synopsis
Matthew 3:12 [74]
Sentence [Source] | Extracted relation | Evaluation (Confidence) |
NIK binds to Nck in cultured cells. [76] | nik bind nck | Correct (High) |
One is that presenilin is required for the proper trafficking of Notch and APP to their proteases, which may reside in an intracellular compartment. [77] | presenilin required for notch | Correct (High) |
Serine 732 phosphorylation of FAK by Cdk5 is important for microtubule organization, nuclear movement, and neuronal migration. [78] | cdk5 phosphorylate fak | Correct (High) |
Histogram quantifying the percent of Arr2 bound to rhodopsin-containing membranes after treatment with blue light (B) or blue light followed by orange light (BO). [79] | arr2 bind rhodopsin | Correct (Low) |
It is now generally accepted that a shift from monomer to dimer and cadherin clustering activates classic cadherins at the surface into an adhesively competent conformation. [80] | cadherin activate cadherins | Correct (Low) |
Binding of G to CSP was four times greater than binding to syntaxin. [81] | csp bind syntaxin | Incorrect (Low) |
Treatment with NEM applied with cGMP made activation by cAMP more favorable by about 2.5 kcal/mol. [82] | camp activate cgmp | Incorrect (Low) |
This matrix is likely to consist of actin filaments, as similar filaments can be induced by actin-stabilizing toxins (O. S. et al., unpublished data). [83] | actin induce actin | Incorrect (High) |
A ligand-gated association between cytoplasmic domains of UNC5 and DCC family receptors converts netrin-induced growth cone attraction to repulsion. [84] | cytoplasmic domains associate unc5 | Incorrect (High) |
Term level | |
Upstream term is a junk substance | |
Action is incorrect biologically | |
Downstream term is a junk substance | |
Relation level | |
Correctly extracted | |
Sentence is hypothesis, not fact | |
Unable to decide | |
Incorrectly extracted | |
Incorrect upstream | |
Incorrect downstream | |
Incorrect action type | |
Missing or extra negation | |
Wrong action direction | |
Sentence does not support the action | |
Sentence level | |
Wrong sentence boundary |
| (2.1) |
|
| (2.3) |
|
|
|
| (2.8) |
| (2.9) |
| (2.10) |
| (2.11) |
| (2.12) |
| (2.13) |
Model | Kernel | Kernel parameter | C-parameter |
SVM (OSU SVM) | Linear | 1 | |
SVM-t0 (SVM Light) | Linear | 1 | |
SVM-t1-d2 | Polynomial | d=2 | 0.3333 |
SVM-t1-d3 | Polynomial | d=3 | 0.1429 |
SVM-t2-g0.5 | Rbf | g=0.5 | 1.2707 |
SVM-t2-g1 | Rbf | g=1 | 0.7910 |
SVM-t2-g2 | Rbf | g=2 | 0.5783 |
Method | Implementation | URL | Number of parameters |
Naïve Bayes | this study, WEKA | http://www.cs.waikato.ac.nz/ml/weka/ | 4,208 |
Clustered Bayes 68 | this study | N/A | 276,432 |
Clustered Bayes 44 | this study | N/A | 361,270 |
Discriminant Analysis | this study | N/A | 4,828 |
SVM | OSU SVM Toolbox for Matlab | http://sourceforge.net/projects/svm | 827,614 |
SVM-t* | SVM light [94] | http://svmlight.joachims.org/ | 827,614 to 880,270 |
Neural Network | Neural Network toolbox for Matlab | N/A | 690 |
MaxEnt 1 | Maximum Entropy Modeling Toolkit for Python and C++ | http://homepages.inf.ed.ac.uk /s0450736/maxent_toolkit.html | 136 |
MaxEnt 2 | same as the MaxEnt 1 | same as the MaxEnt 1 | 4,828 |
MaxEnt 2-v | same as the MaxEnt 1 | same as the MaxEnt 1 | 4,692 |
Meta-Classifier | OSU SVM Toolbox for Matlab | http://sourceforge.net/projects/svm | > 11,560 |
Group of features | Feature(s) | Values | Number of features |
Dictionary look-ups | {Upstream, downstream} term can be found in {GeneBank, NCBI taxonomy, LocusLink, SwissProt, FlyBase, drug list, disease list, Specialist Lexicon, Bacteria, English Dictionary} | Binary | 20 |
Word metrics | Length of the sentence (word count) | Positive integer | 1 |
Distance between the upstream and the downstream term | Integer | 1 | |
Minimum non-negative word distance between the upstream and the downstream term | Non-negative Integer | 1 | |
Distance between the upstream term and the action | Integer | 1 | |
Distance between the downstream term and the action | Integer | 1 | |
Previous scores | Average score of relationships with the same {upstream term, downstream term, action} | Real | 3 |
Count of evaluated relationships with the same {upstream term, downstream term, action} | Positive integer | 3 | |
Total count of relationships with the same {upstream term, downstream term, action} | Positive integer | 3 | |
Average score of relationships that share the same pair of upstream and downstream terms | Real | 1 | |
Total count of evaluated relationships that share the same pair of upstream and downstream terms | Positive integer | 1 | |
Total count of relationships with both the same upstream and downstream terms | Positive integer | 1 | |
Number of relations extracted from the same sentence | Positive integer | 1 | |
Number of evaluated relations extracted from the same sentence | Positive integer | 1 | |
Average score of relations from the same sentence | Real | 1 | |
Number of relations sharing upstream term in same sentence | Positive integer | 1 | |
Number of evaluated relations sharing upstream term in the same sentence | Positive integer | 1 | |
Average score of relations sharing upstream term in same sentence | Real | 1 | |
Relations sharing downstream term in the same sentence | Positive integer | 1 | |
Evaluated relations sharing downstream term in the same sentence | Positive integer | 1 | |
Average score of relations sharing downstream term in the same sentence | Real | 1 | |
Number of relations sharing same action in the same sentence | Positive integer | 1 | |
Number of evaluated relations sharing action in the same sentence | Positive integer | 1 | |
Average score of relations sharing action in the same sentence | Real | 1 | |
Punctuation | Number of {periods, commas, semi-colons, colons} in the sentence | Non-negative integer | 4 |
Number of {periods, commas, semi-colons, colons} between upstream and downstream terms | Non-negative integer | 4 | |
Terms | Semantic sub-class category of the {upstream, downstream} term | Integer | 2 |
Probability that the {upstream, downstream} term has been correctly recognized | Real | 2 | |
Probability that the {upstream, downstream} term has been correctly mapped | Real | 2 | |
Part-of-speech tags | {Upstream, downstream} term is a noun phrase | Binary | 2 |
Action is a verb | Binary | 1 | |
Other | Relationship is negative | Binary | 1 |
Action index | Positive integer | 1 | |
Keyword is present | Binary | (not used) |
|
| (2.15) |
Evaluator | Correct | Incorrect | Accuracy |
[99% CI] | |||
Batch A | |||
A. | 10,981 | 208 (11,189) | 0.981410 |
[0.978014 0.984628] | |||
L. | 10,547 | 642 (11,189) | 0.942622 |
[0.936902 0.948253] | |||
M. | 10,867 | 322 (11,189) | 0.971222 |
[0.967111 0.975244] | |||
MaxEnt 2 | 10,537 | 652 (11,189) | 0.941728 |
[0.935919 0.947359] | |||
Batch B | |||
A. | 9,796 | 430 (10,226) | 0.957950 |
[0.952767 0.962938] | |||
M. | 9,898 | 328 (10,226) | 0.967925 |
[0.963329 0.972325] | |||
S. | 9,501 | 725 (10,226) | 0.929102 |
[0.922453 0.935556] | |||
MaxEnt 2 | 9,379 | 847 (10,226) | 0.917172 |
[0.910033 0.924115] |
Evaluator | Correct | Incorrect | Accuracy |
(Total) | [99% CI] | ||
Batch A | |||
A. | 10,700 | 182 (10,882) | 0.983275 |
[0.980059 0.986400] | |||
L. | 10,452 | 430 (10,882) | 0.960485 |
[0.955615 0.965172] | |||
M. | 10,629 | 253 (10,882) | 0.976751 |
[0.972983 0.980426] | |||
MaxEnt 2 | 10,537 | 345 (10,882) | 0.968296 |
[0.963885 0.972523] | |||
Batch B | |||
A. | 9,499 | 363 (9,862) | 0.963192 |
[0.958223 0.967958] | |||
M. | 9,636 | 226 (9,862) | 0.977084 |
[0.973130 0.980836] | |||
S. | 9,332 | 530 (9,862) | 0.946258 |
[0.940276 0.952038] | |||
MaxEnt 2 | 9,379 | 483 (9,862) | 0.951024 |
[0.945346 0.956500] |
Method | ROC score ± 2 s |
Clustered Bayes 68 | 0.8115 ±0.0679 |
Naïve Bayes | 0.8409 ±0.0543 |
MaxEnt 1 | 0.8647±0.0412 |
Clustered Bayes 44 | 0.8751±0.0414 |
QDA | 0.8826±0.0445 |
SVM-t0 | 0.9203±0.0317 |
SVM | 0.9222±0.0299 |
Neural Network | 0.9236±0.0314 |
SVM-t1-d2 | 0.9277±0.0285 |
SVM-t2-g2 | 0.9280±0.0285 |
SVM-t1-d3 | 0.9281±0.0280 |
SVM-t2-g1 | 0.9286±0.0283 |
SVM-t2-g0.5 | 0.9287±0.0285 |
MaxEnt 2 | 0.9480 ±0.0178 |
MaxEnt 2-v | 0.9492±0.0156 |
"For example, there is a gene name "bride of sevenless" (FlyBase ID FBgn0000206) with its acronym "boss", as well as a protein that has been named after a Chinese breakfast noodle "yotiao" (Swiss-Prot ID Q99996). Even if biologists start to use exclusively "well-formed" and approved names, there are still a huge number of documents containing "legacy" and ad hoc terms." [101]
"For example, we have fourways to tag the name in the phrase `yeast YSY6 protein': `yeast YSY6 protein', `yeast YSY6', `YSY6 protein' or `YSY6'. This ambiguity implies that annotators may include yeast today and may exclude it a year later, unless given some `annotation rules'. [...] To make things worse, protein names are often derived from descriptive terms (signal transducer and activator of transcription, STAT) and only later become accepted by the research community through repetition (STAT-4). Protein names also overlap with gene names (myc-c gene and myc-c protein), cell cultures (CD4+-cells and CD4 protein), and may be rather similar to chemical compounds (Caeridin and Cantharidin)." [141]
"The choice of a gene name can have unforeseen consequences in addition to infringement of trademark ("Pokemon blocks gene name" Nature 438, 897; 2005). The quirky sense of humour that researchers display in choosing a gene name often loses much in translation when people facing serious illness or disability are told that they or their child have a mutation in a gene such as Sonic hedgehog, Slug or Pokemon. As with the acronym CATCH22 (from `cardiac anomaly, T-cell deficit, clefting and hypocalcaemia') for chromosome 22q11.2 microdeletions, which was abandoned because of its no-win connotations (J. Burn J. Med. Genet. 36, 737–738; 1999), researchers need to be mindful when naming genes and syndromes." [142]
"1. Authors often use the original words instead of abbreviations, change letter cases, and ignore implicit name generating rules.Biomedical term recognition stresses the importance of morphological features (letter case, numbers, Greek letters, hyphens, etc.) [143,144] and infixes for term recognition, as a result of the formation patterns in some term classes in biomedicine. Different methodological strategies have been used for term recognition and classification ranging from straightforward dictionary matching to black box setups based on machine learning. Here they are described separately, although they more commonly are used in conjunction.2. Below, the name explains its function.
- epidermal growth factor receptor or EGF receptor or EGFR
- cycline D1-cdk4 complex or cycline D1-Cdk4 complex
- c-Jun or c-jun or c jun
- the Ras guanine nucleotide exchange factor Sos
- the Ras guanine nucleotide releasing protein Sos
- the Ras exchanger Sos
- the GDP-GTP exchange factor Sos
- Sos(mSos), a GDP/GTP exchange protein for Ras" [124]
Name | Description | Examples |
Modifier | Semantic-modifying tokens | receptor, inhibitor |
Non-descriptive | Annotating tokens | fragment, precursor |
Specifier | Numbers and Greek letters | 1, V1, alpha, gamma |
Common | Common English words | and, was, killer |
Delimiter | Separator tokens | ( ) , . ; |
Standard | Standard tokens | TNF, BMP, IL |
"The biology domain offers a prime example of this multiplicity of meanings, since every protein has an associated gene with often the same name. Further, genes and their transcripts (mRNA, rRNA, tRNA and the like) often share the same name as well. Often an article will refer to the protein, gene, and RNA senses of a term in close proximity, relying on the reader's expertise and the surrounding context for disambiguation. For example, SBP2 is listed as a gene/protein in the GenBank [...] database. In one of our source articles [...] we find the following sentences:Moreover, in the biomedical domain word sense might be harder for humans to distinguish. The pairwise agreement between human annotators for classification of gene and protein names has been measured to be about 78% [59,176], compared to 88-100% for generic word senses [180,176]. Abbreviation or acronym resolution is a special case of WSD, for which a specific set of techniques has been developed [181,182,183,184,185,186,187]. The processing entails mapping an abbreviation to a definition for the purpose of disambiguating its meaning. Definitions and abbreviations may appear appositionally (e.g., the abbreviation DNA and the definition desoxyribonucleic acid in: Deoxyribonucleic acid (DNA)), but abbreviations may be found anywhere in a text, as definitions commonly are written only the first time the abbreviation is used. A third case occurs when the author considers that an abbreviation is so well known within a scientific field (e.g., DNA) that the definition is unnecessary, in this case it must be searched in a different text. Although abbreviation resolution faces challenges similar to those of other biomedical text-mining tasks (e.g., large, open vocabulary; few rules; plurality of domains), it actually is a success story and almost a solved problem. This is, perhaps, because it is a well-defined task in terms of inputs and outcomes. Okazaki and Ananiadou [185] reported 99% precision and 82-95% recall. Yu and colleagues [187] reported 92% precision and 91% coverage.In the first sentence the highlighted occurrence of SBP2 is a protein, while in the second sentence is a gene." [59]
- `By UV cross-linking and immunoprecipitaion, we show that SBP2 specially binds selenoprotein mRNAs both in vitro and in vivo.'
- `The SBP2 clone used in this study generates a 3173 nt transcript (2541 nt of coding sequence plus a 632 nt 3' UTR truncated at the plyadenylation site).'
Feature | Example |
GreekLetter | kappa |
CapsDigitHyphen | Oct-1 |
CapsAndDigits | STAT1 |
SingleCap | B |
LettersAndDigits | p105 |
LowCaps | pre-BI |
OneDigit | 2 |
TwoCaps | EBV |
InitCap | Sox |
HyphenDigit | 95- |
LowerCase | kinases |
HyphenBacklash | - |
Punctuation | ( |
DigitSequence | 98401159 |
TwoDigit | 37 |
FourDigit | 1997 |
NucleotideSequence |
| (3.1) |
| (3.2) |
| (3.3) |
| (3.4) |
| (3.5) |
| (3.6) |
| (3.7) |
| (3.8) |
"Animals can be divided into:We would prefer a system with more flexibility, like a machine learning approach. Additionally, we would like our system to have other characteristics that we have tried to implement:
a. belonging to the Emperor
b. embalmed
c. trained
d. pigs
e. sirens
f. fabulous
g. stray dogs
h. included in this classification
i. trembling like crazy
j. innumerable
k. drawn with a very fine camelhair brush
l. et cetera
m. just broke the vase
n. from a distance look like flies." 22
Term | Labels | Nested terms |
Jak tyrosine kinases | O U O | tyrosine |
NFATp / AP-1 complex formation | B I I E O | NFATp / AP-1 complex |
GM-CSF receptor alpha promoter | UB I E O | GM-CSF, GM-CSF receptor alpha |
Class | F-measure (this study) | F-measure [214] |
Protein | 94% | 91% |
DNA | 94% | 85% |
Cell type | 82% | 84% |
Other organic compound | 85% | 70% |
Cell line | 81% | 65% |
Multi cell | 96% | 85% |
Lipid | 76% | 87% |
Virus | 88% | 88% |
Cell component | 96% | 84% |
RNA | 90% | 77% |
Total | 90% | 86% |
It is beyond our power to fathom,We analyzed the frequencies of use of sensory words (describing touch, smell, sight, taste, and sound) and time-related terms in a very large collection of biomedical texts. We then compared the results with similar analyses of a collection of news articles, a large encyclopedia, and a body of literary prose and poetry. We found that, unlike literary compositions and newswire articles, biomedical texts are extremely sensory-poor, but rich in overall vocabulary. It is likely that the sensory-deprived writing style that dominates the biomedical literature impedes text comprehension and numbs the reader's senses. When we read technical or literary prose, chains of words flowing through our minds invoke sensory responses that can be surprising (unexpected) even for the writer. Even a very technical text typically affects the reader on multiple levels, in addition to transmitting the author-intended content. Prose can profoundly alter the physiological and emotional states of an unsuspecting reader. The semantic priming test in modern psychology exploits this phenomenon. For example, people start feeling and behaving as if they have suddenly grown older after reading a scrambled sequence of words enriched with aging-related connotations [215]. The priming effect is largely independent of our conscious understanding of a text: autistic children whose text comprehension is mildly impaired respond to semantic priming similarly to non-autistic kids [216]. Furthermore, our emotional response to a sequence of words depends on our genetic background: for example, children of parents with bipolar disorder react much more vividly to words that have undertones of a social threat than do children in a control group [217]. Semantic priming can profoundly affect the model of the outside world reported by our senses: merely naming an odor ("cheddar cheese" vs. "body odor") can determine our perception of the odor as pleasant or nauseating [218]. The selection of words in a composition also reveals deep personality traits of its author to the reader. For example, a person's color preferences and idiosyncrasies provide information relevant to her psychological evaluation [219]. Schizophrenia patients-who are especially susceptible to semantic priming-have a characteristic utterance pattern: the patients' own words generate diverse secondary associations in their minds. These self-inflicted associations surface in the patients' utterances and disturb the clarity of their messages [220]. Computational analysis of scientific language typically serves as the groundwork for engineering text-mining tools [221,222]. This analysis also provides us with a unique glimpse into the "collective unconsciousness" of a scientific community. In this study we compare the frequencies of sensory terms, such as those related to the perception of color, smell, taste, touch, sound, and time, across multiple large corpora. We use this comparison to infer a "collective sensory landscape" of the biomedical literature and the hypothetical priming that biomedical texts exert on their readers. We analyzed a large collection of scientific texts (Journals, including almost 250,000 full-text articles) representing 78 biomedical journals. We compared the properties of biomedical texts with those of news reports (Reuters), the open-access encyclopedia Wikipedia (Wiki), and complete collections of the compositions of Edgar Allan Poe (Poe), William Shakespeare (Shakespeare), and Walt Whitman (Whitman). We grouped these corpora into those that are collective (Journals, Reuters, and Wiki) and those that are individual (Poe, Shakespeare, and Whitman). When discussing time (Figure 5.1 A), all six corpora most frequently mention days and years. In individual corpora, days predominate over all other time terms, followed by hours and years. In collective corpora, years are most often mentioned, followed by days and seconds. While individual corpora remain exclusively within the second-to-century range, collective corpora reach into picoseconds on the short-term side, and into millennia (and even millions of years, not shown) on the long-term side of the range of time-scales. Within individual corpora, Whitman is the most concerned with centuries, and Shakespeare the least. Reuters is almost twice as time-obsessed as Whitman; all the other corpora are several-fold poorer in time-conscious words (see Figure 5.1 E). Biomedical texts are among the poorest in time-related terms, although Wiki and Shakespeare are even poorer.
Which way the word we utter resonates.
Fedor Tyitchev
I had a scheme, which I still use today when somebody is explaining something that I'm trying to understand: I keep making up examples. For instance, the mathematicians would come in with a terrific theorem, and they're all excited. As they're telling me the conditions of the theorem, I construct something which fits all the conditions. You know, you have a set (one ball)-disjoint (two balls). Then the ball turns colors, grow hairs, or whatever, in my head as they put more conditions on. Finally they state the theorem, which is some dumb thing about the ball which isn't true for my hairy green ball thing, so I say "False!" If it's true, they get all excited, and I let them go on for a while. Then I point out my counterexample. "Oh. We forgot to tell you that it's Class 2 Hausdorff homomorphic." "Well, then," I say, "It's trivial! It's trivial!" By that time I know which way it goes, even though I don't know what Hausdorff homomorphic means. [227](Credit for isolating this quote is due to Daniel Dennett [228].) Our brain was shaped by a chain of evolutionary adaptations, each invoked by an acute necessity to address a concrete survival problem posed by our changing environment. Our neural system is therefore an eclectic ensemble of disparate pieces of hardware, perfected for solving specialized problems-such as the detection of potentially threatening bilateral vertical symmetry (a lurking predator) in the chaotic environment, or prompt recognition of the faces of the numerous members of our own tribe. To make more efficient use of our neural machinery, we need to translate abstract problems into concrete sensory-grounded symbols that can be efficiently processed by our brains. (This is like trying to do a general computation using graphics-oriented hardware: to make the computation efficient, we have to translate our task into spatial translations of three-dimensional primitives.) When we read and compose sensory-deprived prose, we probably leave a large portion of our nervous system uninvolved-different words and meanings are processed by distinct brain areas [229]. We conjecture that a piece of sensory-poor prose does, on average, a poorer job of engaging the reader's imagination than a sensory-rich one, although the former can be much more precise and concise than the latter. Within a narrow scientific subfield, an expert would undoubtedly prefer to read a concise technical text rather than a longer one replete with metaphors and analogies. However, the situation is different for a scientist trying to read a paper from a neighboring subfield: a dry technical description may require a prohibitive amount of a non-expert's time to read and grasp. It is in the writer's best interest to ensure that her work is as widely accessible as possible. We believe that scientific prose should be enriched with sensory words (provided that they clarify the meaning rather than obscure it), in much the same way as a good statistical data visualization involves the mapping of abstract data into colors and three-dimensional shapes, thus aiding the discovery of patterns.
| (6.1) |
| (6.2) |
| (6.3) |
| (7.1) |
| |
|
| |
|
| (7.2) |
| (7.3) |
"Where is the wisdom we have lost in knowledge? Where is the knowledge we have lost in information?"3The current version of GeneWays database contains 4,035,759 redundant interactions (2,652,916 of them are unique) that involve 1,299,146 unique substance terms (with 17,903,358 redundant terms identified in total) from 232,265 full-text articles representing 78 major research journals. The spectrum of relations represented in the database is shown in Figures 2.2 and 2.3. 4We also computed the k-score for the inter-annotator agreement in the following way.
| (8.1) |