TFIDF In Libraries: Part I of III (For Librarians)
This is the first of a three-part series called TFIDF In Libraries, where “relevancy ranking” will be introduced. In this part, term frequency/inverse document frequency (TFIDF) — a common mathematical method of weighing texts for automatic classification and sorting search results — will be described. Part II will illustrate an automatic classification system and simple search engine using TFIDF through a computer program written in Perl. Part III will explore the possibility of filtering search results by applying TFIDF against sets of pre-defined “Big Names” and/or “Big Ideas” — an idea apparently called “champion lists”.
The problem, straight Boolean logic
To many of us the phrase “relevancy ranked search results” is a mystery. What does it mean to be “relevant”? How can anybody determine relevance for me? Well, a better phrase might have been “statistically significant search results”. Taking such an approach — the application of statistical analysis against texts — does have its information retrieval advantages over straight Boolean logic. Take for example, the following three documents consisting of a number of words, Table #1:
Document #1 | Document #2 | Document #3 |
Word | Word | Word |
airplane | book | building |
blue | car | car |
chair | chair | carpet |
computer | justice | ceiling |
forest | milton | chair |
justice | newton | cleaning |
love | pond | justice |
might | rose | libraries |
perl | shakespeare | newton |
rose | slavery | perl |
shoe | thesis | rose |
thesis | truck | science |
A search for “rose” against the corpus will return three hits, but which one should I start reading? The newest document? The document by a particular author or in a particular format? Even if the corpus contained 2,000,000 documents and a search for “rose” returned a mere 100 the problem would remain. Which ones should I spend my valuable time accessing? Yes, I could limit my search in any number of ways, but unless I am doing a known item search it is quite likely the search results will return more than I can use, and information literacy skills will only go so far. Ranked search results — a list of hits based on term weighting — has proven to be an effective way of addressing this problem. All it requires is the application of basic arithmetic against the documents being searched.
Simple counting
We can begin by counting the number of times each of the words appear in each of the documents, Table #2:
Document #1 | Document #2 | Document #3 | |||
Word | C | Word | C | Word | C |
airplane | 5 | book | 3 | building | 6 |
blue | 1 | car | 7 | car | 1 |
chair | 7 | chair | 4 | carpet | 3 |
computer | 3 | justice | 2 | ceiling | 4 |
forest | 2 | milton | 6 | chair | 6 |
justice | 7 | newton | 3 | cleaning | 4 |
love | 2 | pond | 2 | justice | 8 |
might | 2 | rose | 5 | libraries | 2 |
perl | 5 | shakespeare | 4 | newton | 2 |
rose | 6 | slavery | 2 | perl | 5 |
shoe | 4 | thesis | 2 | rose | 7 |
thesis | 2 | truck | 1 | science | 1 |
Totals (T) | 46 | 41 | 49 |
Given this simple counting method, searches for “rose” can be sorted by its “term frequency” (TF) — the quotient of the number of times a word appears in each document (C), and the total number of words in the document (T) — TF = C / T. In the first case, rose has a TF value of 0.13. In the second case TF is 0.12, and in the third case it is 0.14. Thus, by this rudimentary analysis, Document #3 is most significant in terms of the word “rose”, and Document #2 is the least. Document #3 has the highest percentage of content containing the word “rose”.
Accounting for common words
Unfortunately, this simple analysis needs to be offset considering frequently occurring terms across the entire corpus. Good examples are stop words or the word “human” in MEDLINE. Such words are nearly meaningless because they appear so often. Consider Table #3 which includes the number of times each word is found in the entire corpus (DF), and the quotient of the total number of documents (D or in this case, 3) and DF — IDF = D / DF. Words with higher scores are more significant across the entire corpus. Search terms whose IDF (”inverse document frequency”) score approach 1 are close to useless because they exist in just about every document:
Document #1 | Document #2 | Document #3 | ||||||
Word | DF | IDF | Word | DF | IDF | Word | DF | IDF |
airplane | 1 | 3.0 | book | 1 | 3.0 | building | 1 | 3.0 |
blue | 1 | 3.0 | car | 2 | 1.5 | car | 2 | 1.5 |
chair | 3 | 1.0 | chair | 3 | 1.0 | carpet | 1 | 3.0 |
computer | 1 | 3.0 | justice | 3 | 1.0 | ceiling | 1 | 3.0 |
forest | 1 | 3.0 | milton | 1 | 3.0 | chair | 3 | 1.0 |
justice | 3 | 1.0 | newton | 2 | 1.5 | cleaning | 1 | 3.0 |
love | 1 | 3.0 | pond | 1 | 3.0 | justice | 3 | 1.0 |
might | 1 | 3.0 | rose | 3 | 1.0 | libraries | 1 | 3.0 |
perl | 2 | 1.5 | shakespeare | 1 | 3.0 | newton | 2 | 1.5 |
rose | 3 | 1.0 | slavery | 1 | 3.0 | perl | 2 | 1.5 |
shoe | 1 | 3.0 | thesis | 2 | 1.5 | rose | 3 | 1.0 |
thesis | 2 | 1.5 | truck | 1 | 3.0 | science | 1 | 3.0 |
Term frequency/inverse document frequency (TFIDF)
By taking into account these two factors — term frequency (TF) and inverse document frequency (IDF) — it is possible to assign “weights” to search results and therefore ordering them statistically. Put another way, a search result’s score (”ranking”) is the product of TF and IDF:
TFIDF = TF * IDF where:
- TF = C / T where C = number of times a given word appears in a document and T = total number of words in a document
- IDF = D / DF where D = total number of documents in a corpus, and DF = total number of documents containing a given word
Table #4 is a combination of all the previous tables with the addition of the TFIDF score for each term:
Document #1 | |||||||
Word | C | T | TF | D | DF | IDF | TFIDF |
airplane | 5 | 46 | 0.109 | 3 | 1 | 3.0 | 0.326 |
blue | 1 | 46 | 0.022 | 3 | 1 | 3.0 | 0.065 |
chair | 7 | 46 | 0.152 | 3 | 3 | 1.0 | 0.152 |
computer | 3 | 46 | 0.065 | 3 | 1 | 3.0 | 0.196 |
forest | 2 | 46 | 0.043 | 3 | 1 | 3.0 | 0.130 |
justice | 7 | 46 | 0.152 | 3 | 3 | 1.0 | 0.152 |
love | 2 | 46 | 0.043 | 3 | 1 | 3.0 | 0.130 |
might | 2 | 46 | 0.043 | 3 | 1 | 3.0 | 0.130 |
perl | 5 | 46 | 0.109 | 3 | 2 | 1.5 | 0.163 |
rose | 6 | 46 | 0.130 | 3 | 3 | 1.0 | 0.130 |
shoe | 4 | 46 | 0.087 | 3 | 1 | 3.0 | 0.261 |
thesis | 2 | 46 | 0.043 | 3 | 2 | 1.5 | 0.065 |
Document #2 | |||||||
Word | C | T | TF | D | DF | IDF | TFIDF |
book | 3 | 41 | 0.073 | 3 | 1 | 3.0 | 0.220 |
car | 7 | 41 | 0.171 | 3 | 2 | 1.5 | 0.256 |
chair | 4 | 41 | 0.098 | 3 | 3 | 1.0 | 0.098 |
justice | 2 | 41 | 0.049 | 3 | 3 | 1.0 | 0.049 |
milton | 6 | 41 | 0.146 | 3 | 1 | 3.0 | 0.439 |
newton | 3 | 41 | 0.073 | 3 | 2 | 1.5 | 0.110 |
pond | 2 | 41 | 0.049 | 3 | 1 | 3.0 | 0.146 |
rose | 5 | 41 | 0.122 | 3 | 3 | 1.0 | 0.122 |
shakespeare | 4 | 41 | 0.098 | 3 | 1 | 3.0 | 0.293 |
slavery | 2 | 41 | 0.049 | 3 | 1 | 3.0 | 0.146 |
thesis | 2 | 41 | 0.049 | 3 | 2 | 1.5 | 0.073 |
truck | 1 | 41 | 0.024 | 3 | 1 | 3.0 | 0.073 |
Document #3 | |||||||
Word | C | T | TF | D | DF | IDF | TFIDF |
building | 6 | 49 | 0.122 | 3 | 1 | 3.0 | 0.367 |
car | 1 | 49 | 0.020 | 3 | 2 | 1.5 | 0.031 |
carpet | 3 | 49 | 0.061 | 3 | 1 | 3.0 | 0.184 |
ceiling | 4 | 49 | 0.082 | 3 | 1 | 3.0 | 0.245 |
chair | 6 | 49 | 0.122 | 3 | 3 | 1.0 | 0.122 |
cleaning | 4 | 49 | 0.082 | 3 | 1 | 3.0 | 0.245 |
justice | 8 | 49 | 0.163 | 3 | 3 | 1.0 | 0.163 |
libraries | 2 | 49 | 0.041 | 3 | 1 | 3.0 | 0.122 |
newton | 2 | 49 | 0.041 | 3 | 2 | 1.5 | 0.061 |
perl | 5 | 49 | 0.102 | 3 | 2 | 1.5 | 0.153 |
rose | 7 | 49 | 0.143 | 3 | 3 | 1.0 | 0.143 |
science | 1 | 49 | 0.020 | 3 | 1 | 3.0 | 0.061 |
Given TFIDF, a search for “rose” still returns three documents ordered by Documents #3, #1, and #2. A search for “newton” returns only two items ordered by Documents #2 (0.110) and #3 (0.061). In the later case, Document #2 is almost one and a half times more “relevant” than document #3. TFIDF scores can be summed to take into account Boolean unions (or) or intersections (and).
Automatic classification
TDIDF can also be applied a priori to indexing/searching to create browsable lists — hence, automatic classification. Consider Table #5 where each word is listed in a sorted TFIDF order:
Document #1 | Document #2 | Document #3 | |||
Word | TFIDF | Word | TFIDF | Word | TFIDF |
airplane | 0.326 | milton | 0.439 | building | 0.367 |
shoe | 0.261 | shakespeare | 0.293 | ceiling | 0.245 |
computer | 0.196 | car | 0.256 | cleaning | 0.245 |
perl | 0.163 | book | 0.220 | carpet | 0.184 |
chair | 0.152 | pond | 0.146 | justice | 0.163 |
justice | 0.152 | slavery | 0.146 | perl | 0.153 |
forest | 0.130 | rose | 0.122 | rose | 0.143 |
love | 0.130 | newton | 0.110 | chair | 0.122 |
might | 0.130 | chair | 0.098 | libraries | 0.122 |
rose | 0.130 | thesis | 0.073 | newton | 0.061 |
blue | 0.065 | truck | 0.073 | science | 0.061 |
thesis | 0.065 | justice | 0.049 | car | 0.031 |
Given such a list it would be possible to take the first three terms from each document and call them the most significant subject “tags”. Thus, Document #1 is about airplanes, shoes, and computers. Document #2 is about Milton, Shakespeare, and cars. Document #3 is about buildings, ceilings, and cleaning.
Probably a better way to assign “aboutness” to each document is to first denote a TFIDF lower bounds and then assign terms with greater than that score to each document. Assuming a lower bounds of 0.2, Document #1 is about airplanes and shoes. Document #2 is about Milton, Shakespeare, cars, and books. Document #3 is about buildings, ceilings, and cleaning.
Discussion and conclusion
Since the beginning, librarianship has focused on the semantics of words in order to create a cosmos from an apparent chaos. “What is this work about? Read the descriptive information regarding a work (author, title, publisher date, notes, etc.) to workout in your mind its importance.” Unfortunately, this approach leaves much up to interpretation. One person says this document is about horses, and the next person says it is about husbandry.
The mathematic approach is more objective and much more scalable. While not perfect, there is much less interpretation required with TFIDF. It is just about mathematics. Moreover, it is language independent; it is possible to weigh terms and provide relevance ranking without knowing the meaning of a single word in the index.
In actuality, the whole thing is not an either/or sort of question, but instead a both/and sort of question. Human interpretation provides an added value, definitely. At the same time the application of mathematics (”Can you say ’science?’”) proves to be quite useful too. The approaches compliment each other — they are arscient. Much of how we have used computers in libraries has simply been to automate existing processes. We have still to learn how to truly take advantage of a computer’s functionality. It can remember things a whole lot better than we can. It can add a whole lot faster than we can. Because of this it is almost trivial to calculate ( C / T ) * ( D / DF ) over an entire corpus of 2,000,000 MARC records or even 1,000,000 full text documents.
None of these ideas are new. It is possible to read articles describing these techniques going back about 40 years. Why has our profession not used them to our advantage. Why is it taking us so long? If you have an answer, then enter it in the comment box below.
This first posting has focused on the fundamentals of TFIDF. Part II will describe a Perl program implementing relevancy ranking and automatic classification against sets of given text files. Part III will explore the idea of using TFIDF to enable users to find documents alluding to “great ideas” or “great people”.
April 14th, 2009 at 11:58 am
Hi, there:
I read with interest your article. Here are few points worth to mention:
1. IDF is defined as log(D/d) where D is number of documents in a collection and d is the number of documents mentioning a given term, regardless if the documents are relevant to said term. The base of the log does not matter (it can be base 10, 2, etc). The reason for taking logs is because most scoring functions in IR are assumed to be additive and because terms are assumed independent form one another (even when often this is not exactly the case).
2. IDF is a measure of the discriminatory power of a term (term specificity), but it does not relevancy. Indeed, IDF is a term weight score in the absence of relevance information.
3. IDF is a small pixel in the bigger picture of Robertson-Sparck Jones Probabilistic Model (RSJ-PM). A tutorial on the RSJ-PM Model explaining this model is available at http://www.miislita.com.
4. With unstructured, unfocused, and generic collections at the scale of the Web (e.g. commercial search engines like Google), the stability of IDF and this as a reliable scoring function has been put into question by several authors.
Regards
Dr. Edel Garcia
http://www.miislita.com
April 20th, 2009 at 10:42 pm
[...] where relevancy ranking techniques are explored through a set of simple Perl programs. In Part I relevancy ranking was introduced and explained. In Part III additional word/document weighting [...]
May 31st, 2009 at 4:28 pm
[...] is the third of the three-part series on the topic of TFIDF in libraries. In Part I the why’s and wherefore’s of TFIDF were outlined. In Part II TFIDF subroutines and [...]
May 31st, 2009 at 4:30 pm
[...] is the third of the three-part series on the topic of TFIDF in libraries. In Part I the why’s and wherefore’s of TFIDF were outlined. In Part II TFIDF subroutines and [...]
June 3rd, 2009 at 10:14 pm
[...] my explorations of term frequency/inverse document frequency (TFIDF) I became aware of a relatively new field of study called text mining. In many ways, text mining is [...]