Infomotions Mini-Musings » Blog Archive » TFIDF In Libraries: Part I of III (For Librarians) / Eric Lease Morgan

TFIDF In Libraries: Part I of III (For Librarians)

This is the first of a three-part series called TFIDF In Libraries, where “relevancy ranking” will be introduced. In this part, term frequency/inverse document frequency (TFIDF) — a common mathematical method of weighing texts for automatic classification and sorting search results — will be described. Part II will illustrate an automatic classification system and simple search engine using TFIDF through a computer program written in Perl. Part III will explore the possibility of filtering search results by applying TFIDF against sets of pre-defined “Big Names” and/or “Big Ideas” — an idea apparently called “champion lists”.

The problem, straight Boolean logic

To many of us the phrase “relevancy ranked search results” is a mystery. What does it mean to be “relevant”? How can anybody determine relevance for me? Well, a better phrase might have been “statistically significant search results”. Taking such an approach — the application of statistical analysis against texts — does have its information retrieval advantages over straight Boolean logic. Take for example, the following three documents consisting of a number of words, Table #1:

Document #1	Document #2	Document #3
Word	Word	Word
airplane	book	building
blue	car	car
chair	chair	carpet
computer	justice	ceiling
forest	milton	chair
justice	newton	cleaning
love	pond	justice
might	rose	libraries
perl	shakespeare	newton
rose	slavery	perl
shoe	thesis	rose
thesis	truck	science

A search for “rose” against the corpus will return three hits, but which one should I start reading? The newest document? The document by a particular author or in a particular format? Even if the corpus contained 2,000,000 documents and a search for “rose” returned a mere 100 the problem would remain. Which ones should I spend my valuable time accessing? Yes, I could limit my search in any number of ways, but unless I am doing a known item search it is quite likely the search results will return more than I can use, and information literacy skills will only go so far. Ranked search results — a list of hits based on term weighting — has proven to be an effective way of addressing this problem. All it requires is the application of basic arithmetic against the documents being searched.

Simple counting

We can begin by counting the number of times each of the words appear in each of the documents, Table #2:

Document #1		Document #2		Document #3
Word	C	Word	C	Word	C
airplane	5	book	3	building	6
blue	1	car	7	car	1
chair	7	chair	4	carpet	3
computer	3	justice	2	ceiling	4
forest	2	milton	6	chair	6
justice	7	newton	3	cleaning	4
love	2	pond	2	justice	8
might	2	rose	5	libraries	2
perl	5	shakespeare	4	newton	2
rose	6	slavery	2	perl	5
shoe	4	thesis	2	rose	7
thesis	2	truck	1	science	1
Totals (T)	46		41		49

Given this simple counting method, searches for “rose” can be sorted by its “term frequency” (TF) — the quotient of the number of times a word appears in each document (C), and the total number of words in the document (T) — TF = C / T. In the first case, rose has a TF value of 0.13. In the second case TF is 0.12, and in the third case it is 0.14. Thus, by this rudimentary analysis, Document #3 is most significant in terms of the word “rose”, and Document #2 is the least. Document #3 has the highest percentage of content containing the word “rose”.

Accounting for common words

Unfortunately, this simple analysis needs to be offset considering frequently occurring terms across the entire corpus. Good examples are stop words or the word “human” in MEDLINE. Such words are nearly meaningless because they appear so often. Consider Table #3 which includes the number of times each word is found in the entire corpus (DF), and the quotient of the total number of documents (D or in this case, 3) and DF — IDF = D / DF. Words with higher scores are more significant across the entire corpus. Search terms whose IDF (”inverse document frequency”) score approach 1 are close to useless because they exist in just about every document:

Document #1			Document #2			Document #3
Word	DF	IDF	Word	DF	IDF	Word	DF	IDF
airplane	1	3.0	book	1	3.0	building	1	3.0
blue	1	3.0	car	2	1.5	car	2	1.5
chair	3	1.0	chair	3	1.0	carpet	1	3.0
computer	1	3.0	justice	3	1.0	ceiling	1	3.0
forest	1	3.0	milton	1	3.0	chair	3	1.0
justice	3	1.0	newton	2	1.5	cleaning	1	3.0
love	1	3.0	pond	1	3.0	justice	3	1.0
might	1	3.0	rose	3	1.0	libraries	1	3.0
perl	2	1.5	shakespeare	1	3.0	newton	2	1.5
rose	3	1.0	slavery	1	3.0	perl	2	1.5
shoe	1	3.0	thesis	2	1.5	rose	3	1.0
thesis	2	1.5	truck	1	3.0	science	1	3.0

Term frequency/inverse document frequency (TFIDF)

By taking into account these two factors — term frequency (TF) and inverse document frequency (IDF) — it is possible to assign “weights” to search results and therefore ordering them statistically. Put another way, a search result’s score (”ranking”) is the product of TF and IDF:

TFIDF = TF * IDF where:

TF = C / T where C = number of times a given word appears in a document and T = total number of words in a document

IDF = D / DF where D = total number of documents in a corpus, and DF = total number of documents containing a given word

Table #4 is a combination of all the previous tables with the addition of the TFIDF score for each term:

Document #1
Word	C	T	TF	D	DF	IDF	TFIDF
airplane	5	46	0.109	3	1	3.0	0.326
blue	1	46	0.022	3	1	3.0	0.065
chair	7	46	0.152	3	3	1.0	0.152
computer	3	46	0.065	3	1	3.0	0.196
forest	2	46	0.043	3	1	3.0	0.130
justice	7	46	0.152	3	3	1.0	0.152
love	2	46	0.043	3	1	3.0	0.130
might	2	46	0.043	3	1	3.0	0.130
perl	5	46	0.109	3	2	1.5	0.163
rose	6	46	0.130	3	3	1.0	0.130
shoe	4	46	0.087	3	1	3.0	0.261
thesis	2	46	0.043	3	2	1.5	0.065
Document #2
Word	C	T	TF	D	DF	IDF	TFIDF
book	3	41	0.073	3	1	3.0	0.220
car	7	41	0.171	3	2	1.5	0.256
chair	4	41	0.098	3	3	1.0	0.098
justice	2	41	0.049	3	3	1.0	0.049
milton	6	41	0.146	3	1	3.0	0.439
newton	3	41	0.073	3	2	1.5	0.110
pond	2	41	0.049	3	1	3.0	0.146
rose	5	41	0.122	3	3	1.0	0.122
shakespeare	4	41	0.098	3	1	3.0	0.293
slavery	2	41	0.049	3	1	3.0	0.146
thesis	2	41	0.049	3	2	1.5	0.073
truck	1	41	0.024	3	1	3.0	0.073
Document #3
Word	C	T	TF	D	DF	IDF	TFIDF
building	6	49	0.122	3	1	3.0	0.367
car	1	49	0.020	3	2	1.5	0.031
carpet	3	49	0.061	3	1	3.0	0.184
ceiling	4	49	0.082	3	1	3.0	0.245
chair	6	49	0.122	3	3	1.0	0.122
cleaning	4	49	0.082	3	1	3.0	0.245
justice	8	49	0.163	3	3	1.0	0.163
libraries	2	49	0.041	3	1	3.0	0.122
newton	2	49	0.041	3	2	1.5	0.061
perl	5	49	0.102	3	2	1.5	0.153
rose	7	49	0.143	3	3	1.0	0.143
science	1	49	0.020	3	1	3.0	0.061

Given TFIDF, a search for “rose” still returns three documents ordered by Documents #3, #1, and #2. A search for “newton” returns only two items ordered by Documents #2 (0.110) and #3 (0.061). In the later case, Document #2 is almost one and a half times more “relevant” than document #3. TFIDF scores can be summed to take into account Boolean unions (or) or intersections (and).

Automatic classification

TDIDF can also be applied a priori to indexing/searching to create browsable lists — hence, automatic classification. Consider Table #5 where each word is listed in a sorted TFIDF order:

Document #1		Document #2		Document #3
Word	TFIDF	Word	TFIDF	Word	TFIDF
airplane	0.326	milton	0.439	building	0.367
shoe	0.261	shakespeare	0.293	ceiling	0.245
computer	0.196	car	0.256	cleaning	0.245
perl	0.163	book	0.220	carpet	0.184
chair	0.152	pond	0.146	justice	0.163
justice	0.152	slavery	0.146	perl	0.153
forest	0.130	rose	0.122	rose	0.143
love	0.130	newton	0.110	chair	0.122
might	0.130	chair	0.098	libraries	0.122
rose	0.130	thesis	0.073	newton	0.061
blue	0.065	truck	0.073	science	0.061
thesis	0.065	justice	0.049	car	0.031

Given such a list it would be possible to take the first three terms from each document and call them the most significant subject “tags”. Thus, Document #1 is about airplanes, shoes, and computers. Document #2 is about Milton, Shakespeare, and cars. Document #3 is about buildings, ceilings, and cleaning.

Probably a better way to assign “aboutness” to each document is to first denote a TFIDF lower bounds and then assign terms with greater than that score to each document. Assuming a lower bounds of 0.2, Document #1 is about airplanes and shoes. Document #2 is about Milton, Shakespeare, cars, and books. Document #3 is about buildings, ceilings, and cleaning.

Discussion and conclusion

Since the beginning, librarianship has focused on the semantics of words in order to create a cosmos from an apparent chaos. “What is this work about? Read the descriptive information regarding a work (author, title, publisher date, notes, etc.) to workout in your mind its importance.” Unfortunately, this approach leaves much up to interpretation. One person says this document is about horses, and the next person says it is about husbandry.

The mathematic approach is more objective and much more scalable. While not perfect, there is much less interpretation required with TFIDF. It is just about mathematics. Moreover, it is language independent; it is possible to weigh terms and provide relevance ranking without knowing the meaning of a single word in the index.

In actuality, the whole thing is not an either/or sort of question, but instead a both/and sort of question. Human interpretation provides an added value, definitely. At the same time the application of mathematics (”Can you say ’science?’”) proves to be quite useful too. The approaches compliment each other — they are arscient. Much of how we have used computers in libraries has simply been to automate existing processes. We have still to learn how to truly take advantage of a computer’s functionality. It can remember things a whole lot better than we can. It can add a whole lot faster than we can. Because of this it is almost trivial to calculate ( C / T ) * ( D / DF ) over an entire corpus of 2,000,000 MARC records or even 1,000,000 full text documents.

None of these ideas are new. It is possible to read articles describing these techniques going back about 40 years. Why has our profession not used them to our advantage. Why is it taking us so long? If you have an answer, then enter it in the comment box below.

This first posting has focused on the fundamentals of TFIDF. Part II will describe a Perl program implementing relevancy ranking and automatic classification against sets of given text files. Part III will explore the idea of using TFIDF to enable users to find documents alluding to “great ideas” or “great people”.

Tags: term frequency/inverse document frequency (TFIDF)

This entry was posted on Monday, April 13th, 2009 at 7:57 pm and is filed under Librarianship. You can follow any responses to this entry through the RSS 2.0 feed. You can leave a response, or trackback from your own site.

5 Responses to “TFIDF In Libraries: Part I of III (For Librarians)”

egarcia Says:
April 14th, 2009 at 11:58 am
Hi, there:

I read with interest your article. Here are few points worth to mention:

1. IDF is defined as log(D/d) where D is number of documents in a collection and d is the number of documents mentioning a given term, regardless if the documents are relevant to said term. The base of the log does not matter (it can be base 10, 2, etc). The reason for taking logs is because most scoring functions in IR are assumed to be additive and because terms are assumed independent form one another (even when often this is not exactly the case).

2. IDF is a measure of the discriminatory power of a term (term specificity), but it does not relevancy. Indeed, IDF is a term weight score in the absence of relevance information.

3. IDF is a small pixel in the bigger picture of Robertson-Sparck Jones Probabilistic Model (RSJ-PM). A tutorial on the RSJ-PM Model explaining this model is available at http://www.miislita.com.

4. With unstructured, unfocused, and generic collections at the scale of the Web (e.g. commercial search engines like Google), the stability of IDF and this as a reliable scoring function has been put into question by several authors.

Regards

Dr. Edel Garcia
http://www.miislita.com
Infomotions Mini-Musings » Blog Archive » TFIDF In Libraries: Part II of III (For programmers) / Eric Lease Morgan Says:
April 20th, 2009 at 10:42 pm
[...] where relevancy ranking techniques are explored through a set of simple Perl programs. In Part I relevancy ranking was introduced and explained. In Part III additional word/document weighting [...]
Infomotions Mini-Musings » Blog Archive » TFIDF In Libraries: Part II of III (For thinkers) / Eric Lease Morgan Says:
May 31st, 2009 at 4:28 pm
[...] is the third of the three-part series on the topic of TFIDF in libraries. In Part I the why’s and wherefore’s of TFIDF were outlined. In Part II TFIDF subroutines and [...]
Infomotions Mini-Musings » Blog Archive » TFIDF In Libraries: Part III of III (For thinkers) / Eric Lease Morgan Says:
May 31st, 2009 at 4:30 pm
[...] is the third of the three-part series on the topic of TFIDF in libraries. In Part I the why’s and wherefore’s of TFIDF were outlined. In Part II TFIDF subroutines and [...]
Infomotions Mini-Musings » Blog Archive » Text mining: Books and Perl modules / Eric Lease Morgan Says:
June 3rd, 2009 at 10:14 pm
[...] my explorations of term frequency/inverse document frequency (TFIDF) I became aware of a relatively new field of study called text mining. In many ways, text mining is [...]

Mini-Musings