MLKD - Learning From Multi-Label Data

Machine Learning &
Knowledge Discovery Group

	People
	Research
	Publications
	Courses
	Projects
	Events
	Links

Learning from Multi-Label Data

Introduction

Traditional single-label classification is concerned with learning from a set of examples that are associated with a single label l from a set of disjoint labels L, |L| > 1. In multi-label classification, the examples are associated with a set of labels Y in L. In the past, multi-label classification was mainly motivated by the tasks of text categorization and medical diagnosis. Nowadays, we notice that multilabel classification methods are increasingly required by modern applications, such as protein function classification, music categorization and semantic scene classification.

Mulan: An Open Source Library for Multi-Label Learning

We have developed and are constantly enriching a Java library for Multi-label learning, called Mulan. Mulan contains several problem transformation and algorithm adaptation methods for multilabel classification and ranking, an evaluation framework that computes several multilabel classification evaluation measures and a class providing data set statistics. It also contains an algorithm and support for hierarchical multi-label classification. Mulan is built on top of Weka and it therefore utilizes its award-wining code base. It is open-source and distributed under the GNU GPL licence. Please contact Grigorios Tsoumakas for bug reports, comments, suggestions or request for help with the library.

Mulan is hosted at SourceForge, so you can grab latest releases from there, as well as the latest development source code from the project's public SVN repository.

There is a Wiki that serves the purpose of a manual for Mulan. API documentation is available together with each release. The API documentation for the latest release is also available from here.

Datasets

This is a collection of several multilabel datasets, properly formatted for use with Mulan. We initially provide a table with some statistics of the datasets, followed by the actual files and their sources.

Statistics

			attributes
name	domain	instances	nominal	numeric	labels	cardinality	density	distinct
delicious	text (web)	16105	500	0	983	19.020	0.019	15806
emotions	music	593	0	72	6	1.869	0.311	27
genbase	biology	662	1186	0	27	1.252	0.046	32
mediamill	multimedia	43907	0	120	101	4.376	0.043	6555
rcv1v2 (subset1)	text	6000	0	47236	101	2.880	0.029	1028
rcv1v2 (subset2)	text	6000	0	47236	101	2.634	0.026	954
rcv1v2 (subset3)	text	6000	0	47236	101	2.614	0.026	939
rcv1v2 (subset4)	text	6000	0	47229	101	2.484	0.025	816
rcv1v2 (subset5)	text	6000	0	47235	101	2.642	0.026	946
scene	multimedia	2407	0	294	6	1.074	0.179	15
tmc2007	text	28596	49060	0	22	2.158	0.098	1341
yeast	biology	2417	0	103	14	4.237	0.303	198
bibtex	text	7395	1836	0	159	2.402	0.015	2856
bookmarks	text	87856	2150	0	208	2.028	0.010	18716
enron	text	1702	1001	0	53	3.378	0.064	753
medical	text	978	1449	0	45	1.245	0.028	94

Files and Sources

delicious
files (sparse): [delicious-train.rar] [delicious-test.rar]
source: G. Tsoumakas, I. Katakis, I. Vlahavas, “Effective and Efficient Multilabel Classification in Domains with Large Number of Labels”, Proc. ECML/PKDD 2008 Workshop on Mining Multidimensional Data (MMD'08), Antwerp, Belgium, 2008.

emotions
files: [emotions.rar] [emotions-train.rar] [emotions-test.rar] [emotions.xml]
source: K. Trohidis, G. Tsoumakas, G. Kalliris, I. Vlahavas. "Multilabel Classification of Music into Emotions". Proc. 2008 International Conference on Music Information Retrieval (ISMIR 2008), pp. 325-330, Philadelphia, PA, USA, 2008.

genbase
files: [genbase.rar] [genbase-train.rar] [genbase-test.rar]
source: S. Diplaris, G. Tsoumakas, P. Mitkas and I. Vlahavas. Protein Classification with Multiple Algorithms, Proc. 10th Panhellenic Conference on Informatics (PCI 2005), pp. 448-456, Volos, Greece, November 2005.

mediamill
files: [mediamill.rar] [mediamill-train.rar] [mediamill-test.rar] [mediamill.xml]
source: C.G.M. Snoek, M. Worring, J.C. van Gemert, J.-M. Geusebroek, and A.W.M. Smeulders. The Challenge Problem for Automated Detection of 101 Semantic Concepts in Multimedia. In Proceedings of ACM Multimedia, pp. 421-430, Santa Barbara, USA, October 2006.
related URL: The Mediamill challenge

rcv1v2 subsets
files (sparse):
[rcv1subset1.rar] [rcv1subset1-train.rar] [rcv1subset1-test.rar]
[rcv1subset2.rar] [rcv1subset2-train.rar] [rcv1subset2-test.rar]
[rcv1subset3.rar] [rcv1subset3-train.rar] [rcv1subset3-test.rar]
[rcv1subset4.rar] [rcv1subset4-train.rar] [rcv1subset4-test.rar]
[rcv1subset5.rar] [rcv1subset5-train.rar] [rcv1subset5-test.rar]
source: David D. Lewis, Yiming Yang, Tony G. Rose, and Fan Li. RCV1: A new benchmark collection for text categorization research. Journal of Machine Learning Research, 5:361-397, 2004.

scene
files: [scene.rar] [scene-train.rar] [scene-test.rar] [scene.xml]
source: M.R. Boutell, J. Luo, X. Shen, and C.M. Brown. Learning multi-labelscene classiffication. Pattern Recognition, 37(9):1757-1771, 2004.

tmc2007
files (sparse): [tmc2007.rar] [tmc2007-train.rar] [tmc2007-test.rar] [tmc2007.xml]
A shorter version of this dataset, after feature selection (top 500 features selected) that is used in [3] is also available:
files: [tmc-2007-500-train.rar][tmc-2007-500-test.rar]
source: A. Srivastava, B. Zane-Ulman: Discovering recurring anomalies in text reports regarding complex space systems. In: 2005 IEEE Aerospace Conference. (2005)
related URL: SIAM Text Mining Workshop 2007

yeast
files: [yeast.rar] [yeast-train.rar] [yeast-test.rar] [yeast.xml]
source: A. Elisseeff and J. Weston. A kernel method for multi-labelled classification. In T.G. Dietterich, S. Becker, and Z. Ghahramani, (eds), Advances in Neural Information Processing Systems 14, 2002.

bibtex
files (sparse): [bibtex.rar][bibtex-train.rar][bibtex-test.rar]
source: I. Katakis, G. Tsoumakas, I. Vlahavas, “Multilabel Text Classification for Automated Tag Suggestion”, Proceedings of the ECML/PKDD 2008 Discovery Challenge, Antwerp, Belgium, 2008.

bookmarks
files (sparse): [bookmarks.rar]
source: I. Katakis, G. Tsoumakas, I. Vlahavas, “Multilabel Text Classification for Automated Tag Suggestion”, Proceedings of the ECML/PKDD 2008 Discovery Challenge, Antwerp, Belgium, 2008.

enron
files: [enron-train.rar][enron-test.rar] [enron.xml]
sources: a) Jesse Read's Web Page, b) UC Berkeley Enron Email Analysis Project

medical
files: [medical-train.rar][medical-test.rar] [medical.xml]
sources: a) Jesse Read's Web Page, b) Computational Medicine Center's 2007 Medical Natural Language Processing Challenge

Publications

G. Tsoumakas, I. Katakis, I. Vlahavas, "A Review of Multi-Label Classification Methods", in: Proceedings of the 2nd ADBIS Workshop on Data Mining and Knowledge Discovery (ADMKD 2006), pp 99-109, September 2006, Thessaloniki, Greece.
G. Tsoumakas, I. Katakis, "Multi-Label Classification: An Overview", International Journal of Data Warehousing and Mining, 3(3):1-13, 2007.
G. Tsoumakas, I. Vlahavas, "Random k-Labelsets: An Ensemble Method for Multilabel Classification", Proc. 18th European Conference on Machine Learning (ECML 2007), pp. 406-417, Warsaw, Poland, 17-21 September 2007.
K. Trohidis, G. Tsoumakas, G. Kalliris, I. Vlahavas. "Multilabel Classification of Music into Emotions". Proc. 9th International Conference on Music Information Retrieval (ISMIR 2008), pp. 325-330, Philadelphia, PA, USA, 2008.
E. Spyromitros, G. Tsoumakas, I. Vlahavas, “An Empirical Study of Lazy Multilabel Classification Algorithms”, Proc. 5th Hellenic Conference on Artificial Intelligence (SETN 2008), Springer, Syros, Greece, 2008.
G. Tsoumakas, I. Katakis, I. Vlahavas, “Effective and Efficient Multilabel Classification in Domains with Large Number of Labels”, Proc. ECML/PKDD 2008 Workshop on Mining Multidimensional Data (MMD'08), Antwerp, Belgium, 2008.
I. Katakis, G. Tsoumakas, I. Vlahavas, “Multilabel Text Classification for Automated Tag Suggestion”, Proceedings of the ECML/PKDD 2008 Discovery Challenge, Antwerp, Belgium, 2008.
A. Dimou, G. Tsoumakas, V. Mezaris, I. Kompatsiaris, I. Vlahavas, “An Empirical Study Of Multi-Label Learning Methods For Video Annotation”, 7th International Workshop on Content-Based Multimedia Indexing, IEEE, Chania, Crete, 2009
- [cbmi09-bow.rar] [cbmi09-mpeg.rar]
G. Tsoumakas, I. Katakis, I. Vlahavas, "Mining Multi-label Data", Data Mining and Knowledge Discovery Handbook (draft of preliminary accepted chapter), O. Maimon, L. Rokach (Ed.), Springer, 2nd edition, 2009.

Bibliography

Have a look at our new online multi-label learning bibliography at CiteULike (100 papers, September, 2009). Much more useful, as you can grab BibTeX and RIS records, subscribe to the corresponding RSS feed, follow links to the papers' full pdf (may require access to digital libraries) and export the complete bibliography for BibTeX or EndNote use (requires CiteULike account).

Links