A Brief Survey of Text Mining

The enormous amount of information stored in unstructured texts cannot simply be used for further processing by computers, which typically handle text as simple sequences of character strings. Therefore, speciﬁc (pre-)processing methods and algorithms are required in order to extract useful patterns. Text mining refers generally to the process of extracting interesting information and knowledge from unstructured text. In this article, we discuss text mining as a young and interdisciplinary ﬁeld in the intersection of the related areas information retrieval, machine learning, statistics, computational linguistics and especially data mining. We describe the main analysis tasks preprocessing, classiﬁcation, clustering, information extraction and visualization. In addition, we brieﬂy discuss a number of successful applications of text mining.


Introduction
As computer networks become the backbones of science and economy enormous quantities of machine readable documents become available.There are estimates that 85% of business information lives in the form of text (TMS05 2005).Unfortunately, the usual logic-based programming paradigm has great difficulties in capturing the fuzzy and often ambiguous relations in text documents.Text mining aims at disclosing the concealed information by means of methods which on the one hand are able to cope with the large number of words and structures in natural language and on the other hand allow to handle vagueness, uncertainty and fuzziness.
In this paper we describe text mining as a truly interdisciplinary method drawing on information retrieval, machine learning, statistics, computational linguistics and especially data mining.We first give a short sketch of these methods and then define text mining in relation to them.Later sections survey state of the art approaches for the main analysis tasks preprocessing, classification, clustering, information extraction and visualization.The last section exemplifies text mining in the context of a number of successful applications.

Knowledge Discovery
In literature we can find different definitions of the terms knowledge discovery or knowledge discovery in databases (KDD) and data mining.In order to distinguish data mining from KDD we define KDD according to Fayyad as follows (Fayyad et al. 1996): Knowledge Discovery in Databases (KDD) is the non-trivial process of identifying valid, novel, potentially useful, and ultimately understandable patterns in data.
The analysis of data in KDD aims at finding hidden patterns and connections in these data.By data we understand a quantity of facts, which can be, for instance, data in a database, but also data in a simple text file.Characteristics that can be used to measure the quality of the patterns found in the data are the comprehensibility for humans, validity in the context of given statistic measures, novelty and usefulness.Furthermore, different methods are able to discover not only new patterns but to produce at the same time generalized models which represent the found connections.In this context, the expression "potentially useful" means that the samples to be found for an application generate a benefit for the user.Thus the definition couples knowledge discovery with a specific application.
Knowledge discovery in databases is a process that is defined by several processing steps that have to be applied to a data set of interest in order to extract useful patterns.These steps have to be performed iteratively and several steps usually require interactive feedback from a user.As defined by the CRoss Industry Standard Process for Data Mining (Crisp DM 1 ) model (crispdm and CRISP99 1999) the main steps are: (1) business understanding 2 , (2) data understanding, (3) data preparation, (4) modelling, (5) evaluation, (6) deployment (cf.fig. 1 3 ).Besides the initial problem of analyzing and understanding the overall task (first two steps) one of the most time consuming steps is data preparation.This is especially of interest for text mining which needs special preprocessing methods to convert textual data into a format 1 CRoss Industry Standard Process for Data Mining Homepage, http://www.crisp-dm.org/[accessed May 2005]. 2 Business understanding could be defined as understanding the problem we need to solve.In the context of text mining, for example, that we are looking for groups of similar documents in a given document collection.
which is suitable for data mining algorithms.The application of data mining algorithms in the modelling step, the evaluation of the obtained model and the deployment of the application (if necessary) are closing the process cycle.Here the modelling step is of main interest as text mining frequently requires the development of new or the adaptation of existing algorithms.Research in the area of data mining and knowledge discovery is still in a state of great flux.One indicator for this is the sometimes confusing use of terms.On the one side there is data mining as synonym for KDD, meaning that data mining contains all aspects of the knowledge discovery process.This definition is in particular common in practice and frequently leads to problems to distinguish the terms clearly.The second way of looking at it considers data mining as part of the KDD-Processes (see Fayyad et al. (1996)) and describes the modelling phase, i.e. the application of algorithms and methods for the calculation of the searched patterns or models.Other authors like for instance Kumar & Joshi (2003) consider data mining in addition as the search for valuable information in large quantities of data.In this article, we equate data mining with the modelling phase of the KDD process.
The roots of data mining lie in most diverse areas of research, which underlines the interdisciplinary character of this field.In the following we briefly discuss the relations to three of the addressed research areas: Databases, machine learning and statistics.
Databases are necessary in order to analyze large quantities of data efficiently.In this connection, a database represents not only the medium for consistent storing and accessing, but moves in the closer interest of research, since the analysis of the data with data mining algorithms can be supported by databases and thus the use of database technology in the data mining process might be useful.An overview of data mining from the database perspective can be found in Chen et al. (1996).
Machine Learning (ML) is an area of artificial intelligence concerned with the development of techniques which allow computers to "learn" by the analysis of data sets.The focus of most machine learning methods is on symbolic data.ML is also concerned with the algorithmic complexity of computational implementations.Mitchell presents many of the commonly used ML methods in Mitchell (1997).
Statistics has its grounds in mathematics and deals with the science and practice for the analysis of empirical data.It is based on statistical theory which is a branch of applied mathematics.Within statistical theory, randomness and uncertainty are modelled by probability theory.Today many methods of statistics are used in the field of KDD.Good overviews are given in Hastie et al. (2001); Berthold & Hand (1999); Maitra (2002).

Definition of Text Mining
Text mining or knowledge discovery from text (KDT) -for the first time mentioned in Feldman & Dagan (1995) -deals with the machine supported analysis of text.It uses techniques from information retrieval, information extraction as well as natural language processing (NLP) and connects them with the algorithms and methods of KDD, data mining, machine learning and statistics.Thus, one selects a similar procedure as with the KDD process, whereby not data in general, but text documents are in focus of the analysis.From this, new questions for the used data mining methods arise.One problem is that we now have to deal with problems of -from the data modelling perspective -unstructured data sets.
If we try to define text mining, we can refer to related research areas.For each of them, we can give a different definition of text mining, which is motivated by the specific perspective of the area: Text Mining = Information Extraction.The first approach assumes that text mining essentially corresponds to information extraction (cf.section 3.3)the extraction of facts from texts.
Text Mining = Text Data Mining.Text mining can be also defined -similar to data mining -as the application of algorithms and methods from the fields machine learning and statistics to texts with the goal of finding useful patterns.For this purpose it is necessary to pre-process the texts accordingly.Many authors use information extraction methods, natural language processing or some simple preprocessing steps in order to extract data from texts.To the extracted data then data mining algorithms can be applied (see Nahm & Mooney (2002); Gaizauskas (2003)).
Text Mining = KDD Process.Following the knowledge discovery process model (crispdm and CRISP99 1999), we frequently find in literature text mining as a process with a series of partial steps, among other things also information extraction as well as the use of data mining or statistical procedures.Hearst summarizes this in Hearst (1999) in a general manner as the extraction of not yet discovered information in large collections of texts.Also Kodratoff (1999) and Gomez in Hidalgo (2002) consider text mining as process orientated approach on texts.
In this article, we consider text mining mainly as text data mining.Thus, our focus is on methods that extract useful patterns from texts in order to, e.g., categorize or structure text collections or to extract useful information.

Related Research Areas
Current research in the area of text mining tackles problems of text representation, classification, clustering, information extraction or the search for and modelling of hidden patterns.In this context the selection of characteristics and also the influence of domain knowledge and domain-specific procedures plays an important role.Therefore, an adaptation of the known data mining algorithms to text data is usually necessary.In order to achieve this, one frequently relies on the experience and results of research in information retrieval, natural language processing and information extraction.In all of these areas we also apply data mining methods and statistics to handle their specific tasks: Information Retrieval (IR).Information retrieval is the finding of documents which contain answers to questions and not the finding of answers itself (Hearst 1999).In order to achieve this goal statistical measures and methods are used for the automatic processing of text data and comparison to the given question.
Information retrieval in the broader sense deals with the entire range of information processing, from data retrieval to knowledge retrieval (see Sparck-Jones & Willett (1997) for an overview).Although, information retrieval is a relatively old research area where first attempts for automatic indexing where made in 1975 (Salton et al. 1975), it gained increased attention with the rise of the World Wide Web and the need for sophisticated search engines.
Even though, the definition of information retrieval is based on the idea of questions and answers, systems that retrieve documents based on keywords, i.e. systems that perform document retrieval like most search engines, are frequently also called information retrieval systems.
Natural Language Processing (NLP).The general goal of NLP is to achieve a better understanding of natural language by use of computers (Kodratoff 1999).Others include also the employment of simple and durable techniques for the fast processing of text, as they are presented e.g. in Abney (1991).The range of the assigned techniques reaches from the simple manipulation of strings to the automatic processing of natural language inquiries.In addition, linguistic analysis techniques are used among other things for the processing of text.

Information Extraction (IE).
The goal of information extraction methods is the extraction of specific information from text documents.These are stored in data base-like patterns (see Wilks (1997)) and are then available for further use.For further details see section 3.3.
In the following, we will frequently refer to the above mentioned related areas of research.We will especially provide examples for the use of machine learning methods in information extraction and information retrieval.

Text Encoding
For mining large document collections it is necessary to pre-process the text documents and store the information in a data structure, which is more appropriate for further processing than a plain text file.Even though, meanwhile several methods exist that try to exploit also the syntactic structure and semantics of text, most text mining approaches are based on the idea that a text document can be represented by a set of words, i.e. a text document is described based on the set of words contained in it (bag-of-words representation).However, in order to be able to define at least the importance of a word within a given document, usually a vector representation is used, where for each word a numerical "importance" value is stored.The currently predominant approaches based on this idea are the vector space model (Salton et al. 1975), the probabilistic model (Robertson 1977) and the logical model (van Rijsbergen 1986).
In the following we briefly describe, how a bag-of-words representation can be obtained.Furthermore, we describe the vector space model and corresponding similarity measures in more detail, since this model will be used by several text mining approaches discussed in this article.

Text Preprocessing
In order to obtain all words that are used in a given text, a tokenization process is required, i.e. a text document is split into a stream of words by removing all punctuation marks and by replacing tabs and other non-text characters by single white spaces.This tokenized representation is then used for further processing.The set of different words obtained by merging all text documents of a collection is called the dictionary of a document collection.
In order to allow a more formal description of the algorithms, we define first some terms and variables that will be frequently used in the following: Let D be the set of documents and T = {t 1 , . . ., t m } be the dictionary, i.e. the set of all different terms occurring in D, then the absolute frequency of term t ∈ T in document d ∈ D is given by tf(d, t).We denote the term vectors t d = (tf(d, t 1 ), . . ., tf(d, t m )).Later on, we will also need the notion of the centroid of a set X of term vectors.It is defined as the mean value t X := 1 |X| ∑ t d ∈X t d of its term vectors.In the sequel, we will apply tf also on subsets of terms: For T ⊆ T, we let tf(d, T ) := ∑ t∈T tf(d, t).

Filtering, Lemmatization and Stemming
In order to reduce the size of the dictionary and thus the dimensionality of the description of documents within the collection, the set of words describing the documents can be reduced by filtering and lemmatization or stemming methods.
Filtering methods remove words from the dictionary and thus from the documents.A standard filtering method is stop word filtering.The idea of stop word filtering is to remove words that bear little or no content information, like articles, conjunctions, prepositions, etc.Furthermore, words that occur extremely often can be said to be of little information content to distinguish between documents, and also words that occur very seldom are likely to be of no particular statistical relevance and can be removed from the dictionary (Frakes & Baeza-Yates 1992).In order to further reduce the number of words in the dictionary, also (index) term selection methods can be used (see Sect. 2.1.2).
Lemmatization methods try to map verb forms to the infinite tense and nouns to the singular form.However, in order to achieve this, the word form has to be known, i.e. the part of speech of every word in the text document has to be assigned.Since this tagging process is usually quite time consuming and still error-prone, in practice frequently stemming methods are applied.
Stemming methods try to build the basic forms of words, i.e. strip the plural 's' from nouns, the 'ing' from verbs, or other affixes.A stem is a natural group of words with equal (or very similar) meaning.After the stemming process, every word is represented by its stem.A well-known rule based stemming algorithm has been originally proposed by Porter (Porter 1980).He defined a set of production rules to iteratively transform (English) words into their stems.

Index Term Selection
To further decrease the number of words that should be used also indexing or keyword selection algorithms can be used (see, e.g.Deerwester et al. (1990); Witten et al. (1999)).In this case, only the selected keywords are used to describe the documents.A simple method for keyword selection is to extract keywords based on their entropy.E.g. for each word t in the vocabulary the entropy as defined by Lochbaum & Streeter (1989) can be computed: Here the entropy gives a measure how well a word is suited to separate documents by keyword search.For instance, words that occur in many documents will have low entropy.The entropy can be seen as a measure of the importance of a word in the given domain context.As index words a number of words that have a high entropy relative to their overall frequency can be chosen, i.e. of words occurring equally often those with the higher entropy can be preferred.
In order to obtain a fixed number of index terms that appropriately cover the documents, a simple greedy strategy can be applied: From the first document in the collection select the term with the highest relative entropy (or information gain as described in Sect.3.1.1)as an index term.Then mark this document and all other documents containing this term.From the first of the remaining unmarked documents select again the term with the highest relative entropy as an index term.Then mark again this document and all other documents containing this term.Repeat this process until all documents are marked, then unmark them all and start again.The process can be terminated when the desired number of index terms have been selected.A more detailed discussion of the benefits of this approach for clustering -with respect to reduction of words required in order to obtain a good clustering performance -can be found in Borgelt & Nürnberger (2004).
An index term selection methods that is more appropriate if we have to learn a classifier for documents is discussed in Sect.3.1.1.This approach also considers the word distributions within the classes.

The Vector Space Model
Despite of its simple data structure without using any explicit semantic information, the vector space model enables very efficient analysis of huge document collections.It was originally introduced for indexing and information retrieval (Salton et al. 1975) but is now used also in several text mining approaches as well as in most of the currently available document retrieval systems.
The vector space model represents documents as vectors in m-dimensional space, i.e. each document d is described by a numerical feature vector w(d) = (x(d, t 1 ), . . ., x(d, t m )).Thus, documents can be compared by use of simple vector operations and even queries can be performed by encoding the query terms similar to the documents in a query vector.The query vector can then be compared to each document and a result list can be obtained by ordering the documents according to the computed similarity (Salton et al. 1994).The main task of the vector space representation of documents is to find an appropriate encoding of the feature vector.
Each element of the vector usually represents a word (or a group of words) of the document collection, i.e. the size of the vector is defined by the number of words (or groups of words) of the complete document collection.The simplest way of document encoding is to use binary term vectors, i.e. a vector element is set to one if the corresponding word is used in the document and to zero if the word is not.This encoding will result in a simple Boolean comparison or search if a query is encoded in a vector.Using Boolean encoding the importance of all terms for a specific query or comparison is considered as similar.To improve the performance usually term weighting schemes are used, where the weights reflect the importance of a word in a specific document of the considered collection.Large weights are assigned to terms that are used frequently in relevant documents but rarely in the whole document collection (Salton & Buckley 1988).Thus a weight w(d, t) for a term t in document d is computed by term frequency tf(d, t) times inverse document frequency idf(t), which describes the term specificity within the document collection.In Salton et al. (1994) a weighting scheme was proposed that has meanwhile proven its usability in practice.Besides term frequency and inverse document frequency -defined as id f (t) := log(N/n t ) -, a length normalization factor is used to ensure that all documents have equal chances of being retrieved independent of their lengths: where N is the size of the document collection D and n t is the number of documents in D that contain term t.
Based on a weighting scheme a document d is defined by a vector of term weights w(d) = (w(d, t 1 ), . . ., w(d, t m )) and the similarity S of two documents d 1 and d 2 (or the similarity of a document and a query vector) can be computed based on the inner product of the vectors (by which -if we assume normalized vectors -the cosine between the two document vectors is computed), i.e.

S(d
( 3) A frequently used distance measure is the Euclidian distance.We calculate the distance between two text documents d 1 , d 2 ∈ D as follows: However, the Euclidean distance should only be used for normalized vectors, since otherwise the different lengths of documents can result in a smaller distance between documents that share less words than between documents that have more words in common and should be considered therefore as more similar.
Note that for normalized vectors the scalar product is not much different in behavior from the Euclidean distance, since for two vectors x and y it is For a more detailed discussion of the vector space model and weighting schemes see, e.g.Baeza-Yates & Ribeiro-Neto (1999); Greiff (1998); Salton & Buckley (1988); Salton et al. (1975).

Linguistic Preprocessing
Often text mining methods may be applied without further preprocessing.Sometimes, however, additional linguistic preprocessing (c.f.Manning & Schütze (2001)) may be used to enhance the available information about terms.For this, the following approaches are frequently applied: Part-of-speech tagging (POS) determines the part of speech tag, e.g.noun, verb, adjective, etc. for each term.
Text chunking aims at grouping adjacent words in a sentence.An example of a chunk is the noun phrase "the current account deficit".
Word Sense Disambiguation (WSD) tries to resolve the ambiguity in the meaning of single words or phrases.An example is 'bank' which may have -among others -the senses 'financial institution' or the 'border of a river or lake'.Thus, instead of terms the specific meanings could be stored in the vector space representation.This leads to a bigger dictionary but considers the semantic of a term in the representation.
Parsing produces a full parse tree of a sentence.From the parse, we can find the relation of each word in the sentence to all the others, and typically also its function in the sentence (e.g.subject, object, etc.).
Linguistic processing either uses lexica and other resources as well as handcrafted rules.If a set of examples is available machine learning methods as described in section 3, especially in section 3.3, may be employed to learn the desired tags.
It turned out, however, that for many text mining tasks linguistic preprocessing is of limited value compared to the simple bag-of-words approach with basic preprocessing.The reason is that the co-occurrence of terms in the vector representation serves as an automatic disambiguation, e.g. for classification (Leopold & Kindermann 2002).Recently some progress was made by enhancing bag of words with linguistic feature for text clustering and classification (Hotho et al. 2003;Bloehdorn & Hotho 2004).

Data Mining Methods for Text
One main reason for applying data mining methods to text document collections is to structure them.A structure can significantly simplify the access to a document collection for a user.Well known access structures are library catalogues or book indexes.However, the problem of manual designed indexes is the time required to maintain them.Therefore, they are very often not up-to-date and thus not usable for recent publications or frequently changing information sources like the World Wide Web.The existing methods for structuring collections either try to assign keywords to documents based on a given keyword set (classification or categorization methods) or automatically structure document collections to find groups of similar documents (clustering methods).In the following we first describe both of these approaches.Furthermore, we discuss in Sect.3.3 methods to automatically extract useful information patterns from text document collections.In Sect.3.4 we review methods for visual text mining.These methods allow in combination with structuring methods the development of powerful tools for the interactive exploration of document collections.We conclude this section with a brief discussion of further application areas for text mining.

Classification
Text classification aims at assigning pre-defined classes to text documents (Mitchell 1997).An example would be to automatically label each incoming news story with a topic like "sports", "politics", or "art".Whatever the specific method employed, a data mining classification task starts with a training set D = (d 1 , . . ., d n ) of documents that are already labelled with a class L ∈ L (e.g.sport, politics).The task is then to determine a classification model which is able to assign the correct class to a new document d of the domain.
To measure the performance of a classification model a random fraction of the labelled documents is set aside and not used for training.We may classify the documents of this test set with the classification model and compare the estimated labels with the true labels.The fraction of correctly classified documents in relation to the total number of documents is called accuracy and is a first performance measure.
Often, however, the target class covers only a small percentage of the documents.Then we get a high accuracy if we assign each document to the alternative class.To avoid this effect different measures of classification success are often used.Precision quantifies the fraction of retrieved documents that are in fact relevant, i.e. belong to the target class.Recall indicates which fraction of the relevant documents is retrieved.
Obviously there is a trade off between precision and recall.Most classifiers internally determine some "degree of membership" in the target class.If only documents of high degree are assigned to the target class, the precision is high.However, many relevant documents might have been overlooked, which corresponds to a low recall.When on the other hand the search is more exhaustive, recall increases and precision goes down.The F-score is a compromise of both for measuring the overall performance of classifiers.

Index Term Selection
As document collections often contain more than 100,000 different words we may select the most informative ones for a specific classification task to reduce the number of words and thus the complexity of the classification problem at hand.One commonly used ranking score is the information gain which for a term t j is defined as Here p(L c ) is the fraction of training documents with classes L 1 and L 2 , p(t j =1) and p(t j =0) is the number of documents with / without term t j and p(L c |t j =m) is the conditional probability of classes L 1 and L 2 if term t j is contained in the document or is missing.It measures how useful t j is for predicting L 1 from an information-theoretic point of view.We may determine IG(t j ) for all terms and remove those with very low information gain from the dictionary.
In the following sections we describe the most frequently used data mining methods for text categorization.

Naïve Bayes Classifier
Probabilistic classifiers start with the assumption that the words of a document d i have been generated by a probabilistic mechanism.It is supposed that the class L(d i ) of document d i has some relation to the words which appear in the document.This may be described by the conditional distribution p(t 1 , . . . ,t n i |L(d i )) of the n i words given the class.Then the Bayesian formula yields the probability of a class given the words of a document (Mitchell 1997) Note that each document is assumed to belong to exactly one of the k classes in L. The prior probability p(L) denotes the probability that an arbitrary document belongs to class L before its words are known.Often the prior probabilities of all classes may be taken to be equal.The conditional probability on the left is the desired posterior probability that the document with words t 1 , . . ., t n i belongs to class L c .We may assign the class with highest posterior probability to our document.
For document classification it turned out that the specific order of the words in a document is not very important.Even more we may assume that for documents of a given class a word appears in the document irrespective of the presence of other words.This leads to a simple formula for the conditional probability of words given a class L c p(t 1 , . . . , Combining this "naïve" independence assumption with the Bayes formula defines the Naïve Bayes classifier (Good 1965).Simplifications of this sort are required as many thousand different words occur in a corpus.
The naïve Bayes classifier involves a learning step which simply requires the estimation of the probabilities of words p(t j |L c ) in each class by its relative frequencies in the documents of a training set which are labelled with L c .In the classification step the estimated probabilities are used to classify a new instance according to the Bayes rule.In order to reduce the number of probabilities p(t j |L m ) to be estimated, we can use index term selection methods as discussed above in Sect.3.1.1.
Although this model is unrealistic due to its restrictive independence assumption it yields surprisingly good classifications (Dumais et al. 1998;Joachims 1998).It may be extended into several directions (Sebastiani 2002).
As the effort for manually labeling the documents of the training set is high, some authors use unlabeled documents for training.Assume that from a small training set it has been established that word t i is highly correlated with class L c .If from unlabeled documents it may be determined that word t j is highly correlated with t i , then also t j is a good predictor for class L c .In this way unlabeled documents may improve classification performance.In Nigam et al. (2000) the authors used a combination of Expectation-Maximization (EM, Dempster et al. (1977)) and a naïve Bayes classifier and were able to reduce the classification error by up to 30%.

Nearest Neighbor Classifier
Instead of building explicit models for the different classes we may select documents from the training set which are "similar" to the target document.The class of the target document subsequently may be inferred from the class labels of these similar documents.If k similar documents are considered, the approach is also known as k-nearest neighbor classification.
There is a large number of similarity measures used in text mining.One possibility is simply to count the number of common words in two documents.Obviously this has to be normalized to account for documents of different lengths.On the other hand words have greatly varying information content.A standard way to measure the latter is the cosine similarity as defined in (3).Note that only a small fraction of all possible terms appear in this sums as w(d, t) = 0 if the term t is not present in the document d.Other similarity measures are discussed in Baeza-Yates & Ribeiro-Neto (1999).
For deciding whether document d i belongs to class L m , the similarity S(d i , d j ) to all documents d j in the training set is determined.The k most similar training documents (neighbors) are selected.The proportion of neighbors having the same class may be taken as an estimator for the probability of that class, and the class with the largest proportion is assigned to document d i .The optimal number k of neighbors may be estimated from additional training data by cross-validation.
Nearest neighbor classification is a nonparametric method and it can be shown that for large data sets the error rate of the 1-nearest neighbor classifier is never larger than twice the optimal error rate (Hastie et al. 2001).Several studies have shown that k-nearest neighbor methods have very good performance in practice (Joachims 1998).Their drawback is the computational effort during classification, where basically the similarity of a document with respect to all other documents of a training set has to be determined.Some extensions are discussed in Sebastiani (2002).

Decision Trees
Decision trees are classifiers which consist of a set of rules which are applied in a sequential way and finally yield a decision.They can be best explained by observing the training process, which starts with a comprehensive training set.It uses a divide and conquer strategy: For a training set M with labelled documents the word t i is selected, which can predict the class of the documents in the best way, e.g. by the information gain (8).Then M is partitioned into two subsets, the subset M + i with the documents containing t i , and the subset M − i with the documents without t i .This procedure is recursively applied to M + i and M − i .It stops if all documents in a subset belong to the same class L c .It generates a tree of rules with an assignment to actual classes in the leaves.
Decision trees are a standard tool in data mining (Quinlan 1986;Mitchell 1997).They are fast and scalable both in the number of variables and the size of the training set.For text mining, however, they have the drawback that the final decision depends only on relatively few terms.A decisive improvement may be achieved by boosting decision trees (Schapire & Singer 1999), i.e. determining a set of complementary decision trees constructed in such a way that the overall error is reduced.Schapire & Singer (2000) use even simpler one step decision trees containing only one rule and get impressive results for text classification.

Support Vector Machines and Kernel Methods
A Support Vector Machine (SVM) is a supervised classification algorithm that recently has been applied successfully to text classification tasks (Joachims 1998;Dumais et al. 1998;Leopold & Kindermann 2002).As usual a document d is represented by a -possibly weighted -vector (t d1 , . . ., t dN ) of the counts of its words.A single SVM can only separate two classes -a positive class L 1 (indicated by y = +1) and a negative class L 2 (indicated by y = −1).In the space of input vectors a hyperplane may be defined by setting y = 0 in the following linear equation.
The SVM algorithm determines a hyperplane which is located between the positive and negative examples of the training set.The parameters b j are adapted in such a way that the distance ξ -called margin -between the hyperplane and the closest positive and negative example documents is maximized, as shown in Fig. 3.1.5.This amounts to a constrained quadratic optimization problem which can be solved efficiently for a large number of input vectors.The documents having distance ξ from the hyperplane are called support vectors and determine the actual location of the hyperplane.Usually only a small fraction of documents are support vectors.A new document with term vector t d is classified in L 1 if the value f ( t d ) > 0 and into L 2 otherwise.In case that the document vectors of the two classes are not linearly separable a hyperplane is selected such that as few as possible document vectors are located on the "wrong" side.
SVMs can be used with non-linear predictors by transforming the usual input features in a non-linear way, e.g. by defining a feature map Subsequently a hyperplane may be defined in the expanded input space.Obviously such non-linear transformations may be defined in a large number of ways.
The most important property of SVMs is that learning is nearly independent of the dimensionality of the feature space.It rarely requires feature selection as it inherently selects data points (the support vectors) required for a good classification.This allows good generalization even in the presence of a large number of features and makes SVM especially suitable for the classification of texts (Joachims 1998).In the case of textual data the choice of the kernel function has a minimal effect on the accuracy of classification: Kernels that imply a high dimensional feature space show slightly better results in terms of precision and recall, but they are subject to overfitting (Leopold & Kindermann 2002).

Classifier Evaluations
During the last years text classifiers have been evaluated on a number of benchmark document collections.It turns out that the level of performance of course depends on the document collection.Table 1 gives some representative results achieved for the Reuters 20 newsgroups collection (Sebastiani 2002, p.38).Concerning the relative quality of classifiers boosted trees, SVMs, and k-nearest neighbors usually deliver top-notch performance, while naïve Bayes and decision trees are less reliable.

Clustering
Clustering method can be used in order to find groups of documents with similar content.The result of clustering is typically a partition (also called) clustering P, a set of clusters P. Each cluster consists of a number of documents d.Objects -in our case documents -of a cluster should be similar and dissimilar to documents of other clusters.Usually the quality of clusterings is considered better if the contents of the documents within one cluster are more similar and between the clusters more dissimilar.Clustering methods group the documents only by considering their distribution in document space (for example, a n-dimensional space if we use the vector space model for text documents).
Clustering algorithms compute the clusters based on the attributes of the data and measures of (dis)similarity.However, the idea of what an ideal clustering result should look like varies between applications and might be even different between users.One can exert influence on the results of a clustering algorithm by using only subsets of attributes or by adapting the used similarity measures and thus control the clustering process.To which extent the result of the cluster algorithm coincides with the ideas of the user can be assessed by evaluation measures.A survey of different kinds of clustering algorithms and the resulting cluster types can be found in Steinbach et al. (2003).
In the following, we first introduce standard evaluation methods and present then details for hierarchical clustering approaches, k-means, bi-section-k-means, self-organizing maps and the EM-algorithm.We will finish the clustering section with a short overview of other clustering approaches used for text clustering.

Evaluation of Clustering Results
In general, there are two ways to evaluate clustering results.One the one hand statistical measures can be used to describe the properties of a clustering result.On the other hand some given classification can be seen as a kind of gold standard which is then typically used to compare the clustering results with the given classification.We discuss both aspects in the following.

Statistical Measures
In the following, we first discuss measures which cannot make use of a given classification L of the documents.They are called indices in statistical literature and evaluate the quality of a clustering on the basis of statistic connections.One finds a large number of indices in literature (see Fickel (1997); Duda & Hart (1973)).One of the most well-known measures is the mean square error.It permits to make statements on quality of the found clusters dependent on the number of clusters.Unfortunately, the computed quality is always better if the number of cluster is higher.In Kaufman & Rousseeuw (1990) an alternative measure, the silhouette coefficient, is presented which is independent of the number of clusters.We introduce both measures in the following.
Mean square error If one keeps the number of dimensions and the number of clusters constant the mean square error (Mean Square error, MSE) can be used Band 20 -2005 likewise for the evaluation of the quality of clustering.The mean square error is a measure for the compactness of the clustering and is defined as follows: Definition 1 (MSE) The means square error (MSE) for a given clustering P is defined as whereas the means square error for a cluster P is given by: and µ P = 1 |P| ∑ d∈P t d is the centroid of the clusters P and dist is a distance measure.
Silhouette Coefficient One clustering measure that is independent from the number of clusters is the silhouette coefficient SC(P) (cf.Kaufman & Rousseeuw (1990)).The main idea of the coefficient is to find out the location of a document in the space with respect to the cluster of the document and the next similar cluster.For a good clustering the considered document is nearby the own cluster whereas for a bad clustering the document is closer to the next cluster.With the help of the silhouette coefficient one is able to judge the quality of a cluster or the entire clustering (details can be found in Kaufman & Rousseeuw (1990)).Kaufman & Rousseeuw (1990) gives characteristic values of the silhouette coefficient for the evaluation of the cluster quality.A value for SC(P) between 0.7 and 1.0 signals excellent separation between the found clusters, i.e. the objects within a cluster are very close to each other and are far away from other clusters.
The structure was very well identified by the cluster algorithm.For the range from 0.5 to 0.7 the objects are clearly assigned to the appropriate clusters.A larger level of noise exists in the data set if the silhouette coefficient is within the range of 0.25 to 0.5 whereby also here still clusters are identifiable.Many objects could not be assigned clearly to one cluster in this case due to the cluster algorithm.At values under 0.25 it is practically impossible to identify a cluster structure and to calculate meaningful (from the view of application) cluster centers.The cluster algorithm more or less "guessed" the clustering.

Comparative Measures
The purity measure is based on the well-known precision measure for information retrieval (cf.Pantel & Lin (2002)).Each resulting cluster P from a partitioning P of the overall document set D is treated as if it were the result of a query.Each set L of documents of a partitioning L, which is obtained by manual labelling, is treated as if it is the desired set of documents for a query which leads to the same definitions for precision, recall and f-score as defined in Equations 6 and 7.The two partitions P and L are then compared as follows.
The precision of a cluster P ∈ P for a given category L ∈ L is given by The overall value for purity is computed by taking the weighted average of maximal precision values: The counterpart of purity is: where Recall(P, L) := Precision(L, P) and the well known which is based on the F-score as defined in Eq. 7.
The three measures return values in the interval [0, 1], with 1 indicating optimal agreement.Purity measures the homogeneity of the resulting clusters when evaluated against a pre-categorization, while inverse purity measures how stable the pre-defined categories are when split up into clusters.Thus, purity achieves an "optimal" value of 1 when the number of clusters k equals |D|, whereas inverse purity achieves an "optimal" value of 1 when k equals 1.Another name in the literature for inverse purity is microaveraged precision.The reader may note that, in the evaluation of clustering results, microaveraged precision is identical to microaveraged recall (cf.e.g.Sebastiani (2002)).The Fmeasure works similar as inverse purity, but it depreciates overly large clusters, as it includes the individual precision of these clusters into the evaluation.
While (inverse) purity and F-measure only consider 'best' matches between 'queries' and manually defined categories, the entropy indicates how large the information content uncertainty of a clustering result with respect to the given classification is , where ( 15) where prob(L|P) = Precision(P, L) and prob(P) = |P| |D| .The entropy has the range [0, log(|L|)], with 0 indicating optimality.

Partitional Clustering
Hierarchical Clustering Algorithms Manning & Schütze (2001); Steinbach et al. (2000) got their name since they form a sequence of groupings or clusters that can be represented in a hierarchy of clusters.This hierarchy can be obtained either in a top-down or bottom-up fashion.Top-down means that we start with one cluster that contains all documents.This cluster is stepwise refined by splitting it iteratively into sub-clusters.One speaks in this case also of the so called "divisive" algorithm.The bottom-up or "agglomerative" procedures start by considering every document as individual cluster.Then the most similar clusters are iteratively merged, until all documents are contained in one single cluster.In practice the divisive procedure is almost of no importance due to its generally bad results.Therefore, only the agglomerative algorithm is outlined in the following.
The agglomerative procedure considers initially each document d of the the whole document set D as an individual cluster.It is the first cluster solution.It is assumed that each document is member of exactly one cluster.One determines the similarity between the clusters on the basis of this first clustering and selects the two clusters p, q of the clustering P with the minimum distance dist(p, q).Both cluster are merged and one receives a new clustering.One continues this procedure and re-calculates the distances between the new clusters in order to join again the two clusters with the minimum distance dist(p, q).The algorithm stops if only one cluster is remaining.
The distance can be computed according to Eq. 4. It is also possible to derive the clusters directly on the basis of the similarity relationship given by a matrix.For the computation of the similarity between clusters that contain more than one element different distance measures for clusters can be used, e.g. based on the outer cluster shape or the cluster center.Common linkage procedures that make use of different cluster distance measures are single linkage, average linkage or Ward's procedure.The obtained clustering depends on the used measure.Details can be found, for example, in Duda & Hart (1973).
By means of so-called dendrograms one can represent the hierarchy of the clusters obtained as a result of the repeated merging of clusters as described above.The dendrograms allows to estimate the number of clusters based on the distances of the merged clusters.Unfortunately, the selection of the appropriate linkage method depends on the desired cluster structure, which is usually unknown in advance.For example, single linkage tends to follow chain-like clusters in the data, while complete linkage tends to create ellipsoid clusters.Thus prior knowledge about the expected distribution and cluster form is usually necessary for the selection of the appropriate method (see also Duda & Hart (1973)).However, substantially more problematic for the use of the algorithm for large data sets is the memory required to store the similarity matrix, which consists of n(n − 1)/2 elements where n is the number of documents.Also the runtime behavior with O(n 2 ) is worse compared to the linear behavior of KMeans as discussed in the following.
k-means is one of the most frequently used clustering algorithms in practice in the field of data mining and statistics (see Duda & Hart (1973); Hartigan (1975)).The procedure which originally comes from statistics is simple to implement and can also be applied to large data sets.It turned out that especially in the field of text clustering k-means obtains good results.Proceeding from a starting solution in which all documents are distributed on a given number of clusters one tries to improve the solution by a specific change of the allocation of documents to the clusters.Meanwhile, a set of variants exists whereas the basic principle goes back to Forgy (1965) or MacQueen (1967).In literature for vector quantization KMeans is also known under the name LloydMaxAlgorithm (Gersho & Gray 1992).The basic principle is shown in the following algorithm: k-means essentially consists of the steps three and four in the algorithm, whereby the number of clusters k must be given.In step three the documents are assigned to the nearest of the k centroids (also called cluster prototype).
Step four calculates a new centroids on the basis of the new allocations.We repeat the two steps in a loop (step five) until the cluster centroids do not change any more.The algorithm 5.1 corresponds to a simple hill climbing procedure which typically gets stuck in a local optimum (the finding of the global optimum is a NP complete problem).Apart from a suitable method to determine the starting solution (step one), we require a measure for calculating the distance or Algorithm 1 The KMeans algorithm Input: set D, distance measure dist, number k of cluster Output: A partitioning P of the set D of documents (i.e., a set P of k disjoint subsets of D with P∈P P = D).
1: Choose randomly k data points from D as starting centroids t P 1 . . .t P k .
2: repeat 3: Assign each point of P to the closest centroid with respect to dist.

4:
(Re-)calculate the cluster centroids t P 1 . . .t P k of clusters P 1 . . .P k .5: until cluster centroids t P 1 . . .t P k are stable 6: return set P := {P 1 , . . ., P k }, of clusters.similarity in step three (cf.section 2.1).Furthermore the abort criterion of the loop in step five can be chosen differently e.g. by stopping after a fix number of iterations.

Bi-Section-k-means
One fast text clustering algorithm, which is also able to deal with the large size of the textual data is the Bi-Section-KMeans algorithm.
In Steinbach et al. (2000) it was shown that Bi-Section-KMeans is a fast and high-quality clustering algorithm for text documents which is frequently outperforming standard KMeans as well as agglomerative clustering techniques.
Bi-Section-KMeans is based on the KMeans algorithm.It repeatedly splits the largest cluster (using KMeans) until the desired number of clusters is obtained.Another way of choosing the next cluster to be split is picking the one with the largest variance.Steinbach et al. (2000) showed neither of these two has a significant advantage.
Self Organizing Map (SOM, cf.Kohonen (1982)) are a special architecture of neural networks that cluster high-dimensional data vectors according to a similarity measure.The clusters are arranged in a low-dimensional topology that preserves the neighborhood relations in the high dimensional data.Thus, not only objects that are assigned to one cluster are similar to each other (as in every cluster analysis), but also objects of nearby clusters are expected to be more similar than objects in more distant clusters.Usually, two-dimensional grids of squares or hexagons are used (cf.Fig. 3).
The network structure of a self-organizing map has two layers (see Fig. 3).The neurons in the input layer correspond to the input dimensions, here the words of the document vector.The output layer (map) contains as many neurons as clusters needed.All neurons in the input layer are connected with all neurons in the output layer.The weights of the connection between input and output layer of the neural network encode positions in the high-dimensional data space (similar to the cluster prototypes in k-means).Thus, every unit in the output layer represents a cluster center.Before the learning phase of the network, the two-dimensional structure of the output units is fixed and the weights are initialized randomly.During learning, the sample vectors (defining the documents) are repeatedly propagated through the network.The weights of the most similar prototype w s (winner neuron) are modified such that the prototype moves toward the input vector w i , which is defined by the currently considered document d, i.e. w i := t d (competitive learning).As similarity measure usually the Euclidean distance is used.However, for text documents the scalar product (see Eq. 3) can be applied.The weights w s of the winner neuron are modified according to the following equation: where σ is a learning rate.
To preserve the neighborhood relations, prototypes that are close to the winner neuron in the two-dimensional structure are also moved in the same direction.The weight change decreases with the distance from the winner neuron.Therefore, the adaption method is extended by a neighborhood function v (see also Fig. 3): where σ is a learning rate.By this learning procedure, the structure in the highdimensional sample data is non-linearly projected to the lower-dimensional topology.After learning, arbitrary vectors (i.e.vectors from the sample set or prior 'unknown' vectors) can be propagated through the network and are mapped to the output units.For further details on self-organizing maps see Kohonen (1984).Examples for the application of SOMs for text mining can be found in Lin et al. (1991); Honkela et al. (1996); Kohonen et al. (2000); Nürnberger d i with probability q ic to P c (soft clustering), where q i = (q i1 , . . ., q ik ) is a probability vector ∑ k c=1 q ic = 1.The underlying statistical assumption is that a document was created in two stages: First we pick a cluster P c from {1, . . ., k} with fixed probability q c ; then we generate the words t of the document according to a cluster-specific probability distribution p(t|P c ).This corresponds to a mixture model where the probability of an observed document (t 1 , . . ., t n i ) is Each cluster P c is a mixture component.The mixture probabilities q c describe an unobservable "cluster variable" z which may take the values from {1, . . ., k}.A well established method for estimating models involving unobserved variables is the EM-algorithm (Hastie et al. 2001), which basically replaces the unknown value with its current probability estimate and then proceeds as if it has been observed.Clustering methods for documents based on mixture models have been proposed by Cheeseman & Stutz (1996) and yield excellent results.Hofmann ( 2001) formulates a variant that is able to cluster terms occurring together instead of documents.

Alternative Clustering Approaches
Co-clustering algorithm designate the simultaneous clustering of documents and terms (Dhillon et al. 2003).They follow thereby another paradigm than the "classical" cluster algorithm as KMeans which only clusters elements of the one dimension on the basis of their similarity to the second one, e.g.documents based on terms.
Fuzzy Clustering While most classical clustering algorithms assign each datum to exactly one cluster, thus forming a crisp partition of the given data, fuzzy clustering allows for degrees of membership, to which a datum belongs to different clusters (Bezdek 1981).These approaches are frequently more stable.Applications to text are described in, e.g., Mendes & Sacks (2001); Borgelt & Nürnberger (2004).
The Utility of Clustering We have described the most important types of clustering approaches, but we had to leave out many other.Obviously there are many ways to define clusters and because of this we cannot expect to obtain something like the 'true' clustering.Still clustering can be insightful.In contrast to classification, which relies on a prespecified grouping, cluster procedures label documents in a new way.By studying the words and phrases that characterize a cluster, for example, a company could learn new insights about its customers and their typical properties.A comparison of some clustering methods is given in Steinbach et al. (2000).

Information Extraction
Natural language text contains much information that is not directly suitable for automatic analysis by a computer.However, computers can be used to sift through large amounts of text and extract useful information from single words, phrases or passages.Therefore information extraction can be regarded as a restricted form of full natural language understanding, where we know in advance what kind of semantic information we are looking for.The main task is to extract parts of text and assign specific attributes to it.
As an example consider the task to extract executive position changes from news stories: "Robert L. James, chairman and chief executive officer of McCann-Erickson, is going to retire on July 1st.He will be replaced by John J. Donner, Jr., the agencies chief operating officer."In this case we have to identify the following information: Organization (McCann-Erickson), position (chief executive officer), date (July 1), outgoing person name (Robert L. James), and incoming person name (John J. Donner, Jr.).
The task of information extraction naturally decomposes into a series of processing steps, typically including tokenization, sentence segmentation, part-of-speech assignment, and the identification of named entities, i.e. person names, location names and names of organizations.At a higher level phrases and sentences have to be parsed, semantically interpreted and integrated.Finally the required pieces of information like "position" and "incoming person name" are entered into the database.Although the most accurate information extraction systems often involve handcrafted language-processing modules, substantial progress has been made in applying data mining techniques to a number of these steps.

Classification for Information Extraction
Entity extraction was originally formulated in the Message Understanding Conference (Chinchor 1997).One can regard it as a word-based tagging problem: The word, where the entity starts, get tag "B", continuation words get tag "I" and words outside the entity get tag "O".This is done for each type of entity of interest.For the example above we have for instance the person-words "by (O) John (B) J. (I) Donner (I) Jr. (I) the (O)".
Hence we have a sequential classification problem for the labels of each word, with the surrounding words as input feature vector.A frequent way of forming the feature vector is a binary encoding scheme.Each feature component can be considered as a test that asserts whether a certain pattern occurs at a specific position or not.For example, a feature component takes the value 1 if the previous word is the word "John" and 0 otherwise.Of course we may not only test the presence of specific words but also whether the words starts with a capital letter, has a specific suffix or is a specific part-of-speech.In this way results of previous analysis may be used.Now we may employ any efficient classification method to classify the word labels using the input feature vector.A good candidate is the Support Vector Machine because of its ability to handle large sparse feature vectors efficiently.Takeuchi & Collier (2002) used it to extract entities in the molecular biology domain.

Hidden Markov Models
One problem of standard classification approaches is that they do not take into account the predicted labels of the surrounding words.This can be done using probabilistic models of sequences of labels and features.Frequently used is the hidden Markov model (HMM), which is based on the conditional distributions of current labels L (j) given the previous label L (j−1) and the distribution of the current word t (j) given the current and the previous labels L (j) , L (j−1) .L (j) ∼ p(L (j) |L (j−1) ) t (j) ∼ p(t (j) |L (j) , L (j−1) ) (18) A training set of words and their correct labels is required.For the observed words the algorithm takes into account all possible sequences of labels and computes their probabilities.An efficient learning method that exploits the sequential structure is the Viterbi algorithm (Rabiner 1989).Hidden Markov models were successfully used for named entity extraction, e.g. in the Identifinder system (Bikel et al. 1999).

Conditional Random Fields
Hidden Markov models require the conditional independence of features of different words given the labels.This is quite restrictive as we would like to include features which correspond to several words simultaneously.A recent approach for modelling this type of data is called conditional random field (CRF, cf.Lafferty et al. (2001)).Again we consider the observed vector of words t and the corresponding vector of labels L. The labels have a graph structure.For a label L c let N(c) be the indices of neighboring labels.Then (t, L) is a conditional random field when conditioned on the vector t of all terms the random variables obey the Markov property i.e. the whole vector t of observed terms and the labels of neighbors may influence the distribution of the label L c .Note that we do not model the distribution p(t) of the observed words, which may exhibit arbitrary dependencies.
We consider the simple case that the words t = (t 1 , t 2 , . . ., t n ) and the corresponding labels L 1 , L 2 , . . ., L n have a chain structure and that L c depends only on the preceding and succeeding labels L c−1 and L c+1 .Then the conditional distribution p(L|t) has the form where f jr (L j , t) and g jr (L j , L j−1 , t) are different features functions related to L j and the pair L j , L j−1 respectively.CRF models encompass hidden Markov models, but they are much more expressive because they allow arbitrary dependencies in the observation sequence and more complex neighborhood structures of labels.As for most machine learning algorithms a training sample of words and the correct labels is required.In addition to the identity of words arbitrary properties of the words, like part-of-speech tags, capitalization, prefixes and suffixes, etc. may be used leading to sometimes more than a million features.The unknown parameter values λ jr and µ jr are usually estimated using conjugate gradient optimization routines (McCallum 2003).McCallum (2003) applies CRFs with feature selection to named entity recognition and reports the following F1-measures for the CoNLL corpus: person names 93%, location names 92%, organization names 84%, miscellaneous names 80%.CRFs also have been successfully applied to noun phrase identification (McCallum 2003), part-of-speech tagging (Lafferty et al. 2001), shallow parsing (Sha & Pereira 2003), and biological entity recognition (Kim et al. 2004).

Explorative Text Mining: Visualization Methods
Graphical visualization of information frequently provides more comprehensive and better and faster understandable information than it is possible by pure text based descriptions and thus helps to mine large document collections.Many of the approaches developed for text mining purposes are motivated by methods that had been proposed in the areas of explorative data analysis, information visualization and visual data mining.For an overview of these areas of research see, e.g., U. Fayyad (2001); Keim (2002).In the following we will focus on methods that have been specifically designed for text mining oras a subgroup of text mining methods and a typical application of visualization methods -information retrieval.
In text mining or information retrieval systems visualization methods can improve and simplify the discovery or extraction of relevant patterns or information.Information that allow a visual representation comprises aspects of the document collection or result sets, keyword relations, ontologies or -if retrieval systems are considered -aspects of the search process itself, e.g. the search or navigation path in hyperlinked collections.
However, especially for text collections we have the problem of finding an appropriate visualization for abstract textual information.Furthermore, an interactive visual data exploration interface is usually desirable, e.g. to zoom in local areas or to select or mark parts for further processing.This results in great demands on the user interface and the hardware.In the following we give a brief overview of visualization methods that have been realized for text mining and information retrieval systems.

Visualizing Relations and Result Sets
Interesting approaches to visualize keyword-document relations are, e.g., the Cat-a-Cone model (Hearst & Karadi 1997), which visualizes in a three dimensional representation hierarchies of categories that can be interactively used to refine a search.The InfoCrystal (Spoerri 1995) visualizes a (weighted) boolean query and the belonging result set in a crystal structure.The Lyberworld model (Hemmje et al. 1994) and the visualization components of the SENTINEL Model (Fox et al. 1999) are representing documents in an abstract keyword space.
An approach to visualize the results of a set of queries was presented in Havre et al. (2001).Here, retrieved documents are arranged according to their similarity to a query on straight lines.These lines are arranged in a circle around a common center, i.e. every query is represented by a single line.If several documents are placed on the same (discrete) position, they are arranged in the same distance to the circle, but with a slight offset.Thus, clusters occur that represent the distribution of documents for the belonging query.

Visualizing Document Collections
For the visualization of document collections usually two-dimensional projections are used, i.e. the high dimensional document space is mapped on a two-dimensional surface.In order to depict individual documents or groups of documents usually text flags are used, which represent either a keyword or the document category.Colors are frequently used to visualize the density, e.g. the number of documents in this area, or the difference to neighboring documents, e.g. in order to emphasize borders between different categories.If three-dimensional projections are used, for example, the number of documents assigned to a specific area can be represented by the z-coordinate.
An Example: Visualization Using Self-Organizing Maps Visualization of document collections requires methods that are able to group documents based on their similarity and furthermore that visualize the similarity between discovered groups of documents.Clustering approaches that are frequently used to find groups of documents with similar content (Steinbach et al. 2000) -see also section 3.2 -usually do not consider the neighborhood relations between the obtained cluster centers.Self-organizing maps, as discussed above, are an alternative approach which is frequently used in data analysis to cluster high dimensional data.The resulting clusters are arranged in a low-dimensional topology that preserves the neighborhood relations of the corresponding high dimensional data vectors and thus not only objects that are assigned to one cluster are similar to each other, but also objects of nearby clusters are expected to be more similar than objects in more distant clusters.
Usually, two-dimensional arrangements of squares or hexagons are used for the definition of the neighborhood relations.Although other topologies are possible for self-organizing maps, two-dimensional maps have the advantage of intuitive visualization and thus good exploration possibilities.In document retrieval, self-organizing maps can be used to arrange documents based on their similarity.This approach opens up several appealing navigation possibilities.Most important, the surrounding grid cells of documents known to be interesting can be scanned for further similar documents.Furthermore, the distribution of keyword search results can be visualized by coloring the grid cells of the map with respect to the number of hits.This allows a user to judge e.g.whether the search results are assigned to a small number of (neighboring) grid cells of the map, or whether the search hits are spread widely over the map and thus the search was -most likely -too unspecific.
A first application of self-organizing maps in information retrieval was presented in Lin et al. (1991).It provided a simple two-dimensional cluster representation (categorization) of a small document collection.A refined model, the WEBSOM approach, extended this idea to a web based interface applied to newsgroup data that provides simple zooming techniques and coloring methods (Honkela et al. 1996;Honkela 1997;Kohonen et al. 2000).Further extensions introduced hierarchies (Merkl 1998), supported the visualization of search results (Roussinov & Chen 2001) and combined search, navigation and visualization techniques in an integrated tool (Nürnberger 2001).A screenshot of the prototype discussed in Nürnberger ( 2001) is depicted in Fig. 4.

Other Techniques
Besides methods based on self-organizing maps several other techniques have been successfully applied to visualize document collections.For example, the tool VxInsight (Boyack et al. 2002) realizes a partially interactive mapping by an energy minimization approach similar to simulated annealing to construct a three dimensional landscape of the document collection.As input either a vector space description of the documents or a list of directional edges, e.g.defined based on citations of links, can be used.The tool SPIRE (Wise et al. 1995)  applies a three step approach: It first clusters documents in document space, than projects the discovered cluster centers onto a two dimensional surface and finally maps the documents relative to the projected cluster centers.SPIRE offers a scatter plot like projection as well as a three dimensional visualization.The visualization tool SCI-Map (Small 1999) applies an iterative clustering approach to create a network using, e.g., references of scientific publications.The tools visualizes the structure by a map hierarchy with an increasing number of details.
One major problem of most existing visualization approaches is that they create their output only by use of data inherent information, i.e. the distribution of the documents in document space.User specific information can not be integrated in order to obtain, e.g., an improved separation of the documents with respect to user defined criteria like keywords or phrases.Furthermore, the possibilities for a user to interact with the system in order to navigate or search are usually very limited, e.g., to boolean keyword searches and simple result lists.

Further Application Areas
Further major applications of text mining methods consider the detection of topics in text streams and text summarization.
Topic detection studies the problem of detecting new and upcoming topics in time-ordered document collections.The methods are frequently used in order to detect and monitor (topic tracking) news tickers or news broadcasts.An introduction and overview of current approaches can be found in Allan (2002).
Text summarization aims at the creation of a condensed version of a document or a document collection (multidocument summarization) that should contain its most important topics.Most approaches still focus on the idea to extract individual informative sentences from a text.The summary consists then simply of a collection of these sentences.However, recently refined approaches try to extract semantic information from documents and create summaries based on this information (cf.Leskovec et al. (2004)).For an overview see Mani & Maybury (1999) and Radev et al. (2002).

Applications
In this section we briefly discuss successful applications of text mining methods in quite diverse areas as patent analysis, text classification in news agencies, bioinformatics and spam filtering.Each of the applications has specific char-acteristics that had to be considered while selecting appropriate text mining methods.

Patent Analysis
In recent years the analysis of patents developed to a large application area.The reasons for this are on the one hand the increased number of patent applications and on the other hand the progress that had been made in text classification, which allows to use these techniques in this due to the commercial impact quite sensitive area.Meanwhile, supervised and unsupervised techniques are applied to analyze patent documents and to support companies and also the European patent office in their work.The challenges in patent analysis consists of the length of the documents, which are larger then documents usually used in text classification, and the large number of available documents in a corpus (Koster et al. 2001).Usually every document consist of 5,000 words in average.More than 140,000 documents have to be handled by the European patent office (EPO) per year.They are processed by 2,500 patent examiners in three locations.
In several studies the classification quality of state-of-the-art methods was analyzed.Koster et al. (2001) reported very good result with an 3% error rate for 16,000 full text documents to be classified in 16 classes (mono-classification) and a 6% error rate in the same setting for abstracts only by using the Winnow (Littlestone 1988) and the Rocchio algorithm (Rocchio 1971).These results are possible due to the large amount of available training documents.Good results are also reported in (Krier & Zacca 2002) for an internal EPO text classification application with a precision of 81 % and an recall of 78 %.
Text clustering techniques for patent analysis are often applied to support the analysis of patents in large companies by structuring and visualizing the investigated corpus.Thus, these methods find their way in a lot of commercial products but are still also of interest for research, since there is still a need for improved performance.Companies like IBM offer products to support the analysis of patent text documents.Dorre describes in (Dörre et al. 1999) the IBM Intelligent Miner for text in a scenario applied to patent text and compares it also to data mining and text mining.Coupet & Hehenberger (1998) do not only apply clustering but also give some nice visualization.A similar scenario on the basis of SOM is given in (Lamirel et al. 2003).

Text Classification for News Agencies
In publishing houses a large number of news stories arrive each day.The users like to have these stories tagged with categories and the names of important persons, organizations and places.To automate this process the Deutsche Presse-Agentur (dpa) and a group of leading German broadcasters (PAN) wanted to select a commercial text classification system to support the annotation of news articles.Seven systems were tested with a two given test corpora of about half a million news stories and different categorical hierarchies of about 800 and 2,300 categories (Paaß & deVries 2005).Due to confidentiality the results can be published only in anonymized form.
For the corpus with 2,300 categories the best system achieved at an F1-value of 39%, while for the corpus with 800 categories an F1-value of 79% was reached.
In the latter case a partially automatic assignment based on the reliability score was possible for about half the documents, while otherwise the systems could only deliver proposals for human categorizers.Especially good are the results for recovering persons and geographic locations with about 80% F1-value.In general there were great variations between the performances of the systems.
In a usability experiment with human annotators the formal evaluation results were confirmed leading to faster and more consistent annotation.It turned out, that with respect to categories the human annotators exhibit a relative large disagreement and a lower consistency than text mining systems.Hence the support of human annotators by text mining systems offers more consistent annotations in addition to faster annotation.The Deutsche Presse-Agentur now is routinely using a text mining system in its news production workflow.

Bioinformatics
Bio-entity recognition aims to identify and classify technical terms in the domain of molecular biology that correspond to instances of concepts that are of interest to biologists.Examples of such entities include the names of proteins, genes and their locations of activity such as cells or organism names.Entity recognition is becoming increasingly important with the massive increase in reported results due to high throughput experimental methods.It can be used in several higher level information access tasks such as relation extraction, summarization and question answering.
In the 2004 evaluation four types of extraction models were used: Support Vector Machines (SVMs), Hidden Markov Models (HMMs), Conditional Random Fields (CRFs) and the related Maximum Entropy Markov Models (MEMMs).Varying types of input features were employed: lexical features (words), n-grams, orthographic information, word lists, part-of-speech tags, noun phrase tags, etc.The evaluation shows that the best five systems yield an F1-value of about 70% (Kim et al. 2004).They use SVMs in combination with Markov models (72.6%),MEMMs (70.1%),CRFs (69.8%),CRFs together with SVMs (66.3%), and HMMs (64.8%).For practical applications the current accuracy levels are not yet satisfactory and research currently aims at including a sophisticated mix of external resources such as keyword lists and ontologies which provide terminological resources.

Anti-Spam Filtering of Emails
The explosive growth of unsolicited e-mail, more commonly known as spam, over the last years has been undermining constantly the usability of e-mail.One solution is offered by anti-spam filters.Most commercially available filters use black-lists and hand-crafted rules.On the other hand, the success of machine learning methods in text classification offers the possibility to arrive at anti-spam filters that quickly may be adapted to new types of spam.
There is a growing number of learning spam filters mostly using naive Bayes classifiers.A prominent example is Mozilla's e-mail client.Michelakis et al. (2004) compare different classifier methods and investigate different costs of classifying a proper mail as spam.They find that for their benchmark corpora the SVM nearly always yields best results.
To explore how well a learning-based filter performs in real life, they used an SVM-based procedure for seven months without retraining.They achieved a precision of 96.5% and a recall of 89.3%.They conclude that these good results may be improved by careful preprocessing and the extension of filtering to different languages.

Conclusion
In this article, we tried to give a brief introduction to the broad field of text mining.Therefore, we motivated this field of research, gave a more formal definition of the terms used herein and presented a brief overview of currently available text mining methods, their properties and their application to specific problems.Even though, it was impossible to describe all algorithms and applications in detail within the (size) limits of an article, we think that the ideas discussed and the provided references should give the interested reader a rough overview of this field and several starting points for further studies.

Figure 2 :
Figure 2: Hyperplane with maximal distance (margin) to examples of positive and negative classes constructed by the support vector machine.

Figure 3 :
Figure 3: Network architecture of self-organizing maps (left) and possible neighborhood function v for increasing distances from s (right)

Figure 4 :
Figure 4: A Prototypical Retrieval System Based on Self-Organizing Maps

Table 1 :
Performance of Different Classifiers for the Reuters collection