A New Centroid-based Approach for Genre Categorization of Web

In this paper we propose a new centroid-based approach for genre categorization of web pages. Our approach constructs genre centroids using a set of genre-labeled web pages, called training web pages. The obtained centroids will be used to classify new web pages. The aim of our approach is to provide a ﬂexible, incremental, reﬁned and combined categorization, which is more suitable for automatic web genre identiﬁcation. Our approach is ﬂexible because it assigns a web page to all predeﬁned genres with a conﬁ-dence score; it is incremental because it classiﬁes web pages one by one; it is reﬁned because each web page either reﬁnes the centroids or is discarded as noisy page; ﬁnally, our approach combines three diﬀerent feature sets, i.e. URL addresses, logical structure and hypertext structure. The experiments conducted on two known corpora show that our approach is very fast and outperforms other approaches.


Introduction
As the World Wide Web continues to grow exponentially, web page categorization becomes increasingly important in web searching.Web page categorization, called also web page classification, assigns a web page to one or more predefined categories.According to the type of category, categorization can be divided into sub-problems: topic categorization, sentiment categorization, genre categorization, and so on.
Recently, more attention has been given to automatic genre identification of web pages because it can be used to improve the quality of web search results -see, for example, all the articles in this journal and Mehler et al. ().
However, although potentially useful, the concept of "genre" is difficult to define and genre definitions abound.Generally speaking, a genre is a category of artistic, musical, or literary composition characterized by a particular style, form, or content  , but more specialized characterizations have been proposed.For instance, Kessler et al. () defined a genre as a bundle of facets, focusing on different textual properties such as brow, narrative and genre.According to Shepherd and Watters (), while non-digital genre is defined by the tuple <content, form>, the genre of web pages (or "cybergenre") is characterized by the triple <content, form, functionality>, where the functionality attribute accounts for the interaction between the user and the web page.Rauber and Müller-Kögler () defined the genre as a group of documents that share the same stylistic properties.In their experiment with digital libraries, documents of  For example, see Merriam-Webster Online Dictionary http://www.m-w.com the same genre are rendered with the same color.According to Finn (), genre is orthogonal to topic, and relates to polarities such as subjectivity/objectivity and positivity/negativity.For Boese () a genre is characterized by the same style, form and content.
In this article, the word "genre" is loosely defined as a textual category that can be more or less related to the topic or content of a web pages.For this reason, we use two different collections.One created by genre researchers for whom the concept of genre is independent from topic (the KI- corpus, see Section ); the other including a number of academic categories (the WebKB collection, see Section ).
In order to comply with our view of genre, our approach is flexible, incremental, refinable and combines different feature sets.We devised it to be fast, so that in the future it can be applied to web search engines.
Currently, search engines use keywords to classify web pages.Returned web pages are ranked and displayed to the user, who is often not satisfied with the result.For example, searching for the keyword "machine learning" will provide a list of web pages containing the words "machine" and "learning".These pages are from different genres.Therefore, web page genre categorization could be used to improve the retrieval quality of search engines (e.g.see Meyer Zu Eissen ()).For instance, a classifier could be trained on existing web directories and be applied to new web pages.At query time the user could be asked to specify one or more desired genres so that the search engine would returns a list of genres under which the pages would fall.
However, a web page is a complex object that includes heterogeneous elements with different communicative purposes.Generally, a web page is composed of different sections organized in the form of headings and links.These sections belong to different genres.Graphical elements (search buttons, images, menus, forms, etc.) and text types, sizes and colors are used to mark sections in web pages.Our approach assigns a web page to all predefined genres with different confidence scores, which represent the similarity between the web page and the centroid of each genre.
It is worth noting that web genres evolve over time because of the continuous mod ification of the content and purpose of web pages.Simply put, web genre evolution consists in updating old genres and creating new ones.In our approach we focus on the adjustment of old genres.Since automatic genre identification of web pages requires continuous learning, because web pages are often updated, we propose an incremental approach (see Section ).
Additionally, the Word Wide Web is an open environment, where the user can add a new page, modify the content of actual web page, delete a web page and so on.For this reason, the web is instable and contains many noisy web pages.Taking such web pages into account decreases the accuracy of genre classification (e.g.see Shepherd et al. ()).In our approach we propose a refined genre classification of web pages to discard noisy web pages.A web page is considered noisy where its similarities to all genre centroids are below a predetermined threshold.
As mentioned above, a web page is not only a text but contains many HTML tags.The information delimited by these tags is very useful for genre categorization.These information sources are heterogeneous because they have different representation struc tures that should be combined to increase the performance of genre classification of web pages.
In summary, the aim of our approach is to provide a flexible, incremental, refined and combined categorization, which is more suitable for automatic web genre identification.Our approach is flexible because it assigns a web page to all predefined genres with a confidence score; it is incremental because it classifies web pages one by one; it is refined because each web page either refines the centroids or is discarded as noisy page; finally, our approach combines three different feature sets, i.e.URL addresses, logical structure and hypertext structure.
This article is organized as follows: in Section  we summarize previous work on genre categorization of web pages; in Section  we explains our approach; in Section  we briefly describe the corpora used in our experiments; Section  presents our experimental results; Section  presents a comparative study; finally, in Section  we present some conclusions as well as our future work.

Related Work
Previous work on automatic genre identification is reviewed by focusing on features, classification algorithms and genre corpora.
Features Many types of features have been proposed for automatic genre categoriza tion.In the following paragraphs, the most important features are listed.Kessler et al. () used four types of features to classify part of the Brown corpus  by multiple facets (i.e.brow, narrative and genre).The first type is represented by structural features, which include counts of functional words, sentences, etc.The second type relies on lexical features, which include the presence of specific words or symbols.The third kind of features are character level features, such as punctuation marks.The fourh kind of features is based on derivative features, which are derived from character level and lexical features.These four features sets can be divided into two coarser types: structural features and surface features.
Karlgren () used twenty features including frequencies of functional words and Parts-of-Speech (POSs).He also used text statistics, e.g.counts of characters, words, number of words per sentence, etc.
Stamatatos et al. () identified genre using the most English common words.They used the fifty most frequent words on the BNC corpus  and the eight most frequent punctuation marks (period, comma, colon, semicolon, quotes, parenthesis, question mark and hyphen).
Dewdney et al. () adopted two features sets: BOW (Bag Of Words) and presenta tion features.Presentation features amounted to  features including layout features, Jebari linguistic features, verb tenses, etc. Finn and Kushmerick () used a total of  features to differentiate between subjective vs objective news articles and positive vs negative movie reviews.Most of these features were represented by the frequencies of genre-specific words.Meyer Zu Eissen and Stein () used different kinds of features including presentation features (i.e.HTML tag frequencies), classes of words (names, dates, etc.), frequencies of punctuation marks and POS tags.Kennedy and Shepherd () used a feature set including features about content (e.g. common words, met tags), about the form (e.g.number of images) and about the functionality (e.g.number of links, JavaScripts).Boese and Howe () used different kind of features, which can be grouped into three classes, namely stylistic features, form features and content features.More recently, Santini () and Lim et al. () tried to exploit all previ ously used features.Additionally, Lim et al. () used the URL as new feature and Kanaris and Stamatatos () used character n-grams extracted from both text and structure.Mehler et al. () studied the usefulness of logical document structure in text type classification.They adopted two approaches, which are the Quantitative Structure Analysis (QSA) and the Document Object Model Tree Kernel (DomTK).They conducted experiments to stress the usefulness of structure in document type recognition and compared the QSA approach against the DomTK approach.
Machine Learning Techniques Once a set of features has been obtained, it is necessary to choose a categorization algorithm.Most genre categorization algorithms are based on machine learning (cf.Mitchell ()) techniques .Among these techniques, we briefly explain Naïve Bayes, k-Nearest Neighbor, Decision trees and Support Vector Machine techniques because they have been widely used in automatic genre identification.
Naïve Bayes is a simple probability algorithm that determines the probability of a document to belong to a particular genre.Naïve Bayes is a very fast learning algo rithm, which is robust to irrelevant features.It needs reduced storage space and can handle missing values.However, since the weights are the same for all features, perfor mance can be degraded by having many irrelevant features.This technique has been implemented by Argamon et

Corpora and Evaluation
To date, web genre benchmarks built with principled and shared criteria are still missing (cf.Santini and Sharoff in this issue).This means that the performance of a genre categorization system depends on the specific corpora being classified.For instance, Kessler et al. () used a corpus of  texts from Brown Corpus belonging to six diverse genres (reportage, scientific and technical, fiction, etc).They report . and . accuracies for logistic regression and neural network classi fiers, respectively.Dewdney et al. () used a corpus of  texts belonging to seven diverse genres (advertisements, bulletin, boards, radio news, etc.).They achieved ., . and . accuracies for Naïve Bayes, C. and SVM, respectively.Meyer Zu Eis sen and Stein () compiled the KI- corpus.In their first experiment, they used  web pages ( web pages for each of the eight genres included in the corpus) and applied discriminant analysis.They achieved an accuracy of ..Boese and Howe () used the WebKB corpus to study the effect of web genre evolution.Based on logistic regression classifier, they reported an accuracy of ..Kanaris and Stamatatos () used the KI- corpus and the SVM classifier.They obtained accuracy between . and ..Santini () used SVM to classify the KI- corpus.She reported an accuracy of about ..Mehler et al. () used SVM classifiers and a German newspaper corpus that contains , texts distributed over  genres or types.Their experiments provided F=. for QSA and F=. for DomTK.

Proposed Approach
The aim of our approach is to classify web pages by genre based on three different feature sets, namely URL addresses, logical structure and hypertext structure.The proposed approach is based on the construction of genre centroids using a set of gen re-labeled web pages.Each new web page is assigned to all genres with different confi dence scores, which represent the similarity between the web page and the centroid of each genre.In the subsection ., we explain our feature extraction process.The representation of features, the construction of centroids, the categorization of new web pages and the combination of classifiers are described in subsections ., ., . and ., respectively.

Feature Extraction
In our approach, we used three different types of features, which are the URL addresses, the logical structure and the hypertext structure.
The URL is encoded as a text line, which contains genre-specific words.For example, the presence of "FAQ" and "CV" in the file name is a reliable hint of the membership of a web page to the FAQ and CV genres, respectively.
The logical and hypertext structures of a web page are encoded into the HTML tags used in the web page.The logical structure is represented by the text between < title > and < /title > tags and the text between < Hn > and < /Hn > tags (n = 1, . . ., 6), while the hypertext structure is represented by the text included in the anchors (between < A . . .> and < /A . . .> tags).To quantify the contextual and structural information, we used the bag-of-words ap proach -already employed by (Dewdney et al., ) for automatic genre identification) -which relies on all words without ordering.

Representation
Web page representation is performed through three main steps, which are pre-process ing, term weighting, and normalization.
Pre-processing Pre-processing is a basic step in document categorization.In our approach, the aim of this step is summarized in the following points: • Tokenize text into words.
• Remove numbers, non-letter characters and special characters.
• Remove stop words, which are automatically identified using the Luhn Law (Luhn, ).
• Use the information gain to reduce the number of obtained terms (Yang and Pedersen, ).
Term weighting In our work, web pages are represented using the vector space model.We use three different vectors representing the URLs, the logical structure and the hypertext structure.For each feature set, a web page is represented by a vector pj of terms.Each term ti is weighted using the tf idf weighting technique (Salton and Buckley, ).
With this technique, the wij of a term ti in a web page pj increases with the number of times that the term ti occurs in the page pj and decreases with the number of times the term ti occurs in the collection.This means that the importance of a term in a page is proportional to the number of times that the term appears in the page, while the importance of the term is inversely proportional to the number of times that the term appears in the entire collection.Formally, this reasoning is defined as follows: where tfij is the number of times that term ti appears in web page pj, |D| is the total number of pages in the collection, and nt i is the number of pages where term ti appears.

Normalization
The tf idf technique favors large documents and penilizes short docu ments.To deal with this problem, Lertnattee and Theeramunkong () proposed a normalization technique, called T D, which based on term distribution within a partic ular class and within a collection of documents.The term distribution is based on three different factors.These factors depend on the average frequency of the term ti in all pages of genre g k .This average, denoted by tf ik is defined as follows: where Dg k represents the set of web pages that belongs to genre g k , and tf ijk is the frequency of term ti in page pj of genre g k .
The normalization factors are the interclass standard deviation (icsd), the class stan dard deviation (csd) and the standard deviation (sd).The inter-class standard deviation promotes a term that exists in almost all genres but its frequencies for those genres are quite different.For a term ti, this factor is defined as follows: The class standard deviation of a term ti in a genre g k depends on the different fre quencies of the term in the pages of that genre, and varies from genre to genre.This factor is defined as follows: Jebari The standard deviation of a term ti depends on the frequency of that term in the pages in the collection and is independent of genres.It is defined as follows: Using the tf idf weighting technique and term distributions for normalization, the weight of term ti for page pj in genre g k is defined as follows: where α, β and γ are the normalization parameters, which were used to adjust the relative weight of each factor and to indicate whether they were used as a multiplier or as a divisor for the term's tf idf weight, wij.An experimental study is conducted in section  to identify the appropriate values of these parameters.

Construction of genre centroids
The centroid of a particular genre gj is represented by a vector Gj.This centroid is the combination of the vectors pj belonging (or not) to that genre.Several ways were proposed to calculate this centroid.The most used one is the normalized sum, defined as follows: We observed that web pages that are far away from its genre centroid tend to nega tively affect the performance of categorization.Our hypothesis is that these web pages increase web search noise and, consequently, they cannot be considered as useful train ing pages.For this reason, they should be excluded during centroid computation.Assume that you have obtained a set of genre centroids G = {G1, . . ., Gj, . . ., G |G| }, where |G| is the number of genres.In our approach, we discarded web pages that have a similarity with the genre centroid below a predefined threshold s0.For each genre gj, we calculate a new set of training or labeled web pages sj as follows: where pi is a web page and sim is the cosine similarity between the page pi and the genre centroid Gj defined as follows: The sets of training pages obtained after refining will be used to recalculate the genre centroids using the normalized sum presented in equation  as follows: Finally, the refined centroids will be applied to classify new web pages.Note that the complexity of centroid construction is linear to the number of labeled web pages m and to the number of predefined genres |G|.Hence, learning time depends on O(m|G|).
In order to choose the appropriate threshold, we carried out the experimental study described in the next subsection.

Categorization of New Web Pages
In our approach, the categorization of new web pages is performed incrementally.For each new web page p, we calculated its cosine similarity with all genre centroids.Then, we refined the centroids that have a similarity with the page p greater or equal than S0.The refining process is performed as follows: where N Si is the non-normalized centroid of the genre gi and represents the norm of the vector.N Si is calculated as follows: Volume 24 (1) -2009

Jebari
The complexity of web page classification is linear to the number of genres |G| and to the number of unlabeled web pages.Therefore, the running time for classification depends on O(n|G|).

Combination
The basic idea behind the combination of different classifier methods is to create a more accurate classifier via some combination of the outputs of the contributing classifiers.In our approach, the idea is based on the intuition that the combination of homogenous classifiers using heterogeneous features might improve the final result.
OWA operators OWA (Ordered Weighting Average) operators were first introduced in (Yager, ).Generally speaking, a mapping F : [0, 1] n → [0, 1] is called an OWA oper ator of dimension n if it is associated with a weighting vector W = [w1, . . ., wi, . . ., wn], such that wi ∈ [0, 1], i wi = 1 and F (a1, . . ., an) = i wibi, where bi is the ith largest element in the collection a1, . . ., an.Yager () suggested two methods for identify ing weights.The first approach uses learning techniques.The second one firstly gives some semantics to the weights, then based on this semantics, the values for weights are provided.
In the experiments described in this article, we used the second method based on fuzzy linguistic quantifiers for the weights.According to Zadeh (), there are two types of quantifiers: absolute and relative.Here, we used relative quantifiers typified by terms such as "as most", "as least half", etc.A relative quantifier Q is defined as a mapping Q : [0, 1] → [0, 1], verifying Q(0) = 0, there exists r ∈ [0, 1] such that Q(r) = 1 and Q is a non-decreasing function.Herrera and Verdegay () defined a quantifier function as follows: where a, b ∈ [0, 1] are two parameters.Yager () computed the weight wi(i = 1, . . ., n) as follows: where n is set to 3 because we have three classifiers, named URL, logical and hy pertext classifiers.Depending on the values of the parameters a and b, we used the following function operators: • Minimum: Represented by the quantifier "For all" and the function: • Maximum: Represented by the quantifier "There exists" and the function: • Median: Represented by the quantifier "At least one" and the function: () • Vote: Represented by the quantifier "At least half" and the function: () • Vote: Represented by the quantifier "As possible" and the function: r > 0.5.

()
Decision templates Decision templates were proposed by Kuncheva et al. ().Let E1, E2 and E3 be the URL, the logical and the hypertext classifiers.Each of these classifiers produces the output Ei(p) = [di1(p), . . ., d i|G| (p)] where dij(p) is the membership degree given by the classifier Ei that a web page p belong to the genre j.The outputs of all classifiers can be represented by a decision profile DP matrix as follows: Using the training set Z = Z1, . . ., ZN , we computed the fuzzy template F of each genre i, which is represented by a 3 × |G| matrix Fi = fi(k, s).The element fi(k, s) is calculated as follows: where Ind(Zj, i) is an indicator function with value  if Zj comes from genre i and  otherwise.At this stage, the ranking of genres can be achieved by aggregating the columns of DP using fixed rules (minimum, maximum, product, average, etc.).An other method calculates a soft class label vector with components expressing similarity S between the decision template DP and the fuzzy template F .The final classification CLV is defined as follows: where µi(p) is the similarity S(Fi, DP (p)) between the fuzzy template Fi of the genre i and the decision profile DP (p) of the web page p.This similarity is calculated using the Euclidean measure as follows:

Corpora
In our experiment, we used the KI- corpus and WebKB collection  .These corpora are composed of English web pages.Each web page is associated with a specific source URL address, and belongs to a single genre class.
• KI- corpus was compiled by Meyer Zu Eissen and Stein (Meyer Zu Eissen and Stein, ).It is composed of  HTML web pages, which are divided into eight genres (see Table ).
• WebKB corpus was created at Carnegie-Mellon University during the WebKB project (Craven et al., ).This corpus contains  HTML web pages from four different universities.The corpus comprises six genres (see Table ).

Evaluation
In this section, we describe our evaluation within the F RICC framework.F RICC is the abbreviation of Flexible, Refined and Incremental Centroid-based Classifier.The aims of the evaluation process can be summarized as follows: • Identify the best proportions of labeled and unlabeled web pages to achieve the best performance.
• Identify the appropriate number of terms to obtain the best performance.
• Identify the appropriate values of normalization parameters.
• Identify the best thresholds.
• Identify the best combination techniques.
For multiclass corpora, it is suitable to use the break-even-point (BEP ), which is defined in terms of the standard measures of precision and recall (Joachims, ).Precision P is the proportion of true document-category assignments among all assign ments predicted by the classifier.Recall R is the proportion of true document-category assignments that were also predicted by the classifier.Formally, the BEP statistic finds the point where precision and recall are equal.Since this is hard to achieve in Jebari practice, a common approach is to use the arithmetic mean of recall and precision as an approximation, i.e.BEP = (P + R)/2.Since our corpora are unbalanced, we used the micro-averaged BEP computed by first summing the elements of all binary contingency tables (one for each genre).Then, the micro-averaged BEP is computed from these accumulated statistics.Note that the noisy web pages are not considered in evaluation process.
To measure the performance, we used the 10 × k cross-validation.This means that we randomly split each corpus into k equal parts.Then we used one part for testing and the remaining parts for training.This process was performed  times and the final performance is the average of the  individual performances.The number k is identified experimentally according to the used features and corpora.

Results
In the following paragraphs, we describe a number of experiments and show the results.

Effect of Incremental Aspect
In this experiment, we varied the proportion of unlabeled web pages between % and % by step of %.For each proportion, we measured the micro-averaged BEP for each feature set and corpus.The results are illustrated in Figure .The curves presented in Figure  shows that micro-averaged BEP depends on the proportion of labeled and unlabeled web pages.These curves also show that it is the logical structure classifier that achieves the best performance for both KI- and WebKB corpora.The proportions of unlabeled and labeled web pages to achieve the best performance are presented in the Table .These proportions are used in the next experiments.

Effect of Vocabulary Size
The aim of this experiment is to identify the ideal number of terms to achieve the best performance.For this purpose, we calculated the micro-aver aged BEP by varying the number of terms between  and .The number of terms complies to the information gain measure.Note that in this experiment, we used the tf idf weighting technique without normalization.The obtained results are illustrated in Figure .The ideal number of terms to achieve the best performance are summarized in Table .These values will be used in the next experiment.

Effect of Term Weighting
In order to evaluate the effect of each normalization factor alone (icsd, csd and sd), we conducted an experiment whose results are showed in Table .We observed that the icsd factor is very suitable for the KI- corpus because it contains heterogeneous genres.On the other hand, the sd factor achieves the best performance for the WebKB corpus because it contains homogenous genres.
Table .The best performance is reported by setting the normalization parameters α, β and γ to ., - and -. respectively.These values will be used in the next experiment to choose the appropriate threshold.We noticed that in the case of noisy web pages like those contained in KI- corpus, the refining is very useful.On the other hand, for noiseless corpus like WebKB corpus, the refining is useless.The best refining thresholds will be used in the next experiments.The number of noisy web pages and the thresholds to achieve the best micro-averaged BEP are shown in Table .These experiments are evaluated using the accuracy measure.The micro-averaged accuracy for both KI- and WebKB corpora for each author is presented in Table .According to the results shown in this table, our approach outperforms other methods.• ≈ Indicates no significant differences.

Effect of Refining Aspects
• < Indicates that the machine learning method achieves a significantly lower mea surement than F RICC with . as a significance level.
• << Indicates that the machine learning method achieves a significantly lower measurement than F RICC with . as a significance level.
• <<< Indicates that the machine learning method achieves a significantly lower measurement than F RICC with . as a significance level.
Table  shows that the F RICC approach outperforms all other machine learning methods in  cases.Only SVM has similar performance to F RICC.

Training and Test Times
Here we consider another important aspect, namely execution speed.Time is a very important aspect, especially when genre classification has to be integrated in a search engine.Figures ,  and  show a comparison of the execution speeds for each classification method, in both training and test phases, for the KI- and WebKB corpora.
Results show that our approach is the fastest, but also Rocchio and SVM have a good performance.These results indicate that the required time is proportional to the number of categories instead of the number of web pages.Decision tree is, indeed, the slowest machine learning technique for all feature sets and for both corpora.In this article, we proposed a new approach for genre categorization of web pages.Our approach implements four new aspects that were not explored in previous studies on genre categorization.These aspects are flexibility, refining, incrementing and combi nation.Additionally, we conducted many experiments to measure the effectiveness, efficiency and speed of these aspects.Comparisons with previous approaches shows that our method is very fast and outperforms results documented in previous work.
In the future we hope to investigate the following points: • As the pdf format is a very useful format on the web, we propose to classify pdf documents.
• In this work, we used only English web page, in the future we wish to focus on Arabic web documents.
• As our approach is very fast and outperforms many other machine learning tech niques, we hope to include it in a search engine (i.e.Google, FireFox), in a similar way as the WEGA add-on (Stein et al.) .

Remark
The work described in this article summarizes the PhD thesis "Catégorisation Flexible et Incrémentale avec Raffinage de Pages web par Genre", completed by the author, Chaker Jebari, in October , Tunis El Manar University, College of Science, Com puter Science Department, Tunisia.
al. (); Dewdney et al. (); Santini (). The k-Nearest Neighbor (k-NN) algorithm groups documents within a vector space.The Term Frequency Inverse Document Frequency (tfidf ) is usually employed to rep resent documents.The similarity between documents is computed with Euclidean or cosine measures.New documents are classified with the same genre as the nearest neigh bor.The K represents how many neighbors should be analyzed.K-Nearest Neighbor is used only by Lim et al. ().Decision trees are a popular technique used by Argamon et al. (), Dewdney et al. () and Finn ().Interestingly, Karlgren () applied a combination of decision trees and Nearest Neighbor.He calculated textual features for each document and categorized them into a hierarchy of clusters based on C. if-then rules.The  Santini () tried out also Naïve Bayes with different weights.labelsfor genres were then decided using Nearest Neighbor assignments and cluster centroids.Support Vector Machine is a powerful learning method introduced by Vapnik () and successfully applied to text categorization byJoachims ().SVM is based on Structural Risk Maximization theory, which aims to minimize the generalization error instead of relying on the empirical error on training data alone.The Support Vector Machine technique has been used in genre categorization by many authors (e.g.Kanaris and Stamatatos ; Dewdney et al. ; Meyer Zu Eissen and Stein ; Santini ).

Figure 1 :
Figure 1: Micro-averaged BEP for each feature and for both KI-04 (Left) and WebKB (Right) corpora when the proportion of test pages is varied between 10% and 90%

Figure 2 :
Figure 2: Micro-averaged BEP for each feature and for both KI-04 (Left) and WebKB (Right) corpora when the number of terms is varied between 5 and 3000 To measure the effect of refining on genre categorization, we varied the refining threshold between  and  by step of ..Zero value means that is no refining.As illustrated in Figure , the value of threshold affects the micro-averaged BEP of genre categorization.

Figure 3 :
Figure 3: Micro-averaged BEP for each feature and for both KI-04 (Left) and WebKB (Right) corpora when the refining threshold is varied between 0 and 1

Figure 4 :Figure 5 :
Figure 4: Training and test times for URL and for both KI-04 (left) and WebKB (right) corpora

Figure 6 :
Figure 6: Train and test times for hypertext structure and for both KI-04 (left) and WebKB (right) corpora

Table 1 :
Composition of the KI-04 corpus

Table 2 :
Composition of the WebKB corpus

Table 3 :
Best proportions of training and test web pages (Test%-Train%)

Table 4 :
Best values of number of terms

Table 5 :
The effect of each normalization factor on genre categorization performance

Table 7 :
Best number of noisy web pages and refining thresholds for each feature and corpus Effect of CombinationHere we conducted many experiments to choose the appropriate operator for combination.The obtained results are shown in the Table .These results show that the decision template technique provides the best micro-averaged BEP (. for KI- corpus and . for WebKB corpus).

Table 8 :
Micro-averaged BEP for each combination technique and for both KI-04 and WebKB corpora The majority of the previous studies do not provide a reliable comparison with other approaches.The main reason for this is that, until recently, there were no publicly available and standard corpora for this task.Another reason is that there is not a commonly perceived sense of specific web page genres.For example, in two recent studies, user agreement was only In this article, we propose a comparison with other experiments, namely Meyer Zu Eissen and Stein (),Kanaris and Stamatatos ()andSantini (), where the KI- corpus is employed.The WebKB corpus is used only by Boese and Howe (), so we will compare our results with this experiment.

Table 9 :
Micro-averaged accuracy for both KI-04 and WebKB corpora

Table 10 :
Micro-averaged BEP for the KI-04 corpus

Table 11 :
Micro-averaged BEP for the WebKB corpus Statistical SignificanceTo determine the statistical significance of the results, we used 5 × 2 cross validation t − test(Dietterich, ).The results are presented in Table.The symbols used in this table are defined as follows:

Table 12 :
Statistical Significance of our approach F RICC against other machine learning techniques