Keyword extraction for blogs based on content richness

Please download to get full document.

View again

of 13
0 views
PDF
All materials on our website are shared by users. If you have any questions about copyright issues, please report us to resolve them. We are always happy to assist you.
Document Description
http://www.healthmystyles.com/
Document Share
Document Tags
Document Transcript
    http://jis.sagepub.com/  Journal of Information Science  http://jis.sagepub.com/content/early/2013/10/22/0165551513508877The online version of this article can be found at: DOI: 10.1177/0165551513508877 published online 24 October 2013 Journal of Information Science  Jinhee Park, Jaekwang Kim and Jee-Hyong Lee Keyword extraction for blogs based on content richness  Published by:  http://www.sagepublications.com On behalf of:  Chartered Institute of Library and Information Professionals  can be found at: Journal of Information Science  Additional services and information for http://jis.sagepub.com/cgi/alerts Email Alerts: http://jis.sagepub.com/subscriptions Subscriptions:  http://www.sagepub.com/journalsReprints.nav Reprints:  http://www.sagepub.com/journalsPermissions.nav Permissions:  What is This? - Oct 24, 2013OnlineFirst Version of Record >>  at SungKyunKwan University on October 30, 2013 jis.sagepub.comDownloaded from at SungKyunKwan University on October 30, 2013 jis.sagepub.comDownloaded from at SungKyunKwan University on October 30, 2013 jis.sagepub.comDownloaded from at SungKyunKwan University on October 30, 2013 jis.sagepub.comDownloaded from at SungKyunKwan University on October 30, 2013 jis.sagepub.comDownloaded from at SungKyunKwan University on October 30, 2013 jis.sagepub.comDownloaded from at SungKyunKwan University on October 30, 2013 jis.sagepub.comDownloaded from at SungKyunKwan University on October 30, 2013 jis.sagepub.comDownloaded from at SungKyunKwan University on October 30, 2013 jis.sagepub.comDownloaded from at SungKyunKwan University on October 30, 2013 jis.sagepub.comDownloaded from at SungKyunKwan University on October 30, 2013 jis.sagepub.comDownloaded from at SungKyunKwan University on October 30, 2013 jis.sagepub.comDownloaded from at SungKyunKwan University on October 30, 2013 jis.sagepub.comDownloaded from    Article  Journal of Information Science1–12  The Author(s) 2013Reprints and permissions:sagepub.co.uk/journalsPermissions.navDOI: 10.1177/0165551513508877 jis.sagepub.com Keyword extraction for blogs based oncontent richness  Jinhee Park  College of Information and Communication Engineering, Sungkyunkwan University, Republic of Korea  Jaekwang Kim College of Information and Communication Engineering, Sungkyunkwan University, Republic of Korea  Jee-Hyong Lee College of Information and Communication Engineering, Sungkyunkwan University, Republic of Korea Abstract In this paper, a method is proposed to extract topic keywords of blogs, based on the richness of content. If a blog includes rich con-tent related to a topic word, the word can be considered as a keyword of the blog. For this purpose, a new measure, richness, is pro-posed, which indicates how much a blog covers the trendy subtopics of a keyword. In order to obtain trendy subtopics of keywords,we use outside topical context data – the web. Since the web includes various and trendy information, we can find popular and trendycontent related to a topic. For each candidate keyword, a set of web documents is retrieved by Google, and the subtopics found inthe web documents are modelled by a probabilistic approach. Based on the subtopic models, the proposed method evaluates the rich-ness of blogs for candidate keywords, in terms of how much a blog covers the trendy subtopics of keywords. If a blog includes variouscontents on a word, the word needs to be chosen as one of the keywords of the blog. In the experiments, the proposed method iscompared with various methods, and shows better results, in terms of hit count, trendiness and consistency. Keywords Blogs; information retrieval; keyword extraction; LDA; subtopic model; text mining 1. Introduction  Nowadays, blogs are used as an important tool for the delivery of information and news via the Internet. Since there areusually hundreds of posts in a blog, it is important for readers to easily catch the topics of blogs. For this purpose, manyapproaches have been proposed, such as providing topic keywords, such as a tag cloud generated from tags in the postsor manually selected keywords by human experts, or showing categories to which those belong [1–4].However, categories usually just give information on the domains of blogs, which is not enough to help readers tocatch the content of blogs. Human experts may provide detailed and appropriate information, by selecting topic keywordsfrom blogs that describe them well, but this needs much time and deep inspection [5]. On the other hand, a tag cloud isautomatically generated based on tags assigned by authors; therefore it is widely used for blogs to provide topic keywords[6–8]. However, tags cannot always be trusted, because they are manually assigned by authors, and it is not guaranteed that every post has tags. Therefore, tags cannot be considered to reflect all of the topics in a blog [9]. Moreover, tags can be misused as a postscript of posts, or even be abused for search engine optimization, by selecting only attractive tags thatare highly rated by search engines. Corresponding author:  Jee-Hyong Lee, College of Information and Communication Engineering, Sunkyunkwan University, Seobu-ro, Jangan-gu, Suwon, Gyeonggi-do, Republicof Korea.Email: john@skku.edu  Since topic keywords help and assist readers in understanding the overall content of posts in blogs, words that canrepresent the content of a blog are to be chosen to provide information about the blog. For example, let us assume that aword   A  has a set of subtopics { a 1 ,  a 2 ,  a 3 ,  a 4 ,  a 5 }. If a blog covers content about { a 1 ,  a 2 ,  a 3 ,  a 4 }, it is appropriate to choose  A  as one of the topic keywords, because the blog covers most of the related topics about  A . On the other hand, in the caseof a blog including only { a 1 },  A  may be inappropriate as a topic keyword, because the blog does not cover subtopics of   A  well. When we choose topic keywords, we need to consider the coverage of keywords. The coverage is how well akeyword covers the content of a blog. However, the coverage is not a sufficient condition for topic keywords. Another isthe trendiness. If readers look for blogs related to a topic keyword   A , they may expect content that is not only stronglyrelated to  A , but that also fits with the current public interest in  A . Readers will be more satisfied with such blogs.There are many approaches for the extraction of topic keywords from documents, such as word-graph-based, link- based and probability-based approaches [1, 10–19]. Word-graph-based approaches, such as  TextRank   [12] and   HITS  [13], mainly consider the frequencies of words in documents, without considering the coverage or trendiness. Thesemethods choose a word as one of the topic keywords if the word frequently appears together with important words.These methods are appropriate for choosing topic words in single documents.There are also approaches based on the link structure between blogs [14–16]. Since those approaches consider blogstogether with other neighbouring blogs, the subtopic coverage and trendiness can be implicitly reflected into chosen topickeywords. However, blogs have a tendency to be tightly connected to blogs with similar contents, so the topics may betoo narrow or specific. Additionally, there are no common link architectures between blogs to be easily crawled or iden-tified. This can be one problem in generalizing these methods.Other research selected topics based on probabilistic analysis [17–19]. Latent Dirichlet allocation (LDA) is widelyapplied, because it is appropriate for topical analysis. These approaches are based on their own probabilistic models,extended from the srcinal LDA model. However, these models have a higher complexity than the srcinal LDA model.Furthermore, these approaches are dependent on specific information, such as citations or comments, which is difficultto generalize for every blog.In this paper, a method is proposed to extract topic keywords of blogs, considering the coverage of trendy subtopicsof keywords, by a probabilistic analysis based on LDA. In order to obtain trendy subtopics of keywords, we use outsidetopical context data, the web. If we collect the documents on a keyword,  t  , from the web, the documents may includevarious and trendy subtopics of   t  . From the web documents on  t  , we build a probabilistic subtopic model of   t  . Since sub-topic models are built from the web, the models are called the web context model of   t  . Based on web context models, anew measure is proposed: namely Richness, which indicates how much a blog reflects the web context, by measuringthe coverage of the blog on the trendy subtopics that can be found in the current web. The proposed method selects key-words based on the richness measure, so that words that represent the trendy and diverse contents in a blog are chosen.Most existing approaches choose a word if it covers the content of documents. However, we choose a word if thedocument covers most aspects of the word. This ensures that the meaning of keywords overlaps well with the content of documents.The rest of this paper is organized as follows. Section 2 reviews the related work in the topic keyword selection.Section 3 describes the proposed topic keyword selection approach. Section 4 presents the experimental results. The paper is summarized and concluded in Section 5. 2. Related work  For the keyword extraction from a set of documents, several approaches have been proposed, such as word-graph-based,link-based and probability-based approaches. Word-graph-based methods approach the problem by building graphs of words. The connection weights between them can be determined by the distance between the documents, or the count of co-occurrences. In order to determine important nodes in a graph, graph analysis methods are applied to word-graphs,such as  PageRank   and   HITS  .Mihalcea and Tarau proposed   TextRank   [12], which is a representative keyword extraction approach based on  PageRank  . In  TextRank  , nodes are terms in a document, and links between nodes are bi-directional, because there are nodirections between co-occurring terms, while the srcinal  PageRank   algorithm uses directed hyperlinks between webdocuments. Hyperlink-Induced Topic Search [13] (  HITS  ) is a graph analysis method, which was srcinally proposed for web document analysis. It divided documents on the web into hubs and authorities. An authority was semanticallydefined as a document that provided useful information for readers, and a hub was defined as a document that provided links to authorities. It calculated hub scores and authority scores for each document, by an iterative process. If   HITS   isapplied to a word-graph, keywords can be chosen, based on either hub scores or authority scores. Park et al.  2 Journal of Information Science, 2013, pp. 1–12  The Author(s), DOI: 10.1177/0165551513508877  The word-graph-based methods are usually used for a single document, not for a set of documents. The method analy-ses the structure of a word-graph of a document, rather than considering the topical context in a set of documents. For thisreason, keywords may be chosen from a local viewpoint.Chen et al. suggested an analysis method for topic trends from the network of several blogs connected with commoninterests [14]. They also predicted the topics to be discussed in the future in a blog or a community, using their blogging- behaviour model in a supervised manner. The model is based on the graph representation of bloggers with temporal infor-mation. Qamra et al. also applied a community-based approach with temporal clustering, to find shared interests to iden-tify topics and keywords [15]. Sekiguchi also used a similar approach. They extracted shared interests from communitiesof bloggers, and identified the topics from each blog, based on the common interests. They calculated a similarity distri- bution on words to evaluate topic scores [16].Since those approaches analysed blogs together with other neighbouring blogs, trendy subtopics in a set of blogs could  be implicitly considered for selecting topic keywords. However, blogs are usually tightly connected only to blogs withsimilar topics, and rarely have connections to blogs with different topics. The group of connected blogs based on the link structure usually contains only quite similar blogs, so their shared interests may be biased or specific. In other words,some words that are unimportant but frequently used by bloggers may be overestimated. Additionally, these approachesneed additional information, such as the temporal metadata, and the link structure between blogs, which are not alwaysavailable. Blogs usually have few links to other blogs, and may not provide machine-readable temporal metadata. It isnot easy to identify the link structures between blogs, because there are no common or standard link architectures.Some other researches approach the problem with probabilistic models. Nallapati and Cohen suggested a method tofind the topic-specific influences of blog posts, by analysing citations between blog posts using machine learning tech-niques [17]. They suggested a model named Link-PLSA-LDA. They grouped blog posts into two groups, ‘cited’ and ‘citing’, and built a bipartite graph by citations, because citations were a good indicator of influences. Their model alsoconsidered the content of blog posts. Ahmed and Xing analysed blog posts from a perspective of ideology, using topicalanalysis by multi-view LDA [18]. They assumed that the contents of blog posts were affected by the writers’ ideological beliefs and the background topics of each ideology, so they added some more steps of the generation of each word in adocument to the srcinal LDA generative model. The study of Yano et al. introduced a comment prediction method from political blog posts, by applying LDA on blog posts [19].In these studies, the LDA model was extended to their own generative models of blogs, which had a higher level of complexity than the srcinal LDA model. They were also dependent on specific information, such as citations or com-ments, which not all blogs provide. The subtopic coverage could be reflected, because the LDA decomposes blogs intoseveral subtopics. However, they analysed only the given blog, without considering outside topical context. The subto- pics may be found in a local viewpoint, and thus the subtopic coverage may not be complete.In addition to these approaches, there are some studies based on word frequencies. They usually measure the impor-tance of each word using co-occurrences or statistical models [1, 10, 11]. However, these methods do not consider vari-ous aspects of words, so the chosen keywords may cover the content of documents in limited aspects. There are alsosome other types of research that use external data such as a thesaurus or ontology [20–22]. A thesaurus and ontologycan be a good source as a secondary knowledge for extracting keywords, but they may carry a high cost of construction. 3. Proposed method 3.1. Overview  In this paper, a new measure,  Richness , is proposed. The richness of a blog for a given keyword is a score that indicateshow much trendy content the blog includes, from the viewpoint of the given keyword. In order to obtain the trendy sub-topics of a keyword, the web context related to the keyword is extracted. The content of blogs is evaluated, based on theweb context.The proposed method is based on an assumption that a topic includes several subtopics. For example, the subtopics of ‘Smart Phones’ can be ‘iOS’, ‘Android’, ‘Apps’, etc. If a topic keyword is given, its subtopics are identified. Next, howmuch the content of a blog is related to each subtopic is evaluated. If a blog poorly covers the subtopics of a keyword,then it is regarded that the blog has poor content on the keyword. On the other hand, if a blog covers most of the subto- pics well, it is regarded as rich. For example, let us assume that a topic keyword   A  has a set of subtopics { a 1 ,  a 2 ,  a 3 ,  a 4 },and another keyword   B  has a set of subtopics { b 1 ,  b 2 ,  b 3 ,  b 4 }. If a blog covers { a 1 ,  a 3 ,  a 4 ,  b 2 }, then keyword   A  may be preferred over   B . The blog covers more subtopics of   A  than  B .When evaluating the subtopic coverage, we extract the current trendy subtopics of a given keyword. For this, the webdocuments related to a keyword are collected, using web search engines such as Google, and a method that can discover  Park et al.  3 Journal of Information Science, 2013, pp. 1–12  The Author(s), DOI: 10.1177/0165551513508877  hidden topics in documents is applied, such as LDA. The web contains various documents that reflect the current interestsof users. The subtopics found in documents collected from the web can be considered as trendy ones. In order to modeltrendy subtopics, we build a topical model, by applying a probabilistic method to the documents from the web. Since themodel is built from the web document, it is called a web context model.In order to evaluate the subtopic coverage of a blog for a keyword based on the web context, the probabilities that each post of the blog is generated from each subtopic of the web context are calculated. These probabilities are considered thesimilarity between subtopics and blog posts, because a high probability means that the content of a post and the contextof a subtopic are highly related. The higher the probability is, the more similar the content of the posts is to the web con-text. The richness of a blog on a keyword is calculated, based on the similarities of blog posts, and the words on which a blog has a high richness are selected as keywords. In summary, if the content of a blog is highly related to the web docu-ments on a topic word, we regard the blog as having rich content on the topic, and the word is chosen as a keyword.Figure 1 depicts the three steps to extract the topic keywords of a blog based on the content richness. First, the candi-date keywords are chosen in all the words appearing in a blog. The candidate keywords are the words that contributemuch to the content of a blog, and are thus important to the blog. Then, the web documents for each candidate keyword are collected. Using the collected web document, the web context models of each candidate keyword are created. Therichness of a blog for a candidate keyword is calculated based on the web context models, and the candidate words areordered, according to the richness. The top-ranked words are chosen as the keywords of the blog. 3.2. Latent Dirichlet Allocation In this subsection, we briefly describe LDA, which the proposed method adopts to extract subtopics from the web con-text. LDA is a probabilistic generative model for a set of documents based on the bag-of-words assumption [23]. It is astatistical method to identify latent topics from observable data in documents. It is often used in the information retrievalarea, for topical analysis of a document collection.Figure 2 depicts a generative model of documents in LDA. In the figure,  w  indicates a word;  z   is the topic of word   w in document  d;  θ   follows the Dirichlet distribution with parameter   α , and determines the topic distribution for document d;  β  is the parameter of the Dirichlet prior, which determines the probabilities that those words belong to topics. For thegeneration of document  d   based on the model, a topic distribution,  θ  , is determined by the Dirichlet distribution with Figure 1.  Overall processes of the proposed method. Park et al.  4 Journal of Information Science, 2013, pp. 1–12  The Author(s), DOI: 10.1177/0165551513508877
Similar documents
View more...
Search Related
We Need Your Support
Thank you for visiting our website and your interest in our free products and services. We are nonprofit website to share and download documents. To the running of this website, we need your help to support us.

Thanks to everyone for your continued support.

No, Thanks