The Web has been around and maturing for 25 years. The popular websites of today have undergone vast changes during this period, with a few being there almost since the beginning and many new ones becoming popular over the years. This makes it worthwhile to take a look at how these sites have evolved and what they might tell us about the future of the Web. We therefore embarked on a longitudinal study spanning almost the whole period of the Web, based on data collected by the Internet Archive starting in 1996, to retrospectively analyze how the popular Web as of now has evolved over the past 18 years. For our study we focused on the German Web, specifically on the top 100 most popular websites in 17 categories. This paper presents a selection of the most interesting findings in terms of volume, size as well as age of the Web. While related work in the field of Web Dynamics has mainly focused on change rates and analyzed datasets spanning less than a year, we looked at the evolution of websites over 18 years. We found that around 70% of the pages we investigated are younger than a year, with an observed exponential growth in age as well as in size up to now. If this growth rate continues, the number of pages from the popular domains will almost double in the next two years. In addition, we give insights into our data set, provided by the Internet Archive, which hosts the largest and most complete Web archive as of today.
{"title":"The Dawn of today's popular domains: A study of the archived German Web over 18 years","authors":"Helge Holzmann, W. Nejdl, Avishek Anand","doi":"10.1145/2910896.2910901","DOIUrl":"https://doi.org/10.1145/2910896.2910901","url":null,"abstract":"The Web has been around and maturing for 25 years. The popular websites of today have undergone vast changes during this period, with a few being there almost since the beginning and many new ones becoming popular over the years. This makes it worthwhile to take a look at how these sites have evolved and what they might tell us about the future of the Web. We therefore embarked on a longitudinal study spanning almost the whole period of the Web, based on data collected by the Internet Archive starting in 1996, to retrospectively analyze how the popular Web as of now has evolved over the past 18 years. For our study we focused on the German Web, specifically on the top 100 most popular websites in 17 categories. This paper presents a selection of the most interesting findings in terms of volume, size as well as age of the Web. While related work in the field of Web Dynamics has mainly focused on change rates and analyzed datasets spanning less than a year, we looked at the evolution of websites over 18 years. We found that around 70% of the pages we investigated are younger than a year, with an observed exponential growth in age as well as in size up to now. If this growth rate continues, the number of pages from the popular domains will almost double in the next two years. In addition, we give insights into our data set, provided by the Internet Archive, which hosts the largest and most complete Web archive as of today.","PeriodicalId":109613,"journal":{"name":"2016 IEEE/ACM Joint Conference on Digital Libraries (JCDL)","volume":"21 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-06-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126938816","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
In this paper we report preliminary findings based on a quantitative analysis of data from a music social Q&A site, Music StackExchange, focusing on real-life music information needs, uses, and seeking. Eight major topic categories and a two-level taxonomy for question type/intent and the characteristics of questions in each category are presented. Our findings suggest that Q&A sites are a fruitful resource for identifying users' music information needs, how these needs are expressed, and intended uses for the information. On Music StackExchange, users' questioning behaviors were motivated by the recognition of knowledge gaps, lack of resources, need for others' opinions, or interest in research issues, spanning different topics. This study is explorative in nature and the results could improve the understanding of everyday life music information seeking. The findings can inform music librarians and general-purpose music information systems designers of the needs, requirements, and approaches to enhance music related controlled vocabularies, and improve search engines and online knowledge sharing communities to categorize and provide users with more relevant music information.
{"title":"Music information seeking via social Q&A: An analysis of questions in Music StackExchange community","authors":"Hengyi Fu, Yun Fan","doi":"10.1145/2910896.2910914","DOIUrl":"https://doi.org/10.1145/2910896.2910914","url":null,"abstract":"In this paper we report preliminary findings based on a quantitative analysis of data from a music social Q&A site, Music StackExchange, focusing on real-life music information needs, uses, and seeking. Eight major topic categories and a two-level taxonomy for question type/intent and the characteristics of questions in each category are presented. Our findings suggest that Q&A sites are a fruitful resource for identifying users' music information needs, how these needs are expressed, and intended uses for the information. On Music StackExchange, users' questioning behaviors were motivated by the recognition of knowledge gaps, lack of resources, need for others' opinions, or interest in research issues, spanning different topics. This study is explorative in nature and the results could improve the understanding of everyday life music information seeking. The findings can inform music librarians and general-purpose music information systems designers of the needs, requirements, and approaches to enhance music related controlled vocabularies, and improve search engines and online knowledge sharing communities to categorize and provide users with more relevant music information.","PeriodicalId":109613,"journal":{"name":"2016 IEEE/ACM Joint Conference on Digital Libraries (JCDL)","volume":"4 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-06-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130770031","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
This study explored the effect of time constraint on searchers' interactions during two kinds of tasks through conducting a user experiment. The results demonstrated users' did not tend to accelerate their reading or decision speed given time constraint, but to select fewer pages to read, i.e. visit fewer content pages and search result pages (SERPs); and they had more mouse clicks but fewer keystrokes per page when searching with time constraint. The results also showed the different effects of time constraint on search interactions on pages for two types of tasks. The results have implications for the design of digital library systems that take account users' time constraint or time pressure.
{"title":"Preliminary exploration of the effect of time constraint on search interactions on webpages","authors":"Chang Liu, Tao Xu","doi":"10.1145/2910896.2925463","DOIUrl":"https://doi.org/10.1145/2910896.2925463","url":null,"abstract":"This study explored the effect of time constraint on searchers' interactions during two kinds of tasks through conducting a user experiment. The results demonstrated users' did not tend to accelerate their reading or decision speed given time constraint, but to select fewer pages to read, i.e. visit fewer content pages and search result pages (SERPs); and they had more mouse clicks but fewer keystrokes per page when searching with time constraint. The results also showed the different effects of time constraint on search interactions on pages for two types of tasks. The results have implications for the design of digital library systems that take account users' time constraint or time pressure.","PeriodicalId":109613,"journal":{"name":"2016 IEEE/ACM Joint Conference on Digital Libraries (JCDL)","volume":"8 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-06-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132428572","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Soo-yeon Hwang, M. Cragin, M. Lesk, Yu-Hung Lin, Daniel O'Connor
This panel discusses the issues of dealing with fluid data and curating new data in digital libraries.
该小组讨论了在数字图书馆中处理流动数据和管理新数据的问题。
{"title":"Issues of dealing with fluid data in digital libraries","authors":"Soo-yeon Hwang, M. Cragin, M. Lesk, Yu-Hung Lin, Daniel O'Connor","doi":"10.1145/2910896.2926738","DOIUrl":"https://doi.org/10.1145/2910896.2926738","url":null,"abstract":"This panel discusses the issues of dealing with fluid data and curating new data in digital libraries.","PeriodicalId":109613,"journal":{"name":"2016 IEEE/ACM Joint Conference on Digital Libraries (JCDL)","volume":"79 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-06-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122809419","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
We present a recursive pseudo relevance feedback strategy for improving retrieval performance in similarity search. The strategy recursively searches on search results returned for a given query and produces a tree that is used for ranking. Experiments on the Reuters 21578 and WebKB datasets show how the strategy leads to a significant improvement in similarity search performance.
{"title":"Improving similar document retrieval using a recursive pseudo relevance feedback strategy","authors":"Kyle Williams, C. Lee Giles","doi":"10.1145/2910896.2925468","DOIUrl":"https://doi.org/10.1145/2910896.2925468","url":null,"abstract":"We present a recursive pseudo relevance feedback strategy for improving retrieval performance in similarity search. The strategy recursively searches on search results returned for a given query and produces a tree that is used for ranking. Experiments on the Reuters 21578 and WebKB datasets show how the strategy leads to a significant improvement in similarity search performance.","PeriodicalId":109613,"journal":{"name":"2016 IEEE/ACM Joint Conference on Digital Libraries (JCDL)","volume":"61 11 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-06-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133648971","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Research in disciplines such as the earth and biological sciences depends on the availability of representative physical samples that have been collected at substantial cost and effort and some are irreplaceable. The EarthCube iSamples (Internet of Samples in the Earth Sciences) Research Coordination Network (RCN), funded by the National Science Foundation, aims to connect physical samples and sample collections across the Earth Sciences with digital data infrastructures to revolutionize their utility in the support of science. The goal of this workshop is to attract a broad audience comprising of earth scientists and other scientists working with physical samples, data curators, and computer and information scientists to learn from each other about the requirements of physical as well as digital sample and collection management.
{"title":"Physical samples and digital libraries","authors":"Unmil Karadkar, K. Lehnert, C. Lenhardt","doi":"10.1145/2910896.2926736","DOIUrl":"https://doi.org/10.1145/2910896.2926736","url":null,"abstract":"Research in disciplines such as the earth and biological sciences depends on the availability of representative physical samples that have been collected at substantial cost and effort and some are irreplaceable. The EarthCube iSamples (Internet of Samples in the Earth Sciences) Research Coordination Network (RCN), funded by the National Science Foundation, aims to connect physical samples and sample collections across the Earth Sciences with digital data infrastructures to revolutionize their utility in the support of science. The goal of this workshop is to attract a broad audience comprising of earth scientists and other scientists working with physical samples, data curators, and computer and information scientists to learn from each other about the requirements of physical as well as digital sample and collection management.","PeriodicalId":109613,"journal":{"name":"2016 IEEE/ACM Joint Conference on Digital Libraries (JCDL)","volume":"48 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-06-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134610004","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Community Question-Answering (CQA), where questions and answers are generated by peers, has become a popular method of information seeking in online environments. While the content repositories created through CQA sites have been used widely to support general purpose tasks, using them as online digital libraries that support educational needs is an emerging practice. Horizontal CQA services, such as Yahoo! Answers, and vertical CQA services, such as Brainly, are aiming to help students improve their learning process by answering their educational questions. In these services, receiving high quality answer(s) to a question is a critical factor not only for user satisfaction, but also for supporting learning. However, the questions are not necessarily answered by experts, and the askers may not have enough knowledge and skill to evaluate the quality of the answers they receive. This could be problematic when students build their own knowledge base by applying inaccurate information or knowledge acquired from online sources. Using moderators could alleviate this problem. However, a moderator's evaluation of answer quality may be inconsistent because it is based on their subjective assessments. Employing human assessors may also be insufficient due to the large amount of content available on a CQA site. To address these issues, we propose a framework for automatically assessing the quality of answers. This is achieved by integrating different groups of features - personal, community-based, textual, and contextual - to build a classification model and determine what constitutes answer quality. To test this evaluation framework, we collected more than 10 million educational answers posted by more than 3 million users on Brainly's United States and Poland sites. The experiments conducted on these datasets show that the model using Random Forest (RF) achieves more than 83% accuracy in identifying high quality of answers. In addition, the findings indicate that personal and community-based features have more prediction power in assessing answer quality. Our approach also achieves high values on other key metrics such as F1-score and Area under ROC curve. The work reported here can be useful in many other contexts where providing automatic quality assessment in a digital repository of textual information is paramount.
{"title":"Evaluating the quality of educational answers in community question-answering","authors":"Long T. Le, C. Shah, Erik Choi","doi":"10.1145/2910896.2910900","DOIUrl":"https://doi.org/10.1145/2910896.2910900","url":null,"abstract":"Community Question-Answering (CQA), where questions and answers are generated by peers, has become a popular method of information seeking in online environments. While the content repositories created through CQA sites have been used widely to support general purpose tasks, using them as online digital libraries that support educational needs is an emerging practice. Horizontal CQA services, such as Yahoo! Answers, and vertical CQA services, such as Brainly, are aiming to help students improve their learning process by answering their educational questions. In these services, receiving high quality answer(s) to a question is a critical factor not only for user satisfaction, but also for supporting learning. However, the questions are not necessarily answered by experts, and the askers may not have enough knowledge and skill to evaluate the quality of the answers they receive. This could be problematic when students build their own knowledge base by applying inaccurate information or knowledge acquired from online sources. Using moderators could alleviate this problem. However, a moderator's evaluation of answer quality may be inconsistent because it is based on their subjective assessments. Employing human assessors may also be insufficient due to the large amount of content available on a CQA site. To address these issues, we propose a framework for automatically assessing the quality of answers. This is achieved by integrating different groups of features - personal, community-based, textual, and contextual - to build a classification model and determine what constitutes answer quality. To test this evaluation framework, we collected more than 10 million educational answers posted by more than 3 million users on Brainly's United States and Poland sites. The experiments conducted on these datasets show that the model using Random Forest (RF) achieves more than 83% accuracy in identifying high quality of answers. In addition, the findings indicate that personal and community-based features have more prediction power in assessing answer quality. Our approach also achieves high values on other key metrics such as F1-score and Area under ROC curve. The work reported here can be useful in many other contexts where providing automatic quality assessment in a digital repository of textual information is paramount.","PeriodicalId":109613,"journal":{"name":"2016 IEEE/ACM Joint Conference on Digital Libraries (JCDL)","volume":"56 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-06-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116173889","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
The accuracy of the contents of a knowledge base determines the effectiveness of knowledge service applications, thus, it is necessary to evaluate the confidence of triples when a knowledge base is built. This study introduces a generic computational methodology to compute the confidence values of triples in knowledge bases and detect potentially incorrect ones for further verification. The major contributions of the proposed methodology are as follows: (1) A process to compute the confidence values of triples is designed; (2) New algorithms are proposed to adjust the term frequency and inverse document frequency values of each triple; (3) A method to build a support vector machine (SVM) classifier based on the selected triples used for incorrect triple detection is presented.
{"title":"A methodology to evaluate triple confidence and detect incorrect triples in knowledge bases","authors":"Haihua Xie, Xiaoqing Lu, Zhi Tang, Mao Ye","doi":"10.1145/2910896.2925456","DOIUrl":"https://doi.org/10.1145/2910896.2925456","url":null,"abstract":"The accuracy of the contents of a knowledge base determines the effectiveness of knowledge service applications, thus, it is necessary to evaluate the confidence of triples when a knowledge base is built. This study introduces a generic computational methodology to compute the confidence values of triples in knowledge bases and detect potentially incorrect ones for further verification. The major contributions of the proposed methodology are as follows: (1) A process to compute the confidence values of triples is designed; (2) New algorithms are proposed to adjust the term frequency and inverse document frequency values of each triple; (3) A method to build a support vector machine (SVM) classifier based on the selected triples used for incorrect triple detection is presented.","PeriodicalId":109613,"journal":{"name":"2016 IEEE/ACM Joint Conference on Digital Libraries (JCDL)","volume":"12 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-06-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123043789","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
When users search in Twitter, they are overloaded with a mass of microblog posts every time, which are not particularly informative and lack of meaningful organization. Therefore, it is helpful to produce a summarized tweet timeline about the topic. The tweet timeline generation is such a task aiming at selecting a small set of representative tweets to generate meaningful timeline. In this paper, we introduce an optimization framework to jointly model the relevance, novelty and coverage of the tweet timeline, including effective tweet ranking algorithm. Extensive experiments on the public TREC 2014 dataset demonstrate our method can achieve very competitive results against the state-of-art TTG systems.
{"title":"Leveraging tweet ranking in an optimization framework for tweet timeline generation","authors":"Lili Yao, Feifan Fan, Yansong Feng, Dongyan Zhao","doi":"10.1145/2910896.2925453","DOIUrl":"https://doi.org/10.1145/2910896.2925453","url":null,"abstract":"When users search in Twitter, they are overloaded with a mass of microblog posts every time, which are not particularly informative and lack of meaningful organization. Therefore, it is helpful to produce a summarized tweet timeline about the topic. The tweet timeline generation is such a task aiming at selecting a small set of representative tweets to generate meaningful timeline. In this paper, we introduce an optimization framework to jointly model the relevance, novelty and coverage of the tweet timeline, including effective tweet ranking algorithm. Extensive experiments on the public TREC 2014 dataset demonstrate our method can achieve very competitive results against the state-of-art TTG systems.","PeriodicalId":109613,"journal":{"name":"2016 IEEE/ACM Joint Conference on Digital Libraries (JCDL)","volume":"154 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-06-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123273135","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
T. Erekhinskaya, Mithun Balakrishna, M. Tatu, Steven D. Werner, D. Moldovan
Researchers in all domains need to keep abreast with recent scientific advances. Finding relevant publications and reviewing them is a labor-intensive task that lacks efficient automatic tools to support it. Current tools are limited to standard keyword-based search systems that return potentially relevant documents and then leave the user with a monumental task of sifting through them. In this paper, we present a semantic-driven system to automatically extract the most important knowledge from a publication and reduces the effort required for the literature review. The system extracts key findings from biomedical papers in PubMed, populates a predefined template and displays it. This allows the user to get the key ideas of the content even before opening or downloading the publication.
{"title":"Knowledge extraction for literature review","authors":"T. Erekhinskaya, Mithun Balakrishna, M. Tatu, Steven D. Werner, D. Moldovan","doi":"10.1145/2910896.2925441","DOIUrl":"https://doi.org/10.1145/2910896.2925441","url":null,"abstract":"Researchers in all domains need to keep abreast with recent scientific advances. Finding relevant publications and reviewing them is a labor-intensive task that lacks efficient automatic tools to support it. Current tools are limited to standard keyword-based search systems that return potentially relevant documents and then leave the user with a monumental task of sifting through them. In this paper, we present a semantic-driven system to automatically extract the most important knowledge from a publication and reduces the effort required for the literature review. The system extracts key findings from biomedical papers in PubMed, populates a predefined template and displays it. This allows the user to get the key ideas of the content even before opening or downloading the publication.","PeriodicalId":109613,"journal":{"name":"2016 IEEE/ACM Joint Conference on Digital Libraries (JCDL)","volume":"38 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-06-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129485373","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}