Proceedings of XSEDE16 : Diversity, Big Data, and Science at Scale : July 17-21, 2016, Intercontinental Miami Hotel, Miami, Florida, USA. Conference on Extreme Science and Engineering Discovery Environment (5th : 2016 : Miami, Fla.)最新文献
The time-dependent close-coupling method based on the Dirac equation is used to calculate single and double photoionization cross sections for Ne8+ in support of planned FLASH/DESY measurements. The fully correlated ground state radial wavefunction is obtained by solving a time independent inhomogeneous set of close-coupled equations. The repulsive interaction between electrons includes both Coulomb and Gaunt interactions. A Bessel function expression is used to include both dipole and quadruple effects on the radiation field interaction. Propagation of the time-dependent close-coupled equations yields single and double photoionization cross sections for Ne8+ in reasonably good agreement with distorted-wave and R-matrix results.
{"title":"Photoionization of Ne8+","authors":"M. Pindzola, S. Abdel-Naby, C. Ballance","doi":"10.1145/2616498.2616500","DOIUrl":"https://doi.org/10.1145/2616498.2616500","url":null,"abstract":"The time-dependent close-coupling method based on the Dirac equation is used to calculate single and double photoionization cross sections for Ne8+ in support of planned FLASH/DESY measurements. The fully correlated ground state radial wavefunction is obtained by solving a time independent inhomogeneous set of close-coupled equations. The repulsive interaction between electrons includes both Coulomb and Gaunt interactions. A Bessel function expression is used to include both dipole and quadruple effects on the radiation field interaction. Propagation of the time-dependent close-coupled equations yields single and double photoionization cross sections for Ne8+ in reasonably good agreement with distorted-wave and R-matrix results.","PeriodicalId":93364,"journal":{"name":"Proceedings of XSEDE16 : Diversity, Big Data, and Science at Scale : July 17-21, 2016, Intercontinental Miami Hotel, Miami, Florida, USA. Conference on Extreme Science and Engineering Discovery Environment (5th : 2016 : Miami, Fla.)","volume":"185 1","pages":"23:1-23:2"},"PeriodicalIF":0.0,"publicationDate":"2014-07-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"78048991","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Patrick Calhoun, David Akin, Joshua Alexander, Brett Zimmerman, Fred Keller, Brandon George, Henry Neeman
In the era of Big Data, research productivity can be highly sensitive to the availability of large scale, long term archival storage. Unfortunately, many mass storage systems are prohibitively expensive at scales appropriate for individual institutions rather than for national centers. Furthermore, a key issue is the set of circumstances under which researchers can, and are willing to, adopt a centralized technology that, in a pure cost recovery model, might be, or might appear to be, more expensive than what the research teams could build on their own. This paper examines a business model that addresses these concerns in a comprehensive manner, distributing the costs among a funding agency, the institution and the research teams, thereby reducing the challenges faced by each.
{"title":"The Oklahoma PetaStore: A Business Model for Big Data on a Small Budget","authors":"Patrick Calhoun, David Akin, Joshua Alexander, Brett Zimmerman, Fred Keller, Brandon George, Henry Neeman","doi":"10.1145/2616498.2616548","DOIUrl":"https://doi.org/10.1145/2616498.2616548","url":null,"abstract":"In the era of Big Data, research productivity can be highly sensitive to the availability of large scale, long term archival storage. Unfortunately, many mass storage systems are prohibitively expensive at scales appropriate for individual institutions rather than for national centers. Furthermore, a key issue is the set of circumstances under which researchers can, and are willing to, adopt a centralized technology that, in a pure cost recovery model, might be, or might appear to be, more expensive than what the research teams could build on their own. This paper examines a business model that addresses these concerns in a comprehensive manner, distributing the costs among a funding agency, the institution and the research teams, thereby reducing the challenges faced by each.","PeriodicalId":93364,"journal":{"name":"Proceedings of XSEDE16 : Diversity, Big Data, and Science at Scale : July 17-21, 2016, Intercontinental Miami Hotel, Miami, Florida, USA. Conference on Extreme Science and Engineering Discovery Environment (5th : 2016 : Miami, Fla.)","volume":"6 1","pages":"48:1-48:8"},"PeriodicalIF":0.0,"publicationDate":"2014-07-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"80551384","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Nowadays, computing resources like supercomputers are shared by many users. Most systems are equipped with batch systems as their resource managers. From a user's perspective, the overall turnaround of each submitted job is measured by time-to-solution which consists of the sum of batch queuing time and execution time. On a busy machine, most jobs spend more time waiting in the batch queue than their real job executions. And rarely this is a topic of performance tuning and optimization of parallel computing. we propose a workload aware method systematically to predict jobs' batch queue waiting time patterns. Consequently, it will help user to optimize utilization and improve productivity. With workload data gathered from a supercomputer, we apply Bayesian framework to predict the temporal trend of long-time batch queue waiting probability. Thus, the workload of the machine not only can be predicted, we are able to provide users with a monthly updated reference chart to suggest job submission assembled with better chosen number of CPU and running time requests, which will avoid long-time waiting in batch queue. Our experiment shows that the model could make over 89% correct predictions for all cases we have tested.
{"title":"Workload Aware Utilization Optimization for a Petaflop Supercomputer: Evidence Based Assessment Using Statistical Methods","authors":"Fei Xing, Haihang You","doi":"10.1145/2616498.2616536","DOIUrl":"https://doi.org/10.1145/2616498.2616536","url":null,"abstract":"Nowadays, computing resources like supercomputers are shared by many users. Most systems are equipped with batch systems as their resource managers. From a user's perspective, the overall turnaround of each submitted job is measured by time-to-solution which consists of the sum of batch queuing time and execution time. On a busy machine, most jobs spend more time waiting in the batch queue than their real job executions. And rarely this is a topic of performance tuning and optimization of parallel computing. we propose a workload aware method systematically to predict jobs' batch queue waiting time patterns. Consequently, it will help user to optimize utilization and improve productivity. With workload data gathered from a supercomputer, we apply Bayesian framework to predict the temporal trend of long-time batch queue waiting probability. Thus, the workload of the machine not only can be predicted, we are able to provide users with a monthly updated reference chart to suggest job submission assembled with better chosen number of CPU and running time requests, which will avoid long-time waiting in batch queue. Our experiment shows that the model could make over 89% correct predictions for all cases we have tested.","PeriodicalId":93364,"journal":{"name":"Proceedings of XSEDE16 : Diversity, Big Data, and Science at Scale : July 17-21, 2016, Intercontinental Miami Hotel, Miami, Florida, USA. Conference on Extreme Science and Engineering Discovery Environment (5th : 2016 : Miami, Fla.)","volume":"30 1","pages":"50:1-50:8"},"PeriodicalIF":0.0,"publicationDate":"2014-07-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"73949426","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Wheat, corn, and rice provide 60 percent of the world's food intake every day, and just 15 plant species make up 90 percent of the world's food intake. As such there is tremendous agricultural and scientific interest to sequence and study plant genomes, especially to develop a reference sequence to direct plant breeding or to identify functional elements. DNA sequencing technologies can now generate sequence data for large genomes at low cost, however, it remains a substantial computational challenge to assemble the short sequencing reads into their complete genome sequences. Even one of the simpler ancestral species of wheat, Aegilops tauschii, has a genome size of 4.36 gigabasepairs (Gbp), nearly fifty percent larger than the human genome. Assembling a genome this size requires computational resources, especially RAM to store the large assembly graph, out of reach for most institutions. In this paper, we describe a collaborative effort between Cold Spring Harbor Laboratory and the Pittsburgh Supercomputing Center to assemble large, complex cereal genomes starting with Ae. tauschii, using the XSEDE shared memory supercomputer Blacklight. We expect these experiences using Blacklight to provide a case study and computational protocol for other genomics communities to leverage this or similar resources for assembly of other significant genomes of interest.
{"title":"Large-scale Sequencing and Assembly of Cereal Genomes Using Blacklight","authors":"Philip D. Blood, Shoshana Marcus, M. Schatz","doi":"10.1145/2616498.2616502","DOIUrl":"https://doi.org/10.1145/2616498.2616502","url":null,"abstract":"Wheat, corn, and rice provide 60 percent of the world's food intake every day, and just 15 plant species make up 90 percent of the world's food intake. As such there is tremendous agricultural and scientific interest to sequence and study plant genomes, especially to develop a reference sequence to direct plant breeding or to identify functional elements. DNA sequencing technologies can now generate sequence data for large genomes at low cost, however, it remains a substantial computational challenge to assemble the short sequencing reads into their complete genome sequences. Even one of the simpler ancestral species of wheat, Aegilops tauschii, has a genome size of 4.36 gigabasepairs (Gbp), nearly fifty percent larger than the human genome. Assembling a genome this size requires computational resources, especially RAM to store the large assembly graph, out of reach for most institutions. In this paper, we describe a collaborative effort between Cold Spring Harbor Laboratory and the Pittsburgh Supercomputing Center to assemble large, complex cereal genomes starting with Ae. tauschii, using the XSEDE shared memory supercomputer Blacklight. We expect these experiences using Blacklight to provide a case study and computational protocol for other genomics communities to leverage this or similar resources for assembly of other significant genomes of interest.","PeriodicalId":93364,"journal":{"name":"Proceedings of XSEDE16 : Diversity, Big Data, and Science at Scale : July 17-21, 2016, Intercontinental Miami Hotel, Miami, Florida, USA. Conference on Extreme Science and Engineering Discovery Environment (5th : 2016 : Miami, Fla.)","volume":"82 1","pages":"20:1-20:6"},"PeriodicalIF":0.0,"publicationDate":"2014-07-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"76047622","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Efforts to develop component-based simulation workflows for industrial applications using XSEDE parallel computing systems are presented.
介绍了利用XSEDE并行计算系统为工业应用开发基于组件的仿真工作流的方法。
{"title":"HPC Simulation Workflows for Engineering Innovation","authors":"M. Shephard, Cameron W. Smith","doi":"10.1145/2616498.2616556","DOIUrl":"https://doi.org/10.1145/2616498.2616556","url":null,"abstract":"Efforts to develop component-based simulation workflows for industrial applications using XSEDE parallel computing systems are presented.","PeriodicalId":93364,"journal":{"name":"Proceedings of XSEDE16 : Diversity, Big Data, and Science at Scale : July 17-21, 2016, Intercontinental Miami Hotel, Miami, Florida, USA. Conference on Extreme Science and Engineering Discovery Environment (5th : 2016 : Miami, Fla.)","volume":"50 1","pages":"56:1-56:2"},"PeriodicalIF":0.0,"publicationDate":"2014-07-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"76407791","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
This study reports development and validation of two parallel flame solvers with soot models based on the open-source computation fluid dynamics (CFD) toolbox code OpenFOAM. First, a laminar flame solver is developed and validated against experimental data. A semi-empirical two-equation soot model and a detailed soot model using a method of moments with interpolative closure (MOMIC) are implemented in the laminar flame solver. An optically thin radiation model including gray soot radiation is also implemented. Preliminary results using these models show good agreement with experimental data for the laminar axisymmetric diffusion flame studied. Second, a turbulent flame solver is developed using Reynolds-averaged equations and transported probability density function (tPDF) method. The MOMIC soot model is implemented on this turbulent solver. A sophisticated photon Monte-Carlo (PMC) model with line-by-line spectral radiation database for modeling is also implemented on the turbulent solver. The validation of the turbulent solver is under progress. Both the solvers show good scalability for a moderate-sized chemical mechanism, and can be expected to scale even more strongly when larger chemical mechanisms are used.
{"title":"Detailed computational modeling of laminar and turbulent sooting flames","authors":"A. Dasgupta, Somesh P. Roy, D. Haworth","doi":"10.1145/2616498.2616509","DOIUrl":"https://doi.org/10.1145/2616498.2616509","url":null,"abstract":"This study reports development and validation of two parallel flame solvers with soot models based on the open-source computation fluid dynamics (CFD) toolbox code OpenFOAM. First, a laminar flame solver is developed and validated against experimental data. A semi-empirical two-equation soot model and a detailed soot model using a method of moments with interpolative closure (MOMIC) are implemented in the laminar flame solver. An optically thin radiation model including gray soot radiation is also implemented. Preliminary results using these models show good agreement with experimental data for the laminar axisymmetric diffusion flame studied. Second, a turbulent flame solver is developed using Reynolds-averaged equations and transported probability density function (tPDF) method. The MOMIC soot model is implemented on this turbulent solver. A sophisticated photon Monte-Carlo (PMC) model with line-by-line spectral radiation database for modeling is also implemented on the turbulent solver. The validation of the turbulent solver is under progress. Both the solvers show good scalability for a moderate-sized chemical mechanism, and can be expected to scale even more strongly when larger chemical mechanisms are used.","PeriodicalId":93364,"journal":{"name":"Proceedings of XSEDE16 : Diversity, Big Data, and Science at Scale : July 17-21, 2016, Intercontinental Miami Hotel, Miami, Florida, USA. Conference on Extreme Science and Engineering Discovery Environment (5th : 2016 : Miami, Fla.)","volume":"14 1","pages":"12:1-12:7"},"PeriodicalIF":0.0,"publicationDate":"2014-07-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"84369670","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Heterogeneous architectures can improve the performance of applications with computationally intensive, data-parallel operations. Even when these architectures may reduce the execution time of applications, there are opportunities for additional performance improvement as the memory hierarchy of the central processor cores and the graphics processor cores are separate. Applications executing on heterogeneous architectures must allocate space in the GPU global memory, copy input data, invoke kernels, and copy results to the CPU memory. This scheme does not overlap inter-memory data transfers and GPU computations, thus increasing application execution time. This research presents a software architecture with a runtime pipeline system for GPU input/output scheduling that acts as a bidirectional interface between the GPU computing application and the physical device. The main aim of this system is to reduce the impact of the processor-memory performance gap by exploiting device I/O and computation overlap. Evaluation using application benchmarks shows processing improvements with speedups above 2x with respect to baseline, non-streamed GPU execution.
{"title":"Runtime Pipeline Scheduling System for Heterogeneous Architectures","authors":"Julio C. Olaya, R. Romero","doi":"10.1145/2616498.2616547","DOIUrl":"https://doi.org/10.1145/2616498.2616547","url":null,"abstract":"Heterogeneous architectures can improve the performance of applications with computationally intensive, data-parallel operations. Even when these architectures may reduce the execution time of applications, there are opportunities for additional performance improvement as the memory hierarchy of the central processor cores and the graphics processor cores are separate. Applications executing on heterogeneous architectures must allocate space in the GPU global memory, copy input data, invoke kernels, and copy results to the CPU memory. This scheme does not overlap inter-memory data transfers and GPU computations, thus increasing application execution time. This research presents a software architecture with a runtime pipeline system for GPU input/output scheduling that acts as a bidirectional interface between the GPU computing application and the physical device. The main aim of this system is to reduce the impact of the processor-memory performance gap by exploiting device I/O and computation overlap. Evaluation using application benchmarks shows processing improvements with speedups above 2x with respect to baseline, non-streamed GPU execution.","PeriodicalId":93364,"journal":{"name":"Proceedings of XSEDE16 : Diversity, Big Data, and Science at Scale : July 17-21, 2016, Intercontinental Miami Hotel, Miami, Florida, USA. Conference on Extreme Science and Engineering Discovery Environment (5th : 2016 : Miami, Fla.)","volume":"8 1","pages":"45:1-45:7"},"PeriodicalIF":0.0,"publicationDate":"2014-07-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"89457972","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Lei Wang, James W. Mazzuca, Sophya Garashchuk, J. Jakowski
This paper describes a quantum trajectory (QT) approach to molecular dynamics with quantum corrections on behavior of the nuclei interfaced with the on-the-fly evaluation of electronic structure (ES). Nuclear wavefunction is represented by an ensemble of trajectories, concurrently propagated in time under the influence of the quantum and classical forces. For scalability to high-dimensional systems (hundreds of degrees of freedom), the quantum force is computed within the Linearized Quantum Force (LQF) approximation. The classical force is determined from the ES calculations, performed at the Density Functional Tight Binding (DFTB) level. High throughput DFTB version is implemented in a massively parallel environment using Open MP/MPI. The dynamics has also been extended to describe the Boltzmann (imaginary-time) evolution defining temperature of a molecular system. The combined QTES-DFTB code has been used to study reaction dynamics of systems consisting of up to 111 atoms.
{"title":"The hybrid Quantum Trajectory/Electronic Structure DFTB-based approach to Molecular Dynamics","authors":"Lei Wang, James W. Mazzuca, Sophya Garashchuk, J. Jakowski","doi":"10.1145/2616498.2616503","DOIUrl":"https://doi.org/10.1145/2616498.2616503","url":null,"abstract":"This paper describes a quantum trajectory (QT) approach to molecular dynamics with quantum corrections on behavior of the nuclei interfaced with the on-the-fly evaluation of electronic structure (ES). Nuclear wavefunction is represented by an ensemble of trajectories, concurrently propagated in time under the influence of the quantum and classical forces. For scalability to high-dimensional systems (hundreds of degrees of freedom), the quantum force is computed within the Linearized Quantum Force (LQF) approximation. The classical force is determined from the ES calculations, performed at the Density Functional Tight Binding (DFTB) level. High throughput DFTB version is implemented in a massively parallel environment using Open MP/MPI. The dynamics has also been extended to describe the Boltzmann (imaginary-time) evolution defining temperature of a molecular system. The combined QTES-DFTB code has been used to study reaction dynamics of systems consisting of up to 111 atoms.","PeriodicalId":93364,"journal":{"name":"Proceedings of XSEDE16 : Diversity, Big Data, and Science at Scale : July 17-21, 2016, Intercontinental Miami Hotel, Miami, Florida, USA. Conference on Extreme Science and Engineering Discovery Environment (5th : 2016 : Miami, Fla.)","volume":"37 1","pages":"24:1-24:8"},"PeriodicalIF":0.0,"publicationDate":"2014-07-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"89655409","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Fostering the free and open sharing of scientific knowledge between the scientific community and general public is the goal of Academic Torrents. At its core it is a distributed network for efficient content dissemination, connecting scientists, academic journals, readers, research groups, and many others. Leveraging the power of its peer-to-peer architecture, Academic Torrents makes science more accessible through two initiatives. The open data initiative allows researchers to share their datasets at high speeds with low bandwidth costs through the peer-to-peer network. The cooperative nature of scientific research demands access to data, but researchers face significant hurdles making their data available. The technical benefits of the Academic Torrents network allows researchers to scalably and globally distribute content, leading to its adoption by labs all around the world to disseminate and share scientific data. Academic Torrent's open access initiative uses the same technology to share open access papers between institutions and individuals. We design a connector to our network that acts as a onsite digital stack to complement the already existing physical stack curated in the same manner. Utilizing the collective resources of the academic community we eliminate the biases in the closed subscription model and the pay to publish model.
{"title":"Academic Torrents: A Community-Maintained Distributed Repository","authors":"Joseph Paul Cohen, Henry Z. Lo","doi":"10.1145/2616498.2616528","DOIUrl":"https://doi.org/10.1145/2616498.2616528","url":null,"abstract":"Fostering the free and open sharing of scientific knowledge between the scientific community and general public is the goal of Academic Torrents. At its core it is a distributed network for efficient content dissemination, connecting scientists, academic journals, readers, research groups, and many others. Leveraging the power of its peer-to-peer architecture, Academic Torrents makes science more accessible through two initiatives. The open data initiative allows researchers to share their datasets at high speeds with low bandwidth costs through the peer-to-peer network. The cooperative nature of scientific research demands access to data, but researchers face significant hurdles making their data available. The technical benefits of the Academic Torrents network allows researchers to scalably and globally distribute content, leading to its adoption by labs all around the world to disseminate and share scientific data. Academic Torrent's open access initiative uses the same technology to share open access papers between institutions and individuals. We design a connector to our network that acts as a onsite digital stack to complement the already existing physical stack curated in the same manner. Utilizing the collective resources of the academic community we eliminate the biases in the closed subscription model and the pay to publish model.","PeriodicalId":93364,"journal":{"name":"Proceedings of XSEDE16 : Diversity, Big Data, and Science at Scale : July 17-21, 2016, Intercontinental Miami Hotel, Miami, Florida, USA. Conference on Extreme Science and Engineering Discovery Environment (5th : 2016 : Miami, Fla.)","volume":"99 5 1","pages":"2:1-2:2"},"PeriodicalIF":0.0,"publicationDate":"2014-07-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"87723463","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Y. D. Cai, B. Riedl, R. Ratan, Cuihua Shen, A. Picot
Due to the huge volume and extreme complexity in online game data collections, selecting essential features for the analysis of massive game logs is not only necessary, but also challenging. This study develops and implements a new XSEDE-enabled tool, FeatureSelector, which uses the parallel processing techniques on high performance computers to perform feature selection. By calculating probability distance measures, based on K-L divergence, this tool quantifies the distance between variables in data sets, and provides guidance for feature selection in massive game log analysis. This tool has helped researchers choose the high-quality and discriminative features from over 300 variables, and select the top pairs of countries with the greatest differences from 231 country-pairs in a 500 GB game log data set. Our study shows that (1) K-L divergence is a good measure for correctly and efficiently selecting important features, and (2) the high performance computing platform supported by XSEDE has substantially accelerated the feature selection processes by over 30 times. Besides demonstrating the effectiveness of FeatureSelector in a cross-country analysis using high performance computing, this study also highlights some lessons learned for feature selection in social science research and some experience on applying parallel processing techniques in intensive data analysis.
{"title":"FeatureSelector: an XSEDE-Enabled Tool for Massive Game Log Analysis","authors":"Y. D. Cai, B. Riedl, R. Ratan, Cuihua Shen, A. Picot","doi":"10.1145/2616498.2616511","DOIUrl":"https://doi.org/10.1145/2616498.2616511","url":null,"abstract":"Due to the huge volume and extreme complexity in online game data collections, selecting essential features for the analysis of massive game logs is not only necessary, but also challenging. This study develops and implements a new XSEDE-enabled tool, FeatureSelector, which uses the parallel processing techniques on high performance computers to perform feature selection. By calculating probability distance measures, based on K-L divergence, this tool quantifies the distance between variables in data sets, and provides guidance for feature selection in massive game log analysis. This tool has helped researchers choose the high-quality and discriminative features from over 300 variables, and select the top pairs of countries with the greatest differences from 231 country-pairs in a 500 GB game log data set. Our study shows that (1) K-L divergence is a good measure for correctly and efficiently selecting important features, and (2) the high performance computing platform supported by XSEDE has substantially accelerated the feature selection processes by over 30 times. Besides demonstrating the effectiveness of FeatureSelector in a cross-country analysis using high performance computing, this study also highlights some lessons learned for feature selection in social science research and some experience on applying parallel processing techniques in intensive data analysis.","PeriodicalId":93364,"journal":{"name":"Proceedings of XSEDE16 : Diversity, Big Data, and Science at Scale : July 17-21, 2016, Intercontinental Miami Hotel, Miami, Florida, USA. Conference on Extreme Science and Engineering Discovery Environment (5th : 2016 : Miami, Fla.)","volume":"78 1","pages":"17:1-17:7"},"PeriodicalIF":0.0,"publicationDate":"2014-07-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"85916710","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Proceedings of XSEDE16 : Diversity, Big Data, and Science at Scale : July 17-21, 2016, Intercontinental Miami Hotel, Miami, Florida, USA. Conference on Extreme Science and Engineering Discovery Environment (5th : 2016 : Miami, Fla.)