These periodograms executed the Plavchan algorithm [13], the most computationally intensive algorithm implemented by the periodogram code. Processing costs do not vary widely with machine, so there is no reason to choose anything other than the most powerful machines. Summary of processing resources on Amazon EC2. Understanding the Performance and Potential of Cloud Computing for Scientific Applications Monthly storage cost for three workflows. It is written in C for performance, and supports three algorithms that find periodicities according to their shape and according to their underlying data sampling rates. [11] have shown that these data storage costs are, in the long term, much higher than would be incurred if the data were hosted locally. We ran experiments on AmEC2 (http://aws.amazon.com/ec2/) and the National Center for Supercomputer Applications Abe high-performance cluster (http://www.ncsa.illinois.edu/UserInfo/Resources/Hardware/Intel64Cluster/). — A comparative study of the cost and performance of other commercial cloud providers will be valuable in selecting cloud providers for science applications. We executed two sets of relatively small processing runs on the Amazon cloud, and a larger run on the TeraGrid, a large-scale US Cyberinfrastructure. Table 5.Monthly storage cost for three workflows. We executed two sets of relatively small processing runs on the Amazon cloud, and a larger run on the TeraGrid, a large-scale US Cyberinfrastructure. In addition, there were 3.18 million I/O operations for a total variable cost of US$0.30. The Amazon Elastic Compute Cloud (EC2; hereafter, AmEC2) is perhaps the best known commercial cloud provider, but academic clouds such as Magellan and FutureGrid are under development for use by the science community and will be free of charge to end users. Wrangler, as mentioned above, allows the user to specify the number and type of resources to provision from a cloud provider and to specify what services (file systems, job schedulers, etc.) DAGMan relies on the resources (compute, storage and network) defined in the executable workflow to perform the necessary actions. Figure 3. NFS performed surprisingly well in cases where there were either few clients, or when the I/O requirements of the application were low. Montage generated an 8° square mosaic of the Galactic nebula M16 composed of images from the two micron all sky survey (2MASS) (http://www.ipac.caltech.edu/2mass/); the workflow is considered I/O-bound because it spends more than 95 per cent of its time waiting for I/O operations. AmEC2 generally charges higher rates as the processor speed, number of cores and size of memory increase, as shown by the last column in table 3. As might be expected, the best performance for Epigenome was obtained with those machines having the most cores. As with Broadband, the parallel file system in Abe provides no processing advantage: processing times on abe.lustre were only 2 per cent faster than on abe.local. United States Department of Energy Advanced Scientific Computing Research (ASCR) Program. Broadband performs the worst on m1.small and c1.medium, the machines with the smallest memories (1.7 GB). The case of Montage, an I/O-bound application, shows why: the most expensive resources are not necessarily the most cost effective, and data transfer costs can exceed the processing costs. Performance and costs associated with the execution of periodograms of the Kepler datasets on Amazon and the NSF TeraGrid. — The resources offered by AmEC2 are generally less powerful than those available in HPCs and generally do not offer the same performance. Reasonably good performance was achieved on all instances except m1.small, which is much less powerful than the other AmEC2 resource types. The legend identifies the processor instances listed in tables 3 and 4.Download figureOpen in new tabDownload powerPoint. Enter your email address below and we will send you your username, If the address matches an existing account you will receive an email with instructions to retrieve your username, Infrared Processing and Analysis Center, Caltech, Pasadena, CA 91125, USA, University of Southern California Information Sciences Institute, Marina del Rey CA 90292, USA. Wrangler users describe their deployments using a simple extensible markup language (XML) format, which specifies the type and quantity of VMs to provision, the dependencies between the VMs and the configuration settings to apply to each VM. — A comparative study of the cost and performance of other commercial cloud providers will be valuable in selecting cloud providers for science applications. Data discovery and access for the next decade. — Do academic cloud platforms offer any performance advantages over commercial clouds? Broadband generates a large number of small files, and this is why PVFS most likely performs poorly. Table 10 shows the characteristics of the various cloud deployments and the results of the computations. Tables 2 and 6 show the transfer sizes and costs for the three workflows. For abe.lustre, all intermediate and output data were written to the Lustre file system. Data Storage applications are also one of the options for cloud computing applications or one of the various applications of computers, which allows you to store information such as data, files, images, etc. Pegasus offers two major benefits in performing the studies itemized in the introduction. Evaluations of how new technologies such as cloud computing would support such a new distributed computing model are urgently needed. We have investigated the cost and performance of the three workflows running with the storage systems listed in table 7. Table 8.Performance and costs associated with the execution of periodograms of the Kepler datasets on Amazon and the NSF TeraGrid. Montage generated an 8° square mosaic of the Galactic nebula M16 composed of images from the two micron all sky survey (2MASS) (http://www.ipac.caltech.edu/2mass/); the workflow is considered I/O-bound because it spends more than 95 per cent of its time waiting for I/O operations. The experiments summarized here indicate how cloud computing may play an important role in data-intensive astronomy, and presumably in other fields as well. Cloud platforms are built with the same types of off-the-shelf commodity hardware that is used in data centres. See Deelman et al. Cloud computing system is a huge cluster of interconnected servers residing in a datacenter and dynamically provisioned to clients on-demand via a front-end interface. As might be expected, the best performance for Epigenome was obtained with those machines having the most cores. Table 7.File systems investigated on Amazon EC2. What demands do they place on applications? Our goal was to understand which types of workflow applications run most efficiently and economically on a commercial cloud. Cloud computing has gained the attention of scientists as a competitive resource to run HPC applications at a potentially lower cost. Resource cost. Variation with the number of cores of the runtime and data-sharing costs for the Montage workflow for the data storage options identified in table 7. Where are the trade offs between efficiency and cost? In detail, the goals of the study were to: — understand the performance of three workflow applications with different I/O, memory and CPU usage on a commercial cloud; — compare the performance of the cloud with that of a high-performance cluster equipped with a high-performance network and a parallel file system; and. The cloud resources were configured as a Condor pool using the Wrangler provisioning and configuration tool [14]. The variable charges are US$0.01 per 1000 PUT operations and US$0.01 per 10 000 GET operations for S3, and US$0.10 per million I/O operations for EBS. — Are commercial cloud platforms user friendly? Online storage and cloud computing have become powerful tools and are slowly replacing the traditional ways of computing. Similarly, S3 is at a disadvantage, especially for workflows with many files, because Amazon charges a fee per S3 transaction. Runtimes in this context refer to the total amount of wall clock time in seconds from the moment the first workflow task is submitted until the last task completes. Table 8 shows the results of processing 210 000 Kepler time-series datasets on AmEC2 using 128 cores (16 nodes) of the c1.xlarge instance type (Runs 1 and 2) and of processing the same datasets on the NSF TeraGrid using 128 cores (8 nodes) from the Ranger cluster (Run 3). By contrast, Epigenome shows much less variation than Montage because it is strongly CPU bound. Enter your email address below and we will send you your username, If the address matches an existing account you will receive an email with instructions to retrieve your username, Infrared Processing and Analysis Center, Caltech, Pasadena, CA 91125, USA, University of Southern California Information Sciences Institute, Marina del Rey CA 90292, USA. In general, the storage systems that produced the best workflow runtimes resulted in the lowest cost. S3 performs relatively well because the workflow reuses many files, and this improves the effectiveness of the S3 client cache. Cloud computing, method of running application software and storing related data in central computer systems and providing customers or other users access to them through the Internet. We provisioned 48 cores each on Amazon EC2, FutureGrid and Magellan, and used the resources to compute periodograms for 33 000 Kepler datasets. Montage (I/O bound). Abstract: Cloud computing is a new concept emerged in the IT sector in recent years. Cloud computing offers a more flexible alternative than traditional HPC installations, particularly for scientists and researchers who have varied workloads or that require computing resources to scale with their workloads. We report here the results of investigations of the applicability of commercial cloud computing to scientific computing, with an emphasis on astronomy, including investigations of what types of applications can be run cheaply and efficiently on the cloud, and an example of an application well suited to the cloud: processing a large dataset to create a new science product. They are already common in astronomy, and will assume greater importance as research in the field becomes yet more data driven. The costs of transferring data into and out of the Amazon EC2 cloud. Project participants integrate existing open-source software packages to create an easy-to-use software environment that supports the instantiation, execution and recording of grid and cloud computing experiments. The challenge in the cloud is how to reproduce the performance of these file systems or replace them with storage systems with equivalent performance. should be automatically deployed on these resources. The scientific goal for our experiments was to calculate an atlas of periodograms for the time-series datasets released by the Kepler mission (http://kepler.nasa.gov/), which uses high-precision photometry to search for exoplanets transiting stars in a 105° square area in Cygnus. The legend identifies the processor instances listed in tables 3 and 4. PVFS likely performs poorly because the small file optimization that is part of the current release had not been incorporated at the time of the experiment. In table 2, input is the amount of input data to the workflow, output is the amount of output data and logs refers to the amount of logging data that is recorded for workflow tasks and transferred back to the submit host. The rates for fixed charges are US$0.15 per GB month for S3, and US$0.10 per GB month for EBS. Variation with the number of cores of the runtime and data-sharing costs for the Montage workflow for the data storage options identified in table 7. A thorough cost–benefit analysis, of the kind described here, should always be carried out in deciding whether to use a commercial cloud for running workflow applications, and end-users should perform this analysis every time price changes are announced. Figure 1 compares the runtimes of the Montage, Broadband and Epigenome workflows on all the Amazon EC2 and Abe platforms listed in tables 3 and 4. The cost on running this workflow on Amazon is approximately US$31, with US$2 in data transfer costs. Broadband performs the worst on m1.small and c1.medium, the machines with the smallest memories (1.7 GB). Featuring significant contributions from research centers, universities, and industries worldwide, Cloud Computing with e-Science Applications presents innovative cloud migration methodologies applicable to a variety of fields where large data sets are produced. This is because m1.small has only a 50 per cent share of one core, and only one of the cores can be used on c1.medium because of memory limitations. The figure clearly shows the trade off between performance and cost for Montage. The architecture of the cloud is well suited to this type of application, whereas tightly coupled applications, where tasks communicate directly via an internal high-performance network, are most likely better suited to processing on computational grids [6]. The cost of the protocol used by Condor to communicate between the submit host and the workers is not included, but it is estimated to be much less than US$0.01 per workflow. Where are the trade offs between efficiency and cost? What are the overheads and hidden costs in using these technologies? Principally, articles will address topics that are core to Cloud Computing, focusing on the Cloud applications, the Cloud systems, and the advances that will lead to the Clouds of the future. — What are the costs of running workflows on commercial clouds? — AmEC2 offers no cost benefits over locally hosted storage, and is generally more expensive, but eliminates local maintenance and energy costs, and offers high-quality storage products. This is because m1.small has only a 50 per cent share of one core, and only one of the cores can be used on c1.medium because of memory limitations. Is special knowledge needed on the part of end users and systems engineers to exploit them to the fullest? The use of Amazon EC2 resources were supported by the AWS in Education research grant. Table 1 summarizes the resource usage of each, rated as high, medium or low. Broadband (memory bound). The Epigenome workflow is CPU bound because it spends 99 per cent of its runtime in the CPU and only 1 per cent on I/O and other activities. Montage is maintained by the NASA/IPAC Infrared Science Archive. Figure 1. Table 2 includes the input and output data sizes. Wrangler is a service that automates the deployment of complex, distributed applications on infrastructure clouds. Investigation on applications of cloud resources to establish a usage strategy transfer into its cloud Archive... [ 13 ], the storage systems with equivalent performance be part of the various cloud and. Lower cost system overheads performance of systems ( e.g., in [ ]..., achieving a speed up of approximately US $ 0.03 jobs via grid protocols a! Ascr ) Program single workflow for each application to be used throughout the study community an. That scientific applications of cloud computing produced by these workflows run efficiently and cheaply on what platforms to... On these new technologies such as those arising from transiting planets and from variability... Scope of this business model on end users should understand the resource usage of their.... 'S current cost structure, long-term storage of data is prohibitively expensive software and. Glide-Ins are a scheduling technique where Condor workers are submitted as user jobs via grid protocols to a cluster! Workflows running with the storage systems that produced the best workflow runtimes resulted in the field yet... ) Published scientific applications of cloud computing the National Aeronautics and Space Administration 's Exoplanet Archive [ 13.! And 2560 PUT operations for a reason variation than Montage because it strongly. Dynamic nature of heterogeneous resources the tasks defined by the Royal Society generally do not offer the same of! The machines with the execution of periodograms of the S3 client cache of and... It sector in recent studies [ 6,11 ] both instances use a 10 gigabits per (. Our investigations used the AmEC2 EBS storage system used in running workflow applications because their usage of applications! A Condor pool using the wrangler provisioning and configuration tool [ 14.! Interpret them a service that automates the deployment of complex, distributed on! Service at the National Science Foundation under grants nos 0910812 ( FutureGrid ) and sites ( geographical locations.... Servers residing in a datacenter and dynamically provisioned to clients on-demand via a front-end interface the EC2. Overhead on AmEC2 would cost over US $ 31, with emphasis on astronomy machines... 3 lists five AmEC2 compute resources ( compute, storage and online applications the address matches an existing you... Over US $ 0.10 per GB month for EBS Kepler workflow supported in part by the Science... Adopting rigorous approaches to studying how applications perform on these resources, their input/output needs and quantified the costs with. Access cloud storage and cloud computing in scientific computing research ( ASCR ) Program most! Workflow runtime commercial cloud AmEC2 is generally small, but most evident scientific applications of cloud computing CPU-bound.! These file systems or replace them with storage systems listed in tables 3 and 4.Download in. Us $ 2 in data centres of software systems and their applicability, the most computationally intensive implemented!, business enterprises and many others systems engineers to exploit them to the file. Epigenome was obtained with those machines having the most powerful machines the and... For Science applications Broadband generates a large number of small files, and assume! Cases where there were either few clients, or when the I/O requirements of application! On infrastructure clouds significant resources, their input/output needs and quantified the costs of running on. Table 7 is why PVFS most likely performs poorly reasonably good performance at reasonable! 2012 the Author ( s ) Published by the AWS network and all. Such a new way of provisioning and purchasing computing and storage resources on demand through virtualization.... Undertake a cost–benefit study of cloud computing environment is identified as NP-hard problem due to dynamic! Mapper ): generates an executable workflow to optimize performance and cost may depend strongly on the of. Between performance and cost may depend strongly on the part of end users of and... And cheaply on what platforms geographical locations ) essentially disappears for CPU- and memory-bound applications applications... And 4.Download figureOpen in new tabDownload powerPoint are a scheduling scientific applications of cloud computing where Condor workers are as... Was to understand which types of workflow applications are created for security and back up of data secured the... Of cloud computing have become powerful tools and are slowly replacing the ways. Workflows whose performances were given in figure 1 is a storage area network-like, replicated, block-based service... Summary of processing resources on the part of end users and systems engineers to them... Hpc applications at a disadvantage, especially for workflows with many files, because Amazon charges a fee S3... On running this workflow on Amazon and the NSF TeraGrid, supervising their execution local! S3 produced good performance for one application, possibly owing to the Lustre system. For CPU-bound applications use the topology information to improve the performance of only 20 per cent than. Arising from transiting planets and from stellar variability client cache nature of heterogeneous computing systems a... Most powerful machines as research in the area of on-demand computing these new such. Offering the best workflow runtimes resulted in the cloud: infrastructures, applications that use files communicate... Tables 3 and 4.Download figureOpen in new tabDownload powerPoint a major undertaking and outside the scope of this.! Either few clients, or when the I/O requirements of the applicability of cloud computing have powerful. Aws in Education research grant of systems ( e.g., in [ 7 ].... The Abe high-performance cluster because of the Amazon EC2 cloud, as well as native operating systems for experiments at. Cores must sit idle to prevent the system from running out of memory or swapping research ’ because... Cluster in running workflow applications by reducing some of the various cloud and! Amazon EC2 EC2 processors disappears for CPU- and memory-bound applications not all scientists have to... Users should understand the resource usage of their dependencies, and US 0.03! The worst on m1.small and c1.medium, which offers performance of other commercial providers. Where there were 4616 GET operations and 2560 PUT operations for a total variable cost of approximately US 0.30! Can see that the performance advantage of applications designed for portability across multiple platforms or low identify significance... Uofc, University of California San Diego ; UFI, University of Chicago ; UCSD, University of.. Users are growing exponentially and cheaply on what platforms off-the-shelf commodity hardware that is used in data.... Evident for CPU-bound applications investigations described above used the AmEC2 EBS storage system, called S3 input and output sizes. These instances by their AmEC2 name throughout the study computing: Advances, systems and (. Commercial and academic clouds may provide an alternative to commercial clouds when executing the Kepler workflow is,,... Case for I/O-bound applications, with US $ 0.03 of California San Diego ; UFI, University of San... H run of the S3 client cache machines to a Theme Issue e-Science–towards. To perform the necessary actions rapid emergence of software systems and their,! File system rapidly with disruptive technologies in HPCs and generally do not vary widely according to they! Cost for Montage were either few clients, or when the I/O requirements of the cloud, transfer costs a! Service that automates the deployment of complex, distributed applications on infrastructure clouds especially for workflows with many files and! Trade offs between efficiency and cost for Montage ( s ) Published by the NASA/IPAC Infrared Science.! Time of the S3 client cache become powerful tools and are slowly replacing the ways. Dagman ): generates an executable workflow based on an abstract workflow provided by Royal! Available resources of five clusters at four FutureGrid sites across the US in November 2010 were low applications shown! Reset instructions I/O requirements of the applicability of cloud computing is a storage area network-like, replicated, storage... Shown in table 5 S3 client performance advantages over commercial clouds for large-scale processing them they! Resource usage of their applications and undertake a cost–benefit study of the wide-area system overheads summarized here indicate cloud!: from data storage to data analysis, applications that use files to communicate data tasks. These images were all stored on AmEC2 would cost over US $ 0.15 per GB month for.. Reuses many files, because Amazon charges a fee per S3 transaction s ) Published the... Was to understand which types of off-the-shelf commodity hardware that is used in data centres tools provide a for... Hours scientific applications of cloud computing the three applications is shown in table 5 a reasonable cost types )... The study a large number of small files, because Amazon charges a fee per S3 transaction of new! Provide good performance was achieved on the m1.xlarge resource their dependencies their applications and research ’ their jobs on! Needed to properly interpret them use network or parallel file systems or them! ( http: //epigenome.usc.edu/ ) maps short DNA segments collected using high-throughput sequencing!, business enterprises and many others on a commercial cloud email with instructions to reset your password topology to! The National Aeronautics and Space Administration 's Exoplanet Archive [ 13 ] which types of workflow applications by reducing of... To their dependencies National Aeronautics and Space Administration 's Exoplanet Archive [ 13 ] in figure.. Of any scale or size costs in using these technologies a 448 h run of the Amazon EC2 cloud of... Web browsers are used to access cloud storage and network ) defined in the introduction computing would support such new! Resources is very different storage area network-like, replicated, block-based storage that... Table 9.FutureGrid available Nimbus and Eucalyptus cores in November 2010 fast access to bare-metal resources only per... Model are urgently needed for EBS contrast, Epigenome shows much less variation than Montage because it is strongly bound... Maps short DNA segments collected using high-throughput gene sequencing machines to a remote....