Big Data in addition to Clouds: Computing, Analytics in addition to Curriculum Persistent Systems Dece

Big Data in addition to Clouds: Computing, Analytics in addition to Curriculum Persistent Systems Dece www.phwiki.com

Big Data in addition to Clouds: Computing, Analytics in addition to Curriculum Persistent Systems Dece

Watts, Vicky, Co-owner has reference to this Academic Journal, PHwiki organized this Journal Big Data in addition to Clouds: Computing, Analytics in addition to Curriculum Persistent Systems December 20 2012Geoffrey Foxgcf@indiana.edu http://www.infomall.org http://www.futuregrid.org School of In as long as matics in addition to ComputingDigital Science CenterIndiana University BloomingtonAbstractBig data analytics is growing in importance in many fields. We need data science curricula, quality scalable robust data mining libraries in addition to system architectures that support data intensive applications. The ability to use Cloud computing allows us to tap cheap commercial resources in addition to several important data in addition to programming advances. Nevertheless we also need to exploit traditional HPC environments. We discuss an approach to the technical challenges which involves Iterative MapReduce as an interoperable Cloud-HPC runtime. We stress that the communication structure of data analytics is very different from classic parallel algorithms as one uses large collective operations (reductions or broadcasts) rather than the many small messages familiar from parallel particle dynamics in addition to partial differential equation solvers. We discuss new robust algorithms as long as clustering in addition to visualization by dimension reductionBoth cloud computing in addition to data science are expected to have many millions of new jobs as long as our students. We discuss new data science curriculaWe mention FutureGrid in addition to a software defined Computing Testbed as a Service2Broad Overview: Data Deluge to Clouds3

Austin Peay State University TN www.phwiki.com

This Particular University is Related to this Particular Journal

Some TrendsThe Data Deluge is clear trend from Commercial (Amazon, e-commerce) , Community (Facebook, Search) in addition to Scientific applicationsLight weight clients from smartphones, tablets to sensorsMulticore reawakening parallel computingExascale initiatives will continue drive to high end with a simulation orientationClouds with cheaper, greener, easier to use IT as long as (some) applicationsNew jobs associated with new curriculaClouds as a distributed system (classic CS courses)Data Analytics (Important theme in academia in addition to industry)Network/Web Science4Some Data sizes~40 109 Web pages at ~300 kilobytes each = 10 PetabytesYoutube 48 hours video uploaded per minute; in 2 months in 2010, uploaded more than total NBC ABC CBS~2.5 petabytes per year uploadedLHC 15 petabytes per yearRadiology 69 petabytes per yearSquare Kilometer Array Telescope will be 100 terabits/secondEarth Observation becoming ~4 petabytes per yearEarthquake Science – few terabytes total todayPolarGrid – 100’s terabytes/yearExascale simulation data dumps – terabytes/second5Why need cost effective Computing!Full Personal Genomics: 3 petabytes per day

Clouds Offer From different points of view Features from NIST: On-dem in addition to service (elastic); Broad network access; Resource pooling; Flexible resource allocation; Measured serviceEconomies of scale in per as long as mance in addition to electrical power (Green IT)Powerful new software models Plat as long as m as a Service is not an alternative to Infrastructure as a Service – it is instead an incredible valued addedAmazon is as much PaaS as Azure 7Some Sizes in 2010http://www.mediafire.com/file/zzqna34282frr2f/koomeydatacenterelectuse2011finalversion.pdf 30 million servers worldwideGoogle had 900,000 servers (3% total world wide)Google total power ~200 Megawatts< 1% of total power used in data centers (Google more efficient than average – Clouds are Green!)~ 0.01% of total power used on anything world wideMaybe total clouds are 20% total world server count (a growing fraction)8Some Sizes Cloud v HPCTop Supercomputer Sequoia Blue Gene Q at LLNL16.32 Petaflop/s on the Linpack benchmark using 98,304 CPU compute chips with 1.6 million processor cores in addition to 1.6 Petabyte of memory in 96 racks covering an area of about 3,000 square feet7.9 Megawatts powerLargest (cloud) computing data centers100,000 servers at ~200 watts per CPU chipUp to 30 Megawatts powerSo largest supercomputer is around 1-2% per as long as mance of total cloud computing systems with Google ~20% total9 Clouds in Science102 Aspects of Cloud Computing: Infrastructure in addition to RuntimesCloud infrastructure: outsourcing of servers, computing, data, file space, utility computing, etc Cloud runtimes or Plat as long as m: tools to do data-parallel ( in addition to other) computations. Valid on Clouds in addition to traditional clustersApache Hadoop, Google MapReduce, Microsoft Dryad, Bigtable, Chubby in addition to others MapReduce designed as long as in as long as mation retrieval but is excellent as long as a wide range of science data analysis applicationsCan also do much traditional parallel computing as long as data-mining if extended to support iterative operationsData Parallel File system as in HDFS in addition to BigtableInfrastructure, Plat as long as ms, Software as a ServiceSoftware Services are building blocks of applications The middleware or computing environment Nimbus, Eucalyptus, OpenStack, OpenNebula CloudStackOpenFlow Science Computing EnvironmentsLarge Scale Supercomputers – Multicore nodes linked by high per as long as mance low latency networkIncreasingly with GPU enhancementSuitable as long as highly parallel simulationsHigh Throughput Systems such as European Grid Initiative EGI or Open Science Grid OSG typically aimed at pleasingly parallel jobsCan use “cycle stealing”Classic example is LHC data analysis Grids federate resources as in EGI/OSG or enable convenient access to multiple backend systems including supercomputersPortals make access convenient in addition to Workflow integrates multiple processes into a single jobSpecialized visualization, shared memory parallelization etc. machines13Clouds HPC in addition to GridsSynchronization/communication Per as long as mance Grids > Clouds > Classic HPC SystemsClouds naturally execute effectively Grid workloads but are less clear as long as closely coupled HPC applicationsClassic HPC machines as MPI engines offer highest possible per as long as mance on closely coupled problemsLikely to remain in spite of Amazon cluster offeringService Oriented Architectures portals in addition to workflow appear to work similarly in both grids in addition to cloudsMay be as long as immediate future, science supported by a mixture ofClouds – some practical differences between private in addition to public clouds – size in addition to softwareHigh Throughput Systems (moving to clouds as convenient)Grids as long as distributed data in addition to accessSupercomputers (“MPI Engines”) going to exascaleCloud Applications15

What Applications work in CloudsPleasingly (moving to modestly) parallel applications of all sorts with roughly independent data or spawning independent simulationsLong tail of science in addition to integration of distributed sensorsCommercial in addition to Science Data analytics that can use MapReduce (some of such apps) or its iterative variants (most other data analytics apps)Which science applications are using clouds Venus-C (Azure in Europe): 27 applications not using Scheduler, Workflow or MapReduce (except roll your own)50% of applications on FutureGrid are from Life Science Locally Lilly corporation is commercial cloud user ( as long as drug discovery) but not IU BiolohyBut overall very little science use of clouds1627 Venus-C Azure Applications17Chemistry (3) Lead Optimization in Drug Discovery Molecular DockingCivil Eng. in addition to Arch. (4) Structural Analysis Building in as long as mation Management Energy Efficiency in Buildings Soil structure simulationEarth Sciences (1) Seismic propagationICT (2) Logistics in addition to vehicle routing Social networks analysisMathematics (1) Computational AlgebraMedicine (3) Intensive Care Units decision support. IM Radiotherapy planning. Brain ImagingMol, Cell. & Gen. Bio. (7) Genomic sequence analysis RNA prediction in addition to analysis System Biology Loci Mapping Micro-arrays quality.Physics (1) Simulation of Galaxies configurationBiodiversity & Biology (2) Biodiversity maps in marine species Gait simulationCivil Protection (1) Fire Risk estimation in addition to fire propagationMech, Naval & Aero. Eng. (2) Vessels monitoring Bevel gear manufacturing simulation VENUS-C Final Review: The User Perspective 11-12/7 EBC BrusselsParallelism over Users in addition to Usages“Long tail of science” can be an important usage mode of clouds. In some areas like particle physics in addition to astronomy, i.e. “big science”, there are just a few major instruments generating now petascale data driving discovery in a coordinated fashion. In other areas such as genomics in addition to environmental science, there are many “individual” researchers with distributed collection in addition to analysis of data whose total data in addition to processing needs can match the size of big science. Clouds can provide scaling convenient resources as long as this important aspect of science.Can be map only use of MapReduce if different usages naturally linked e.g. exploring docking of multiple chemicals or alignment of multiple DNA sequencesCollecting together or summarizing multiple “maps” is a simple Reduction18

Internet of Things in addition to the Cloud It is projected that there will be 24 billion devices on the Internet by 2020. Most will be small sensors that send streams of in as long as mation into the cloud where it will be processed in addition to integrated with other streams in addition to turned into knowledge that will help our lives in a multitude of small in addition to big ways. The cloud will become increasing important as a controller of in addition to resource provider as long as the Internet of Things. As well as today’s use as long as smart phone in addition to gaming console support, “Intelligent River” “smart homes in addition to grid” in addition to “ubiquitous cities” build on this vision in addition to we could expect a growth in cloud supported/controlled robotics.Some of these “things” will be supporting scienceNatural parallelism over “things”“Things” are distributed in addition to so as long as m a Grid19Classic Parallel ComputingHPC: Typically SPMD (Single Program Multiple Data) “maps” typically processing particles or mesh points interspersed with multitude of low latency messages supported by specialized networks such as Infinib in addition to in addition to technologies like MPIOften run large capability jobs with 100K (going to 1.5M) cores on same jobNational DoE/NSF/NASA facilities run 100% utilizationFault fragile in addition to cannot tolerate “outlier maps” taking longer than othersClouds: MapReduce has asynchronous maps typically processing data points with results saved to disk. Final reduce phase integrates results from different mapsFault tolerant in addition to does not require map synchronizationMap only useful special caseHPC + Clouds: Iterative MapReduce caches results between “MapReduce” steps in addition to supports SPMD parallel computing with large messages as seen in parallel kernels (linear algebra) in clustering in addition to other data mining204 Forms of MapReduce21MPI is Map followed by Point to Point Communication – as in style d)

Data Intensive ApplicationsApplications tend to be new in addition to so can consider emerging technologies such as cloudsDo not have lots of small messages but rather large reduction (aka Collective) operationsNew optimizations e.g. as long as huge messagesEM (expectation maximization) tends to be good as long as clouds in addition to Iterative MapReduceQuite complicated computations (so compute largish compared to communicate)Communication is Reduction operations (global sums or linear algebra in our case)We looked at Clustering in addition to Multidimensional Scaling using deterministic annealing which are both EM See also Latent Dirichlet Allocation in addition to related In as long as mation Retrieval algorithms with similar EM structure22Map Collective Model (Judy Qiu)Combine MPI in addition to MapReduce ideasImplement collectives optimally on Infinib in addition to , Azure, Amazon 23Twister as long as Data Intensive Iterative Applications(Iterative) MapReduce structure with Map-Collective is frameworkTwister runs on Linux or AzureTwister4Azure is built on top of Azure tables, queues, storageLarger Loop-Invariant DataGeneralize to arbitrary Collective BroadcastSmaller Loop-Variant DataQiu, Gunarathne

Watts, Vicky KOZT-FM Co-owner www.phwiki.com

Pleasingly Parallel Per as long as mance ComparisonsSmith Waterman Sequence AlignmentPer as long as mance – Kmeans ClusteringNumber of Executing Map Task HistogramStrong Scaling with 128M Data PointsWeak ScalingTask Execution Time HistogramFirst iteration per as long as ms the initial data fetchOverhead between iterationsHadoop on bare metal scales worstHadoopTwisterTwister4Azure(adjusted as long as C /Java)Twister4AzureQiu, Gunarathne Recent results on 512 cores Azure2720 Dimensions500 CentersData sizes 128 millionQiu, Gunarathne

Data Intensive Kmeans Clustering Image Classification: 1.5 TB; 500 features per image;10k clusters 1000 Map tasks; 1GB data transfer per Map taskWork of Qiu in addition to ZhangBroadcasting Data could be largeChain & MSTMap Collectives Local mergeReduce Collectives Collect but no mergeCombineDirect download or GatherTwister Communication StepsWork of Qiu in addition to ZhangPolymorphic Scatter-Allgather in Twister i.e. have collective primitives in addition to find optimal implementation on each systemWork of Qiu in addition to Zhang

Conclusions IICTaaS (Computing Testbed as a Service) in addition to software defined computingMore employment opportunities in clouds than HPC in addition to Grids in addition to in data than simulation; so cloud in addition to data related activities popular with studentsInternational activity to discuss data science educationAgree on curricula; is such a degree attractiveRole of MOOC’s as eitherDisseminating new curricula Managing course fragments that can be assembled into custom courses as long as particular interdisciplinary students73

Watts, Vicky Co-owner

Watts, Vicky is from United States and they belong to KOZT-FM and they are from  Fort Bragg, United States got related to this Particular Journal. and Watts, Vicky deal with the subjects like Music Programming

Journal Ratings by Austin Peay State University

This Particular Journal got reviewed and rated by Austin Peay State University and short form of this particular Institution is TN and gave this Journal an Excellent Rating.