Data Analytics: Curricula in addition to CloudsClemson UniversitySeptember 21 2012Geoffrey

Data Analytics: Curricula in addition to CloudsClemson UniversitySeptember 21 2012Geoffrey www.phwiki.com

Data Analytics: Curricula in addition to CloudsClemson UniversitySeptember 21 2012Geoffrey

Van Es, Johnjay, Morning Drive-Time Host has reference to this Academic Journal, PHwiki organized this Journal Data Analytics: Curricula in addition to CloudsClemson UniversitySeptember 21 2012Geoffrey Foxgcf@indiana.edu In as long as matics, Computing in addition to PhysicsIndiana University BloomingtonAbstract We posit that big data implies robust data-mining algorithms that must run in parallel to achieve needed per as long as mance. Further we need appropriate data science training to support the different X-In as long as matics fields that are emerging in addition to exp in addition to ing.Further the ability to use Cloud computing allows us to tap cheap commercial resources in addition to several important data in addition to programming advances. Nevertheless we also need to exploit traditional HPC environments. Both cloud computing in addition to data science are expected to have many millions of new jobs as long as our students. We discuss an approach to the technical challenges which involves Iterative MapReduce as an interoperable Cloud-HPC runtime. We stress that the communication structure of data analytics is very different from classic parallel algorithms as one uses large collective operations (reductions or broadcasts) rather than the many small messages familiar from parallel particle dynamics in addition to partial differential equation solvers. Data science needs different runtime optimizations from those familiar from simulations. We suggest that a coordinated ef as long as t is needed to enable big data analytics across many fields. We need data science curricula, quality scalable robust data mining libraries in addition to system architectures that support data intensive applications.We mention FutureGrid in addition to Computing Testbed as a Service2Topics CoveredBroad Overview: Data Deluge to CloudsClouds Grids in addition to HPCCloud applicationsAnalytics in addition to Parallel Computing on Clouds in addition to HPC Data (Analytics) ArchitecturesWhat is Data AnalyticsData Analytics (& In as long as matics) Fields in addition to their Education in addition to TrainingFutureGridComputing Testbed as a ServiceConclusions3

Ferris State University US www.phwiki.com

This Particular University is Related to this Particular Journal

Broad Overview: Data Deluge to Clouds4Some TrendsThe Data Deluge is clear trend from Commercial (Amazon, e-commerce) , Community (Facebook, Search) in addition to Scientific applicationsLight weight clients from smartphones, tablets to sensorsMulticore reawakening parallel computingExascale initiatives will continue drive to high end with a simulation orientationClouds with cheaper, greener, easier to use IT as long as (some) applicationsNew jobs associated with new curriculaClouds as a distributed system (classic CS courses)Data Analytics (Important theme in academia in addition to industry)Network/Web Science5Some Data sizes~40 109 Web pages at ~300 kilobytes each = 10 PetabytesYoutube 48 hours video uploaded per minute; in 2 months in 2010, uploaded more than total NBC ABC CBS~2.5 petabytes per year uploadedLHC 15 petabytes per yearRadiology 69 petabytes per yearSquare Kilometer Array Telescope will be 100 terabits/secondEarth Observation becoming ~4 petabytes per yearEarthquake Science – few terabytes total todayPolarGrid – 100’s terabytes/yearExascale simulation data dumps – terabytes/second6

Why need cost effective Computing!Full Personal Genomics: 3 petabytes per dayClouds Offer From different points of view Features from NIST: On-dem in addition to service (elastic); Broad network access; Resource pooling; Flexible resource allocation; Measured serviceEconomies of scale in per as long as mance in addition to electrical power (Green IT)Powerful new software models Plat as long as m as a Service is not an alternative to Infrastructure as a Service – it is instead an incredible valued addedAmazon is as much PaaS as Azure 8Jobs v. Countries9

McKinsey Institute on Big Data JobsThere will be a shortage of talent necessary as long as organizations to take advantage of big data. By 2018, the United States alone could face a shortage of 140,000 to 190,000 people with deep analytical skills as well as 1.5 million managers in addition to analysts with the know-how to use the analysis of big data to make effective decisions.10Some Sizes in 2010http://www.mediafire.com/file/zzqna34282frr2f/koomeydatacenterelectuse2011finalversion.pdf 30 million servers worldwideGoogle had 900,000 servers (3% total world wide)Google total power ~200 Megawatts< 1% of total power used in data centers (Google more efficient than average – Clouds are Green!)~ 0.01% of total power used on anything world wideMaybe total clouds are 20% total world server count (a growing fraction)11Some Sizes Cloud v HPCTop Supercomputer Sequoia Blue Gene Q at LLNL16.32 Petaflop/s on the Linpack benchmark using 98,304 CPU compute chips with 1.6 million processor cores in addition to 1.6 Petabyte of memory in 96 racks covering an area of about 3,000 square feet7.9 Megawatts powerLargest (cloud) computing data centers100,000 servers at ~200 watts per CPU chipUp to 30 Megawatts powerSo largest supercomputer is around 1-2% per as long as mance of total cloud computing systems with Google ~20% total12 Clouds Grids in addition to HPC132 Aspects of Cloud Computing: Infrastructure in addition to RuntimesCloud infrastructure: outsourcing of servers, computing, data, file space, utility computing, etc Cloud runtimes or Plat as long as m: tools to do data-parallel ( in addition to other) computations. Valid on Clouds in addition to traditional clustersApache Hadoop, Google MapReduce, Microsoft Dryad, Bigtable, Chubby in addition to others MapReduce designed as long as in as long as mation retrieval but is excellent as long as a wide range of science data analysis applicationsCan also do much traditional parallel computing as long as data-mining if extended to support iterative operationsData Parallel File system as in HDFS in addition to BigtableInfrastructure, Plat as long as ms, Software as a ServiceSoftware Services are building blocks of applications The middleware or computing environment Nimbus, Eucalyptus, OpenStackOpenNebula CloudStack15 Science Computing EnvironmentsLarge Scale Supercomputers – Multicore nodes linked by high per as long as mance low latency networkIncreasingly with GPU enhancementSuitable as long as highly parallel simulationsHigh Throughput Systems such as European Grid Initiative EGI or Open Science Grid OSG typically aimed at pleasingly parallel jobsCan use “cycle stealing”Classic example is LHC data analysis Grids federate resources as in EGI/OSG or enable convenient access to multiple backend systems including supercomputersPortals make access convenient in addition to Workflow integrates multiple processes into a single jobSpecialized visualization, shared memory parallelization etc. machines16Clouds HPC in addition to GridsSynchronization/communication Per as long as mance Grids > Clouds > Classic HPC SystemsClouds naturally execute effectively Grid workloads but are less clear as long as closely coupled HPC applicationsClassic HPC machines as MPI engines offer highest possible per as long as mance on closely coupled problemsLikely to remain in spite of Amazon cluster offeringService Oriented Architectures portals in addition to workflow appear to work similarly in both grids in addition to cloudsMay be as long as immediate future, science supported by a mixture ofClouds – some practical differences between private in addition to public clouds – size in addition to softwareHigh Throughput Systems (moving to clouds as convenient)Grids as long as distributed data in addition to accessSupercomputers (“MPI Engines”) going to exascaleCloud Applications18

What Applications work in CloudsPleasingly (moving to modestly) parallel applications of all sorts with roughly independent data or spawning independent simulationsLong tail of science in addition to integration of distributed sensorsCommercial in addition to Science Data analytics that can use MapReduce (some of such apps) or its iterative variants (most other data analytics apps)Which science applications are using clouds Venus-C (Azure in Europe): 27 applications not using Scheduler, Workflow or MapReduce (except roll your own)50% of applications on FutureGrid are from Life Science Locally Lilly corporation is commercial cloud user ( as long as drug discovery)Nimbus applications in bioin as long as matics, high energy physics, nuclear physics, astronomy in addition to ocean sciences1927 Venus-C Azure Applications20Chemistry (3) Lead Optimization in Drug Discovery Molecular DockingCivil Eng. in addition to Arch. (4) Structural Analysis Building in as long as mation Management Energy Efficiency in Buildings Soil structure simulationEarth Sciences (1) Seismic propagationICT (2) Logistics in addition to vehicle routing Social networks analysisMathematics (1) Computational AlgebraMedicine (3) Intensive Care Units decision support. IM Radiotherapy planning. Brain ImagingMol, Cell. & Gen. Bio. (7) Genomic sequence analysis RNA prediction in addition to analysis System Biology Loci Mapping Micro-arrays quality.Physics (1) Simulation of Galaxies configurationBiodiversity & Biology (2) Biodiversity maps in marine species Gait simulationCivil Protection (1) Fire Risk estimation in addition to fire propagationMech, Naval & Aero. Eng. (2) Vessels monitoring Bevel gear manufacturing simulation VENUS-C Final Review: The User Perspective 11-12/7 EBC BrusselsParallelism over Users in addition to Usages“Long tail of science” can be an important usage mode of clouds. In some areas like particle physics in addition to astronomy, i.e. “big science”, there are just a few major instruments generating now petascale data driving discovery in a coordinated fashion. In other areas such as genomics in addition to environmental science, there are many “individual” researchers with distributed collection in addition to analysis of data whose total data in addition to processing needs can match the size of big science. Clouds can provide scaling convenient resources as long as this important aspect of science.Can be map only use of MapReduce if different usages naturally linked e.g. exploring docking of multiple chemicals or alignment of multiple DNA sequencesCollecting together or summarizing multiple “maps” is a simple Reduction21

Internet of Things in addition to the Cloud It is projected that there will be 24 billion devices on the Internet by 2020. Most will be small sensors that send streams of in as long as mation into the cloud where it will be processed in addition to integrated with other streams in addition to turned into knowledge that will help our lives in a multitude of small in addition to big ways. The cloud will become increasing important as a controller of in addition to resource provider as long as the Internet of Things. As well as today’s use as long as smart phone in addition to gaming console support, “Intelligent River” “smart homes” in addition to “ubiquitous cities” build on this vision in addition to we could expect a growth in cloud supported/controlled robotics.Some of these “things” will be supporting scienceNatural parallelism over “things”“Things” are distributed in addition to so as long as m a Grid2223Cloud based robotics from GoogleSensors (Things) as a ServiceSensors as a ServiceSensor Processing as a Service (could use MapReduce)A larger sensor Output Sensorhttps://sites.google.com/site/opensourceiotcloud/ Open Source Sensor (IoT) Cloud

Van Es, Johnjay KRQQ-FM Morning Drive-Time Host www.phwiki.com

Pub/Sub MessagingAt the core Sensor Cloud is a pub/sub systemPublishers send data to topics with no in as long as mation about potential subscribersSubscribers subscribe to topics of interest in addition to similarly have no knowledge of the publishersURL: https://sites.google.com/site/opensourceiotcloud/ GPS Sensor: Multiple Brokers in Cloud26Analytics in addition to Parallel Computing on Clouds in addition to HPC 27

Classic Parallel ComputingHPC: Typically SPMD (Single Program Multiple Data) “maps” typically processing particles or mesh points interspersed with multitude of low latency messages supported by specialized networks such as Infinib in addition to in addition to technologies like MPIOften run large capability jobs with 100K (going to 1.5M) cores on same jobNational DoE/NSF/NASA facilities run 100% utilizationFault fragile in addition to cannot tolerate “outlier maps” taking longer than othersClouds: MapReduce has asynchronous maps typically processing data points with results saved to disk. Final reduce phase integrates results from different mapsFault tolerant in addition to does not require map synchronizationMap only useful special caseHPC + Clouds: Iterative MapReduce caches results between “MapReduce” steps in addition to supports SPMD parallel computing with large messages as seen in parallel kernels (linear algebra) in clustering in addition to other data mining284 Forms of MapReduce29Commercial “Web 2.0” Cloud ApplicationsInternet search, Social networking, e-commerce, cloud storageThese are larger systems than used in HPC with huge levels of parallelism coming fromProcessing of lots of users or An intrinsically parallel Tweet or Web searchClassic MapReduce is suitable (although Page Rank component of search is parallel linear algebra) Data IntensiveDo not need microsecond messaging latency30

Cosmic Comments IDoes Cloud + MPI Engine as long as computing + grids as long as data cover allWill current high throughput computing in addition to cloud concepts mergeNeed interoperable data analytics libraries as long as HPC in addition to Clouds that address new robustness in addition to scaling challenges of big dataBusiness in addition to Academia should collaborate on SPIDALCan we characterize data analytics applicationsI said modest size in addition to kernels need reduction operations in addition to are often full matrix linear algebra (true)Does a “modest-size private science cloud” make sense Too small to be elasticShould governments fund use of commercial clouds (or build their own)Are privacy issues motivating private clouds really valid67Cosmic Comments IIRecent private cloud infrastructure (Eucalyptus 3, OpenStack Essex in USA) much improvedNimbus, OpenNebula still goodBut are they really competitive with commercial cloud fabric runtimeShould we integrate Cloud Plat as long as ms with other Plat as long as msIs Research Computing as a Service interestingMany related commercial offerings e.g. MapReduce value added vendorsFederated resources as long as CTaaS (Computing Testbed as a Service)More employment opportunities in clouds than HPC in addition to Grids in addition to in data than simulation; so cloud in addition to data related activities popular with studentsNeed international activity to discuss data science education68

Van Es, Johnjay Morning Drive-Time Host

Van Es, Johnjay is from United States and they belong to KRQQ-FM and they are from  Tucson, United States got related to this Particular Journal. and Van Es, Johnjay deal with the subjects like Celebrities; Entertainment; Interviews/Profiles; Music

Journal Ratings by Ferris State University

This Particular Journal got reviewed and rated by Ferris State University and short form of this particular Institution is US and gave this Journal an Excellent Rating.