Data Analytics Rule at Spark Summit West 2017

Spark Summit West is always well attended, and this year was no exception. Data engineers, data scientists, programmers, architects and technology enthusiasts descended on San Francisco’s Moscone Center earlier this month to learn all about the latest developments with Apache Spark™ and its massive ecosystem.

Complexity of analytics use cases and data science was a dominant theme throughout this year’s event. The keynote by the CEO and co-founder of Databricks, Ali Ghodsi, highlighted some of the challenges with implementing large-scale analytics projects. Ghodsi discussed how the continued growth of Apache Spark has resulted in myriad innovative uses cases, from churn analytics to genome sequencing. These applications are difficult to develop, as they often involve siloed teams of different domain experts; their complex workflows take too long from data access to insight; and the infrastructure is costly and difficult to manage.

AI, ML, DL

Data scientists like to explore data by transforming massive datasets and by building large-scale machine learning models. If you’re looking to experiment with machine learning and deep learning, Spark is as good a platform as any to start with. It continues to attract the most interest from academia and open-source developers.

Andy Feng and Lee Yang from Yahoo presented “TensorFlow on Spark: Scalable Tensorflow Learning on Spark Clusters.” TensorFlow is a new framework that enables easy experimentation for algorithm designs, and supports scalable training and inferencing on Spark clusters. It supports all TensorFlow functionalities, including synchronous and asynchronous learning, model and data parallelism and TensorBoard. It provides architectural flexibility for data ingestion to TensorFlow (pushing vs. pulling) and network protocols (gRPC and RDMA) for server-to-server communication. Its Python API makes the integration with existing Spark libraries like MLlib easy.

Jason Dai and Radhika Rangarajan discussed BigDL, which is a distributed deep learning framework for Apache Spark recently open sourced by Intel. BigDL helps make deep learning more accessible to the big data community by allowing them to continue using familiar tools and infrastructure to build deep learning applications. With BigDL, users can write their deep learning applications as standard Spark programs, which can then directly run on top of existing Spark or Hadoop clusters.

HPC for Spark

Apache Spark workloads typically maintain persistent data in memory which is frequently accessed over the network. Networking I/O performance is a critical component in Spark systems. HPC system’s performance characteristics, such as high bandwidth, low latency and low CPU overhead, offer an excellent opportunity to accelerate Spark by increasing network throughput.

Costin Iancu (Lawrence Berkeley National Laboratory) and Nicholas Chaimov (University of Oregon) presented their findings from their research porting Apache Spark to the Cray^® XC™ line of supercomputers. Their talk focused on addressing the scalability bottleneck with the global file system present in all large-scale HPC installations. Using two techniques (file open pooling and mounting the Spark file hierarchy in a specific manner), they were able to improve scalability from O(100) cores to O(10,000) cores. This is the first result at such a large scale on HPC systems, and it had a transformative impact on research, enabling their colleagues to run on 50,000 cores.

Srivatsan Krishnan and Zhongyue Nah presented Intel’s design and implementation of FPGA as a supplement to vcores in Spark YARN mode to accelerate SparkML applications on the Intel Xeon+FPGA platform. In particular, they have added new options to Spark core that provides an interface for the user to describe the accelerator dependencies of the application. The FPGA info in the Spark context will be used by the new APIs and DRF policy implemented on YARN to schedule the Spark executor to a host with Xeon+FPGA installed. Experimental results using ALS scoring applications that accelerate general matrix-to-matrix multiplication operations demonstrate that Xeon+FPGA improves the FLOPS throughput by 1.5× compared to a CPU-only cluster.

My recommended talks

Videos and in some cases presentations are available for these and other sessions from Spark Summit West 2017:

“Databricks,” Ali Ghodsi and Greg Owen, Databricks (video and slides)

“BigDL: Bringing Ease of Use of Deep Learning For Apache Spark,” Jason Dai and Radhika Rangarajan, Intel (video)

“Apache Spark on Supercomputers: A Tale of The Storage Hierarchy,” Costin Iancu (LBNL) and Nicholas Chaimov (University of Oregon) (video)

“Accelerating SparkML Workloads on The Intel Xeon+Fpga Platform,” Srivatsan Krishnan and Zhongyue Nah, Intel (video)

“Speeding Up Spark with Data Compression on Xeon+FPGA,” David Ojika (University of Florida) (slides and video)

“Scaling Genetic Data Analysis with Apache Spark,” Jonathan Bloom and Timothy Poterba (Broad Institute of MIT and Harvard) (slides and video)

“Needle in the Haystack—User Behavior Anomaly Detection for Information Security,” Ping Yan and Wei Deng, Salesforce.com (video)

The post Data Analytics Rule at Spark Summit West 2017 appeared first on Cray Blog.

Data Analytics Rule at Spark Summit West 2017

Trending Articles

Practice Sheet of Right form of verbs for HSC Students

Download: FK ft Shenky – Nakuyewa ”Prod by: Shenky”

How to win at Markstrat (Markstrat Tips and Tricks) – Vodites

Ominde Commission Report and Recommendations – Ominde Report of 1964

Bureau of Internal Revenue: Regional Offices (Directory)

GO 53 on Enhancement of Ex-gratia upto 5 Lakhs Toddy Tappers in Telangana

Cakewalk CA-2A Leveling Amplifier v2.0.1.97 WiN, v2.0.1.96 OSX Incl Keygen

Mp3 Download: Mdu - Kunjenjenjena

How the kill the job , when DTP request running for long hours.

Microsoft Intune から展開しているアプリのアップデートについて

18-year-old girl was beaten for half an hour by two Northampton men in 'an...

Car crash in Dunton Bassett leaves driver in critical condition

Macky 2, Two Others In Road Accident

Application log 00000000000000089514: Could not convert queue DLVST90CLNT

Detroit mafia: D’Anna Brothers agree to plea deal

Delivery block field greyed out using VA02

Muloraki Au

【個人撮影】スマホのプライベート映像♪「中に出さないで///」カラオケ屋での生ハメ撮りが流出ｗ【リベンジポルノ】＠PornHub

BREAKING NEWS: Diamond Platnumz Is Reported Dead After Ghastly Car Accident

FIAT 500 B0111 B0112