Scale-Out Data Science with R and Python (ICSE 2019 - Tutorials) - International Conference on Software Engineering 2019 in Montreal, Canada

Blogs (1) >>

Sat 25 - Fri 31 May 2019 Montreal, QC, Canada

Who

Tomas Singliar, Mario Inchiosa, John Mark Agosta, Hang Zhang

Track

ICSE 2019 Tutorials

Time Zone

The program is currently displayed in (GMT-04:00) Eastern Time (US & Canada).

Use conference time zone: (GMT-04:00) Eastern Time (US & Canada)Select other time zone

The GMT offsets shown reflect the offsets at the moment of the conference.

Time Band

By setting a time band, the program will dim events that are outside this time window. This is useful for (virtual) conferences with a continuous program (with repeated sessions).
The time band will also limit the events that are included in the personal iCalendar subscription service.

Display full programSpecify a time band

Save

When

Tue 28 May 2019 09:00 - 17:30 at Place du Canada - Scale-Out Data Science with R and Python

Abstract

Course Description

Hands-on tutorial duration: 6 hours (2x 3-hour sessions)

Target audience: Intermediate level in knowledge and practice of machine learning, R, and Python

Abstract

Python and R dominate the domain of data science software. However, when it comes to scalable analysis, or deployment of trained models into production, barriers still exist. Many data scientists are hindered by a limited suite of available functions to handle large datasets efficiently, and knowledge about the appropriate computing environments to scale R and Python scripts from desktop analysis to elastic and distributed cloud services. Another productivity limitation is the tedium of the experimentation loop in which the right preprocessing, model, and hyperparameters are found.

In this tutorial, we will demonstrate how to create scalable machine learning pipelines in R and Python with emphasis on scaling on Spark clusters. We will model the data science journey by first prototyping locally and then show how to move the data science process to the Cloud, to exploit larger compute resources and data colocation that various Spark implementations offer. In particular, the attendees will see how to build, persist, and consume machine learning models using distributed machine learning functions in Python and R. Armed with a distributed computing platform, we will show how Microsoft’s AutoML library can automate the search for the best model.

We will provide hands-on exercises drawing on recent examples from time series forecasting, Active Learning, and Reinforcement Learning. Code samples will be available in a public GitHub repository. Spark and AzureML Compute clusters will be the target distributed platforms; participants will do exercises on Data Science Virtual Machines using RStudio and Jupyter notebooks.

Outline

Introduction:

Scaling up your data science process - issues and solutions
What limits the scalability of your code in face of large data? What techniques can be used to overcome those limits? What libraries can I use in Python? In R?
What limits your modeling productivity? How do I navigate the space of modeling choices - preprocessing sequences, models, hyperparameters?

Hands-on exercises and demonstrations:

End to end scalable data process
Data exploration, wrangling, visualization, modeling and deployment on single node Data Science Virtual Machines and Spark clusters
Scalable analysis on single nodes: Analysis with data on disk, in-database, and in Spark
Distributed model search and parameter optimization in python with AutoML.
Deployment of ML models as web-services APIs with Azure ML python SDK, with parallel scoring on an elastic cluster.

Tomas Singliar is Principal Data Scientist in the Azure ML group in Microsoft AI Platform's AutoML team. He works on automated search for the best forecasting models. In this, his experience from architecting the Azure ML Python Package for Forecasting comes handy. Tomas's favorite hammer is probabilistic and Bayesian modeling, which he applies analytically and predictively to business data. His favorite anvil is cloud data stores, especially MPP SQL databases and data lakes. He studied machine learning at the University of Pittsburgh. Tomas published a dozen papers in and serves as a reviewer for several top tier AI conferences (AAAI, UAI, etc). He holds four patents in intent recognition through inverse reinforcement learning. Contact information: Tomas.Singliar@microsoft.com

Dr. Mario Inchiosa’s passion for data science and high-performance computing drives his work as Principal Software Engineer in Microsoft Cloud + AI, where he focuses on delivering advances in scalable advanced analytics, machine learning, and AI. Previously, Mario served as Revolution Analytics’ Chief Scientist and as Analytics Architect in IBM’s Big Data organization, where he worked on advanced analytics in Hadoop, Teradata, and R. Prior to that, Mario was US Chief Scientist in Netezza Labs, bringing advanced analytics and R integration to Netezza’s SQL-based data warehouse appliances. He also served as US Chief Science Officer at NuTech Solutions, a computer science consultancy specializing in simulation, optimization, and data mining, and Senior Scientist at BiosGroup, a complexity science spin-off of the Santa Fe Institute. Mario holds Bachelor’s, Master’s, and PhD degrees in Physics from Harvard University. He has been awarded four patents and has published over 30 research papers, earning Publication of the Year and Open Literature Publication Excellence awards. Contact information: marinch@microsoft.com.

John Mark Agosta leads a team that is expanding the machine learning and artificial intelligence capabilities of Microsoft Azure. He recently joined Microsoft, which if he were smarter, he should have done earlier in his career – a career that involved working with startups and labs in the Bay Area, in such areas as “The Connected Car 2025” at Toyota ITC, sales opportunity scoring at Inside Sales, malware detection at Intel, and automated planning at SRI. At Intel Labs, he was awarded a Santa Fe Institute Business Fellowship in 2007. He has over 30 peer-reviewed publications and 6 accepted patents. His dedication to probability and its applications is shown by his participation in the annual Uncertainty in AI conference since its inception in 1985. When feeling low he recharges his spirits by singing Russian music with Slavyanka, the Bay Area’s Slavic music chorus. Contact information: joagosta@microsoft.com.

Dr. Hang Zhang is a Principal Data & Applied Scientist in the Commercial Software Engineering team at Microsoft. He is also an affiliated professor at the University of Washington. His technical domains of interests include big data IoT, scalable data science and machine learning frameworks, computer vision, etc. Before joining Microsoft in 2014, Hang had stints at Walmart Labs and Opera Solutions leading a team building internal tools for search analytics and business intelligence and focusing on machine learning. Hang has a Ph.D. in Industrial and Systems Engineering and an M.S. in Statistics from Rutgers, The State University of New Jersey. He is a Senior Member of IEEE. Contact information: hangzh@microsoft.com.

Prerequisites

Participants should come to the sessions with access to an Azure subscription. You can use Azure’s free tier.

Tomas Singliar

Microsoft

Mario Inchiosa

Microsoft

John Mark Agosta

Microsoft

Hang Zhang

Microsoft

Time Zone

The program is currently displayed in (GMT-04:00) Eastern Time (US & Canada).

Use conference time zone: (GMT-04:00) Eastern Time (US & Canada)Select other time zone

The GMT offsets shown reflect the offsets at the moment of the conference.

Time Band

Display full programSpecify a time band

Save

Session Program

Tue 28 May
Displayed time zone: Eastern Time (US & Canada) change

09:00 - 17:30	Scale-Out Data Science with R and PythonTutorials at Place du Canada

09:00 8h30m Tutorial		Scale-Out Data Science with R and PythonIndustry Program Tutorials Tomas Singliar Microsoft, Mario Inchiosa Microsoft, John Mark Agosta Microsoft, Hang Zhang Microsoft