Python and R dominate the domain of data science software. However, when it comes to scalable analysis, or deployment of trained models into production, barriers still exist. Many data scientists are hindered by a limited suite of available functions to handle large datasets efficiently, and knowledge about the appropriate computing environments to scale R and Python scripts from desktop analysis to elastic and distributed cloud services. Another productivity limitation is the tedium of the experimentation loop in which the right preprocessing, model, and hyperparameters are found.
In this tutorial, we will demonstrate how to create scalable machine learning pipelines in R and Python with emphasis on scaling on Spark clusters. We will model the data science journey by first prototyping locally and then show how to move the data science process to the Cloud, to exploit larger compute resources and data colocation that various Spark implementations offer. In particular, the attendees will see how to build, persist, and consume machine learning models using distributed machine learning functions in Python and R. Armed with a distributed computing platform, we will show how Microsoft’s AutoML library can automate the search for the best model.
We will provide hands-on exercises drawing on recent examples from time series forecasting, Active Learning, and Reinforcement Learning. Code samples will be available in a public GitHub repository. Spark and AzureML Compute clusters will be the target distributed platforms; participants will do exercises on Data Science Virtual Machines using RStudio and Jupyter notebooks.
- Scaling up your data science process - issues and solutions
- What limits the scalability of your code in face of large data? What techniques can be used to overcome those limits? What libraries can I use in Python? In R?
- What limits your modeling productivity? How do I navigate the space of modeling choices - preprocessing sequences, models, hyperparameters?
- End to end scalable data process
- Data exploration, wrangling, visualization, modeling and deployment on single node Data Science Virtual Machines and Spark clusters
- Scalable analysis on single nodes: Analysis with data on disk, in-database, and in Spark
- Distributed model search and parameter optimization in python with AutoML.
- Deployment of ML models as web-services APIs with Azure ML python SDK, with parallel scoring on an elastic cluster.
Tomas Singliar is Principal Data Scientist in the Azure ML group in Microsoft AI Platform's AutoML team. He works on automated search for the best forecasting models. In this, his experience from architecting the Azure ML Python Package for Forecasting comes handy. Tomas's favorite hammer is probabilistic and Bayesian modeling, which he applies analytically and predictively to business data. His favorite anvil is cloud data stores, especially MPP SQL databases and data lakes. He studied machine learning at the University of Pittsburgh. Tomas published a dozen papers in and serves as a reviewer for several top tier AI conferences (AAAI, UAI, etc). He holds four patents in intent recognition through inverse reinforcement learning. Contact information: Tomas.Singliar@microsoft.com
Dr. Mario Inchiosa’s passion for data science and high-performance computing drives his work as Principal Software Engineer in Microsoft Cloud + AI, where he focuses on delivering advances in scalable advanced analytics, machine learning, and AI. Previously, Mario served as Revolution Analytics’ Chief Scientist and as Analytics Architect in IBM’s Big Data organization, where he worked on advanced analytics in Hadoop, Teradata, and R. Prior to that, Mario was US Chief Scientist in Netezza Labs, bringing advanced analytics and R integration to Netezza’s SQL-based data warehouse appliances. He also served as US Chief Science Officer at NuTech Solutions, a computer science consultancy specializing in simulation, optimization, and data mining, and Senior Scientist at BiosGroup, a complexity science spin-off of the Santa Fe Institute. Mario holds Bachelor’s, Master’s, and PhD degrees in Physics from Harvard University. He has been awarded four patents and has published over 30 research papers, earning Publication of the Year and Open Literature Publication Excellence awards. Contact information: email@example.com.
John Mark Agosta leads a team that is expanding the machine learning and artificial intelligence capabilities of Microsoft Azure. He recently joined Microsoft, which if he were smarter, he should have done earlier in his career – a career that involved working with startups and labs in the Bay Area, in such areas as “The Connected Car 2025” at Toyota ITC, sales opportunity scoring at Inside Sales, malware detection at Intel, and automated planning at SRI. At Intel Labs, he was awarded a Santa Fe Institute Business Fellowship in 2007. He has over 30 peer-reviewed publications and 6 accepted patents. His dedication to probability and its applications is shown by his participation in the annual Uncertainty in AI conference since its inception in 1985. When feeling low he recharges his spirits by singing Russian music with Slavyanka, the Bay Area’s Slavic music chorus. Contact information: firstname.lastname@example.org.
Dr. Hang Zhang is a Principal Data & Applied Scientist in the Commercial Software Engineering team at Microsoft. He is also an affiliated professor at the University of Washington. His technical domains of interests include big data IoT, scalable data science and machine learning frameworks, computer vision, etc. Before joining Microsoft in 2014, Hang had stints at Walmart Labs and Opera Solutions leading a team building internal tools for search analytics and business intelligence and focusing on machine learning. Hang has a Ph.D. in Industrial and Systems Engineering and an M.S. in Statistics from Rutgers, The State University of New Jersey. He is a Senior Member of IEEE. Contact information: email@example.com.
Participants should come to the sessions with access to an Azure subscription. You can use Azure’s free tier.