Rosetta Academy Lessons in Practice: Design Patterns and the Anatomy of the Analytics Workload
(Credits to Steely Dan for the title reference)
Go back and try a second model, with a broader view and expanded skills, and you might just find a fresh and forward-looking method for analytics in the modern cloud enterprise. The anatomy of analytics workloads and the application of foundational SAS and SQL skills in the land of open-source languages reveal a wider perspective on what’s possible with legacy SAS assets. Here’s a story about a Corios client that gained both training and a new point of view on modeling with a dual approach and an eye toward innovation.
Building a model pipeline twice in both conventional and open-source contexts
Corios was hired by a rapidly growing bank to build the newest release of their prospect acquisition scorecard model; not once, but twice: once in SAS 9.4 (their production environment), and the second time in an open-source approach that leveraged Python, Dask and Spark.
The reason for the second modeling effort was to explore what an innovative, modern cloud-focused analytic environment could and should look like to support predictive model lifecycle management: authoring, champion/challenger experimentation, validation, version management, cloud deployment, drift analysis and ongoing refresh.
Two Pipelines: Breaking in a New Model
Building the first traditional model pipeline was familiar territory for us because we had built several models for this client that had been put into production over the past few years. The greatest challenge was that the bar was set very high for the mathematical performance of the model, since we had to beat the performance of the current version which was constructed effectively and exhibited strong performance.
The second model broke a lot of new ground for the client. Major elements included:
- Amazon Web Services clustered compute, storage, code development and management
- Python, Dask and Spark as open-source frameworks for analytic pipeline development
- Side-by-side comparisons for analytics assets built in familiar territory (SAS) and unfamiliar territory (open-source frameworks on cloud services); and
- Novel analytics techniques (with new performance contributions) made available by open-source frameworks for the first time in a native, business-critical context
But what was the basis for the new framework to arrive at this breakthrough model? At its core is the Analytics Design Patterns course that is part of the Corios Rosetta Academy curriculum.
Anatomy of an Analytic Workload
To help the client think about the new tools and capabilities, we built a simple framework called the “Anatomy of an Analytic Workload” with 11 stages, each stage consisting of analytic design patterns that are familiar to seasoned analytics professionals. For instance, “Extract”, “Filter”, “Explore”, “Deploy”; all eleven stages are covered more deeply in the first course of the Rosetta Academy curriculum. Then we overlaid both the traditional SAS model and the hybrid open source plus SAS Viya model over this anatomy framework and showed the comparisons between the conventional code involved in the model pipeline and the open-source code.
This went a long way towards demystifying what our team built in both cases. Such a framework won’t teach a new analyst fresh out of school how to become a seasoned expert, and it won’t teach you how to learn to write the proper code for each part of the project, but it will give you a roadmap to understand how to link the major building blocks together.
Second, when working with open-source frameworks, you cannot separate model pipeline development from architecture. Instead, the Data Scientist role needs to work very closely with the DevOps Engineer and the Data Engineer roles to make changes to the architecture environment along the way – not because you can, but often because you must. The Data Scientist will need to understand design patterns, architecture, scale, APIs, code and object interactions and ask the right questions of their peers to adjust their working environment as they encounter challenges with the open-source libraries they’re using, the scale of the data they’re working with and the storage and network implications of working with large data.
In contrast, in working with conventional systems like SAS 9.4 many of the architecture choices are made once (potentially years ago) and are not fungible, but at least the software platform abstracts many of the architecture, storage and security concerns away from the Data Engineer and the Data Scientist roles.
Advantages of the cloud route
Migrating SAS to the cloud has distinct advantages regardless of which cloud service platform you choose. And there may be differing advantages aligned to specific workloads that have you planning SAS code migration to different platforms. This is an important factor in assessing your SAS analytics assets – knowing the destination of your assets.
In this specific case the choice of Amazon Web Services as the cloud infrastructure and services layer provided our team a lot of power and flexibility to make these adaptations along the way. We added new open-source libraries to our toolset, we changed our data formats and storage patterns, we expanded from single compute hosts to large clusters of machines running Spark and Dask workloads, we moved our data back and forth between the open-source frameworks and the SAS Viya environment pretty easily.
An important challenge to the data science industry, as we continue to embrace and use open-source frameworks, is how to address the needs of model validation and transparency in a regulated business context like banking and financial services.
Not nearly enough development has been invested in open-source frameworks like SparkML, DaskML, XGBoost to expose the machine learning model’s inference pipeline internals.
- Nearly every fitted model in a pipeline is constructed as a binary object, and only in some cases can the data scientist inspect the properties of those objects at the level to which seasoned professionals are accustomed. I personally opt for the “trust and verify” approach when validating a model over the “trust blindly” approach that is required when working with these newer frameworks.
- There are some useful developments going on in the area of Shapley model profiling and model drift, but it’s our view that some very valuable established model performance and validation criteria that have existed for decades are now being re-invented in the open-source world without sufficient attention being given to ground that has already been broken decades ago.
Want to learn more about effective, financial services enterprise-scale machine learning and deployment and how your team can train on Python and Spark and cloud skills over the top of their foundation SAS expertise? Visit our services pages and the Rosetta Academy. Then email me directly at president@coriosgroup.com.