BINPIPE | blog: 2020

"Quality, Budget, or Time - pick any two!"

That is the rule of thumb when it comes to delivering projects in the Software Engineering world.

In a real world no one likes that. No sane project manager would compromise on quality, go over-budget, or miss a deadline!

As a part of Engineering at STARZPLAY we are often required to deliver results under tight deadlines. Recently we had to pull through a project under extremely unreal deadlines for something that would ideally have taken at least 3X the given time. Now, how did we manage to maintain quality, within budget, and ensure we meet our deadline?

Here are some of my takeaways from the experience of being a part of this project-

- Parallelism & Sequence:
Having clear expectations is the basic necessity for planning well. And once the plan is on the table, the most important factor for rapid delivery is finding out the tasks that can be executed in parallel and zeroing in on the order(sequence) of execution. That is where time is saved/earned.

- Automation:
Replace manual efforts around deployment, and the provisioning, cloning, and sharing of environments with automation scripts & tools as far as possible.

- Flat Hierarchy:
People are important for any successful execution. However, it's not just the number of people but skilled/productive people that make a difference, because we are looking at meeting a tight deadline and not grooming a team for the future (that is another paradigm and subject for discussion for another day). Each member of the team is considered a leader who owns his tasks. Instead of one large team, we kept the team size small & fully autonomous- with members, carefully picked & possessing a variety of skillsets.

- Tracking Time & Direction:
One member in the team tracks actions and decisions and ensures follow-ups on a daily basis and documents the outcomes. This allows the team to keep the members focussed in the desired direction, staying within budget and also helps in reconfiguring priorities when needed.

- Scope & Acceptance Criteria:
So we keep our quality, we stick to budget, and we meet our deadline, but what's delivered on that deadline is continually up for discussion and consideration. That is where scope comes in. One should be clear on the acceptance criteria for every deliverable. Because optimization is a process that can go on infinitely. So in order to be able to close a project on time, a 'scope' for every task needs to be agreed upon.

- Tweak your process not the outcome:
For high velocity projects, it helps to give more weight to `people and interactions` over following a process because its a "process"! Quoting Steve Jobs - "Customers don't measure you on how you do it or how hard you try, they measure you on what you deliver".

The DevOps Success Story

Its mid 2020 and software development life cycle has reached an appreciable level of maturity. Two things that stand out now are -

DevOps culture & practices have evolved immensely. From version control to build phases, CI/CD, automation tests, deployment orchestration, cloud infrastructure-as-code - all these processes have created a synergy for successful software delivery.
The tool ecosystem for developing software applications is remarkably rich. And DevOps processes have revolved around these tools to 'Automate everything humanly possible!'

For the uninitiated, DevOps is defined as "a set of practices intended to reduce the time between committing a change to a system and the change being placed into normal production, while ensuring high quality".

DevOps has helped software businesses to succeed by majorly by these three ways -

Collaboration: DevOps has helped break silos between software developers and operation engineers. So it's a culture that has promoted "We build it, we run it," not the "Throw the code over the wall" paradigm.
Speed: DevOps practices heavily advocate the "Automate Everything" ideology and this leads to faster time to market.
Reliability: Use of standardised CI/CD pipelines lead to near zero errors and reproducibility. So things fail fast(which is good, and can be fixed) or do not fail at all!

Enter Artificial Intelligence

In parallel to the rise of Cloud-native software services & DevOps, one more area that is making waves in the technology circles is Artificial Intelligence (AI).

Fathers of AI Minsky and McCarthy, described artificial intelligence as any task performed by a program or a machine that, if a human carried out the same activity, we would say the human had to apply intelligence to accomplish the task.

Now AI has many sub disciplines and helps solve problems for various fields like planning, learning, reasoning, problem solving, knowledge representation, perception, motion, and manipulation and, to a lesser extent, social intelligence and creativity.

One one hand DevOps is the de-facto standard for application development. However, modern ML (Machine Learning) and AI do not have a standard tooling or process ecosystem. This makes sense for a number of reasons-

AI research was confined to Universities & Labs. They had their own development methodologies including CRISP-DM and Microsoft Team Data Science Process (TDSP).
The best practices have not emerged as of now because the tools are changing rapidly and there is a need for a single body of knowledge here.

The below excerpt from Microsoft Azure Blog, throws more light on the topic-

"AI/ML projects Like DevOps, these methodologies are grounded in principles and practices learned from real-world projects. AI/ML teams use an approach unique to data science projects where there are frequent, small iterations to refine the data features, the model, and the analytics question. It's a process intended to align a business problem with AI/ML model development. The release process is not a focus for CRISP-DM or TDSP and there is little interaction with an operations team. DevOps teams (today) are yet not familiar with the tools, languages, and artifacts of data science projects.

DevOps and AI/ML development are two independent methodologies with a common goal: to put an AI application into production. Today it takes the effort to bridge the gaps between the two approaches. AI/ML projects need to incorporate some of the operational and deployment practices that make DevOps effective and DevOps projects need to accommodate the AI/ML development process to automate the deployment and release process for AI/ML models.

DevOps for AI/ML

DevOps for AI/ML has the potential to stabilize and streamline the model release process. It is often paired with the practice and toolset to support Continuous Integration/Continuous Deployment (CI/CD). Here are some ways to consider CI/CD for AI/ML workstreams:

The AI/ML process relies on experimentation and iteration of models and it can take hours or days for a model to train and test. Carve out a separate workflow to accommodate the timelines and artifacts for a model build and test cycle. Avoid gating time-sensitive application builds on AM/ML model builds.
For AI/ML teams, think about models as having an expectation to deliver value over time rather than a one-time construction of the model. Adopt practices and processes that plan for and allow a model lifecycle and evolution.
DevOps is often characterized as bringing together business, development, release, and operational expertise to deliver a solution. Ensure that AI/ML is represented on feature teams and is included throughout the design, development, and operational sessions.

Establish performance metrics and operational telemetry for AI/ML

Use metrics and telemetry to inform what models will be deployed and updated. Metrics can be standard performance measures like precision, recall, or F1 scores. Or they can be scenario specific measures like the industry-standard fraud metrics developed to inform a fraud manager about a fraud model's performance. Here are some ways to integrate AI/ML metrics into an application solution:

Define model accuracy metrics and track them through model training, validation, testing, and deployment.
Define business metrics to capture the business impact of the model in operations. For an example see R notebook for fraud metrics.
Capture data metrics, like dataset sizes, volumes, update frequencies, distributions, categories, and data types. Model performance can change unexpectedly for many reasons and it's expedient to know if changes are due to data.
Track operational telemetry about the model: how often is it called? By which applications or gateways? Are there problems? What are the accuracy and usage trends? How much compute or memory does the model consume?
Create a model performance dashboard that tracks model versions, performance metrics, and data sets.

AI/ML models need to be updated periodically. Over time, and as new and different data becomes available — or customers or seasons or trends change — a model will need to be re-trained to continue to be effective. Use metrics and telemetry to help refine the update strategy and determine when a model needs to be re-trained.

Automate the end-to-end data and model pipeline

The AI/ML pipeline is an important concept because it connects the necessary tools, processes, and data elements to produce and operationalize an AI/ML model. It also introduces another dimension of complexity for a DevOps process. One of the foundational pillars of DevOps is automation, but automating an end-to-end data and model pipeline is a byzantine integration challenge.

Workstreams in an AI/ML pipeline are typically divided between different teams of experts where each step in the process can be very detailed and intricate. It may not be practical to automate across the entire pipeline because of the difference in requirements, tools, and languages. Identify the steps in the process that can be easily automated like the data transformation scripts, or data and model quality checks. Consider the following workstreams:

Workstream	Description	Automation
Data Analysis	Includes data acquisition and focusing on exploring, profiling, cleaning, and transforming. Also includes enriching, and staging data for modeling.	Develop scripts and tests to move and validate the data. Also create scripts to report on the data quality, changes, volume, and consistencies.
Experimentation	Includes feature engineering, model fitting, and model evaluation.	Develop scripts, tests, and documentation to reproduce the steps and capture model outputs and performance.
Release Process	Includes the process for deploying a model and data pipeline into production.	Integrate the AI/ML pipeline into the release process
Operationalization	Includes capturing operational and performance metrics.	Create operational instrumentation for the AI/ML pipeline. For subsequent model retraining cycles, capture and store model inputs, and outputs.
Model Re-training and Refinement	Determine a cadence for model re-training.	Instrument the AI/ML pipeline with alerts and notifications to trigger retraining.
Visualization	Develop an AI/ML dashboard to centralize information and metrics related to the model and data. Include accuracy, operational characteristics, business impact, history, and versions.	n/a

An automated end-to-end process for the AI/ML pipeline can accelerate development and drive reproducibility, consistency, and efficiency across AI/ML projects."

The Challenges

The problems plaguing AI/ML/Data Scientists is the need of toolchains, automation pipelines, knowledge about standard model training frameworks and ease of hardware access - different teams need different numbers of GPUs , FPGAs , CPUs, TPUs or even IPUs.

Here are the some of the challenges put out as questions-

Who manages and maintains these resources for AI teams?
Who administers hardware resources?
Who prioritizes the jobs?
How is the sanity of resource allocations maintained?
Who supports automation scripting and defining pipelines?
Who handles security issues, authentication & authorization?
Who ensures all the accelerators and nodes are optimized?
How to profile slow applications and help the Data Scientists?
Who maintains the toolchains and cloud servers for AI teams?
Who maintains any other infrastructure or systems specific issues?
So who is the one with the cape?

DevOps the Superhero!

The answer to all this is again DevOps. But it's not the same DevOps from the Application Development era that would fit in here! This is another beast and needs some more superpowers in addition to its core strengths. Knowledge of newer tools and practices like Kubeflow, Tensorflow, Google ML-Ops, Azure AI pipelines, AWS Sagemaker Studio will be required. And it's high time all this knowledge is aggregated and standardised. I will follow up with more soon, until then enjoy this insightful white-paper from google.ai with some research finding on these lines - https://storage.googleapis.com/pub-tools-public-publication-data/pdf/aad9f93b86b7addfea4c419b9100c6cdd26cacea.pdf

BINPIPE | blog

Search Posts on Binpipe Blog

Delivering Results Under Tight Deadlines

How DevOps practices reinforce AI/ML

DevOps for AI/ML

Establish performance metrics and operational telemetry for AI/ML

Automate the end-to-end data and model pipeline

Translate this Page

Wikipedia