FinOps and AI – Do we need our own AI Discipline in FinOps or is it just a part of the game?

6th November 2024

FinOps Consultant | Microsoft Advisor

With the growing reliance on Artificial Intelligence (AI) in enterprises, the impact on cloud resource consumption is enormous. As AI workloads require substantial computational power, it raises the question: Is FinOps (Financial Operations) for AI a distinct discipline, or is it just a natural extension of existing practices?

In the start, as we most likely don’t have a lot of data around how to implement AI, roughly the same as when the first customers moved from the on-premises world into the cloud the focus is the domain Quantify Business Value and here mainly as starting point to get the capability “Planning and Estimating” right. Getting the first business case ready FinOps can support with the cost per resource expecting the organization to understand current Cloud Usage and Cost. The “project” will need to ingest the information to the FinOps team around which type of AI usage the corporation is most likely looking for.

There is a huge difference if you just want to run a SaaS project like using Microsoft CoPilot or Genesis and find a use case where the user optimizes his/her workload. Do we look into implementing our own GenAI model and “train” our model or find use cases to use Machine Learning to solve some unique issues?

In addressing this, we can consider two main perspectives:

FinOps for AI (FinOps4AI): As AI systems consume vast amounts of resources, integrating FinOps best practices is essential for managing costs and energy consumption. This requires continuous reassessment of AI applications to optimize their financial and operational performance.
AI for FinOps (AI4FinOps): AI can enhance FinOps processes by automating repetitive tasks, predicting cloud resource needs, and identifying cost-saving opportunities. By leveraging AI, businesses can shift their focus from manual cost management to strategic initiatives.

For this discussion, we will focus on the FinOps4AI approach—understanding how AI-driven workloads influence cloud consumption and costs—while leaving the optimization of cloud consumption through AI tools to cloud service providers (CSPs) and third-party solutions.

Key FinOps Domains for AI Workloads

The core issue when dealing with AI workloads in the cloud revolves around understanding and managing costs. Three primary challenges arise:

Avoiding Costs in the Early Stages of AI Setup: When setting up AI in an organization, it is critical to ensure that costs are minimized from the start. Without proper planning, businesses may face runaway expenses as they experiment with different models and tools.
Maintaining Control Over Cloud Consumption: AI adoption can skyrocket rapidly, even when historical data about consumption is limited. Organizations need to develop strategies to monitor and manage cloud resources effectively, even in the early stages of AI implementation.
Forecasting AI Costs Despite Uncertainty: AI use cases often come with ambiguity and are built on assumptions. This makes it difficult to accurately predict future costs, as the maturity of AI initiatives may not match the organization’s overall FinOps practice. Read more

Understanding AI Cloud Resource Consumption

In many organizations, AI is not just a single technology initiative—it cuts across multiple departments and stakeholders. Therefore, the success of AI depends largely on the value it delivers to the business. However, estimating AI-related costs is a challenge, especially when historical data is scarce. This situation is similar to when organizations first transitioned from on-premises data centers to cloud environments.

At the outset of AI adoption, the focus within FinOps should be on Quantifying Business Value, specifically honing in on Planning and Estimating capabilities. A solid first business case is crucial. The FinOps team can assist by calculating costs for cloud resources and helping the organization comprehend its current cloud usage and cost structure.

Here are some key considerations for FinOps teams when managing AI workloads:

Defining the Type of AI Usage: The type of AI use case the company plans to implement greatly affects cloud costs. For instance, using a pre-built SaaS tool like Microsoft CoPilot or Google’s AI-powered services would only require calculating the cost per user and comparing it against potential productivity gains.
Custom AI Implementations: In contrast, more customized AI projects, such as building and training a proprietary Generative AI model, involve much greater complexity. These projects require larger amounts of data, often necessitating extensive storage and significant processing power. The choice between using pre-packaged, serverless AI services from cloud providers or running a dedicated GPU cluster is crucial in determining costs.
AI Training vs. Inference Costs: One of the major cost drivers in AI projects is the distinction between training and inference. Training AI models require large datasets, considerable storage, and intense computational power, typically through GPUs or TPUs (Tensor Processing Units). These resources can be costly to maintain over time. Inference, on the other hand, involves using trained models to make predictions or generate outputs, which can also consume significant resources depending on the scale of the operation.

Managing Cloud Usage for AI Workloads

As organizations adopt AI at a rapid pace, controlling cloud usage becomes increasingly important. Without historical usage data, companies may struggle to gauge future consumption, leading to unpredictable costs. To manage cloud usage effectively, organizations can implement the following strategies:

Resource Tagging and Allocation: FinOps practitioners should ensure that AI workloads are tagged and categorized separately. This allows for precise tracking of AI-related cloud resources and enables cost allocation across departments or projects.
Right-Sizing and Auto-Scaling: AI workloads can fluctuate dramatically depending on the model’s complexity and the stage of the project (training vs. inference). Implementing right-sizing practices ensures that only necessary resources are provisioned, while auto-scaling allows for flexibility in adjusting resources as demand changes.
Cloud Cost Monitoring Tools: CSPs offer monitoring tools to track real-time cloud usage, such as AWS Cost Explorer, Azure Cost Management, and Google Cloud’s Billing Reports. By monitoring consumption, organizations can identify cost spikes or inefficiencies early and take action to reduce waste.

Forecasting Cloud Costs for AI

The uncertainty surrounding AI projects makes accurate cost forecasting difficult. This is especially true in the early stages, where AI models are constantly being refined and adjusted. However, several forecasting techniques can help FinOps teams anticipate costs:

Scenario-Based Forecasting: In situations where AI use cases are undefined, scenario-based forecasting can be employed. This involves creating multiple forecasts based on different possible outcomes, such as a best-case, worst-case, and likely scenario. Each scenario can help estimate the range of costs an organization may face.
Historical Data from Similar Projects: When possible, organizations can use historical data from other AI or cloud projects to inform forecasts. While the exact details of AI projects may differ, similar cloud consumption patterns can provide a baseline for estimating costs.
Collaboration with Data Science Teams: AI cost forecasting should involve close collaboration between FinOps practitioners and data science teams. Data scientists can provide insights into the expected complexity of AI models, data storage needs, and computational requirements. These inputs are crucial for building more accurate financial models.

Cloud Resources Commonly Used in AI Workloads

FinOps teams need to understand the specific cloud resources AI workloads typically consume. The following are some of the most commonly used cloud resources in AI projects:

Compute Resources:
- GPUs (Graphics Processing Units): Necessary for training deep learning models.
- TPUs (Tensor Processing Units): Specialized hardware for AI model training, available in Google Cloud.
- vCPUs (Virtual Central Processing Units): Used for less resource-intensive tasks like inference or general computing.
Storage Resources:
- Object Storage (e.g., Amazon S3, Google Cloud Storage): For storing large datasets required for AI model training.
- Block Storage (e.g., Azure Managed Disks): For faster access to frequently used data during training or inference.
- Database Services (e.g., AWS DynamoDB, Google Bigtable): For structured data storage and retrieval during model development.
Networking Resources:
- Content Delivery Networks (CDNs): To efficiently deliver AI-powered services to end users.
- Load Balancers: To distribute workloads evenly across compute resources during inference.
AI-Specific Services:
- Amazon SageMaker, Google AI Platform, and Azure Machine Learning: Managed services that provide pre-built tools for developing, training, and deploying AI models.
- AutoML Tools: For businesses that prefer using automated machine learning solutions without needing deep technical expertise.

As AI continues to proliferate in business environments, understanding the impact of AI workloads on cloud resource consumption is critical. By applying FinOps principles to AI initiatives, organizations can avoid runaway costs, control cloud consumption, and develop accurate forecasts despite uncertainties. FinOps for AI may not require an entirely separate discipline, but it certainly demands a specialized approach, given the unique challenges AI workloads present.

Human Factor in AI Impacting FinOps from an ACM perspective

As long as we empower humans, the technology will follow. This sounds obvious; however, in reality, we often rationalize bringing technology into corporations using a top-down approach. We have seen many failures in implementing technology. These failures persist as long as we ignore the fact that any technology introduced into a human-driven environment needs to consider the non-logical factors of how we react to change

Adopting cloud technology and FinOps practices can help organizations drive innovation, increase efficiency, and reduce costs. However, making a successful transition to the cloud requires an effective adoption change management process.

The same process that counts for implementing a FinOps Practice can and should be used when implementing Artificial Intelligence. So the impact on FinOps will be mostly in looking to avoid additional cost due to the fact we bring in AI without proper use cases which are deeply rooted into the DNA of a company. The most benefit AI can bring to an organization is a) to change and automate administrative processes and b) to fulfill tasks where the AI strength for example pattern recognition in Big Data plays a main role, where humans will fail and not build for. Where for b) there are tons of tools so for a) the approach must be a human first approach.

Conclusion

There is no way around a proper Adoption Change Management process to implement AI here but even before this, you have to identify and create the AI use cases preferably by the persons who are most impacted by the change the employees in the field of change.