OpsGuru's Data Modernization services will empower your business with updated data infrastructure, advanced analytics and AI, and improved scalability and performance.
Learn Moreadd
Unlock the full potential of cloud migration with OpsGuru's Cloud Modernization services. We can refactor apps and use cloud-native features to future-proof your business.
Learn Moreadd
Maximize business resilience with OpsGuru’s 24/7 AWS Managed Cloud Operations Services. Get round-the-clock monitoring, proactive incident response, and cloud reliability.
Learn Moreadd
Enhance your applications with OpsGuru's Cloud Native Development services. Use custom strategies and cloud technology to cut costs while improving scalability, resilience, and operations.
Learn Moreadd
Enhance your cloud security with OpsGuru, a trusted Arctic Wolf Partner. Our Arctic Wolf consultants provide threat detection, incident response, and expert remediation to safeguard your cloud environment. Talk to a security advisor today!
Learn Moreadd
Maximize your data's potential with OpsGuru, a trusted Databricks consulting partner. From data engineering to analytics and machine learning, our Databricks consultancy provides tailored solutions to accelerate your cloud journey.
Learn Moreadd
Enhance your cloud security with OpsGuru, a trusted DoiT Partner. Our DoiT consultants provide threat detection, incident response, and expert remediation to safeguard your cloud environment. Talk to a security advisor today!
Learn Moreadd
Enhance your cloud security posture with OpsGuru, a trusted Fortinet consulting partner. Our experts provide tailored cloud security solutions using Fortinet's data-driven platform. Talk to a cloud security expert today!
Learn Moreadd
Data-centric approach to cloud security so you can establish multiple layers of defense, ensuring immediate risk remediation and compliance without disrupting your business.
Learn Moreadd
Explore the latest news from OpsGuru.
See Alladd
Discover our customer success stories through case studies showcasing OpsGuru’s innovative solutions.
See Alladd
Learn more about our upcoming events and how to connect with OpsGuru through conferences, webinars, and immersion days.
See Alladd
Unlock customer success stories, insights, and cloud strategies through our solution-based ebooks.
See Alladd
Find the latest industry news, insights, and more on our Blog.
See Alladd
  • Customer Success
January 6, 2025
Kubernetes Development Environment for ML in Amazon EKS

Kubernetes Development Environment for ML in Amazon EKS

Background

This client is a company that leverages AI to generate photos for use in the fashion industry. They unite a team of experts with the perfect combination of skills to revolutionize fashion visuals. With vast experience in computer vision, AI, media, and usable enterprise products, they are committed to transforming the fashion industry through the use of synthetic media.

The Challenge

The client faced several challenges in developing a more efficient and cost-effective development environment for ML. First, their previous solution using GPU instances in GCP was too expensive and slow. Second, they needed a solution that would enable their developers to carry out the development process and training experiments in a more streamlined and efficient way.

Our Solution

To address these challenges, OpsGuru created a development environment for ML using EKS and GPU instances on Amazon Web Services (AWS). This provided the client with a faster and more efficient development environment than they previously had. The developers were able to SSH into the pods to carry out the development process and could use Kubernetes jobs to carry out training experiments.

OpsGuru used Terraform to create the environment, which helped to automate the process and ensure consistency across the entire environment. This also allowed OpsGuru to deploy the environment quickly and efficiently while minimizing the risk of errors.

OpsGuru proposed building an ML development environment using Terraform to automate the environment creation process, as well as EKS and GPU instances on AWS to provide the necessary infrastructure for training ML models.

To implement the solution, OpsGuru followed the following steps:

Step 1: Use Terraform to automate environment creation

OpsGuru used Terraform to automate the process of creating the ML development environment on AWS. This allowed the client to create and manage the environment in a more efficient and cost-effective way.

Step 2: Set up the EKS cluster

OpsGuru set up an EKS cluster using a combination of managed AWS services and open-source tools. This provided a scalable and reliable platform for running ML workloads.

Step 3: Configure GPU instances

To accelerate the training process and reduce costs, OpsGuru configured GPU instances to be used in the development environment. Thus allowing the client’s development team to train models more quickly and efficiently.

Step 4: Set up a development environment

OpsGuru set up the development environment, including pods and Kubernetes jobs, to enable the development team to carry out ML experiments. The development environment was designed to be flexible and scalable, allowing the team to run multiple experiments in parallel.

The Result

The project was a success, and the company’s CEO expressed satisfaction with the results, stating, “I really enjoyed working together and very happy with the results”. By leveraging OpsGuru’s expertise in cloud-native technologies and extensive experience with the AWS platform, the company was able to achieve its goal of developing an efficient and cost-effective development environment for ML including:

  • A cost-effective and scalable environment by leveraging Terraform automation and utilizing GPU instances on the AWS platform.
  • Faster model training times, thus allowing them to iterate more quickly on their AI models.
  • A scalable and flexible development environment that could accommodate multiple experiments in parallel.
  • The new environment has enabled their developers to work in a more streamlined and efficient manner while avoiding the high costs and slow performance they had experienced in the past.

As a result of this project, the client was able to train its AI models more quickly and cost-effectively, thus gaining a competitive advantage in the fashion industry.