Home

MLOps Infra Engineer@Bentonville, Arkansas OR Remote at Remote, Remote, USA
Email: [email protected]
MLOps Infra Engineer

Location: Bentonville, Arkansas OR Remote 

Number of days onsite - 2

Must Have Skills

Kubernetes (On Prem/Cloud) 5+ Yrs of Exp

Docker 5+ Yrs of Exp

Programming Lang (Python, Node, Golang, or bash) 5+ Yrs of Exp (At least 2)

Seldon core, MLFlow, Istio, Jaeger, Ambassador, Triton, PyTorch, Tensorflow/TFserving or similar tools ( 4 of these)

4+ Yrs of Exp

Good To have Skills

At least 2 from below skills distributed computing and deep learning technologies such as Apache MXNet, CUDA, cuDNN, TensorRT

Mandatory if Applicable

Domain Experience (If any ) Retail

If Yes provide dates , details of account/project

Looking for someone who progressed from Developer to this Admin or Infra engineer. We are not looking candidate from Data Scientist background.

UST Global is looking for a highly energetic and collaborative ML Ops Engineer with experience building enterprise solutions on web / cloud platforms. The ideal candidate should have experience with Seldon core, MLFlow, Istio, Jaeger, Ambassador, Triton, PyTorch, Tensorflow/TFserving and Experience with distributed computing and deep learning technologies such as Apache MXNet, CUDA, cuDNN, TensorRT. The candidate should be a proven self-starter with demonstrated ability to make decisions and accept responsibility and risk. Excellent written and verbal communication skills with the ability to collaborate effectively with domain experts and IT leadership team is key to be successful in this role. We are using Kubernetes (K8) for their MLOps pipeline orchestration, and this is a powerful and intricate system that involves many moving parts and requires knowledge of related technologies such as Docker, container networking, load balancing, and more. Hands-on practice is essential, as it requires deploying and managing containerized applications, creating Kubernetes objects, configuring networking and storage, and troubleshooting issues that arise in the system.

Key Responsibilities:

Work with Walmart's AI/ML Platform Enablement team within the eCommerce Analytics team. The broader team is currently on a transformation path, and this role will be instrumental in enabling the broader team's vision.

Work closely with data scientists to help with production models and maintain them in production.

Deploy and configure Kubernetes components for production cluster, including API Gateway, Ingress, Model Serving, Logging, Monitoring, Cron Jobs, etc. Improve the model deployment process for MLE for faster builds and simplified workflows

Be a technical leader on various projects across platforms and a hands-on contributor of the entire platform's architecture

System administration, security compliance, and internal tech audits

Responsible for leading operational excellence initiatives in the AI/ML space which includes efficient use of resources, identifying optimization opportunities, forecasting capacity, etc.

Design and implement different flavors of architecture to deliver better system performance and resiliency.

Develop capability requirements and transition plan for the next generation of AI/ML enablement technology, tools, and processes to enable Walmart to efficiently improve performance with scale.

Tools/Skills (hands-on experience is must):

Administering Kubernetes. Ability to create, maintain, scale, and debug production Kubernetes clusters as a Kubernetes administrator and In-depth knowledge of Docker.

Ability to transform designs ground up and lead innovation in system design

Deep understanding of data center architectures, networking, storage solutions, and scale system performance

Have worked on at least one Kubernetes cloud offering (EKS/GKE/AKS) or on-prem Kubernetes (native Kubernetes, Gravity, MetalK8s)

Programming experience in Python, Node, Golang, or bash

Ability to use observability tools (Splunk, Prometheus, and Grafana ) to look at logs and metrics to diagnose issues within the system.

Experience with Seldon core, MLFlow, Istio, Jaeger, Ambassador, Triton, PyTorch, Tensorflow/TFserving is a plus.

Experience with distributed computing and deep learning technologies such as Apache MXNet, CUDA, cuDNN, TensorRT

Experience hardening a production-level Kubernetes environment (memory/CPU/GPU limits, node taints, annotations/labels, etc.)

Experience with Kubernetes cluster networking and Linux host networking

Experience scaling infrastructure to support high-throughput data-intensive applications

Background with automation and monitoring platforms, MLOps ,and configuration management platforms

Education & Experience: -

5+ years relevant experience in roles with responsibility over data platforms and data operations dealing with large volumes of data in cloud based distributed computing environments.

Graduate degree preferred in a quantitative discipline (e.g., computer engineering, computer science, economics, math, operations research).

Proven ability to solve enterprise level data operations problems at scale which require cross-functional collaboration for solution development, implementation, and adoption.

The ideal candidate should have experience with Seldon core, MLFlow, Istio, Jaeger, Ambassador, Triton, PyTorch, Tensorflow/TFserving (is a plus) and Experience with distributed computing and deep learning technologies such as Apache MXNet, CUDA, cuDNN, TensorRT. We are using Kubernetes (K8) for their MLOps pipeline orchestration, and this is a powerful and intricate system that involves many moving parts and requires knowledge of related technologies such as Docker, container networking, load balancing, and more. Hands-on practice is essential, as it requires deploying and managing containerized applications, creating Kubernetes objects, configuring networking and storage, and troubleshooting issues that arise in the system

Comments for Suppliers: Tensorflow,TFServing,Cuda

Additional Skills: Istio,Ambassador,Seldon Core,Triton

Warm Regards,

Bhaskar kumar 
|
 Senior Recruiter 

3S Business Corporation

[email protected]

16700 HOUSE HAHL RD BLDG 6B, Cypress, TX-77433

An E-Verified Company 

To be removed from our mailing list reply with "
[email protected]
" and include your "original email address/addresses" in the subject heading. Include complete address/addresses and/or domain to be removed. We will immediately update it accordingly. We apologize for the inconvenience if any caused. Please consider the environment before printing this email.
 Go Green

--

Keywords: artificial intelligence machine learning information technology golang Texas
[email protected]
View all
Tue Jan 16 20:57:00 UTC 2024

To remove this job post send "job_kill 1017362" as subject from [email protected] to [email protected]. Do not write anything extra in the subject line as this is a automatic system which will not work otherwise.


Your reply to [email protected] -
To       

Subject   
Message -

Your email id:

Captcha Image:
Captcha Code:


Pages not loading, taking too much time to load, server timeout or unavailable, or any other issues please contact admin at [email protected]
Time Taken: 27

Location: , Oregon