AI Infrastructure Engineer at San Diego, CA Onsite at Remote, Remote, USA |
Email: [email protected] |
From: Jaya Krishna, 3K Technologies [email protected] Reply to: [email protected] AI Infrastructure Engineer Location: On-site (in San Diego, CA) Long Term Job Summary: We are seeking an experienced AI Infrastructure Engineer to set up and manage GPU hardware in our data center, optimizing it for AI workloads and high-performance computing. This role involves designing, implementing, and maintaining the infrastructure needed to support large-scale machine learning models and AI applications. The ideal candidate will have a strong background in GPU architecture, data center operations, and cloud infrastructure, with hands-on experience configuring and managing high-performance GPUs for AI. Key Responsibilities: Design and implement scalable GPU-based infrastructure for AI/ML workloads in a data center environment. Configure, install, and maintain GPU clusters and nodes, ensuring optimal performance and resource allocation. Set up GPU hardware, firmware, and software layers (drivers, libraries, frameworks like CUDA, cuDNN, and TensorRT). Collaborate with AI/ML teams to understand workload requirements and tailor infrastructure for performance and efficiency. Monitor and manage GPU performance, resource usage, and scalability to support AI operations. Implement solutions for GPU orchestration and job scheduling (e.g., Kubernetes, Slurm). Ensure network connectivity, security, and redundancy for seamless GPU operations in the data center. Troubleshoot hardware and software issues related to GPUs and provide support for system upgrades and maintenance. Optimize power consumption, cooling, and resource utilization in the data center. Document infrastructure setup, configurations, and standard operating procedures. Required Skills and Qualifications: Bachelors degree in Computer Science, Electrical Engineering, or a related field. 3+ years of experience in GPU infrastructure design, implementation, and management in data centers. Expertise with GPU hardware (e.g., NVIDIA, AMD), parallel computing, and AI frameworks (e.g., TensorFlow, PyTorch). Strong knowledge of GPU programming models like CUDA, and experience with GPU performance tuning. Familiarity with cloud-based infrastructure (AWS, GCP, Azure) and hybrid cloud architectures. Experience with containerization, orchestration tools (Kubernetes, Docker), and distributed computing. Knowledge of networking protocols, data center infrastructure, power management, and cooling systems. Excellent troubleshooting, problem-solving, and communication skills. Preferred Qualifications: Experience with AI/ML workload optimization and deep learning pipeline support. Knowledge of storage solutions and distributed file systems for AI datasets. Familiarity with automation tools (Ansible, Terraform) for infrastructure provisioning and management. Certifications in data center management or cloud platforms. Thanks & Regards Jaya Krishna 1114 Cadillac Ct, Milpitas, CA 95035 www.3ktechnologies.com | [email protected] Gmail:Jayakrishnatalasila Yahoo:jaya3kt Analytics | BI | Big Data | Cloud |Software Engg. t: +1 (408)713-6640 Keywords: artificial intelligence machine learning business intelligence information technology California Connecticut AI Infrastructure Engineer at San Diego, CA Onsite [email protected] |
[email protected] View all |
Tue Oct 29 22:21:00 UTC 2024 |