Job Details

Home

ML|AI Infrastructure Architect at Remote, Remote, USA

Email: [email protected]

From:

Akanksha Yadav,

Tek Inspirations LLC

[email protected]

Reply to: [email protected]

Job Description -

ML/AI Infrastructure Architect

Remote

Position Summary:

As an ML Infrastructure Architect, you will play a pivotal role in designing, implementing, and troubleshooting scalable and secure AI infrastructures for our clients. These clients primarily build either single-tenant or multi-tenant ML/AI clusters for applications such as inference or large language models. This role demands not only technical excellence in NVIDIA/Mellanox/Cumulus networks, InfiniBand architecture, and GPU cluster management but also exceptional consultative knowledge and soft skills to understand and meet the unique needs of each client.

Responsibilities:
Architect and deploy robust, scalable GPU clusters for AI and ML workloads across multitenant and single-tenant infrastructures.
Design and optimize InfiniBand networks and NVIDIA/Mellanox/Cumulus solutions to meet the high-performance requirements of diverse AI applications.
Engage directly with clients to gather requirements, provide expert advice, and tailor infrastructure solutions that align with their specific AI and ML projects.
Lead collaborative discussions with AI researchers, developers, and client stakeholders to ensure infrastructure capabilities are fully leveraged.
Troubleshoot and resolve complex infrastructure issues, ensuring high availability and performance for all clients.
Continuously assess and integrate new technologies and methodologies to enhance the infrastructures capabilities and efficiency.
Develop comprehensive documentation and training materials for clients, enhancing their understanding and effective use of the infrastructure.
Conduct workshops and training sessions for clients, focusing on best practices for AI and ML infrastructure utilization.
Uphold the highest standards of data protection and comply with all regulatory requirements, ensuring a secure environment for client data.

Required Qualifications:
Experience with or knowledge of terabyte/petabyte SAN and data pipelines.
Deep knowledge of DGX platform details e.g. SuperPOD/BasePOD differences.
Proven experience in managing multitenant infrastructures, particularly for AI/ML workloads.
Expert knowledge of NVIDIA/Mellanox/Cumulus networking technologies, Infiniband architecture, and GPU cluster management.
Exceptional problem-solving skills and the ability to adapt solutions to meet individual client needs.
Strong consultative skills with a focus on client engagement and stakeholder management.
Excellent communication and interpersonal skills, with the ability to convey technical concepts to non-technical audiences.
Experience in training and mentorship, with the ability to co-develop and deliver educational content for clients.

Preferred Qualifications:
Certifications relevant to network engineering, system administration, or cybersecurity.
Knowledge of TensorFlow, PyTorch, and other frameworks.
Knowledge of vector, graph, and traditional OLAP/OLTP databases.
Knowledge of Snowflake, Databricks, and other data warehousing products.
Knowledge of applied AI data analysis and traditional data analytics.
Knowledge in non-NVIDIA networking vendors i.e. Juniper, Cisco, Nokia.
Experience with cloud services and technologies relevant to AI and ML deployment.
Knowledge of containerization and orchestration technologies (e.g., Docker, Kubernetes.)

Interview Process:
One phase interview with EVP of Engineering and Operations, as well as additional engineers to validate skillset

Keywords: artificial intelligence machine learning Colorado
ML|AI Infrastructure Architect
[email protected]

[email protected]
View all

Fri Apr 12 03:46:00 UTC 2024

Your reply to [email protected] -

Time Taken: 0

Location: ,