Job Description
We are seeking an experienced MLOps/LLMOps Architect with deep expertise in building scalable, production-grade machine learning and generative AI pipelines. The ideal candidate will have strong experience in observability, model performance monitoring , and using tools such as Arize to ensure the reliability and scalability of AI/ML models in production. This role requires a strategic mindset and hands-on technical skills to design and implement robust MLOps/LLMOps frameworks, ensuring seamless model deployment, monitoring, and optimization.
Key Responsibilities :
- Architect and Implement MLOps/LLMOps Frameworks:
- Design and build scalable MLOps/LLMOps pipelines for model training, deployment, monitoring, and retraining.
- Establish automated CI/CD pipelines to streamline model development and deployment.
- Model Observability and Monitoring:
- Develop and implement model observability strategies using tools like Arize to track model performance, drift, and bias.
- Create real-time dashboards and alerts for proactive issue identification and resolution.
- Performance and Scalability:
- Ensure high availability, low latency, and scalability of deployed models.
- Optimize model inference and serving using best practices in distributed computing and cloud infrastructure.
- Manage and optimize compute costs for large-scale Gen AI models by implementing intelligent load balancing, autoscaling, and infrastructure tuning.
- Model Governance and Compliance:
- Establish frameworks for model versioning, auditing, and explainability to meet regulatory and business requirements.
- Ensure alignment with Responsible AI and ethical AI guidelines.
- Cross-Functional Collaboration:
- Partner with data scientists, ML engineers, platform teams, and business stakeholders to align MLOps strategies with business objectives.
- Provide technical leadership and mentorship to junior team members.
Required Skills and Qualifications :
- Experience: 10 to 15 years of experience in machine learning, MLOps, and AI model deployment in enterprise environments.
- MLOps/LLMOps Expertise: Strong background in MLOps and LLMOps, including model lifecycle management, monitoring, and automation.
- Observability Tools: Proficient in using observability platforms such as Arize, Weights & Biases, TensorBoard, MLflow , or similar tools.
- Cloud Platforms: Experience with cloud-based ML solutions (e.g., AWS, Azure, GCP ).
- Programming: Strong programming skills in Python and experience with ML frameworks such as TensorFlow, PyTorch, and Hugging Face .
- Containerization and Orchestration: Hands-on experience with Docker, Kubernetes , and distributed computing frameworks.
- Model Monitoring: Experience in detecting and mitigating model drift, bias, and data quality issues.
- Performance Tuning: Expertise in model optimization, inference acceleration, and efficient resource utilization.
Job Tags