Performance engineer

Company	Writer
Location	New York, NY, USA
Salary	$Not Provided – $Not Provided
Type	Full-Time
Degrees	Bachelor’s, Master’s
Experience Level	Expert or higher

Requirements

Bachelor’s degree in Computer Science, Engineering, or a related field (Master’s preferred)
10+ years of experience in performance engineering, with a focus on large-scale distributed systems
2+ years of experience working with AI/ML technologies
Proven experience in performance testing, profiling, and analysis of complex software systems
Deep understanding of NLP architectures, training, and inference
Experience with vector databases and search technologies
Experience with cloud computing platforms (e.g., AWS, Azure, GCP) and containerization technologies (e.g., Docker, Kubernetes)
Strong programming skills in Python
Experience with performance analysis tools (e.g., profilers, debuggers, monitoring tools)

Responsibilities

Define and implement performance engineering strategies for our Generative AI full stack, including services, application, LLMs, RAG pipelines, and related infrastructure
Lead performance testing, profiling, and analysis efforts to identify and resolve performance bottlenecks
Establish and maintain performance benchmarks and SLAs for critical AI services
Provide technical leadership and mentorship to performance engineering team members
Analyze and improve LLM inference performance, including latency, throughput, and resource utilization
Develop and implement strategies for LLM capacity planning and scaling
Collaborate with AI researchers to evaluate and improve LLM model architectures and training techniques for performance
Optimize LLM inference through techniques such as quantization, distillation, and optimized kernel implementation
Design and implement performance tests for RAG pipelines, including retrieval, ranking, and generation components
Identify and optimize performance bottlenecks in RAG systems, such as database queries, vector search, and document processing
Evaluate and optimize RAG system architectures for scalability and efficiency
Tune vector databases for optimal recall and latency
Collaborate with infrastructure teams to optimize hardware and software configurations for AI workloads
Evaluate and recommend new technologies and tools for performance monitoring and analysis
Develop and maintain performance dashboards and reports to track key metrics
Optimize GPU utilization and memory management for LLM inference
Work closely with AI researchers, software engineers, and product managers to ensure performance requirements are met
Communicate performance findings and recommendations to stakeholders at all levels
Stay up-to-date with the latest developments in Generative AI and performance engineering

Preferred Qualifications

No preferred qualifications provided.