AI Infrastructure Engineer at Kooya Inc

About the Role:

We are looking for a highly technical AI Engineer / MLOps Specialist to lead a core infrastructure migration. We are currently transitioning from managed AI services to a fully self-hosted, open-source AI architecture on AWS to optimize operating expenses and increase data privacy and control.

You will be responsible for the end-to-end pipeline: building automated data ingestion systems, managing a vector database, provisioning AWS GPU servers, and ensuring an open-source Large Language Model (LLM) runs securely 24/7.

Note: If your AI experience is limited to calling managed APIs (like OpenAI or Anthropic), this role is not for you. We need someone who knows how to allocate GPU VRAM, optimize inference speeds, and manage bare-metal Linux servers.

What You Will Build (Core Responsibilities):

Infrastructure Migration: Transition existing generative AI workflows off managed cloud APIs and onto self-hosted open-source models (e.g., Llama 3, Qwen, Gemma) hosted on AWS EC2 GPU instances.
Data Ingestion & Scraping: Build and maintain robust Python automated pipelines/scrapers that run daily to extract unstructured data from external web sources.
Vector Database Management: Clean, chunk, and embed the extracted text into a Vector Database (preferably PostgreSQL + pgvector). Implement strict "upsert" logic to ensure daily updates do not create duplicate vectors.
Local LLM Serving: Deploy the open-source model using optimized inference engines (e.g., vLLM, Ollama, llama.cpp). Apply quantization techniques (GGUF, AWQ) where necessary to maximize hardware efficiency and prevent Out of Memory (OOM) crashes.
Backend Integration: Wrap the RAG (Retrieval-Augmented Generation) pipeline in a secure, high-concurrency REST API (FastAPI) to serve the frontend application.
Cloud Security: Secure the AWS EC2 environment using proper VPC routing, IAM roles, and Security Groups.

What You Must Have (Requirements):

AI / MLOps: Proven experience downloading raw model weights (Hugging Face) and serving them locally on GPUs. Deep understanding of LLM memory requirements (KV cache, VRAM allocation).
Cloud Infrastructure: Hands-on experience with AWS, specifically spinning up, configuring, and securing persistent Linux EC2 VMs.
Data Engineering / Python: Strong Python skills. Experience with modern web scraping libraries (Playwright, BeautifulSoup, Scrapy) and handling messy HTML data.
RAG Architecture: Strong understanding of embedding models, semantic chunking, and vector databases.
DevOps: Highly proficient in Docker, specifically containerizing GPU-accelerated applications (NVIDIA Container Toolkit).

Bonus Points If You Have:

Experience actively migrating off managed APIs to self-hosted Small Language Models (SLMs).
Advanced RAG experience (Cross-encoder re-ranking, Hybrid Search).
Experience with Caddy, Nginx, or similar tools for reverse-proxying secure API endpoints and managing SSL certificates.

Perks

Competitive Salary
WFH every Friday
HMO after 3 to 6 months
Mid-Shift Schedule
With Leave Credits
Leaves convertible to Cash
OT & Holiday Pay
Hybrid Work Set-up : BGC, Taguig City