Note: You must have hard core devlop/MLOps direct experience; everything else is optional. BuzzTrail is an AI sales companion — realistic video avatars conduct live product demos with real-time voice conversations. We're early, profitable, and building the foundation to scale to hundreds of concurrent meetings. Two Nx monorepos: a main platform (React, Express, Python/FastAPI) and a RAG knowledge base. We have startup credits on AWS, Azure, and GCP — Azure is the primary cloud candidate for infrastructure, but we route AI services across all three to burn credits instead of cash.
You're not inheriting someone else's infrastructure — you're designing it from scratch. The decisions you make now become the platform a unicorn runs on.
How This Works
- ~10 hours/week, async-first. You own the plan and the pace. No standups, no ticket grooming, no process overhead eating your hours. One 30-minute weekly sync with the CTO, everything else via async comms. You decide what to work on each week within the agreed quarterly priorities.
- Equity conversation welcome. We want you invested in the outcome, not just billing hours.
- Path to grow. Fractional now, with expanded scope as we scale. This can become whatever makes sense for both sides.
The Job
Phase 1: SOC 2 Foundation (Immediate Priority)
We're pursuing SOC 2 Type II certification using Vanta's Workstreet Sprint — the compliance platform and audit framework are already chosen. Your job is implementing the technical controls, not picking the tool. The infrastructure question remains: migrate to Azure, harden what we have on Railway/Supabase/Cloudflare, or some combination.
- Infrastructure strategy— 7 production services on Railway today. Azure is the primary candidate if we consolidate (AKS/Container Apps, Azure DB for PostgreSQL, Functions, Front Door, Key Vault), but staying on current PaaS is on the table if it meets compliance. You own the recommendation.
- IaC— whatever runs on Azure gets Terraform/Bicep. PaaS stays managed through its own tooling.
- Audit controls— centralized logging, immutable audit trails, change management, evidence collection automation.
- Access & encryption— least-privilege policies, MFA enforcement, key rotation for 19+ vendor API keys, secrets management (Key Vault), encryption at rest and in transit.
- Cost optimization— maximizing credits across Azure (infra + AI), AWS (AI services), and GCP (AI services).
Phase 2: ML Data Infrastructure (Next Quarter)
Building the datasets and pipelines a future data scientist needs to fine-tune models and improve conversation quality.
- Data collection— every voice conversation generates transcripts, RAG queries, LLM inputs/outputs, embeddings, STT/TTS events, and tool calls. Capture, structure, and store for ML training and evaluation.
- LLM/AI service management— cost tracking, failover, and routing across providers. Credits on Azure (Azure OpenAI, Azure AI Speech, Azure AI Search), AWS (Bedrock, Transcribe, Polly), and GCP (Vertex AI, Cloud Speech/TTS) — automatic failover between clouds.
- RAG pipeline— web scraping → chunking → embeddings → vector search. Multiple ingestion sources, namespace partitioning, hybrid search, dedup.
- Dataset management— currently Langfuse. Whether we stay, move to Hugging Face datasets, Azure ML, or a combination is your call.
- PII handling— anonymization, redaction, and access controls baked into the pipeline.
Phase 3: Observability & CI/CD Hardening (Ongoing)
- CI/CD— GitHub Actions or Azure DevOps across two Nx monorepos (~40 projects): lint, typecheck, test, build, deploy. Nx affected builds, caching, deployment gates and rollback.
- Real-time— voice agent autoscaling, video avatar lifecycle, Supabase Realtime for slide control, meeting bot recording archival.
- Monitoring— Langfuse (LLM tracing), Grafana (operational dashboards), Sentry (error tracking). May consolidate or move to Azure Monitor depending on your infra recommendations.
You Have
- Strong Azure experience — you know what to use and what to skip. Multi-cloud AI service routing (Bedrock, Vertex AI, Azure OpenAI) is a plus.
- IaC (Terraform or Bicep).
- SOC 2 Type II — at least one audit cycle, ideally built controls from scratch.
- ML dataset infrastructure — data collection pipelines for model training, evaluation, or fine-tuning.
- LLM/AI service management — multiple providers, cost tracking, failover, model routing.
- CI/CD pipelines for monorepos or complex build systems.
Bonus
- Cloudflare Workers, KV, Durable Objects.
- Real-time voice/video infrastructure (LiveKit, WebRTC).
- Multi-cloud AI service management (routing across Azure, AWS, GCP).
- Azure OpenAI Service (managed endpoints, PTU provisioning, content filtering).
Roadmap (What You're Building Toward)
- Staging environment with separate database and vendor staging keys.
- LLM abstraction layer with failover and cost-optimized routing.
- DSPy per-client compiled models with composite evaluation metrics.
- Multi-framework compliance (SOC 2, ISO 27001, ISO 42001, GDPR, CCPA, EU AI Act).
- Load testing for 100+ concurrent meetings.
- Public REST API.
Current Stack
- Frontend: React 19, Vite 8, TailwindCSS, HeroUI
- Backend: Express 5, FastAPI, Hono (Cloudflare Workers)
- Voice/Video: LiveKit (evaluating alternatives), Deepgram, ElevenLabs
- LLM: Multiple providers (model-agnostic), DSPy, LangChain
- RAG: Pinecone, Firecrawl, OpenAI embeddings
- Database: Supabase (PostgreSQL + Realtime + Auth + Storage)
- Hosting: Railway, Cloudflare, GitHub Pages
- Monitoring: Sentry, Langfuse, Grafana
- Build: Nx, pnpm, Vitest, Playwright, Husky