Note: You must have hard core devlop/MLOps direct experience; everything else is optional. BuzzTrail is an AI sales companion — realistic video avatars conduct live product demos with real-time voice conversations. We're early, profitable, and building the foundation to scale to hundreds of concurrent meetings. Two Nx monorepos: a main platform (React, Express, Python/FastAPI) and a RAG knowledge base. We have startup credits on AWS, Azure, and GCP — Azure is the primary cloud candidate for infrastructure, but we route AI services across all three to burn credits instead of cash.

You're not inheriting someone else's infrastructure — you're designing it from scratch. The decisions you make now become the platform a unicorn runs on.

How This Works

~10 hours/week, async-first. You own the plan and the pace. No standups, no ticket grooming, no process overhead eating your hours. One 30-minute weekly sync with the CTO, everything else via async comms. You decide what to work on each week within the agreed quarterly priorities.
Equity conversation welcome. We want you invested in the outcome, not just billing hours.
Path to grow. Fractional now, with expanded scope as we scale. This can become whatever makes sense for both sides.

The Job

Phase 1: SOC 2 Foundation (Immediate Priority)

We're pursuing SOC 2 Type II certification using Vanta's Workstreet Sprint — the compliance platform and audit framework are already chosen. Your job is implementing the technical controls, not picking the tool. The infrastructure question remains: migrate to Azure, harden what we have on Railway/Supabase/Cloudflare, or some combination.

Infrastructure strategy— 7 production services on Railway today. Azure is the primary candidate if we consolidate (AKS/Container Apps, Azure DB for PostgreSQL, Functions, Front Door, Key Vault), but staying on current PaaS is on the table if it meets compliance. You own the recommendation.
IaC— whatever runs on Azure gets Terraform/Bicep. PaaS stays managed through its own tooling.
Audit controls— centralized logging, immutable audit trails, change management, evidence collection automation.
Access & encryption— least-privilege policies, MFA enforcement, key rotation for 19+ vendor API keys, secrets management (Key Vault), encryption at rest and in transit.
Cost optimization— maximizing credits across Azure (infra + AI), AWS (AI services), and GCP (AI services).

Phase 2: ML Data Infrastructure (Next Quarter)

Building the datasets and pipelines a future data scientist needs to fine-tune models and improve conversation quality.

Data collection— every voice conversation generates transcripts, RAG queries, LLM inputs/outputs, embeddings, STT/TTS events, and tool calls. Capture, structure, and store for ML training and evaluation.
LLM/AI service management— cost tracking, failover, and routing across providers. Credits on Azure (Azure OpenAI, Azure AI Speech, Azure AI Search), AWS (Bedrock, Transcribe, Polly), and GCP (Vertex AI, Cloud Speech/TTS) — automatic failover between clouds.
RAG pipeline— web scraping → chunking → embeddings → vector search. Multiple ingestion sources, namespace partitioning, hybrid search, dedup.
Dataset management— currently Langfuse. Whether we stay, move to Hugging Face datasets, Azure ML, or a combination is your call.
PII handling— anonymization, redaction, and access controls baked into the pipeline.

Phase 3: Observability & CI/CD Hardening (Ongoing)

CI/CD— GitHub Actions or Azure DevOps across two Nx monorepos (~40 projects): lint, typecheck, test, build, deploy. Nx affected builds, caching, deployment gates and rollback.
Real-time— voice agent autoscaling, video avatar lifecycle, Supabase Realtime for slide control, meeting bot recording archival.
Monitoring— Langfuse (LLM tracing), Grafana (operational dashboards), Sentry (error tracking). May consolidate or move to Azure Monitor depending on your infra recommendations.

You Have

Strong Azure experience — you know what to use and what to skip. Multi-cloud AI service routing (Bedrock, Vertex AI, Azure OpenAI) is a plus.
IaC (Terraform or Bicep).
SOC 2 Type II — at least one audit cycle, ideally built controls from scratch.
ML dataset infrastructure — data collection pipelines for model training, evaluation, or fine-tuning.
LLM/AI service management — multiple providers, cost tracking, failover, model routing.
CI/CD pipelines for monorepos or complex build systems.

Bonus

Cloudflare Workers, KV, Durable Objects.
Real-time voice/video infrastructure (LiveKit, WebRTC).
Multi-cloud AI service management (routing across Azure, AWS, GCP).
Azure OpenAI Service (managed endpoints, PTU provisioning, content filtering).

Roadmap (What You're Building Toward)

Staging environment with separate database and vendor staging keys.
LLM abstraction layer with failover and cost-optimized routing.
DSPy per-client compiled models with composite evaluation metrics.
Multi-framework compliance (SOC 2, ISO 27001, ISO 42001, GDPR, CCPA, EU AI Act).
Load testing for 100+ concurrent meetings.
Public REST API.

Current Stack