I’m building a production-grade public-records ETL pipeline. The system needs to use headless browser to interact with a public government website, extract public information, normalize it, and load it into Postgres on Supabase. Final deliverable should run as a Dockerized Cloud Run service with logging, monitoring, and a simple runbook.
Core Responsibilities
- Implement browser-automation workflow using Playwright
- Handle rate-limiting and request pacing intelligently
- Design & build ETL components (extract, clean, transform, load)
- Implement Postgres schema + ingestion routines
- Dockerize the solution
- Create CI/CD deployment pipeline (Cloud Run preferred)
- Set up structured logs + error tracking
Tech Stack
- Python or Node
- Playwright
- Postgres (Supabase)
- Docker
- Cloud Run
- GCP logging/monitoring
Engagement Model
- PAY BY MILESTONES (preferred)
- Each module has a fixed scope + acceptance criteria
- If first module goes well, long-term modular projects follow
Example Milestones
- Milestone 1: Browser automation module + proof-of-concept retrieval
- Milestone 2: ETL transform pipeline
- Milestone 3: Postgres ingestion + schema
- Milestone 4: Dockerized Cloud Run deployment
- Milestone 5: Documentation + reliability checklist