I’m building a production-grade public-records ETL pipeline. The system needs to use headless browser to interact with a public government website, extract public information, normalize it, and load it into Postgres on Supabase. Final deliverable should run as a Dockerized Cloud Run service with logging, monitoring, and a simple runbook.
Core Responsibilities
- Implement browser-automation workflow using Playwright
- Handle rate-limiting and request pacing intelligently
- Design & build ETL components (extract, clean, transform, load)
- Implement Postgres schema + ingestion routines
- Dockerize the solution
- Create CI/CD deployment pipeline (Cloud Run preferred)
- Set up structured logs + error tracking
Tech Stack
- Python or Node
- Playwright
- Postgres (Supabase)
- Docker
- Cloud Run
- GCP logging/monitoring
Engagement Model
- Each module has a fixed scope + acceptance criteria
- If first module goes well, long-term modular projects follow
Example Milestones
- Milestone 1: Browser automation module + proof-of-concept retrieval
- Milestone 2: ETL transform pipeline
- Milestone 3: Postgres ingestion + schema
- Milestone 4: Dockerized Cloud Run deployment
- Milestone 5: Documentation + reliability checklist
Please reply to the screening questions (mandatory to be taken into account):
- Have you used a tool called "multi-login"?
- What is the most difficult web crawling, data extraction, or web scraping job you have done, and how did you solve the challenges?
- What is the biggest data extraction pipeline you have built in terms of size (gigabytes, terabytes, etc.), and how did they organize all that data?