Full Job Description
Cirrus Lake Solutions is seeking a senior data Scientist/Engineer for the a client data analytics team to support a new prospective cohort of study that will recruit 200,000 adults in the United States. The study is designed to further investigate the etiology of cancer and its outcomes, which may inform new approaches in precision prevention and early detection. This new cohort study will capitalize on research innovations to advance the field of cancer epidemiology and prevention.
System description
The research system is built primarily upon the Google Cloud Platform (GCP), with a goal to maximize use of managed services provided by GCP. The primary user interfaces and data collection systems are built upon GCP Firestore exposing an extensive API and set of client-side JavaScript applications that make use of this API. Collaborating clinical sites and study management contractors also access this API directly from their own internal systems. Data, mostly in the form of research participant specific JSON documents, are built in GCP Firestore and regularly moved into GCP BigQuery tables.
Work description
The successful candidate for this position will assume data engineering, management, and analytical responsibilities for research study data starting with their deposition into GCP BigQuery. Your role will be within the client data analytics team. This group will be developing tools to support a large cohort, with the project’s expected lifetime to exceed 25 years. This duration imparts deep requirements for software flexibility and modularity, along with avoiding any lock-in to public or private standards that may be transient. Further, since this is a human subjects study, all developers, and the systems they create must remain cognizant of regulatory compliance viz patient protections and pay strict attention to where data are and where data go.
Minimum Qualifications:
· Master’s degree in epidemiology, biostatistics, statistics, data science, a related field, or equivalent practical experience (approximately 5 additional years of experience).
· 5 years of experience with large scale multi-source data collection and analysis. Analytical engagements outside class work while at school can be included.
· Proficiency with Google Cloud Platform (BigQuery, Cloud Scheduler, Cloud Functions, Cloud Build, Cloud Run, Cloud Storage, Cloud Composer, gcloud CLI)
· Experience with R, Python, or JavaScript
· Familiarity with containerization (like Docker)
· Experience using visualization tools (such as Rshiny, plumer APIs, Posit-connect, RStudio, Quarto, Visual Studio Code)
· Working knowledge of Continuous Integration/Continuous Development (CI/CD) with GitHub (repositories, Workflows, Pages, Issues, Projects)
Responsibilities
· Design and maintain ETL to flatten Firestore (NoSQL) data for analysis in BigQuery (SQL)
· Develop and maintain reporting pipelines for survey, recruitment and biospecimen datasets
· Develop R-backed APIs using plumber
· Maintain Docker containers in which reporting scripts are executed in GCP
· Leverage Google Cloud Platform (GCP) resources to streamline data processing and automation
· Develop and maintain secure GCP to Box.com integrations
· Enhance and optimize QAQC frameworks for the Connect database
· Lead development of researcher-facing data warehouse, ensuring that data are appropriately cleaned, curated, de-identified, and secure
· Guide development of stakeholder-facing data products, e.g., shiny dashboards
· Partner with DevOps to create and implement agreed upon data structures
· Contribute to mapping of EHR data to OMOP Common Data Model and related QAQC
· Drive the Implemention of CI/CD processes via GitHub for updating, testing, and delivering reports
· Ensure proper Documentation of all ETL processes, pipelines, and decision-making thoroughly
· Mentor junior analysts, conduct code reviews, and uphold coding standards
· Deliver ad hoc requests for back-end testing and database issues promptly
· Assist partner sites with access to data and code resources
· Write well-documented, modular, and reusable code according to FAIR principles
· Remain current with cloud computing and data science advancements, particularly in GCP, R, and Python
Additional details:
Pay is commensurate with qualifications. This is a contract position, arranged through a
government contractor, located in Minnesota.