Web Data Engineer
Summary
Our client is seeking a Web Data Engineer who is passionate about extracting structured insights from unstructured Web data. This engineer will design, build, and maintain scalable Web scraping pipelines to gather healthcare provider information from multiple online sources, using agent-based automation frameworks including Firecrawl and advanced Python scraping libraries.
Responsibilities
Our client is seeking a Web Data Engineer who is passionate about extracting structured insights from unstructured Web data. This engineer will design, build, and maintain scalable Web scraping pipelines to gather healthcare provider information from multiple online sources, using agent-based automation frameworks including Firecrawl and advanced Python scraping libraries.
Responsibilities
- Develop, deploy, and maintain robust Web scraping pipelines for collecting healthcare provider data.
- Work with agentic frameworks (e.g., Firecrawl) to automate dynamic data extraction workflows.
- Use tools, including Selenium, to extract and parse structured/unstructured Web data.
- Ensure data accuracy, completeness, and freshness through validation, deduplication, and error-handling processes.
- Collaborate with data engineers to integrate scraped data into our existing data pipelines and storage systems.
- Monitor scraping performance and troubleshoot issues with site structure changes, blocking mechanisms, or throttling.
- Follow best practices for ethical and compliant data collection.
- 3+ years of professional experience in Python-based Web scraping or data engineering, preferably in a SaaS based environment.
- Strong proficiency with Python and libraries including Selenium, Beautiful Soup, or Playwright.
- Familiarity with agentic scraping frameworks (e.g., Firecrawl) or autonomous browser-based extraction systems.
- Experience handling large-scale scraping, asynchronous requests, and data normalization.
- Working knowledge of data storage formats and systems (e.g., JSON, Parquet, SQL, or Cloud databases).
- Strong problem-solving skills and ability to debug complex scraping workflows.
- Solid understanding of Web protocols, HTML structures, and REST APIs.
- Bachelor's degree in Data Science, Computer Science, Statistics, Mathematics, or a related quantitative field.
- Experience with Cloud-based data pipelines (Databricks).
- Knowledge of healthcare provider data or healthcare data standards.
- Familiarity with AI-driven or LLM-powered data collection frameworks.