Senior Data Engineer
About SCALA.AI & The Role
SCALA.AI is a leading AI-native company building the next generation of intelligent, customer-centric solutions. We are a small, but highly efficient and high-performing team of engineers and innovators dedicated to delivering impactful products to our customers and members.
We operate at the bleeding edge of technology, leveraging the latest advancements in AI, machine learning, and modern cloud infrastructure to solve complex, real-world problems. Our culture thrives on ownership, continuous learning, and pushing the boundaries of what’s possible. If you’re excited by massive technical challenges, a fast-paced environment, and the opportunity to make an outsized impact, you’ll fit right in.
Responsibilities: What You'll Build & Own
As a Senior Data Engineer, you will be the core architect of our data infrastructure, responsible for building and optimizing the robust, high-throughput data pipelines that feed our advanced AI and machine learning models. You will ensure data quality, reliability, and security across the entire data lifecycle, enabling our researchers and engineers to innovate at the speed of the startup world. This role requires deep technical expertise, independence, and a passion for data excellence at scale.
- Design and Build Data Pipelines: Architect, construct, and manage scalable ETL/ELT pipelines for data ingestion, processing, and transformation, ensuring high availability and fault tolerance.
- Vector Infrastructure for RAG: Build and maintain the specialized data pipelines required for Retrieval-Augmented Generation (RAG). This includes automated document parsing, metadata extraction, and high-performance ingestion into Vector Databases (e.g., Pinecone, Weaviate, Milvus, or pgvector).
- Live Data for MCP: Design and optimize "live" data access layers that support Model Context Protocol (MCP). You will ensure that AI agents have low-latency, secure access to structured enterprise data and real-time APIs for "agentic" tool-calling and decision making.
- Real-Time Streaming: Implement and manage real-time data streaming architectures (e.g., Kafka, Kinesis, or Flink) to ensure our AI models are grounded in the most current data available, moving beyond static knowledge bases.
- Optimize Data Architecture: Drive the technical vision for our data warehousing, data lakes, and data streaming platforms, optimizing infrastructure for performance and cost-efficiency on AWS.
- Data Quality & Governance: Implement rigorous data validation, monitoring, and testing frameworks to ensure the accuracy, completeness, and consistency of data used by AI models and business applications.
- Collaboration: Work closely with AI/ML Engineers to bridge the gap between raw data sources and model-ready context.
Required Qualifications
We seek a seasoned Data Engineer with a deep command of modern cloud-native data architectures.
- Bachelor's degree in Computer Science, Engineering, or a related quantitative field.
- 7+ years of experience as a Data Engineer, focused on building and scaling production data systems.
- Expert proficiency in at least one backend language highly used in data engineering, such as Python or Scala.
- Direct experience with Vector Databases and the data engineering challenges unique to RAG (e.g., managing embeddings, indexing strategies, and hybrid search).
- Experience building/maintaining APIs or data services that interface with LLM "tools" or agentic frameworks via protocols like MCP or JSON-RPC.
- Proven, hands-on experience building large-scale data solutions on AWS, utilizing services like S3, Redshift, Kinesis/MSK, Glue, and Lambda.
- Extensive experience with modern data orchestration tools (e.g., Airflow, Prefect, or Dagster).
- Deep expertise in SQL and working with large-scale relational and NoSQL databases (e.g., PostgreSQL, DynamoDB).
Desired Attributes
- You thrive in an early-stage startup environment and can run fast with a small development team, demonstrating a strong bias for action and execution.
- Experience in MLOps data pipelines, including feature store management and providing data infrastructure tailored for training and inference of Large Language Models (LLMs).
- Familiarity with containerization technologies (Docker, Kubernetes) for deploying data services.
- A track record of high independence and excellent communication skills, capable of driving projects and clearly articulating data architecture decisions.
Salary & Benefits
- Competitive base salary, depending on experience and location
- Plus, annual equity awards
Why SCALA.AI
We’re redefining how businesses use AI — with a team that’s fast, fearless, and focused. You’ll play a key role in driving growth across industries and shaping how customers adopt intelligent, agentic technology.
Join us and help write the story of how AI transforms work.
Pay: $100,000.00 - $190,000.00 per year
Work Location: Remote