Building Data Pipelines for Large Model Training
Large Models like Large Language Models, Large Vision Models, or any Large Multimodal Model will continue to be the biggest product differentiator for the next 10 years. While we are late maturity in LLMs, we are still early in diffusion and multi-modals.
Data is the biggest differentiator in AI. Data Pipelines have become a critical part to build this moat. Data Pipelines are mainly in Training and sometimes Inference.
In Training phase, they have three main jobs - ingestion, improving quality and feeding to training pipeline - at scale. If there are many model versions being trained at the same time, the Data Pipelines need to manage resource sharing while providing predictable outcomes for each training pipeline.
During Inference, data pipelines can be used to sanitize prompts, classify and augment prompts in DSL, perform RAG,
In this post, I will cover the data pipelines we built at Salesforce, and for CompanyA (name under NDA), and an AI startup. We choose the stack based on the needs of each team and product. This is an evolving space with lots of options
Why Data Pipelines Matter
Data is the most crucial differentiator in AI. Well-designed data pipelines not only enhance model performance but also establish a competitive edge. Here’s a quick overview of the three critical tasks data pipelines perform during the training phase:
Ingestion: Collecting data from various sources.
Preprocessing: Improve quality, transform, enrich, convert to DSL and tokenize
Feeding Training Models: Delivering data efficiently at scale.
Managing these tasks at exabyte levels, especially with multiple model versions, presents unique challenges.
Ingestion
The large volume of data usually comes from various sources - databases, data lakes, and streaming content - varying in schema, quality, modality, availability etc. All this data needs to be converted to a shared schema, checked for policies, catalogued and saved in a data lake.
There are non-functional requirements for Ingestion which makes the simple logic more complicated. Three main areas are scale, distributed data, and cost.
Scale: With exabytes of data, ingestion needs to be efficient. There are two optimizations that makes things easier - data schema, parallelization. I cannot emphasize the importance of a good data schema design - it is way to costly to reprocess all data again. For parallelization, sharding helps optimizes storage, enables duplicate detection, and improves reliability during failures. Spark and Flink are good open source frameworks.
Cost: Data freshness requirement drives your ingestion rate for offline data. If your training pipeline relies on daily snapshots, consider ingesting data during off-peak hours to cut costs by using idle compute resources. Good pipeline design helps lower costs by striking the right balance between resilience and flexibility. Furthermore, ingestion rate should be tuned with training needs. There is no point in spending millions to ingest the data if the data is not used by training runs. Cost can also be brought down by maintaining a control plane catalogue that maintains
Distributed data: Training typically happens in a single data center (but let me know if that’s changing!). Because of this, your training data needs to be co-located in that center. However, since data sources are often spread out, your data pipelines should be distributed and located close to the data sources. This improves performance, removes any bad data before transferring across regions.
At Salesforce, there was thousands of different data sources. For this, we employed an Exporter/Collector pattern. Data is then passed through a preprocessing pipeline that we see next, and then put into Kafka for time sensitive processing, and eventually into S3 for the next stage.
On the other hand, CompanyA had few massive data sources. We used Spark and Airflow to standardize offline sources, and had custom services over serverless for streaming.
Preprocessing
Preprocessing stage improves data quality and enriches it. Any transform that is needed prior to training is performed here, and is where majority of the resources in the data pipeline are consumed. At a high level, we take the following steps
Cleaning: Remove any bad formatting or incomplete dataApply ML heuristics to detect and remove low quality data (e.g. AI generated content)
Deduplication: Sources and Ingestion often have duplicates. LLMs are sensitive to duplicates since data repetition maps to statistical importance of it. If data has Alex is a girl's name 1000 times, and only 1 time for boys' name then LLM will encode Alex as a girl's name.
Anonymization: Remove any Personally Identifiable Information
Content Characterization: Apply labels to categorize data
Alignment: Using principles from Meta's LIMA (Less is More for Alignment) paper curate based on human preference. This can be included with labelling step but with alignment models
DSL Conversion: In some domains, converting input into DSL or custom vocabulary improves model performance and speeds up training/inference
Tokenization: This is usually the last step using some encoding. The encoding selection depends
Depending on the data ingest quality, some of these steps above can be performed during Ingestion. This would make sense if data sources are distributed and bad data needs to be dropped before moved centrally.
Feeding Training Models
The last sub-stage of preprocessing is tokenization, sharding/batching and preparing for Training. This involves converting data into Tensor vectors that are directly available to GPUs. Since training can be parallelized by training separate layers, this stage shards and batches data to train different layers. The training system orchestration pulls data from this stage directly via the data plane. The preprocessing control plane manages backpressure from training, and coordinate ingestion rate. The control plane also manages checkpointing to lower costs to recover from training failures.
At CompanyA, the preprocessing stages was a combination of custom logic over Ray and Spark. With RayDP, we could use seemlessly merge preprocessing and first stage of training in a single control plane. While this is not necessary, it was useful at the time, although it may not in the future.
Common Challenges in AI Data Pipelines
While building a data pipeline for large model training, several challenges can arise (some covered in detail above)
Distributed Data: Managing data across different regions makes operations hard
Unified Data Schema: Keeping consistency across heterogenous data types
Resource Sharing: Research teams are constantly training and testing new models. Resources used by Data pipelines are shared - for example GPUs are bulk reserved by the company on specific accounts. Running these pipelines needs a control plane to manage the pipelines, and resources and managing state across partial runs during failures.
Checkpointing: Failures are common in long running runs like these pipelines. Since the cost of each run is so high, a single failure can mean millions of dollars of resources wasted, and lots of time lost. A good Checkpointing system is critical to resume after failures.
Responsible AI (RAI) alignment: RAI is an evolving field. Extensibility and mutability in pipeline configurations is important.
Compliance: Compliance needs often change depending on growth and customer needs. My contrarian advice is NOT build compliance early. Wait until you need it. These overheads add up and slow you down.
Looking ahead
This is a rapidly changing field with lots of new solutions being developed. The right answer for each company almost always is based on your data and infrastructure. If you are starting from scratch, there are lots of great choices.