Discover five practical, architect-level tips for building robust, scalable multi-stage recommendation systems. Learn how to balance latency and freshness, optimize offline computations, choose meaningful evaluation metrics, and strategically evaluate your system at every stage: ensuring reliable, high-quality recommendations aligned closely with your business goals.
Introduction
Recommendation systems today shape the information consumed by millions of users every day. As an ML engineer or architect tasked with designing a robust recommendation pipeline, you must navigate numerous trade-offs, architectural choices, and algorithms to deliver relevant, timely, and engaging user experiences at scale. Whether you’re recommending videos, products, events, travel destinations, or social content, your choices directly impact user experience and business outcomes.
I’ve been leading teams building recommendation systems at massive scale for many years now, primarily in the e-commerce and ads space, and have spoken extensively with industry experts across various verticals who’ve built similar systems. One thing that’s clear is that modern recommendation systems are incredibly complex (almost like assembling LEGO blocks). There are countless possible combinations of feature engineering, data pipelines, models, compute/storage methodologies, UX considerations, and more. Each company might do this differently, with methodologies and naming conventions often varying widely across organizations within a single company!
However, for clarity and consistency in this article, as I walk you through various architectural insights, I’ll adopt a widely recognized four-stage pipeline structure (see Figure 1 below): retrieval, filtering, scoring, and ordering. Each stage performs a specific role, helping you balance speed, accuracy, and content freshness. This structure will hopefully serve as a consistent reference point, helping you intuitively understand and apply practical strategies when designing and building your own recommendation systems.

Selecting appropriate algorithms is critical, but building recommendation systems involves much more. You must carefully align technical decisions with practical constraints like latency, scalability, and handling cold-start scenarios.
In this article (Part-1 of this series), I will share five practical, architect-level tips for designing and implementing robust multi-stage recommendation systems.
- Balance freshness and latency: Decide clearly what to compute offline versus online to maintain speed without sacrificing recommendation quality.
- Map algorithms strategically: Select appropriate models for each pipeline stage, balancing complexity, compute requirements, and personalization.
- Blend candidate sources effectively: Combine multiple candidate generation approaches to boost recommendation diversity and recall.
- Refresh candidate pools and features offline wisely: Precompute heavy tasks like embeddings and metadata at optimal cadence to improve efficiency.
- Evaluate rigorously and continuously: Use layered evaluation strategies (offline tests, online experiments, ongoing monitoring) aligned closely with your business objectives.
Let’s dive in!
Tip 1: Balance Latency and Freshness
When designing scalable recommendation systems, you must balance two critical yet conflicting goals: latency (how quickly users receive recommendations) and freshness (how current the recommended content is). Different domains have different needs. For social media or news platforms, freshness is essential. Users expect to immediately see recent posts or trending topics. For e-commerce or video streaming services, a moderate level of freshness may suffice if recommendations appear promptly. Ideally, aim to optimize for both speed and freshness.
Practically, this balance means deciding clearly what components of your recommendation pipeline you can process entirely offline, and what components require real-time computation. Let’s walk through what this means for a standard 4-stage pipeline we discussed above.

Retrieval (Candidate Generation)
In the retrieval stage, your primary task is identifying potential candidates for recommendation. Several key computations always occur offline due to their complexity:
- Model training (including embedding models, collaborative filtering models, or deep learning recommenders)
- Creation and periodic updating of indexes (such as approximate nearest neighbor indexes)
- Generation and updating of embeddings for items and users
These heavy tasks typically happen offline, often on a fixed schedule—daily, hourly, or weekly—depending on how rapidly your content or user behavior evolves.
The main decision for you at this stage becomes: should the candidate retrieval itself be done offline or online?
- Offline candidate retrieval:
You generate candidate lists ahead of time for each user or user segment. Because you’ve prepared these lists beforehand, retrieval latency during actual user requests is minimal. However, these lists won’t reflect real-time user interactions or very recent signals, potentially sacrificing freshness and immediate relevance.
- Online candidate retrieval:
You dynamically retrieve candidates at request time, leveraging precomputed embeddings along with fresh, real-time user interactions (recent clicks, searches, or session behavior). This significantly improves freshness and relevance but comes at a higher computational cost, requiring robust infrastructure for retrieval and potentially introducing increased latency.
When making this decision, carefully weigh your application’s priorities:
- Does the substantial freshness boost gained through online retrieval justify higher latency and infrastructure costs?
- Or can you tolerate slightly less freshness in exchange for the efficiency and speed of offline candidate retrieval?
Filtering
The next important decision in your recommendation pipeline involves filtering your retrieved candidates. Compared to retrieval, filtering typically involves less debate about offline versus online processing. Instead, the main question is clearly identifying which filtering steps you can comfortably perform offline, and which require real-time computation.
Filtering usually includes checks for compliance, content categories, eligibility rules, availability of products, and user-specific constraints. Certain filtering conditions change infrequently or remain static over long periods (e.g. compliance rules, content moderation categories, or basic eligibility criteria). These conditions make ideal candidates for offline filtering. By applying these filters ahead of time (during embedding creation or candidate set generation), you minimize latency significantly without sacrificing accuracy or freshness.
On the other hand, some filtering conditions change frequently and unpredictably, such as inventory availability, newly flagged inappropriate content, or recent account changes. In such cases, applying filters at request time—online—is essential.
A good practical strategy is adopting a hybrid filtering approach:
- Pre-filter offline whenever possible (static compliance rules, stable content classifications, persistent eligibility criteria).
- Reserve online filtering strictly for rapidly evolving conditions (availability, real-time moderation status, or dynamic user restrictions).
Scoring (Ranking)
When you get to the scoring stage, your main goal is ranking filtered candidates by relevance. At this stage, you need to carefully consider whether scoring should happen online or if offline scoring is feasible.
In most practical scenarios, scoring needs to be online. This is especially true if your recommendations depend heavily on recent user interactions or real-time session data. For example, think about a social media app. What videos or posts has the user just liked? What items did they recently click or skip? Capturing these real-time signals and rapidly incorporating them into your scoring algorithm ensures personalized and relevant recommendations.
You might wonder: Can scoring ever reasonably happen offline? The answer is yes, but only in specific scenarios. Offline scoring makes sense primarily when recommendations don’t heavily depend on recent user activity or personalization. For example, consider a scenario like recommending similar products on an e-commerce site or similar listings on Airbnb. Here, you’re looking purely at static item-to-item similarities. Because there’s no immediate user context needed, you can score these similarities offline and serve recommendations instantly.
However, for most recommendation tasks, personalization and freshness matter greatly. Here, online scoring is the clear choice. You may worry that online scoring could add latency. But remember: by this point, retrieval and filtering have already narrowed down your candidate set significantly (probably to a few 100s of items). So even if you use relatively sophisticated real-time scoring models, the latency remains manageable.
If freshness and latency are both critical, your best approach is using lightweight models for scoring. Keep the online scoring computations fast by using simpler linear models or distilled neural networks. Move more complex processing upstream into retrieval and filtering, so you significantly reduce the number of candidates by the scoring stage.
Ultimately, unless you’re dealing strictly with non-personalized recommendations (e.g. Item-to-item recommendations, “Trending movies today”), performing scoring online is your safest bet!
Ordering (Post-Processing)
Ordering is your pipeline’s final step. This is your last chance to make real-time adjustments before recommendations reach the user. Even if your retrieval, filtering, and scoring happen mostly offline, ordering typically must be done online.
At this stage, your actions can range widely. You might run sophisticated online experiments, like multi-armed bandits, to dynamically balance exploring new content against showing proven items. On the simpler end, you may just apply quick checks and adjustments to ensure recommendations remain relevant. For example, you could filter out items the user has already seen, purchased, or that just went out of stock.
Why is this stage so critical for online processing? Think about it: no matter how carefully you prepare candidates offline, certain conditions always change in real time. An item might suddenly become unavailable. A user might complete a purchase seconds earlier. The ordering stage is your opportunity to catch these scenarios and quickly correct your recommendations.
The good news is that ordering usually involves simple, lightweight logic. It rarely introduces significant latency. Even if you perform heavier offline computation at earlier stages, real-time ordering can comfortably accommodate last-second checks and tweaks without slowing down your system.
Real-World Examples
Now that we’ve discussed the trade-offs and architectural considerations for each pipeline stage, let’s illustrate these ideas through concrete use cases.
| Use Case | Retrieval | Filtering | Scoring | Ordering (Post-Processing) |
|---|---|---|---|---|
| User viewing similar listings (e.g., real-estate rental platform) | Fully offline | Fully offline (static conditions rarely change) | Fully offline (scores based purely on stable item-to-item similarity) | Fully online (simple real-time filters, e.g., removing listings already viewed or unavailable) |
| Quick-scrolling video feed (short-video platform) | Mostly online (retrieval blends offline embeddings with real-time user interaction signals) | Mostly online (filters based on dynamic moderation, recent user behavior) | Fully online (ranking heavily influenced by recent interactions, session behavior, and live user feedback) | Fully online (real-time experiments such as explore-exploit via bandits, removing recently viewed videos) |
| Daily personalized news digest (email or app notification) | Fully offline (generate candidates once daily) | Fully offline (apply all static filters once daily) | Fully offline (personalized scoring performed once daily based on user preferences) | Fully offline (ordering determined ahead of time, delivered in batch) |
| Personalized e-commerce homepage recommendations | Mostly offline (candidate sets generated periodically); selective online retrieval if user is actively interacting | Mixed (offline compliance and eligibility, online availability and real-time inventory) | Mostly online (ranking incorporates recent browsing history, recent cart interactions) | Fully online (final tweaks to promotions, filtering out unavailable or recently purchased products) |
| Trending news or event feed (high freshness news platform) | Mixed (offline candidates with frequent real-time refresh to capture breaking events) | Mostly online (real-time filtering for content moderation, event freshness, editorial decisions) | Fully online (ranking prioritizes breaking events, recent user engagement, topic relevance) | Fully online (quick adjustments based on rapidly evolving news context, user preferences) |
Tip 2: Map the Right Algorithms to the Right Stage
When selecting algorithms for each stage of your recommendation pipeline, always keep two key strategic principles in mind:
- The complexity of your chosen model significantly impacts compute resources and latency. More complex models require more resources and time to run, limiting their practicality for online scenarios.
- The number of candidate items you must process directly affects how computationally intensive your models can be. Larger candidate sets restrict you to simpler or approximate algorithms to maintain performance.
Here are practical guidelines and questions to consider at each stage, clearly aligning algorithm choices with these strategic principles:
Retrieval (Candidate Generation)
At this initial stage, your primary challenge is reducing an enormous candidate set (potentially millions or billions) to a manageable number of items (hundreds or thousands). Ask yourself clearly:
- What size is my candidate set, and what latency am I aiming for?
- Am I running retrieval online (real-time), or can it be computed periodically offline?
If you’re running retrieval online (real-time), your priority is speed and scalability. In these scenarios, Approximate Nearest Neighbor (ANN) search techniques (e.g., FAISS or HNSW with lightweight embeddings) or simple collaborative filtering heuristics make sense. These methods quickly identify relevant items without expensive computations.
But what if you can run retrieval offline? If latency requirements permit, you might periodically employ more powerful, deeper two-tower neural networks, large language model embeddings (like GPT-based models), or other sophisticated neural embedding models offline. While these models are computationally expensive, running them offline mitigates latency concerns, allowing higher-quality retrieval sets. Then, at serving time, you simply query these precomputed embeddings or candidates rapidly.
Filtering
In filtering, your main question is straightforward:
- Which filtering conditions rarely change (offline filters), and which frequently change (online filters)?
Static or rarely-changing conditions like compliance rules, geographical eligibility, or content categories can be efficiently handled offline. For dynamic, frequently-changing conditions (such as inventory availability, moderation flags, or user-specific constraints), you’ll naturally choose fast online filtering strategies. Consider rapid database or cache lookups, or lightweight prediction models if necessary. Always balance simplicity and latency to ensure scalability.
Scoring (Ranking)
By the time candidates reach your scoring stage, you’ve narrowed them significantly, often down to just dozens or hundreds of items. At this point, ask yourself carefully:
- Given this smaller candidate set, what complexity level can your model practically afford?
- Will adding real-time signals, such as recent clicks, searches, or session behaviors, substantially improve recommendation relevance?
The reduced candidate set at scoring usually enables more sophisticated and computationally intensive models. Unlike retrieval, where you’re constrained by the large candidate pool, scoring is typically where richer personalization happens.
Common practical algorithm choices include:
- Deep Two-Tower Neural Networks:
Deep Two-Tower models generate separate embeddings for users and items. Typically, you compute item embeddings fully offline, as item attributes rarely change rapidly. User embeddings often involve a combination: you precompute historical interaction embeddings offline, while computing additional user embeddings online, capturing recent interactions, session signals, or current context. The main advantage here is that these real-time embeddings dramatically increase freshness and personalization. - Deep Factorization Machines (DeepFM):
If your data includes many categorical features, like product categories, demographics, or sparse user interaction data, DeepFM can be ideal. It combines the best aspects of traditional Factorization Machines (great at handling sparse, categorical data efficiently) with deep neural networks (capturing complex interactions beyond simple linear relationships). This means DeepFM can model subtle and nonlinear relationships between users and items effectively. It’s especially useful in scenarios where recommendations must reflect intricate user-item dynamics but must still remain fast and efficient at inference time. - Gradient-Boosted Decision Trees (GBDT) like XGBoost or LightGBM:
GBDTs (such as XGBoost or LightGBM) work very well with structured tabular data and are excellent when you need interpretability or quick inference speeds. These models iteratively build decision trees, each capturing subtle feature interactions to boost prediction accuracy. They’re particularly effective when your features are clearly structured and your system needs to balance accuracy with fast inference. - Deep Learning Recommendation Models (DLRMs):
DLRMs are specialized deep neural networks tailored specifically for recommendations. They integrate categorical embeddings and numerical features into deep layers that capture deeper, nonlinear interactions between users and items. They’re particularly powerful when you need highly personalized recommendations using detailed user profiles, extensive historical interaction data, or rich product metadata. Because your candidate set at scoring is small, you can typically run these models efficiently without major latency issues.
Remember, at this stage, embedding computations are typically not your primary latency bottleneck, since item embeddings and historical user embeddings are generally precomputed offline. Instead, your main runtime latency comes from inference complexity, particularly when real-time user embeddings are computed on-the-fly. However, because your candidate set is already significantly reduced, this incremental cost remains practical.
Ordering (Post-Processing)
The final stage in your recommendation pipeline. Ordering, or more precisely, reordering, is your last chance to determine exactly what recommendations users see, and in which sequence. Even though previous stages have already scored and ranked items, ordering is your opportunity to apply final, last-minute adjustments to align recommendations precisely with real-time conditions and business goals.
When deciding how to approach this stage, consider carefully:
- Do you have immediate business priorities, promotions, or constraints to apply?
- Is it beneficial to make real-time adjustments based on users’ ongoing session behavior?
- Do you want to experiment and introduce some exploration into your recommendation results?
Here are practical ways you can handle these final adjustments clearly and effectively
- Simple Lightweight Rules
At the simplest level, ordering might involve minimal intervention. You might directly use the scores from the scoring stage, applying basic rules just to filter out irrelevant or redundant items. For example, removing items the user has already viewed, items currently unavailable, or products they’ve just purchased. You might also apply basic diversity constraints, ensuring, for example, no two items from the same seller or category appear consecutively. These approaches require minimal computational overhead and are very quick.
- Last-Minute Business Constraints and Adjustments
Often, you need to reorder recommendations slightly due to immediate business needs or promotional priorities. Perhaps you have a product promotion running for the day, or your company wants to prioritize certain content types or product categories. At this stage, you can quickly move these items higher in the rankings without heavy recomputation or complex modeling. Similarly, if your users actively reject certain recommendations during their current session, you can quickly demote or remove related items to improve immediate relevance.
- Multi-Armed Bandit for Exploration and Exploitation
Sometimes you might want to balance showing familiar, reliably relevant content (“exploitation”) with trying out new, less-tested recommendations (“exploration”). This is especially common in short-video feeds or social media platforms, where continuously showing similar content may lead to boredom or limited user engagement. A practical method for achieving this balance is through multi-armed bandit algorithms. Imagine each “arm” of the bandit represents a different recommendation strategy or category. For example, one arm might focus on recommendations very similar to items the user has recently engaged with. Another might feature content specifically from people or creators the user follows or frequently interacts with. A third arm might explore currently trending content, even if the user hasn’t previously shown direct interest. Finally, an additional arm might explicitly surface completely new or experimental items, content the user hasn’t seen before but might find engaging.
Tying It All Together: Choosing the Right Models
The table below summarizes your practical algorithm choices for each pipeline stage, categorized clearly into Heavy Models and Lightweight Models. Use it as a quick reference to understand your options clearly at each stage.
- Heavy models generally deliver richer, deeper personalization but demand more computational resources and increase latency. They are ideal when you have a generous latency budget or prioritize accuracy and personalization over speed.
- Lightweight models provide faster, more responsive recommendations. Choose these when speed, scalability, or infrastructure costs are critical constraints.
Remember, your ideal algorithmic choice at each stage depends primarily on your application’s specific latency requirements, budget, and personalization goals.
| Stage | Heavy Models (Higher Latency & Compute) | Lightweight Models (Lower Latency & Compute) |
|---|---|---|
| Retrieval |
|
|
| Filtering | Uncommon |
|
| Scoring |
|
|
| Ordering |
|
|
Tip 3: Blend Multiple Candidate Generators to Improve Recall and Diversity
Why should you blend multiple candidate generators instead of relying on a single method? Think practically about common scenarios you face:
If you’re running a social media platform, your users expect more than just similar content. They also want trending stories, posts from people they follow, and entirely new topics to explore. Similarly, on an e-commerce website, customers appreciate not only recommendations based on previous purchases but also trending items, complementary products, and curated picks. Even streaming users prefer a balanced mix: familiar shows, trending series, and editorially chosen content.
Relying on just one candidate generator rarely satisfies all these user needs. Blending multiple sources improves your recommendations in three key ways:
- Better recall: Combining different candidate generators helps you find more potentially relevant items that a single source might miss.
- Improved diversity: Different methods naturally emphasize different types of content or user needs, resulting in a broader set of recommendations.
- Greater flexibility: A multi-generator strategy lets different teams within your organization independently build and experiment with their candidate methods. This federated approach is more manageable and agile compared to maintaining one monolithic candidate generator.
When it comes to blending candidate generators, you have a few practical options:
Blend During Retrieval

You can blend candidate sources right at the retrieval stage. In this approach, you explicitly define how much of each candidate source, such as trending content, friends’ interactions, or editorially selected items, you include upfront. For example, you might decide 40% of candidates should come from the user’s past interactions, 30% from globally trending content, and another 30% from curated editorial lists.
Blending during retrieval has a clear advantage. It simplifies your downstream logic because subsequent filtering, scoring, and ordering stages deal with just one unified candidate list.
However, blending too early also comes with drawbacks. If your subsequent filtering or scoring steps disproportionately remove candidates from one particular source (for instance, due to availability, compliance, or quality filters), you may lose the intended diversity. Imagine blending ads into your initial candidate pool, but later filtering rules remove many ads due to user eligibility or geographic restrictions. Suddenly, your carefully planned proportions no longer hold, potentially hurting your recommendation strategy.
Blend During Scoring

Another practical way to blend candidate sources is by delaying the combination until the scoring stage. Here, you retrieve each candidate source separately, without explicitly merging them initially. Then, your scoring model evaluates candidates from all these sources together, assigning scores based on personalized signals such as recent user interactions, preferences, and session context.
Blending at scoring provides a distinct advantage: your scoring model typically has richer, more detailed information available about user behavior. For instance, suppose you have one candidate source updated daily, reflecting longer-term trends and stable user interests, while another source updates in near-real-time, capturing immediate trends or breaking content. By blending at scoring, you naturally combine these long-term stable candidates with fresh, real-time candidates seamlessly.
This flexibility is particularly helpful when your candidate retrieval runs at different cadences or computational scales. You don’t need to synchronize offline, periodic retrieval jobs with faster, real-time retrieval processes. Instead, your scoring model simply considers candidates from all these sources together at scoring time, selecting the best matches based on the user’s current needs and interests.
However, blending at scoring also introduces certain challenges. Managing explicit business constraints, such as ensuring minimum exposure for certain categories of content or ads, can become slightly more complex. Make sure your scoring model ranks not only based on relevance and personalization signals, but additional downstream logic (e.g. Sponsored Ads) if you have fixed quotas to fulfill.
Blend During Ordering

A third option is blending candidate sources dynamically at the ordering stage. Instead of explicitly merging earlier, you keep candidate lists separate through retrieval and scoring, and blend them only when final recommendations are selected. This approach often uses real-time techniques like multi-armed bandits or other adaptive experimentation strategies to decide how many candidates from each source to present.
For instance, consider a short-video feed. You could have one candidate source focused on content from friends, another on trending videos, and another on exploratory content unfamiliar to the user. At the ordering stage, a multi-armed bandit algorithm dynamically adjusts how many items come from each source based on immediate user reactions (clicks, views, skips) in real-time.
Blending during ordering gives you the maximum flexibility and real-time responsiveness, continuously optimizing the candidate mix. If users respond positively to trending items today, your system immediately boosts their visibility. However, the downside is increased complexity. You will need robust infrastructure to handle real-time decisions and additional logic at this stage. You will also need sophisticated experimentation frameworks to manage dynamic candidate selections effectively.
Tip 4: Refresh Candidate Pools and Features Offline at the Right Cadence
When you build recommendation systems, you quickly realize one critical challenge: it’s impossible to perform all complex computations in real-time. Tasks like indexing large catalogs, computing embeddings for millions of items and users, and extracting detailed content features from text or visual data simply require too much time and computational resources. Instead, you precompute these features offline, typically in batches, and refresh them periodically. Your goal is straightforward: keep recommendations fresh without overwhelming your live systems or incurring unnecessary cost.
What features of pre-compute?
So, what exactly should you precompute offline? Broadly speaking, there are two main categories:
- Candidate pools: These are sets of items from which your system selects recommendations. Examples include new or trending products, recently published articles, or fresh video uploads. Whenever you add items to your catalog or observe significant user engagement with certain items, you’ll want to update these pools offline.
- Precomputed features: These go beyond candidate pools and include detailed, valuable data you compute once and reuse repeatedly. Specifically:
- Embeddings: Numerical vector representations of users or items, capturing their characteristics or preferences. You typically index these embeddings in specialized databases, enabling rapid similarity searches.
- Rich Metadata: Detailed attributes derived from item content or user data, such as product categories, visual or textual attributes, or structured tags. Storing this metadata in dedicated feature stores allows you quick access later, during scoring and ordering, without expensive recomputations.
How Often to Refresh
The right refresh frequency for your system would directly depend on how rapidly your underlying data changes and how critical fresh recommendations are to your user experience. To choose your optimal refresh frequency, clearly understand your users’ expectations for freshness. Experiment with different schedules and measure their impact directly through engagement metrics and feedback.
- Hourly or multiple times per day refreshes make sense when your recommendations must rapidly reflect changing user interests or trending events. For example, if you run a news or social media platform, your content relevance shifts quickly, and frequent updates significantly enhance user experience.
- Daily refreshes are effective for e-commerce or streaming platforms. Here, your inventory and new content might update frequently but at predictable intervals. A daily refresh balances freshness and computational cost efficiently.
- Weekly or less frequent refreshes fit scenarios where content changes slowly, such as long-term rentals, durable products, or travel destination recommendations. In these cases, frequent refreshes add little value.
Efficiently Precomputing Candidate Pools and Features
Heavy computations, such as generating embeddings or extracting complex visual/textual features, are best done offline through scheduled batch processes. Consider these proven strategies for efficient offline computations:
- Run batch processes during low-traffic periods using scalable data processing frameworks like Apache Spark or AWS SageMaker. Scheduling intensive computations overnight ensures your system remains responsive during peak user hours.
- Apply incremental updates whenever possible. Instead of recomputing everything from scratch, refresh only items or user embeddings that have significantly changed. For instance, newly popular items or updated metadata might warrant incremental processing throughout the day.
- Use hybrid approaches when freshness is crucial for some items but less for others. You might refresh embeddings for highly active products multiple times daily, while stable items are refreshed weekly. This lets you strategically allocate computational resources where they matter most.
Optimizing Storage for Online Retrieval
Once you’ve refreshed candidate pools and precomputed features offline, you must store them for efficient online retrieval. Choose optimized storage solutions that support rapid access during live requests:
- Fast Key-Value Stores and Caches: Use Redis or Memcached for rapid access to precomputed recommendations, ensuring low-latency responses during user requests.
- Approximate Nearest Neighbor (ANN) Databases: Efficiently index embeddings in ANN systems like FAISS, Pinecone, Redis or ElasticSerach for instantaneous vector-based searches, enabling swift retrieval even from large candidate pools.
- Dedicated Feature Stores: Use these for storing detailed metadata not only for retrieval but crucially for downstream scoring and ordering stages. These stores ensure quick, repeated access to complex attributes without recalculating.
Tying It All Together

Industry leaders apply these offline refresh strategies effectively. Instagram Explore refreshes item embeddings daily offline, providing fresh recommendations without latency penalties. Airbnb updates listing embeddings offline daily, balancing freshness with operational efficiency.
Tip 5: Evaluate Early, Often, and Continuously
Evaluating recommendation systems effectively can be challenging because there’s no straightforward definition of what’s “correct.” Unlike more straightforward tasks like classification, object detection, or even image segmentation, where success criteria is well defined, recommendations depend on subjective user preferences and evolving interests.
Your goal should be straightforward: validate your solution as early as possible. The earlier you catch an issue or validate your recommendation system, the faster you can address it, saving you significant time and resources.
For example:
- If you discover early on that your new candidate retrieval algorithm is underperforming, you can quickly fix or replace it.
- But if you wait until a later stage, like after running extensive A/B tests in production to discover problems, you might lose weeks or months of effort.
- If you discover major issues only after your recommendation system goes live to all users, fixing them can become extremely costly and complicated, potentially requiring months of rework.
To prevent these costly setbacks, carefully choose your evaluation metrics and strategies. Begin by clearly defining success criteria that match your core business goals.
Focus First on Business Metrics
It’s crucial to align evaluation metrics directly with your primary business objectives. Ask yourself clearly:
- Are you primarily looking to increase total revenue or sales?
- Is your goal to maximize user engagement, such as time spent per session or daily active users?
- Do you aim to attract and retain new customers or user segments?
- Or perhaps your business priority is to boost interactions with new or niche products?
Clearly defining these objectives upfront lets you select the right metrics to track your recommendation system’s effectiveness. Some practical examples of business metrics you might focus on include:
- Conversion rate: Percentage of users who take a desired action (e.g., purchases, subscriptions).
- Average order value (AOV): The average revenue per transaction influenced by your recommendations and sponsored ads.
- User retention rate: The percentage of users returning regularly due to effective recommendations.
- Revenue per user/session: How much revenue each user generates through recommendations.
- Engagement metrics: Session length, number of items clicked, or average number of user interactions per session.
Choosing these business-driven metrics helps you clearly see if your recommendations are delivering genuine, measurable value.
Offline Evaluation
You should absolutely start your evaluations offline, using historical user data and a “golden dataset” of known interactions to quickly assess if your recommendations are reasonable. Common offline metrics include:
- Recall@K: Measures how many relevant items your system successfully retrieved in the top K recommendations.
- Precision@K: Evaluates how many of your top recommendations were actually relevant.
- Mean Average Precision (MAP): Assesses overall recommendation quality, considering both relevance and ranking order.
- Normalized Discounted Cumulative Gain (NDCG): Measures if highly relevant recommendations appear prominently near the top.
But these offline metrics alone are rarely sufficient. Why? Because they assume past user behavior perfectly predicts future preferences, ignoring changing user tastes or reactions to new types of content.
To bridge this gap, you need more sophisticated offline techniques:
- Off-policy Evaluation (Counterfactual Analysis): Uses historical interaction data to estimate how new recommendation strategies would perform without running live experiments. Techniques like Inverse Propensity Scoring (IPS) or Doubly Robust estimators help accurately estimate future performance.
- Shadow Testing: Silently deploy your new recommendation model alongside the current one, tracking performance, latency, and unexpected behaviors without exposing users to risks.
These methods can significantly increase your confidence before live deployment, minimizing the gap between offline and online performance.
Online Evaluation
Ultimately, the gold standard for evaluating recommendation systems is online testing, most commonly through A/B tests. The power of online evaluation lies in its ability to directly measure real user responses. But remember, effective A/B testing requires careful statistical analysis:
- Ensure your tests have sufficient statistical power, meaning your test groups (cohorts) are large enough to confidently detect meaningful differences.
- Analyze different user segments separately (such as new vs. returning users or mobile vs. desktop), ensuring you uncover any hidden issues or unintended effects.
- Monitor a comprehensive set of metrics: not just engagement metrics like clicks and session duration, but also business outcomes like conversion rate, retention, and overall revenue.
Continuous Production Monitoring
Your evaluation journey doesn’t end when your recommendation system goes live. User behaviors change constantly, new items appear regularly, and unexpected trends emerge. Continuous production monitoring helps you catch and address these issues rapidly:
- Set up real-time dashboards and automatic alerts to quickly identify significant deviations in key metrics (like sudden drops in conversion rates or user engagement).
- Log detailed recommendation outputs and features, enabling quick debugging, issue diagnosis, and rapid model retraining if necessary.
- Implement automated drift detection mechanisms to quickly identify when user behavior or data patterns shift away from your model’s expectations.
A strong production monitoring strategy helps ensure your recommendation system remains effective and relevant, quickly adapting as your business and user preferences evolve.
Bringing it All Together

Closing Thoughts
In this first part of our Architect’s Playbook, we’ve explored five practical tips to help you design robust, scalable multi-stage recommendation systems. These tips cover foundational architectural choices, strategies for balancing freshness and latency, how to efficiently refresh your candidate pools and features, the critical role of evaluation metrics, and the importance of thorough, layered testing strategies.
In the next part of this series, we’ll dive deeper into additional practical strategies, exploring more advanced algorithmic choices and techniques to further enhance recommendation quality, reduce latency, and optimize the overall user experience.
Before we close, I want to highlight an important point—call it “Tip 0”: You don’t always need exactly four stages in your recommendation system.
Tip 0: You Don’t Always Need a Four-Stage RecSys
Throughout this article, I have consistently used a four-stage structure (retrieval, filtering, scoring, ordering) as a clear and convenient template to anchor our discussion. However, real-world recommendation systems are diverse and come in many flavors. You might find yourself building a three-stage, a two-stage, or even a one-stage recommendation system because it better suits your needs.
Recently, there’s also been growing research and interest in unified, single-model approaches. Rather than deploying multiple specialized components, many researchers and practitioners now prefer one large, versatile model capable of directly generating personalized recommendations, such as bespoke GPT-based recommenders or large multitask models. Such approaches integrate multiple recommendation functions into one unified framework, potentially simplifying the overall system architecture.
I’m emphasizing this explicitly because the goal of these tips is not to enforce a rigid architectural pattern. Rather, the intention is to share practical, adaptable insights you can tailor to your specific needs. Feel free to use these guidelines flexibly, adapting them to your own pipeline structure, whether it’s a traditional multi-stage system, a simplified two-stage process, or even a single-model recommender.
Let me know what you think and if there are any other tips you’d like to share. Feel free to leave a note below or contact me directly via my website.
You can also subscribe to my blogs for more insights and updates.
Disclaimer
The views and opinions expressed in my articles are my own and do not represent those of my current or past employers, or any other affiliations.
Discover more from The ML Architect
Subscribe to get the latest posts sent to your email.