DeskRAG: Create an Offline AI Assistant in One Afternoon

May 19, 2025
Written By Rahul Suresh

Senior AI/ML and Software Leader | Startup Advisor | Inventor | Author | ex-Amazon, ex-Qualcomm

Introduction

In my previous article, I discussed how Vibe coding is rapidly transforming development workflows, enabling experienced engineers to become significantly more productive. While not yet mature enough for deploying production-ready code directly, Vibe coding has proven invaluable for quickly prototyping and iterating complex applications, provided you approach it with the right mindset and methodology.

I recently challenged myself to build an entirely local, offline GPT-style assistant powered by Retrieval-Augmented Generation (RAG), which I named DeskRAG, within a single afternoon (roughly four hours). My goal was twofold:

  1. Create a personalized offline AI assistant for quickly searching, querying, and summarizing documents on my own laptop.
  2. Carefully document the entire end-to-end vibe-coding workflow I used in building it, using Cursor powered by Claude 3.5 and 3.7.

In this article, I’ll walk you through how I built DeskRAG, highlighting key architectural choices and technical considerations. I’ll also share insights into how I shifted into a guiding role, like a program manager or software architect, mentoring my AI assistant through iterative coding cycles, reviewing, refining, and improving the application. I’ll discuss what worked well, what didn’t, and some best practices.

Motivation: Why DeskRAG?

The goal of this project is simple: I wanted to demonstrate how quickly and easily you can build your own GPT-like assistant that runs entirely locally on your laptop. While I regularly use, and love, various cloud-based language models, there are many situations where uploading sensitive or personal documents to a third-party service just isn’t ideal. I often wish that I had a completely local solution that allows me to retain full control over my data and privacy.

For example, you might want to:

  1. Quickly find specific information or details buried inside hundreds of personal documents without manually scanning each one.
  2. Instantly summarize long reports, research papers, or notes saved on your laptop, without uploading them anywhere.
  3. Search through private documents such as financial statements, medical records, or contracts safely and privately.
  4. Answer specific questions about local files, notes, or meeting minutes instantly, without leaving your desktop.
  5. Create a secure personal knowledge-base or local reference assistant customized exactly to your preferences.

These are exactly the types of privacy-sensitive I wanted DeskRAG to handle. Sure, there are many existing applications and open-source code bases that can run AI assistants locally, but building my own meant complete control. I could select the exact models, architecture, and features I wanted, all customized to run perfectly offline on my laptop. Plus, it seemed like a great way to spend an afternoon and have fun along the way!

While I’ll use DeskRAG as the example throughout this article, the concepts and architecture I discuss here are broadly applicable to any privacy-driven, local AI assistant use case.

Let’s dive in!

How DeskRAG Works

High-Level Architecture

This diagram visually illustrates the high-level architecture of the DeskRAG application, clearly divided into two primary flows: **Real-Time Query Flow** and **Document Processing**.

In the **Real-Time Query Flow** (top portion), a user submits a query, which is received by the RAG Engine. The RAG Engine then communicates with the Vector Database (FAISS) to perform a search for relevant document vectors. FAISS returns matching document vectors to the RAG Engine, which combines these retrieved documents with the user's query and sends this combined context to the Language Model. The Language Model then processes this input and generates a textual response, returning it back to the RAG Engine. Finally, the RAG Engine provides the response directly back to the user.

In the **Document Processing** flow (bottom portion), the user initially uploads documents into the system, labeled "Source Documents." These raw documents are passed into the Document Parser, which processes and breaks them down into manageable chunks. These document chunks are then sent to the CLIP Embedder, which generates numerical embeddings (vectors) from these chunks. The resulting embeddings are stored directly in the Vector Database (FAISS), ready for retrieval during real-time queries.

Overall, the diagram emphasizes clear separation between query handling (real-time user interactions) and document preprocessing (initial document setup), highlighting the distinct roles of key components, including the RAG Engine, Vector Database, Language Model, Document Parser, and CLIP Embedder.
Figure 1: High-level Architecture Diagram

The main workflow for DeskRAG is straightforward and consists of five core steps:

  1. Select Documents:
    Choose individual files or entire folders on your laptop to create your personal knowledge base (PDF, DOCX, TXT, Markdown, images).
  2. Process and Generate Embeddings:
    Parse documents into manageable chunks of text or image data, then convert them into numeric multimodal embeddings using CLIP.
  3. Store Embeddings:
    Store embeddings locally in a high-performance FAISS vector database for fast and efficient retrieval.
  4. User Query via Selected Language Models:
    Select from various supported quantized language models (Falcon, Mistral, TinyLlama, Phi-2, Llama-7B) and enter your queries in the DeskRAG interface.
  5. Retrieve and Respond:
    Retrieve relevant document chunks from FAISS, create a contextual prompt, and generate a response locally using your chosen Language Model. Responses include citations linking directly back to source documents.

Below are detailed sequence diagram explaining the document ingestion and query flows.

Figure 2: Sequence diagram for document ingestion and parsing
Figure 3: Sequence diagram for query processing

Tools and Frameworks

The table below summarizes each key architectural component, along with the selected tools, reasoning behind these choices, and alternatives considered during design.

For DeskRAG’s user interface, I chose Tkinter because it’s lightweight, integrates seamlessly with Python, and easily supports fully offline use. Alternatives like Streamlit were appealing but limited in customization, while Electron and PyQt introduced unnecessary complexity and heavier dependencies.

For document parsing, I relied on mature libraries like PyPDF2, python-docx, markdown, and Pillow because they're reliable, widely-used, and straightforward. While alternatives such as Apache Tika and Unstructured offered richer parsing features, they required complex setups and heavier dependencies, making them unsuitable for this quick, local implementation.

The embedding model selected was primarily CLIP (8-bit quantized), offering unified multimodal embeddings for both text and images. This future-proofs DeskRAG for image-related extensions. I also included optional TextTransformer embeddings for flexibility. Alternatives like Sentence Transformers offered text-only support, insufficient for anticipated multimodal features.

For vector storage, FAISS stood out for its simplicity, performance, and minimal overhead. While alternatives such as Chroma or Qdrant provided richer features, their added complexity was unnecessary for DeskRAG’s focused local application. Cloud-based solutions like Pinecone were unsuitable due to the offline requirement.

Finally, I leveraged Hugging Face for model storage because of its extensive collection, excellent quantization support, and easy integration. Alternative manual downloads or custom hosting were more cumbersome. Enabled optimizations included 4-bit quantization and embedding caching, significantly reducing inference latency. FlashAttention, though promising, remains temporarily disabled due to compatibility issues.
Figure 4: Tools and Frameworks used in DeskRAG

Key Trade-Offs and Considerations

1. Embedding Model Selection

When selecting an embedding model, the key considerations you must consider are:

  • Embedding quality (high retrieval accuracy)
  • Multimodal capability (text and image support)
  • Efficiency through quantization (for local GPU execution)
  • Open-source availability and widespread adoption

Considering these factors, I chose CLIP (ViT-B/32, 8-bit quantized) because it produces unified embeddings for both text and images, supporting potential future image-based queries. Its quantization capability ensures efficient local execution on my RTX 4070 GPU.

Another great option is Sentence Transformers based embeddings, which excel specifically at text-only tasks and can sometimes offer better performance for purely textual content.

You can also consider other alternatives, including specialized encoders for text (e.g., BERT-based models) and images (e.g., ResNet embeddings). However, these approaches would have increased complexity and maintenance overhead without sufficient benefit given DeskRAG’s scope.

2. Vector Storage

Here are some factors I considered when choosing a vector storage:

  • Open-source, mature, and widely adopted
  • Fast local retrieval performance
  • Easy integration with Python and Hugging Face ecosystem
  • Efficient local resource consumption (memory and disk usage)

Given these requirements, I selected FAISS, a widely recognized industry standard vector database developed by Meta, primarily because it offers exceptional retrieval performance, easy Python integration, and minimal complexity for local storage scenarios. FAISS is specifically optimized for high-speed vector searches on CPUs and GPUs, making it ideal for DeskRAG’s offline, lightweight use case.

There are many other popular vector databases, and they all have their pros and cons. Chroma, Qdrant, Pinecone, OpenSearch, ElasticSearch, and Redis all offer a wide range of features to create, store, and retrieve embeddings, some of which also allow you to run locally. If you prefer any of these other vector databases, you can use them instead for your use case.

FAISS was ultimately the most practical, efficient, and optimal solution for this problem, given DeskRAG’s local execution and simplicity requirements.

3. Language Model Choices

Language Models supporting DeskRAG should fulfill the following criteria:

  • Compatibility with local GPU (RTX 4070, 12 GB VRAM)
  • Model size and parameter count (directly affecting GPU memory and speed)
  • Robust quantization support (4-bit quantization for local efficiency)
  • Open-source availability, popularity, and ease of access via Hugging Face
  • Good documented performance benchmarks

Keeping flexibility in mind, I chose to explicitly support and test several quantized language models: Falcon-7B-Instruct, Mistral-7B-Instruct, TinyLlama-1.1B, Phi-2, and Llama-7B. All these models comfortably fit within my RTX 4070 GPU memory, achieve inference speeds of a few seconds or less, and have strong community adoption, performance benchmarks, and Hugging Face availability. While I chose these models, DeskRAG’s architecture easily supports any Hugging Face transformer model meeting similar criteria.

4. Optimizations

To achieve optimal local performance, I implemented three primary optimizations:

Quantization: I applied 4-bit quantization for all language models (LLMs) and 8-bit quantization for CLIP embeddings. Quantization significantly reduces memory usage and inference latency, enabling all chosen models to run comfortably on a consumer-grade GPU, such as my RTX 4070.

Embedding Caching: Embeddings are computed once and cached locally, eliminating redundant computations for frequently accessed documents. This dramatically reduces processing time during retrieval, especially after initial indexing is complete.

Built-in GPU and CPU Optimizations: Leveraged standard performance optimizations provided by PyTorch and Hugging Face Transformers, ensuring full utilization of available GPU resources for model inference and embedding generation. The FAISS retrieval process was also tuned to effectively utilize available CPU resources, balancing overall system load.

Additionally, I intended to use FlashAttention, a high-performance transformer optimization known to further speed up inference. However, due to compatibility challenges on my Windows setup, I disabled this optimization, but it remains a priority for future exploration.

Implementation: Building DeskRAG with Vibe Coding

This project was one of those rare instances where I was able to move from idea to a fully functioning prototype in just a single afternoon session. The key enabler was Vibe coding, which allowed me to step away from directly writing every line of code and instead focus on clearly communicating my intent. Throughout the process, I wore two hats simultaneously: a software product manager, clearly defining the detailed functional requirements, and a software architect, specifying technical choices, key optimizations, and overall workflow.

My Workflow Using Vibe Coding

I began by preparing a detailed initial prompt for Cursor, describing the functional requirements, the overall architectural vision, and the technical specifics like model selection, embedding strategy, and optimizations. After submitting this prompt, Cursor generated the first iteration of the codebase.

From there, I went through around 25 iterations, each short and interactive, initially to fix environmental setup and dependency issues, and later to debug, optimize, and refine the application. During these interactions, I transitioned from debugging initial errors to collaboratively refining the logic and ensuring system efficiency.

After debugging dependency and environmental issues, the DeskRAG’s responses were incorrect or irrelevant. A deeper look revealed that the embedding generation logic wasn’t correctly tied to the document parser, resulting in random embeddings. After highlighting this critical issue explicitly, Cursor corrected it quickly, significantly improving the application’s accuracy and response quality.

Ultimately, this iterative, conversational process allowed me to complete DeskRAG in just a few hours, far less than the multiple days such an effort typically demands.

Here are some key learnings using Vibe coding for this project:

  1. Be Explicit in Your Architectural Intent: It was crucial to clearly outline requirements and architectural decisions upfront. The more precise and explicit I was, especially around key technologies (like CLIP embeddings, FAISS database, and quantized models), the more accurate and relevant the generated code became. Conversely, vagueness led to unnecessary iterations and corrections.
  2. Expect Iteration and Debugging, Especially on Integration Points: Integration points between modules (e.g., embedding generation and document parsing, embeddings and the vector database) required the most debugging effort. Cursor’s generated code initially struggled with these connections, and significant interaction was needed before the code properly linked these critical modules. Paying close attention to these points early on would have saved considerable time.
  3. Prioritize Quick Environment and Dependency Checks: Initial interactions frequently revolved around environment setup and resolving dependency conflicts. Quickly establishing a stable environment (e.g., through Conda or Docker) and testing basic imports upfront significantly reduced friction later in the coding process.
  4. Collaborate and Iterate Rapidly for Optimizations: The iterative nature of Vibe Coding shines in performance tuning and optimization phases. Rather than manually tweaking code, conversationally exploring various optimizations (quantization, caching, GPU utilization) allowed me to quickly assess trade-offs and achieve substantial performance improvements in minimal time.

The Initial Prompt: Blueprint for the DeskRAG Implementation

Below is the exact initial prompt I used to instruct Cursor (powered by Claude) to generate the DeskRAG system. I designed this prompt to clearly combine the product-level requirements (like a product manager) and architectural guidance (like a software architect), including specific model choices and optimizations, while leaving flexibility where appropriate.

You can directly reuse or adapt this prompt for similar projects:

Click to view the full DeskRAG initial prompt
PRODUCT-LEVEL REQUIREMENTS  
    - Build a desktop application named DeskRAG that runs locally and lets users query, summarise, and retrieve information from personal documents stored on their machine, entirely offline.  

DOCUMENT SELECTION AND INGESTION  
    - User can point the app at a single file or an entire folder.  
    - Supported formats: PDF, DOCX, TXT, Markdown, JPG, PNG.  
    - Each document is split into manageable chunks for downstream embedding.  

FAST, ACCURATE LOCAL RETRIEVAL  
    - For every query, retrieve the most relevant chunks from the local vector store.  
    - Responses must cite the original document and exact page or image.  

PRIVACY GUARANTEE  
    - No user data ever leaves the device.  
    - Internet connectivity is required only to download open-source models from Hugging Face; runtime queries and documents stay local.  

CHAT-STYLE USER INTERFACE  
    - Lightweight desktop GUI shows ingestion progress, supports drag-and-drop, and streams answers back to the user.  

ARCHITECTURAL AND TECHNICAL REQUIREMENTS  

EMBEDDINGS  
    - Primary: CLIP ViT-B/32 (8-bit) for unified text + image embeddings.  
    - Optional: Sentence-Transformer model for text-only retrieval, switchable via config.  

LANGUAGE MODELS  
    - Support 4-bit Falcon-7B-Instruct, Mistral-7B-Instruct, TinyLlama-1.1B, Phi-2, and Llama-7B.  
    - Architecture must allow other Hugging Face models with minimal changes.  

OPTIMISATIONS  
    - Quantise LLMs to 4-bit and embeddings to 8-bit.  
    - Cache embeddings to avoid redundant work.  
    - Enable PyTorch and Transformers GPU optimisations.  
    - Keep code ready for FlashAttention (disabled for now on Windows).  

VECTOR STORE  
    - Use FAISS for local similarity search because it is fast, lightweight, and well-supported for desktop environments.  

DOCUMENT PARSING AND UI  
    - Select mature Python libraries with minimal setup for parsing and GUI.  

HIGH-LEVEL WORKFLOW  
    - Document ingestion → Embedding generation → FAISS storage → User query → Retrieval → LLM generation with context → Streamed response to GUI with citations.  


A Deep Dive into the Generated Code

Let us examine the final structure of code package.

deskrag/
├── src/
│   ├── core/
│   │   ├── config.py
│   │   ├── logging.py
│   │   └── rag_engine.py
│   ├── processors/
│   │   └── document.py
│   ├── models/
│   │   ├── embedding.py
│   │   └── llm.py
│   ├── db/
│   │   └── vector_store.py
│   └── ui/
│       ├── app.py
│       └── components.py
├── scripts/
│   └── download_models.py
├── tests/
│   └── test_rag.py
├── requirements.txt
└── README.md

The core components include:

  • Configuration (src/core/config.py): Manages directories under ~/.deskrag, loads/saves JSON configuration, and tracks available models.
  • Document Processing (src/processors/document.py): Reads PDFs, DOCX, text files, markdown, and images, splitting them into overlapping chunks for embedding. The process_file method dispatches to format-specific handlers.
  • Vector Store (src/core/vector_store.py): Uses FAISS to store CLIP embeddings, including caching, batch processing, and optional index optimization.
  • RAG Engine (src/core/rag_engine.py): Coordinates processing, embedding, storing, and querying. The query method retrieves relevant documents (via text or image search), assembles context, and asks the LLM to generate a response.
  • Models (src/models/embedding.py and src/models/llm.py): Provide CLIP-based embeddings and an LLM interface with optional 4‑bit quantization and response caching.
  • User Interface (src/ui/app.pysrc/ui/components.py): Tkinter-based GUI with file selection, progress tracking, and a chat interface. DeskRAGApp initializes configuration and the RAG engine, and displays the main window.
  • Scripts (scripts/download_models.pyscripts/cleanup_models.py): Utilities for downloading or cleaning up model files from Hugging Face.

The entry point src/main.py simply launches the DeskRAGApp GUI.

Results

I primarily built DeskRAG to organize my personal documents and easily find answers to queries based on my local files. However, for the sake of demonstrating the capabilities here, I created a test folder with a selection of well-known research papers—including the original Transformer paper (“Attention is All You Need”), the ResNet paper (“Deep Residual Learning for Image Recognition”), and a few other classic research articles. I then ingested these papers into the system, asked specific questions related to their content, and captured some of the resulting responses, as shown below.

Clearly, the answers are not as sophisticated or nuanced as you’d expect from today’s frontier language models. But remember—DeskRAG runs entirely on small, quantized models with only around 7 billion parameters (or lower), suitable for a lightweight desktop environment. These models aren’t designed to handle extremely complex queries or intricate reasoning tasks. Instead, they’re optimized for straightforward questions and organizing information from your personal documents.

From that perspective, as your “personal GPT” focused on your own files and documents, the results are quite satisfactory. DeskRAG is intended to quickly and reliably handle routine document-search and organization tasks, and it performs those functions well.

Moving forward, I’ll continue experimenting with different models, tweaking the embedding strategies, and refining the UX to make the app more functionally useful.

Figure 5: A snapshot of DeskRAG in action

In my very non-scientific latency measurements, I observed notable differences between models when asking similar queries about the ingested research documents—questions like “What is positional encoding?” or “Explain the attention mechanism,” or simple factual inquiries about paper results. On the lower end, Phi-2 and TinyLlama produced short, concise responses within roughly 1 to 2 seconds, but these answers were terse and lacked depth. At the other end, Falcon and Llama provided richer, more detailed responses, but took significantly longer, approximately 8 to 10 seconds, to complete their answers. Among these, Falcon and Llama delivered the highest-quality results.

Conclusion

I was able to move from an idea to a working demo and a fully functional prototype in just a single afternoon- something that traditionally would have taken days or weeks. That itself was an incredible productivity boost. While the initial results look promising, DeskRAG remains primarily a fun, exploratory project for me right now. I plan to continue refining it, adding new features, and enhancing existing functionalities over time.

Have I thoroughly reviewed all the generated code line-by-line yet? Not completely. I’ve tested the basic functionality, verified key architectural components, and ensured that the prototype meets my initial requirements. However, if you ask whether I’d feel comfortable deploying this code in a production environment today, my answer is clearly “no.” To reach that point, the code would require careful unit testing, comprehensive end-to-end functional testing, thorough code reviews, and many other software development best practices. I plan to undertake all these steps before making this tool publicly available.

Yet, despite the remaining work ahead, the overall productivity gain remains substantial. Even considering the additional tasks required to make the code production-ready, I’ve already saved significant development time compared to building from scratch. And importantly, I now have a useful tool I’m actively enjoying and benefiting from each day—exactly the kind of outcome I’d hoped for.

Related Reading

Disclaimer

The views and opinions expressed in my articles are my own and do not represent those of my current or past employers, or any other affiliations.


Discover more from The ML Architect

Subscribe to get the latest posts sent to your email.

Leave a Reply

Discover more from The ML Architect

Subscribe now to keep reading and get access to the full archive.

Continue reading