In my journey to set up an efficient local AI environment, I've experimented extensively with Ollama's native Windows installation and OpenWebUI as my preferred web interface.
In my journey to set up an efficient local AI environment, I've experimented extensively with Ollama's native Windows installation and OpenWebUI as my preferred web interface. My hardware setup currently includes two Nvidia RTX 3090 Founder's Edition GPUs, but I'm planning an expansion soon—I have two more RTX 3090 cards waiting in the wings that I intend to install, connect via NVLINK, water cool, and power using dual 1600-watt PSUs. To accommodate this significant power draw, I've set up a dedicated 30 Amp circuit in my apartment. Eventually, I'll upgrade further with GPUs offering higher VRAM capacities, but my current setup already handles very demanding workloads impressively well.
In this blog post, I'll walk through my comprehensive, hands-on experience setting up Ollama and OpenWebUI directly (without Docker) and showcase how to configure advanced model settings, enable powerful Retrieval-Augmented Generation (RAG), and leverage hybrid API integrations like OpenAI.
Initially, I experimented with Docker Desktop on Windows, but for performance and simplicity reasons, I shifted to a native Ollama Windows binary installation. Avoiding the overhead and complexity of Docker and Nvidia's Container Toolkit on Windows simplified things considerably. Ollama's native binary offers:
Installation on Windows couldn't be simpler:
ollama run llama3
Use nvidia-smi to confirm your GPUs are engaged as expected.
Next, I configured OpenWebUI as my chat interface. Since OpenWebUI also runs well natively or via Docker, I opted for the Docker route here simply because it offers a cleaner sandboxed environment for the web interface itself—this won't negatively affect Ollama's native GPU performance:
docker run -d ` -v open-webui:/app/backend/data ` -p 3000:8080 ` --name open-webui ` ghcr.io/open-webui/open-webui:main
After launching OpenWebUI, integrating it with my native Ollama installation was straightforward:
http://host.docker.internal:11434
On Windows, host.docker.internal ensures that Docker communicates properly with the native Windows Ollama installation. Once connected, all models downloaded or built via Ollama immediately become available in OpenWebUI.
I use two primary models frequently in my workflow:
Configuring advanced settings in OpenWebUI is critical for handling these large contexts effectively. My typical parameters look something like this:
To fine-tune these settings, I navigate to:
Admin Panel → Models → Edit Model → Advanced Parameters
By adjusting these settings, I can precisely control how detailed or expansive the model outputs become, significantly improving productivity and adaptability in various tasks.
Perhaps the most valuable aspect of my setup is my extensive use of Retrieval-Augmented Generation (RAG). RAG allows models to reference external knowledge bases customized for my specific needs. In OpenWebUI, implementing RAG is intuitive:
I've experimented with several embedding models to power my retrieval pipeline, such as:
I also maintain active API integrations through OpenWebUI's connections with OpenAI's suite (O1, O3, GPT-4.5 Preview, and Embeddings API). This hybrid approach offers unique flexibility:
Deploying such a setup in a private business setting provides significant benefits:
A hybrid model (local + API services) strikes an optimal balance for many practical business situations, offering speed, privacy, cost-effectiveness, and access to state-of-the-art external models exactly when needed.
I've been extensively using my local Ollama + OpenWebUI setup for developing new coding projects. Particularly, for integration into VS Code, I use tools like Bolt.diy and Cline, which significantly streamline my coding workflow and development speed. From quickly bootstrapping projects to debugging complex scenarios, these AI-driven tools in conjunction with my powerful local GPU setup dramatically enhance my productivity and coding efficiency. I'll share more insights and detailed use cases of my coding experiences in upcoming dedicated posts.
Currently, my two RTX 3090s provide excellent performance. However, soon I'll expand to all four GPUs, coupled with custom water-cooling loops, NVLINK connections, and dual power supply setups. These upgrades will let me run even larger models, higher contexts, and concurrent instances—plus serve multiple AI applications simultaneously without performance sacrifice. Eventually, upgrading to GPUs that offer higher VRAM capacities, such as Nvidia A100 or H100, might be necessary.
The native Ollama Windows setup paired with OpenWebUI has proven extremely capable—offering both performance and flexibility. The addition of RAG and a hybrid API approach makes this setup an exceptional solution for coding, research, personal projects, and even enterprise applications. If you're aiming for a powerful, fully customizable, and private AI stack, directly installing Ollama and pairing it with OpenWebUI is truly a robust choice.
Stay tuned for my upcoming posts covering practical coding projects and deeper dives into specific models and optimization techniques!
AI, Advanced Manufacturing, Strategy, and Entrepreneurship
© 2025 Michael Moceri. All rights reserved.