View all Projects
Local AI Inference Server

Local AI Inference Server

A local AI application development server and network in my lab with custom configured hardware and software stack.

Key Impact

Fast local AI inference and development environment

Technologies Used

Threadripper PRO4x RTX 3090FE10G NetworkingNASCUDAOllama

Project Brief

A high-performance, custom-built local computing environment housing powerful GPUs and optimized software stacks for rapid AI inference. Created to explore and master the end-to-end AI application development process and hardware optimizations.

Challenges

Hardware sourcing; finding water blocks and loop components to cool the x4 Nvidia RTX 3090 Founder's Edition GPUs, 128GB of fast 6400 MT/s ram, 5GHz Threadripper PRO 32 core/64 thread processor, gen 5 nvme, and dual 1600 watt PSUs, NVLINK bridges for the GPUs.
Custom 30amp circuit installation to handle the full load and future expansion of the system and accessories.
Configuring and storing LLMs is data and bandwidth intensive which requires very fast caching, storage, and networking.
The rapidly evolving AI application development landscape requires constant development environment configuration tweaking, optimizations, and experimentation.

Solution

When building this machine, I envisioned having the ability to run local AI models, agents, and workflows in connection with cloud hosted solutions to enable myself to fully develop applications from the ground up. But the real purpose was to take advantage of the this opportunity to grant myself the experience to grasp the knowledge and experience to lead future AI enabled projects such as:

1. Ground up hardware system configuration and understanding the bottlenecks, nuances, and trends which enable these applications to operate effectively.

2. Development environment configuration through deep research and experimentation with various tools such as Ollama, Open WebUI, RAG pipelines, N8N, Bolt.diy, and more.

3. Various AI model frameworks experimentation such as custom distilled llama 3.3 CoT thinking models, vision models, embedding models, image models such as Flux, and embedding them into application development pipelines, tools, apps, open source project projects, and more.

Future Development

Upgrade and expand GPU array with L40s or GDDR7 equivalent for larger models, context, and faster inference.

© 2025 Michael Moceri. All rights reserved.