Self-hosted Machine Learning Model
Project Overview
This project focuses on deploying and running a modern large language model (LLM) entirely on local hardware. The goal was to gain hands-on experience with GPU-accelerated machine learning, model inference, and system-level configuration while avoiding reliance on cloud-based AI services.
The system was configured to support experimentation with open-source language models and serves as the foundation for future work involving low-level machine learning development in C.
Technologies Used
- Ubuntu 24.04 LTS (Desktop)
- NVIDIA GeForce RTX 3060 (12GB VRAM)
- AMD Ryzen 12-core / 24-thread CPU
- 32GB DDR4 RAM
- 500GB NVMe (OS)
- 1TB NVMe (ML data & models)
- PyTorch
- Hugging Face Transformers
- Miniconda
- CUDA Toolkit
Implementation Details
This environment was built on Ubuntu with a dual-NVMe storage layout to separate the operating system from machine learning data and model artifacts. GPU acceleration was enabled using an NVIDIA RTX 3060 with CUDA-supported PyTorch to ensure efficient local inference. A dedicated Conda environment was created to isolate machine learning dependencies and maintain reproducibility. Modern transformer-based language models were deployed using quantization techniques to operate within consumer-grade GPU memory constraints while maintaining stable inference performance.
System and Storage Configuration
Ubuntu Desktop was selected to simplify GPU driver installation, CUDA support, and debugging during development. A dual-NVMe setup was used to separate the operating system from machine learning data.
- 500GB NVMe dedicated to the operating system and development tools
- 1TB NVMe wiped, formatted, and mounted at
/mnt/ml-data - Dedicated storage used for models, datasets, and inference outputs
GPU Acceleration
The NVIDIA RTX 3060 GPU was configured with proprietary drivers and CUDA support to enable hardware-accelerated inference. PyTorch was installed with CUDA bindings and verified to correctly detect and utilize the GPU.
- Installed NVIDIA drivers compatible with CUDA
- Installed PyTorch with CUDA support
- Verified GPU availability using PyTorch device checks
Environment Management
A dedicated Conda environment was created to isolate machine learning dependencies from the base system. This approach allows for controlled experimentation and version management.
- Miniconda installed on the base system
- Dedicated
llmConda environment created using Python 3.10 - Installed PyTorch, Transformers, and related ML libraries
Model Inference
The Mistral 7B language model was selected due to its strong performance and compatibility with consumer-grade GPUs when using quantization.
- Loaded the model using 4-bit quantization to reduce VRAM usage
- Configured inference parameters to allow long-form text generation
- Verified stable generation performance on local hardware
This setup enables interactive experimentation with language models without external dependencies.
Results
- Successfully deployed a modern LLM entirely on local hardware
- Achieved GPU-accelerated inference using consumer-grade components
- Gained practical experience with CUDA, PyTorch, and model deployment
- Established a scalable platform for future ML experimentation
Future Improvements
- Develop a custom machine learning library written in C
- Implement tensor operations, automatic differentiation, and basic optimizers
- Explore low-level GPU programming using CUDA
- Reimplement model inference pipelines without Python dependencies
- Integrate models with custom C-based runtime environments