Self-hosted Machine Learning Model

Project Overview

This project focuses on deploying and running a modern large language model (LLM) entirely on local hardware. The goal was to gain hands-on experience with GPU-accelerated machine learning, model inference, and system-level configuration while avoiding reliance on cloud-based AI services.

The system was configured to support experimentation with open-source language models and serves as the foundation for future work involving low-level machine learning development in C.

Technologies Used

Ubuntu 24.04 LTS (Desktop)
NVIDIA GeForce RTX 3060 (12GB VRAM)
AMD Ryzen 12-core / 24-thread CPU
32GB DDR4 RAM
500GB NVMe (OS)
1TB NVMe (ML data & models)
PyTorch
Hugging Face Transformers
Miniconda
CUDA Toolkit

Implementation Details

This environment was built on Ubuntu with a dual-NVMe storage layout to separate the operating system from machine learning data and model artifacts. GPU acceleration was enabled using an NVIDIA RTX 3060 with CUDA-supported PyTorch to ensure efficient local inference. A dedicated Conda environment was created to isolate machine learning dependencies and maintain reproducibility. Modern transformer-based language models were deployed using quantization techniques to operate within consumer-grade GPU memory constraints while maintaining stable inference performance.

System and Storage Configuration

Ubuntu Desktop was selected to simplify GPU driver installation, CUDA support, and debugging during development. A dual-NVMe setup was used to separate the operating system from machine learning data.

500GB NVMe dedicated to the operating system and development tools
1TB NVMe wiped, formatted, and mounted at /mnt/ml-data
Dedicated storage used for models, datasets, and inference outputs

GPU Acceleration

The NVIDIA RTX 3060 GPU was configured with proprietary drivers and CUDA support to enable hardware-accelerated inference. PyTorch was installed with CUDA bindings and verified to correctly detect and utilize the GPU.

Installed NVIDIA drivers compatible with CUDA
Installed PyTorch with CUDA support
Verified GPU availability using PyTorch device checks

Environment Management

A dedicated Conda environment was created to isolate machine learning dependencies from the base system. This approach allows for controlled experimentation and version management.

Miniconda installed on the base system
Dedicated llm Conda environment created using Python 3.10
Installed PyTorch, Transformers, and related ML libraries

Model Inference

The Mistral 7B language model was selected due to its strong performance and compatibility with consumer-grade GPUs when using quantization.

Loaded the model using 4-bit quantization to reduce VRAM usage
Configured inference parameters to allow long-form text generation
Verified stable generation performance on local hardware

This setup enables interactive experimentation with language models without external dependencies.

Results

Successfully deployed a modern LLM entirely on local hardware
Achieved GPU-accelerated inference using consumer-grade components
Gained practical experience with CUDA, PyTorch, and model deployment
Established a scalable platform for future ML experimentation

Future Improvements

Develop a custom machine learning library written in C
Implement tensor operations, automatic differentiation, and basic optimizers
Explore low-level GPU programming using CUDA
Reimplement model inference pipelines without Python dependencies
Integrate models with custom C-based runtime environments

Back to Portfolio