Self-hosted Machine Learning Model

Project Overview

This project focuses on deploying and running a modern large language model (LLM) entirely on local hardware. The goal was to gain hands-on experience with GPU-accelerated machine learning, model inference, and system-level configuration while avoiding reliance on cloud-based AI services.

The system was configured to support experimentation with open-source language models and serves as the foundation for future work involving low-level machine learning development in C.


Technologies Used

  • Ubuntu 24.04 LTS (Desktop)
  • NVIDIA GeForce RTX 3060 (12GB VRAM)
  • AMD Ryzen 12-core / 24-thread CPU
  • 32GB DDR4 RAM
  • 500GB NVMe (OS)
  • 1TB NVMe (ML data & models)
  • PyTorch
  • Hugging Face Transformers
  • Miniconda
  • CUDA Toolkit

Implementation Details

This environment was built on Ubuntu with a dual-NVMe storage layout to separate the operating system from machine learning data and model artifacts. GPU acceleration was enabled using an NVIDIA RTX 3060 with CUDA-supported PyTorch to ensure efficient local inference. A dedicated Conda environment was created to isolate machine learning dependencies and maintain reproducibility. Modern transformer-based language models were deployed using quantization techniques to operate within consumer-grade GPU memory constraints while maintaining stable inference performance.

System and Storage Configuration

Ubuntu Desktop was selected to simplify GPU driver installation, CUDA support, and debugging during development. A dual-NVMe setup was used to separate the operating system from machine learning data.

  • 500GB NVMe dedicated to the operating system and development tools
  • 1TB NVMe wiped, formatted, and mounted at /mnt/ml-data
  • Dedicated storage used for models, datasets, and inference outputs

GPU Acceleration

The NVIDIA RTX 3060 GPU was configured with proprietary drivers and CUDA support to enable hardware-accelerated inference. PyTorch was installed with CUDA bindings and verified to correctly detect and utilize the GPU.

  • Installed NVIDIA drivers compatible with CUDA
  • Installed PyTorch with CUDA support
  • Verified GPU availability using PyTorch device checks

Environment Management

A dedicated Conda environment was created to isolate machine learning dependencies from the base system. This approach allows for controlled experimentation and version management.

  • Miniconda installed on the base system
  • Dedicated llm Conda environment created using Python 3.10
  • Installed PyTorch, Transformers, and related ML libraries

Model Inference

The Mistral 7B language model was selected due to its strong performance and compatibility with consumer-grade GPUs when using quantization.

  • Loaded the model using 4-bit quantization to reduce VRAM usage
  • Configured inference parameters to allow long-form text generation
  • Verified stable generation performance on local hardware

This setup enables interactive experimentation with language models without external dependencies.

Results

  • Successfully deployed a modern LLM entirely on local hardware
  • Achieved GPU-accelerated inference using consumer-grade components
  • Gained practical experience with CUDA, PyTorch, and model deployment
  • Established a scalable platform for future ML experimentation

Future Improvements

  • Develop a custom machine learning library written in C
  • Implement tensor operations, automatic differentiation, and basic optimizers
  • Explore low-level GPU programming using CUDA
  • Reimplement model inference pipelines without Python dependencies
  • Integrate models with custom C-based runtime environments