Run AI on Your Own Machine: A Practical Guide to Local LLMs with LM Studio

Set up local AI development with LM Studio in minutes. Covers model selection, quantization, the OpenAI-compatible local API, and when local beats cloud.

By the end of this post, you'll have a large language model running on your own hardware, serving an OpenAI-compatible API that your existing code can hit without changing a single line. No API keys, no usage fees, no data leaving your machine.

Install LM Studio

LM Studio is a desktop app that downloads, manages, and serves local models through a clean GUI. It runs on macOS (Apple Silicon M1+), Windows (x64 and ARM64), and Linux (x64 and aarch64). Download it from lmstudio.ai and install like any other app.

LM Studio dropped Intel Mac support. If you're on an Intel Mac, skip ahead to the Ollama section below, or check out llamafile which runs on basically everything.

Once installed, you'll see a model search bar. Here's where it gets fun.

Download Your First Model

For your first model, I'd recommend Qwen3 8B or Llama 3.1 8B. Both run well on 16GB of RAM and produce genuinely useful output. On an M2 or newer Mac with 16GB unified memory, I typically see 20-30 tokens per second with either model.

In LM Studio's search bar, type Qwen3 8B and you'll see several quantization options. Pick the Q4_K_M variant. That's the sweet spot between quality and size for most hardware.

terminal

# If you prefer the CLI, LM Studio ships with `lms`:
lms get qwen3-8b@q4_k_m
lms load qwen3-8b@q4_k_m

The download is roughly 5GB, and once it finishes you can click the model to load it. You'll see memory usage in the bottom bar, and if the model fits entirely in your GPU or unified memory, inference will be noticeably fast.

Quick Sizing Guide

Here's what actually runs well on consumer hardware:

These assume Q4_K_M quantization. Apple Silicon gets an edge here because unified memory means the GPU can access all your system RAM directly, so a 36GB M3 Pro can comfortably run a 30B model that would need a dedicated GPU elsewhere. If you want to check exactly which models your hardware can handle, Can I Run AI? lets you plug in your specs and see what fits.

Understanding Quantization

Quantization compresses model weights from 16-bit floats down to 4-bit or lower integers. You lose some quality, but the model fits in far less memory and runs faster. Here's what the labels mean when you're browsing models:

Quantization	Bits/Weight	Quality	When to Use
Q2_K	~2.6	Poor	Only if nothing else fits
Q4_K_M	~4.5	Good	Default choice. Best quality-to-size ratio
Q5_K_M	~5.5	Better	When you've got headroom
Q6_K	~6.6	Near-lossless	Plenty of VRAM
Q8_0	~8.0	~Full	Diminishing returns above Q6

My recommendation: start with Q4_K_M every time. If the output feels off, try Q5_K_M. I've rarely needed to go higher for development work.

Start the Local API Server

This is the part that makes local AI development practical. LM Studio runs an OpenAI-compatible API server on localhost:1234. Click the "Developer" tab in LM Studio and toggle the server on.

Your existing OpenAI SDK code now works with zero changes:

local_chat.py

from openai import OpenAI
 
client = OpenAI(
    base_url="http://localhost:1234/v1",  # highlight-line
    api_key="not-needed"  # highlight-line
)
 
response = client.chat.completions.create(
    model="qwen3-8b",
    messages=[
        {"role": "system", "content": "You are a helpful coding assistant."},
        {"role": "user", "content": "Write a Python function that validates an email address using regex."}
    ],
    temperature=0.7
)
 
print(response.choices[0].message.content)

Same thing in TypeScript:

local_chat.ts

import OpenAI from "openai";
 
const client = new OpenAI({
  baseURL: "http://localhost:1234/v1", // highlight-line
  apiKey: "not-needed", // highlight-line
});
 
const response = await client.chat.completions.create({
  model: "qwen3-8b",
  messages: [
    { role: "system", content: "You are a helpful coding assistant." },
    { role: "user", content: "Explain the difference between Promise.all and Promise.allSettled." },
  ],
});
 
console.log(response.choices[0].message.content);

That's it. Two lines changed (base_url and api_key), and your code talks to a model running on your machine instead of OpenAI's servers. When you're ready to ship, swap those two lines back.

LM Studio added Anthropic API compatibility in v0.4.1. If your codebase uses the Anthropic SDK, set base_url="http://localhost:1234" and let the SDK call /v1/messages for you.

Streaming Responses

For anything interactive, you'll want streaming. The local API supports it the same way OpenAI's does:

local_stream.py

from openai import OpenAI
 
client = OpenAI(base_url="http://localhost:1234/v1", api_key="not-needed")
 
stream = client.chat.completions.create(
    model="qwen3-8b",
    messages=[{"role": "user", "content": "Explain how garbage collection works in Go."}],
    stream=True  # highlight-line
)
 
for chunk in stream:
    content = chunk.choices[0].delta.content
    if content:
        print(content, end="", flush=True)

Streaming from a local model has noticeably lower first-token latency than cloud APIs, because there's no network round trip and no inference queue to wait in. The model just starts generating the moment you ask.

Where to Find Models

Hugging Face is the main source. Filter by the gguf library tag to find models in the right format for local inference. The GGUF format was created by Georgi Gerganov (the author of llama.cpp) and has become the standard for quantized local models.

The best GGUF publishers to look for:

Unsloth - Consistently well-done quantizations, often among the first to publish new models in GGUF format.
Bartowski - Another reliable publisher with a wide model catalogue.
LM Studio's official repos - Curated and tested specifically for LM Studio compatibility.

Ollama's model library is the other major source. Ollama uses its own format (Modelfiles built on GGUF) and pulls models with a single command:

terminal

# Install Ollama
curl -fsSL https://ollama.com/install.sh | sh
 
# Pull and run a model
ollama pull qwen3:8b
ollama run qwen3:8b

Ollama is CLI-first and developer-oriented. If you're building a service that needs to run headless (no GUI), Ollama or LM Studio's llmster daemon (added in v0.4.0) are both solid choices.

Models Worth Trying

As of March 2026, here's what I'd actually recommend:

Qwen3 8B - Best general-purpose model at this size. Good at code, reasoning, and conversation.
Qwen3.5 35B-A3B - A Mixture of Experts model with 35B total parameters but only 3B active per token. Runs like a 3B model, thinks like a 35B. Wild.
DeepSeek-R1 8B - Strong reasoning and chain-of-thought capabilities.
Gemma 3 12B - Google's open model. Solid all-rounder.
Llama 3.1 8B - Meta's workhorse. Massive community, tons of fine-tunes available.

The MoE models (like Qwen3.5 35B-A3B) are the most interesting trend right now. They pack more capability into less active compute, which makes them perfect for local inference where every gigabyte of RAM counts.

When Local Beats Cloud

Running models locally isn't always the right call. Here's when it genuinely makes sense:

Privacy and compliance. If you're working with medical records, legal documents, financial data, or anything subject to GDPR, HIPAA, or government data residency rules, local inference means the data never leaves your network. No API terms of service to audit, no third-party data processing agreements.

Prototyping and iteration. When I'm testing prompts, I don't want to think about cost. Local models let you run hundreds of variations without watching a billing dashboard. The feedback loop is faster too, since there's no rate limiting.

Offline development. Whether you're on a flight, dealing with spotty hotel wifi, or working in an air-gapped environment, a local model doesn't care about your internet connection.

High-volume inference. If you're processing thousands of documents or running batch evaluations, the per-token cost of cloud APIs adds up fast. Local inference costs electricity and hardware depreciation, which is dramatically cheaper at scale.

Cloud APIs still win when you need frontier-quality output (GPT-5, Claude Opus), very long context windows, or tasks where the absolute best model matters more than cost. In practice, I use both: local for iteration and privacy-sensitive work, cloud for the tasks where quality is non-negotiable.

What's Next

A year ago, running models locally still felt like a hobbyist pursuit: fiddly setup, mediocre output, and compatibility issues with every other model. That's genuinely changed. The tooling works, the models are good enough for real development tasks, and the ecosystem around GGUF and llama.cpp continues to mature.

I run a local model daily for code review, draft writing, and prompt prototyping before committing to cloud API calls. The whole setup took about ten minutes, and the savings in both money and privacy compound from day one.

If you're looking for help integrating local or cloud AI into your development workflow, that's exactly what we do at Corvus Tech.

Sources

LM Studio: Desktop app for downloading, managing, and serving local LLMs with an OpenAI-compatible API.
LM Studio Documentation: Official docs covering the local API server and supported features.
Ollama: CLI-first tool for running local models, with its own model library and Modelfile format.
Hugging Face GGUF Models: Primary source for quantized models in the GGUF format used by llama.cpp and LM Studio.
llama.cpp: The inference engine behind most local LLM tools, created by Georgi Gerganov.
llamafile: Mozilla's single-file executable for running LLMs on any platform, including Intel Macs.
Can I Run AI?: Hardware compatibility checker for local model sizing.
Unsloth: Reliable publisher of well-optimised GGUF quantizations on Hugging Face.