live · on your machine · no API key v3.5.x

How fast does the LM Studio app run on your hardware?

Q: Does the CLI emit machine-readable benchmark output?

Yes — the lms CLI streams tokens-per-second, time-to-first-token and total-tokens stats in JSON when run with --json, suitable for piping into a time-series database.

LM Studio is the desktop application that turns a consumer laptop into a benchmarkable local AI workstation. Built for Mac and Windows users who want reproducible tokens-per-second numbers rather than vague cloud-API impressions, the lm studio app exposes every measurable surface of a local LLM run: GPU utilization, KV cache size, prompt-processing speed and generation throughput. Together with image generation and a plugin marketplace, it is the most measurement-friendly local AI tooling in 2026.

Get LM Studio Try a live demo

~/lm-studio · zsh · 80×24

// playground

Try LM Studio with a local LLM, right here

Pick a model on the left, type a prompt or click a chip, and watch a simulated streaming response with live tokens-per-second readout. This is a deterministic mock — the real thing runs on your hardware after install.

Models · GGUF

Pick a prompt below or type your own. The mock streams tokens at the picked model's typical local throughput.

▸ no network call · the playground responds deterministically from a small set of curated answers

your hardware

Will LM Studio run on my Mac or Windows PC?

Drag the RAM slider, pick your GPU class, and see which open-weight models will fit. Numbers assume Q4_K_M quantization — the default LM Studio recommends for most consumer hardware.

System / unified memory · 24 GB

8162432486496128 GB

GPU configuration

Apple Silicon (M1/M2/M3/M4) counts as "dedicated" — unified memory feeds the GPU directly.

model catalogue

LM Studio model library — open weights by family

LM Studio's model browser pulls open weights directly from Hugging Face. Below is a snapshot of the families most commonly downloaded in 2026 — click a chip to filter.

Llama-3.1-8B8 B

Meta · Llama 3.1

General-purpose instruct model. The default starting point for most LM Studio users — strong reasoning, fast on 16 GB hardware.

4.6 GB Q4_K_M8K ctx

Llama-3.1-70B70 B

Meta · Llama 3.1

Heavy-class generalist. Comfortable on a 64 GB Apple Silicon Mac or a 48 GB+ Nvidia rig; reasoning approaches GPT-4 class.

42 GB Q4_K_M128K ctx

Qwen3-14B14 B

Alibaba · Qwen3

Strong multilingual model with native tool-calling support. The current sweet spot for 32 GB machines.

9.8 GB Q5_K_M32K ctx

DeepSeek-R1-7B7 B

DeepSeek · R1 distill

Reasoning-tuned distill that shows its work. Excellent for code review and step-by-step math, on a 16 GB machine.

4.1 GB Q4_016K ctx

Mistral-Nemo-12B12 B

Mistral · Nemo

Apache-licensed mid-size workhorse with a 128K context window. Popular for retrieval-augmented workflows.

7.5 GB Q4_K_M128K ctx

gpt-oss-20B20 B

OpenAI · open release

OpenAI's 2025 open-weight release. Slower per token than smaller models but the highest local quality short of 70B-class.

12.4 GB Q4_K_M32K ctx

Gemma-3-9B9 B

Google · Gemma 3

Google's open-weight family, tuned for instruction following with light hardware footprint.

5.4 GB Q4_K_M8K ctx

Phi-4-14B14 B

Microsoft · Phi-4

Reasoning-focused 14B from Microsoft. Punches well above its weight on code and math benchmarks for the size.

8.7 GB Q4_K_M16K ctx

SDXL 1.03.5 B

Stability · diffusion

Stable Diffusion XL — the open-weight image generator most commonly run inside LM Studio's diffusion mode.

6.9 GB safetensors1024² px

FLUX.1-schnell12 B

Black Forest Labs · FLUX

Apache-licensed diffusion model. State-of-the-art quality at home on 24 GB+ unified or dedicated VRAM.

23 GB safetensors2048² px

tokens / second

LM Studio tokens-per-second across Mac and Windows

Illustrative throughput numbers measured at Q4_K_M quantization. Pick a hardware preset and the bars race to the matching tokens-per-second rate.

llama 3.1-8B-Q4

— tok/s

qwen 3-14B-Q5

— tok/s

deepseek r1-7B-Q4

— tok/s

gpt-oss 20B-Q4

— tok/s

phi 4-14B-Q4

— tok/s

platforms

LM Studio Mac vs LM Studio Windows — same UI, different back-ends

Same UI, same model catalogue, but the back-end stack changes per OS. The lm studio mac edition has the MLX runtime; the lmstudio windows edition has CUDA, ROCm and XPU.

lm studio mac

macOS 13.4 or later · Apple Silicon native

Notarized .dmg with the MLX runtime for measurably faster on-device inference on M-series chips. Integrates with Spotlight, Shortcuts and the menu bar.

Native Apple Silicon (M1 onwards)
MLX runtime for diffusion + LLMs
Unified memory used as VRAM
Shortcuts & CLI hooks

16 GB
min unified RAM≥ M1
Apple Silicon.dmg
notarized installer

lmstudio windows

Windows 10 / 11 · x64 + ARM64

Signed MSI installer with full GPU acceleration across NVIDIA CUDA, AMD ROCm and Intel Arc XPU back-ends. SmartScreen-clean, AppLocker-friendly.

NVIDIA CUDA back-end
AMD ROCm back-end
Intel Arc / XPU acceleration
Vulkan fallback for any GPU

16 GB
min system RAM8 GB
recommended VRAM.msi
signed installer

// marketplace

Extend the app with lm studio plugins

The plugin runtime turns LM Studio into a programmable surface. Browse community extensions, install with one click — every plugin runs in a sandboxed worker on your machine.

web-search

Search & RAG

Live web search before generation. Hooks into SearXNG or a self-hosted Brave Search endpoint.

↓ 42k · ★ 4.8

rag-folder

Search & RAG

Index a local folder of PDFs and Markdown, then chat against it with citations.

↓ 31k · ★ 4.7

code-runner

Code

Sandboxed Python and JavaScript execution for tool-calling workflows. Results stream back as messages.

↓ 28k · ★ 4.9

git-context

Code

Drops the current Git diff into the system prompt — great for code review with a local model.

↓ 14k · ★ 4.6

mcp-bridge

Tools

Connect any MCP server (Model Context Protocol) to your local model as a tool-calling target.

↓ 18k · ★ 4.8

shell-call

Tools

Allow the model to run whitelisted shell commands. Useful for local agentic workflows.

↓ 9k · ★ 4.4

theme-mono

Monochrome editor theme with monospace everywhere and reduced motion.

↓ 6k · ★ 4.5

cmd-palette

⌘K-style global command palette across models, prompts and conversations.

↓ 12k · ★ 4.7

quantization

LM Studio quantization explained — Q4, Q5, Q8 and what they cost you

Quantization shrinks a model's weights by storing them in fewer bits. The slider walks you through the common levels — watch the file size shrink and the quality meter drop in real time.

Q4_K_MQuantization level 5 of 7

FP16Q8Q6Q5Q4Q3Q2

File size (8 B model)— GB

Output quality (vs FP16)— %

▸ Q-suffixes from llama.cpp · K_M variants pack the most quality per byte at most levels

The plain answer to what is lm studio is this: a desktop GUI that downloads open-weight models from Hugging Face, loads them via llama.cpp or MLX, and reports the numbers that matter — tokens-per-second, time-to-first-token, memory footprint and GPU layer offload. LM Studio is the desktop application that turns a consumer laptop into a benchmarkable local AI workstation. Built for Mac and Windows users who want reproducible tokens-per-second numbers rather than vague cloud-API impressions, the lm studio app exposes every measurable surface of a local LLM run: GPU utilization, KV cache size, prompt-processing speed and generation throughput. Together with image generation and a plugin marketplace, it is the most measurement-friendly local AI tooling in 2026.

LM Studio for benchmarks, in plain terms

The plain answer to what is lm studio is this: a desktop GUI that downloads open-weight models from Hugging Face, loads them via llama.cpp or MLX, and reports the numbers that matter — tokens-per-second, time-to-first-token, memory footprint and GPU layer offload. For a benchmarks-driven audience, what is lm studio app comes down to a measurement harness with a chat UI on top. The lm studio ai workflow is straightforward: pick a quantization, run a fixed prompt set, record throughput. Because everything runs locally and the model weights are byte-stable across machines, the same prompt-and-model combination produces directly comparable numbers on any hardware.

Is LM Studio safe to leave running as a benchmark daemon

Anyone running benchmarks at scale will eventually ask is lm studio safe to leave open as a measurement daemon for hours. The answer is yes. Code-signed binaries on Mac and Windows, clean independent scans on VirusTotal, and the runtime opens no inbound connections by default. The lm studio safe story rests on architecture: model weights pulled from Hugging Face over HTTPS with verifiable hashes, license activation as the single outbound call, and a local API server bound to 127.0.0.1 unless explicitly opened. For a continuous-integration setup that drives benchmark sweeps, the absence of telemetry and the documented network boundary are the deciding factors.

The reason LM Studio wins in published benchmarks is consistency: the tokens-per-second number you see in the chat UI is the same number the CLI reports and the same number the local API server logs. Arjun Mehta — ML Performance Engineer

LM Studio Mac: MLX is the headline number

The lm studio mac edition is the platform of choice for repeatable Apple Silicon benchmarks. Distributed as a notarized .dmg, runs natively on Apple Silicon, requires macOS 13.4 or later. The MLX runtime is the main reason: on M-series chips, MLX-format models deliver materially higher tokens-per-second than the cross-platform llama.cpp back-end, and the difference compounds as model size grows. A 7B model on an M2 Air clocks around 22 tok/s under Q4_K_M; the same model on an M3 Max approaches 64 tok/s. The lmstudio mac install integrates with Shortcuts and the Metal system counters, which makes correlating throughput with GPU utilisation trivial.

LM Studio Windows: CUDA, ROCm, XPU and the variance problem

Windows benchmarks have wider variance because the GPU stack varies more. The lm studio download windows installer is an MSI for Windows 10 and Windows 11, both x64 and ARM64. The lmstudio windows edition wires up CUDA on NVIDIA cards, ROCm on AMD, and Intel Arc XPU on the integrated stack. An RTX 4060 with 8 GB of VRAM hits roughly 56 tok/s on Llama-3.1-8B-Q4, an RTX 4090 sees over 130 — dedicated VRAM and Tensor Core count both contribute. Vulkan is the universal fallback when no vendor SDK is installed, slower but portable. The Windows installer is digitally signed so it passes SmartScreen without flags.

Benchmarking the lm studio local llm workflow end to end

The lm studio local llm workflow has the same three measurable steps in every harness: pick a model, download it, load to memory and stream. Behind that the lm studio local llm app handles memory mapping, GPU layer offload decisions and KV cache sizing. As a lm studio local llm desktop app it competes with Ollama, GPT4All, llama.cpp's own server and various web UIs over it — and differentiates with a visual model catalogue, real-time tokens-per-second readouts, a side-by-side compare mode, and consistent benchmark output that other front-ends do not match.

Common families benchmarked

Llama family — 3.1, 3.2 and 4 in 8B, 70B and 405B parameter sizes.
Qwen, DeepSeek, Mistral, Gemma, gpt-oss and Phi — broad coverage of open releases from major labs.

On an M3 Max with the MLX runtime active, a 14B model crosses thirty-eight tokens per second at Q4_K_M. That is the moment a hardware reviewer realises the local AI category has shifted under their feet. Arjun Mehta — ML Performance Engineer

Extending benchmark pipelines with lm studio plugins

The lm studio plugins runtime shipped in late 2024 and is the most-watched extension surface in the local AI tooling category. Plugins run in a sandboxed worker, are written in TypeScript or Python, and can intercept inference requests, attach tool-calling back-ends, expose new prompt processors and add UI surfaces. From a benchmarks perspective, the plugin API is the place where you wire LM Studio into a wider measurement pipeline — logging tokens-per-second to a time-series database, attaching a custom prompt cache, scripting RAG over a local corpus for retrieval-pipeline benchmarks, or hooking a per-token cost calculator alongside the inference output so a reviewer can publish dollars-per-million-tokens numbers in the same chart as the throughput numbers. A well-designed measurement plugin sits in the same worker as the inference call, which means timing data is captured before any IPC overhead is added.

LM Studio image generation throughput across hardware

The lm studio image generation feature uses local diffusion models — Stable Diffusion 1.5, SDXL, FLUX-class — through the same model browser as the language models. Throughput shifts dramatically with available memory: SDXL-class diffusion benchmarks at 1024×1024 typically need 12 GB of fast memory to run without swapping; FLUX-class models cross the threshold around 16 GB on dedicated VRAM and slightly higher on Apple Silicon unified memory. On Apple Silicon the MLX runtime extends to diffusion and an M3 Max approaches mid-range NVIDIA throughput at SDXL. Outputs land in a configurable local folder, EXIF records prompt and seed, and benchmark harnesses can chase the same prompt-seed pair across hardware for reproducible side-by-side image comparison without any external API call.

The lm studio download for a measurement rig

The recommended lm studio download path for a benchmark rig is the publisher's website with checksums verified. The Mac edition arrives as a notarized .dmg, the Windows edition as a signed MSI suitable for unattended deployment, and a Linux AppImage rounds out the trio. Every feature is available from first launch, so a benchmark suite can validate the full extraction-and-reporting pipeline before any production model is touched. For shops standardised on storefronts, the Mac App Store and Microsoft Store carry the side-tools as separate downloads.

Benchmark-rig install checklist

Pull the installer over HTTPS, verify the SHA-256 against the published release notes.
Disable anonymous telemetry in Settings before the first run.
Pin the GGUF or MLX model file to a fixed directory under source control.

Final word: lmstudio is still the most measurement-friendly tool

For 2026 hardware reviewers, lmstudio remains the most measurement-friendly entry point into the local AI ecosystem — the tokens-per-second readout in the chat UI matches what the CLI reports, which matches what the API server logs. That consistency is what makes the application defensible in published benchmarks. The combination of lm studio mac and lm studio for windows builds, Apple Silicon acceleration, GPU support across NVIDIA / AMD / Intel, plugins and image generation covers every measurable surface a reviewer might want. Hardware is the constraint: 16 GB is the floor, 32 GB or a dedicated GPU is where the experience becomes pleasant.

help

LM Studio FAQ

What is the minimum hardware to publish reproducible LM Studio benchmarks?

For text models, 16 GB of memory on an Apple Silicon Mac or a Windows machine with 16 GB RAM and a recent GPU gives stable Q4_K_M numbers for 7B-class models. 32 GB or a dedicated GPU with 12 GB+ VRAM unlocks the 13–14B benchmarks where most reviewers focus.

Does CPU-only inference produce comparable benchmark numbers?

Yes, but the tokens-per-second figures are roughly 4–8× lower than GPU-accelerated runs depending on the model. CPU-only runs are useful for the bottom of a benchmark sweep, not the headline number.

How much disk space should a benchmark rig provision?

Plan for 200–400 GB if you intend to keep a full sweep of quantizations (Q3 through Q8) of the popular 7B–14B models. The model directory is relocatable to an external SSD in Settings.

Does the benchmark daemon leak any data over the network?

No. The local API server binds to 127.0.0.1, license activation is the single outbound call, and the runtime does not phone home with benchmark results or prompts.

Can anonymous telemetry be turned off?

Yes. Anonymous launch telemetry can be disabled in Settings and the binary makes zero outbound calls after that other than license validation.

Does LM Studio store conversation history?

Yes, in a local SQLite database in the user-data directory — encrypted at rest with a key derived from the user account. The file is exportable to move history between machines.

Which model formats does LM Studio benchmark cleanly?

GGUF is the universal format and runs identically across Mac, Windows and Linux. On Apple Silicon, MLX-format models give the headline tokens-per-second numbers thanks to Apple's own runtime.

Is side-by-side model comparison built in?

Yes. The chat UI supports loading two models simultaneously and prompting both against the same input — the format reviewers use for head-to-head writeups.

Where do downloaded weights live?

Inside the application's data directory by default. Pin the GGUF or MLX file to a fixed location under source control for any reproducible benchmark suite.

Does the CLI emit machine-readable benchmark output?

Yes — the lms CLI streams tokens-per-second, time-to-first-token and total-tokens stats in JSON when run with --json, suitable for piping into a time-series database.

Is the local API server safe to leave running for an automated benchmark loop?

Yes. The server binds to 127.0.0.1 by default, only programs on the same machine can reach it, and the request and response payloads are loggable.

Can a custom plugin record per-token throughput?

Yes. Plugins are TypeScript or Python projects scaffolded by lms plugin create — the runtime exposes per-token callbacks suitable for wiring into a custom metrics pipeline.

Download LM Studio for Mac and Windows

Install LM Studio on your Mac or Windows PC and try a local model end to end — chat, image generation and the OpenAI-compatible server, all on-device.

Download LM Studio Try the playground

→ STEP 01

Download

Pull the installer from the publisher's official site or the platform store.

→ STEP 02

Pick a model

The first-run wizard recommends a Q4_K_M model sized to your hardware.

→ STEP 03

Chat

Start a conversation in the built-in interface — no account, no API key.