Loading article...

Gemma 4: A Powerful New Weapon for SoloTagger

Qwen3.5 9B + SoloTagger is currently the only captioning setup I run locally on my laptop, and I use it almost every day. After Google released Gemma 4, I gave it a try as well, and the results were honestly pretty impressive.

Gemma 4 currently comes in four versions: E2B, E4B, 26B A4B, and 31B. The 31B model is just too large for my laptop. E2B and E4B both run easily on a laptop and are very fast, but like Qwen3.5 2B and 4B, models that small do not follow prompts especially well. So the one I mainly went with was 26B A4B.

Here are my quick impressions after trying it out.

I. Basic Info

### 1. Hardware
Same as usual: I am still using my old low-spec laptop. The basic setup is:
- CPU: Intel Core Ultra 5 225H
- RAM: 32GB
- GPU: NVIDIA GeForce RTX 5060 Laptop
- VRAM: 8GB
### 2. Models Compared
I used SoloTagger to run a simple speed test on three models:

  • gemma-4-26B-A4B-it-UD-IQ2_M.gguf, 9.28GB, referred to below as Gemma 4 26B
  • Qwen3.5-9B-UD-Q5KXL.gguf, 6.28GB, referred to below as Qwen3.5 9B
  • llama-joycaption-beta-one-hf-llava.i1-Q6_K.gguf, 6.14GB, referred to below as JoyCaption Beta One

These three models differ in both size and quantization level, so strictly speaking, they are not really meant to be compared side by side. This is not intended as a proper benchmark of the models themselves. I am only doing a simple comparison of how they perform on the very specific task of generating image captions on a laptop.

### 3. LLM Runtime
LM Studio0.4.10 (Build 1),CUDA 12 llama.cpp (Windows) v2.12.0
The newer CUDA 12 version of llama.cpp includes optimized support for Gemma 4, so I would recommend updating to the latest version.

### 4. Notes
All the numbers below are based on the same 512 × 512 PNG image, using the same prompt in SoloTagger, with captions generated separately by the three models. The data comes from the LM Studio log output.

II. Speed and Runtime

Execution Time chart

Token Processing Speed chart

The two charts above make a few things pretty clear:

  1. Qwen3.5 9B is the fastest, while Gemma 4 26B is the slowest.
  2. All three models finish a single image within 4 seconds, and even the slowest one, Gemma 4 26B, only takes about 1 second longer than the fastest.

So in terms of speed, all three are totally acceptable. Even the slowest one, Gemma 4 26B, would only need about 6 minutes to process 100 images of this kind.

Gemma 4 26B spends more time in the preprocessing stage, and I think there is still room for that to improve with future optimization.
## III. Prompt Following and Caption Quality

The reason I gave up on the faster Qwen3.5 2B and 4B earlier was that both their prompt following and caption quality were pretty weak.
Out of these three models, my subjective impression right now is:
Gemma4 26B > Qwen3.5 9B > JoyCaption Beta One

The caption quality of Qwen3.5 and JoyCaption Beta One is fairly similar, but Qwen3.5 does a somewhat better job of following the prompt.

Thanks to its larger size, Gemma 4 26B is clearly better than Qwen3.5 9B in both prompt following and caption quality. For me, that makes the slightly longer runtime absolutely worth it.

## IV. Recommendations
Overall, for a relatively simple task like image captioning, all three models can get the job done, so the best choice really depends on your actual needs.

  1. If your hardware is strong enough, just go with Gemma 4 26B or even a larger model without overthinking it.
  2. If your prompt is simple and you only need a basic description of the image, or even just tag-style output, then Qwen3.5 9B and JoyCaption Beta One are both solid options, and they are faster too.
  3. If you have more complex requirements and are using a longer prompt, then Gemma 4 26B is clearly the best choice out of these three.

Finally, if you are interested in SoloTagger, you can now download it from GitHub and give it a try:
https://github.com/sololo-xyz/SoloTagger

As long as I am still using SoloTagger, I will keep improving and refining it while keeping it simple. Suggestions for new features or further optimization are always welcome.

You Might Also Like View All
Cover: SoloCropper: Efficient Human Image Cropping for Dataset Preparation
SoloCropper: Efficient Human Image Cropping for Dataset Preparation
Cover: Z Image LoRA Commissions Now Open
Z Image LoRA Commissions Now Open
Cover: Price Update for Model Orders
Price Update for Model Orders
Cover: SoloTagger v0.20: Improved Caption Output Quality
SoloTagger v0.20: Improved Caption Output Quality
Cover: SoloTagger v0.12: Easier Prompt Editing
SoloTagger v0.12: Easier Prompt Editing
Cover: SoloTagger: Local JoyCaption Beta One GGUF Setup on Windows via LM Studio
SoloTagger: Local JoyCaption Beta One GGUF Setup on Windows via LM Studio
Cover: Basic Workflow: FLUX.2-klein I2I v2.2  |  4-in-1 image editing
Basic Workflow: FLUX.2-klein I2I v2.2 | 4-in-1 image editing
Cover: Training Notes 2: Z Image LoRA - Phase Recap
Training Notes 2: Z Image LoRA - Phase Recap
Cover: The new version is coming. What surprises will it bring to the world this time?
The new version is coming. What surprises will it bring to the world this time?
Cover: Training Notes: My Take on Klein & Z Image
Training Notes: My Take on Klein & Z Image
Cover: Basic Workflow: Z Image T2I v1.3
Basic Workflow: Z Image T2I v1.3
Cover: Basic Workflow: FLUX.2-klein I2I v2.0  |  4-in-1 image editing
Basic Workflow: FLUX.2-klein I2I v2.0 | 4-in-1 image editing