Domesticating a local language model: Configuring and customising homebrew personal assistant LLMs

You've captured an AI model on a consumer-level computer. So what do you do with it next?

ChatGPT's depiction of an army of helpful agents (and one rather sad-looking one we couldn't get it to stop generating)
ChatGPT's depiction of an army of helpful agents (and one rather sad-looking one we couldn't get it to stop generating)

In the first of Machine's two-part series on building local language models on a home computer published last week, we walked you through the process of installing LM Studio and running your first LLM.

Now it's time to take the next step.

While the default settings provide a good starting point, unlocking the full potential of your AI assistant requires configuring and personalising it to suit specific tasks. Maximising performance and customising behaviour are key to getting the most out of running an LLM locally.

From temperature to frequency penalties: Understanding model parameters

Let’s examine in more detail the settings we briefly covered in Article 1: Temperature, Top P and Frequency and Presence Penalty.

When thinking about these variables it’s important to remember that an AI doesn’t think (in the way we do) it uses its vast learning model to take a prompt and anticipate what words (or code or whatever) are likely to follow the preceding one – hence the references to probabilities. As such, these parameters control how the model generates text, influencing creativity, coherence and repetition and are rooted in some fundamental mathematical principles:

  • Temperature: This parameter controls the randomness of the AI's output. Mathematically, it scales the probabilities associated with each possible next token (word or part of a word). A higher temperature flattens this probability distribution, making less likely tokens more probable. A lower temperature sharpens the distribution, favouring the most likely tokens. Think of it like adjusting the "focus" on the model's choices – high focus = predictable, low focus = exploratory.
  • Top P (Nucleus Sampling): Instead of considering all possible tokens, Top P focuses on a cumulative probability threshold. It selects the smallest set of tokens whose probabilities add up to at least P. For example, if P is 0.9, it considers only the tokens that account for 90% of the total probability mass. This prevents the model from generating extremely unlikely or nonsensical text while still allowing for creativity. A lower value makes the model more focused and deterministic, while a higher value allows for more diverse and creative outputs. 
  • Frequency Penalty & Presence Penalty: These parameters are designed to combat repetition and unwanted biases. They adjust the probabilities of tokens based on how frequently they've already appeared in the generated text (frequency) or simply whether they’re present at all (presence). They essentially ‘penalise’ the model for repeating itself or using certain words/phrases too often.
Here's what happened when we asked ChatGPT to cheer up the sad robot in the title image. Now they're all too happy!
Here's what happened when we asked ChatGPT to cheer up the sad robot in the title image. Now they're all too happy!

Prompt engineering techniques

As well as adjusting the model’s parameters to suit your requirements, the quality of your prompts directly impacts the AI’s responses. Here are some prompt techniques that can help ensure you get the output you’re looking for:

  • Zero-shot Prompting: Simply asking a question without providing any examples (e.g., "Translate 'hello' to Spanish").
  • Few-shot Prompting: Providing a few example input-output pairs to guide the AI (e.g., “English: Hello, Spanish: Hola. English: Goodbye, Spanish: Adios. English: Thank you, Spanish: …”). This helps the model understand the desired format and style of response.

READ MORE: OpenAI boss Sam Altman vows to fix ChatGPT's em-dash addiction (and finally end LinkedIn's "is this AI writing" debate)

  • Chain-of-Thought Prompting: Encouraging the AI to explain its reasoning step-by-step, leading to more accurate and insightful answers. (e.g., "Solve this math problem by first outlining your steps."). This is particularly useful for complex tasks like coding or logical reasoning.
  • System Prompts: Use system prompts to define the AI's persona, role, or constraints. For example: “You are a helpful chatbot assistant specialising in astrophysics.” System prompts set the stage for the entire conversation and significantly influence the model’s behaviour.

Maximising Performance & optimising your local LLM

Without a data centre full of servers to rely on, getting the most out of your local LLM requires careful optimisation. To that end, here are some considerations for maximising performance, regardless of your hardware:

  • Model Quantisation: Reducing a model's size and memory footprint is crucial for efficient operation. Lower quantisations are faster but may sacrifice some accuracy. Trying out different quantisation levels (Q4_K_M, Q5_K_M) will help find the best balance between performance and quality. 
  • Threading/Core Utilisation: LM Studio allows you to specify the number of threads used for inference. Experimenting with different thread counts (typically equal to or slightly less than your CPU core count) will enable you to yield the best performance. Too many threads can actually decrease performance due to overhead.

READ MORE: ChatGPT Agent excels at finding ways to "cause most harm with least effort", OpenAI reveals

  • Background Processes: LLMs are resource hungry, so closing any unnecessary applications running in the background can free up necessary system resources. Monitor your resource usage using your operating system's task manager. Similarly, ensure your operating system is up-to-date with the latest drivers and patches or disabling visual effects can free up system resources.
  • Model Selection: Some models are inherently more efficient than others, even at the same size. Research model benchmarks and choose models known for their performance on common architectures. If your model is really slow, try looking for ones specifically optimised for inference speed.
Here's what happened when we asked ChatGPT to cheer up the sad robot in the title image. Now they're all too happy!
A third attempt at generating an image for this article with ChatGPT, which responded to our order to cheer up the sad, solitary, frowning robot by giving it a partner in misery. Will you have more luck in getting a homegrown model to obey orders?

Advanced customisation: Low-Rank Adaptation (LoRA)

LoRA is a relatively new technique that allows you to fine-tune LLMs for specific tasks without the need to retrain the entire model. This significantly reduces computational requirements and storage space, making it accessible even on systems with limited resources. Instead of modifying the billions of parameters that make up the original model, LoRA introduces a small set of trainable parameters that adapt the model’s behaviour to the new task.

  • How LoRA Works: Imagine the original LLM as a vast, complex network. LoRA adds "shortcut" connections within this network. During fine-tuning, only these shortcut connections are adjusted, while the original weights remain frozen.
  • Benefits of LoRA:
    • Reduced Training Time & Cost: Training is much faster and requires less computational power.
    • Smaller Model Size: LoRA adapters are tiny compared to the full model (often just a few hundred megabytes).
    • Task-Specific Customisation: You can create multiple LoRA adapters for different tasks without needing separate copies of the base LLM.
  • Using LoRA with LM Studio: LM Studio supports loading and applying LoRA adapters alongside the base model. You’ll find LoRA adapters on Hugging Face Hub, often shared by the community. When browsing models, look for those tagged with "LoRA." The adapter will specify which base model it's designed to work with.

For example, let’s say you want to create a chatbot that specialises in writing marketing copy. You could download a LoRA adapter trained on a dataset of successful ad campaigns and apply it to Mistral-7B. This would effectively "teach" the model how to write marketing text, without needing to retrain the entire Mistral-7B model from scratch.

It’s been a fun experience to play around with creating our own local AI and we hope you’ve found this guide useful.

Do you have a story or insights to share? Get in touch and let us know. 

Follow Machine on XBlueSky and LinkedIn