PrivateGPT and CPU’s with no AVX2

In my quest to explore Generative AIs and LLM models, I have been trying to setup a local / offline LLM model. This is where has made its presence felt along with gpt4all.

I wanted to try both and realised gpt4all needed GUI to run in most of the case and it’s a long way to go before getting proper headless support directly However, PrivateGPT has its own ingestion logic and supports both GPT4All and LlamaCPP model types Hence i started exploring this with more details. There are a lot of prerequisites if you want to work on these models, the most important of them being able to spare a lot of RAM and a lot of CPU for processing power (GPUs are better but I was stuck with non-GPU machines to specifically focus on CPU optimised setup).

This post is more of a reminder for myself when i encounter these errors again and hopefully it will help others in the same process.


You need Python 3.10 to run these systems. Ubuntu 20.04 and similar systems don’t have it by default. You will have to use a PPA to get Python 3.10 on those systems. seems to be the most commonly referred python version.

Commands to set this up

sudo add-apt-repository ppa:deadsnakes/ppa
sudo apt update
sudo apt install python3.10 python3.10-dev python3.10-distutils 

Installing pip and other packages

Expert Tip: Use venv to avoid corrupting your machine’s base Python.

create a new venv environment in the folder containing privategpt. This is a one time step.

python3.10 -m venv venv 

subsequent activities require following two commands

  • activate
source venv/bin/activate
  • deactivate

Finding the models

The problem with GPTs is that most of it depends on the input model you are using. has a lot of interesting models listed in it. However, these models didn’t worked or me.

  • ggml-gpt4all-j-v1.3-groovy.bin
Using embedded DuckDB with persistence: data will be stored in: db
Found model file.
gptj_model_load: loading model from 'models/ggml-gpt4all-j-v1.3-groovy.bin' - please wait ...
gptj_model_load: n_vocab = 50400
gptj_model_load: n_ctx   = 2048
gptj_model_load: n_embd  = 4096
gptj_model_load: n_head  = 16
gptj_model_load: n_layer = 28
gptj_model_load: n_rot   = 64
gptj_model_load: f16     = 2
gptj_model_load: ggml ctx size = 5401.45 MB
Illegal instruction (core dumped)
  • ggml-stable-vicuna-13B.q4_2.bin
$ python3
Using embedded DuckDB with persistence: data will be stored in: db
Found model file.
gptj_model_load: loading model from 'models/ggml-stable-vicuna-13B.q4_2.bin' - please wait ...
gptj_model_load: invalid model file 'models/ggml-stable-vicuna-13B.q4_2.bin' (bad magic)
GPT-J ERROR: failed to load model from models/ggml-stable-vicuna-13B.q4_2.bin

This brought my experimentation to a halt and I needed to start looking at alternatives. The first stop, as always, is to look at the issue logs on the project repository itself. Things got a bit complicated as we are looking at 3 projects llamacpp, gpt4all and privategpt.

  • This issue confirmed my suspicion that I was using older CPU and that could be the problem in this case. specifically they needed AVX2 support.
  • For getting gpt4all models working the suggestion seems to be pointing to recompiling gpt4 however his gets complex as i am not directly using gpt4all and i am using via a python binding so this would be a mess of its own.
  • This is where I turned my attention towards the llama.cpp because I had played around with it earlier and it did work at some level.

I would like to spend a few minutes talking about llama.cpp as it can help us convert models into a usable format, especially if they’re created with different types of software. was one of the first few things that came out and it has a lot of toolings around model customisation.

I would again recommend using venv for this project also. After compiling llama you would be able to play with model files as listed here.–run. I will not dig into the part of how to obtain those models there are n number of ways and this article would simply assume you possess those models and we will work from that point onwards.

Now I remember I had done some work on llama.cpp setup and conversion of a 7B model a few days back and I did get that model working in chat mode. but that was way to small output and the system was not of my taste as my ultimate aim was to get more inputs from my own notes. So i thought lets give that system a try and for privategpt all i needed to do was change the model path in .env


Running python3 now started opening another can of worms.

Error: module format no longer supported

error loading model: this format is no longer supported (see [ggerganov/llama.cpp#1305](

This was funny but it’s okay, I built the model a few weeks back in AI/GPT world that’s like a decade. and Text section above confirmed that quantisation format has changed at llama. That is easy pease I need to do git pull and recompile the project. And redo the conversion and that should sort the issue. After a few minutes, I was ready to run with the updated model.

Next error: unknown magic, version

error loading model: unknown (magic, version) combination: 67676a74, 00000003; is this really a GGML file?

This was interesting. I was damn sure it’s the ggml file because I just now created it but the error reminded of other languages which will crap out when file magic values don’t match. I went back to the issues log and found and a lot more similar entries pointing to upgrading llama-cpp-python binding. I realised requirements.txt is stuck on 0.1.50 so a pip install of fixed version and change on requirements got that sorted.

Command for reference :

pip install llama-cpp-python==0.1.53

If these two are not the error you are facing I would suggest keeping a watch on for any updates. For me the llama-cpp-python binding did the trick and finally got my privateGPT instance working. I am still experimenting with data inputs.

Injesting files in vector database

  1. It works in subdirectories also.
  2. There is an issue if the file has longer lines or extra characters that are hard to read in UTF-8

I made a minor change in the code to get the name of the file causing ingestion errors due to Unicode issues. I added a try catch block to print the name of file causing the error.

def load_single_document(file_path: str) -> Document:
    ext = "." + file_path.rsplit(".", 1)[-1]
    if ext in LOADER_MAPPING:
        loader_class, loader_args = LOADER_MAPPING[ext]
            loader = loader_class(file_path, **loader_args)
            return loader.load()[0]
        except UnicodeDecodeError:

    raise ValueError(f"Unsupported file extension '{ext}'")

This worked so far for me to identify what files need to be removed as a source. I am yet to find a better way to deal with those files right now I simply remove the offending file and rely on files which are directly digestible.

Speed up the responses

Once the ingestion process has worked wonders, you will now be able to run python3 and receive a prompt that can hopefully answer your questions. It lists all the sources it has used to develop that answer. However, you will immediately realise it is pathetically slow. I immediately fired up htop to check how much of a server load is added by that process and to my amusement and as half expected the server was just using 1 thread and the RAM usage was also in total control.

So it seems the number of threads is something that I need to update and right on money there was a discussion in the privateGPT repo which covered this same aspect A quick edit of my file to the code below did the job.

 n_cpus = len(os.sched_getaffinity(0))
    match model_type:
        case "LlamaCpp":
            llm = LlamaCpp(model_path=model_path, n_threads=n_cpus, n_ctx=model_n_ctx, callbacks=callbacks, verbose=False)

Now running the code I can see all my 32 threads in use while it tries to find the “meaning of life”

Bonus Tip:

Bonus Tip: if you are simply looking for a crazy fast search engine across your notes of all kind, the Vector DB makes life super simple. Load a fake model name so no model is loaded. Enter the search string in the search box; it will point to all the files and sources used to obtain the relevant text.

What next?

It seems like I’m just starting on a wild ride, and there is a whole lot more to learn and play around with. I might write more about this topic if i have some more work going on in this area.

5 thoughts on “PrivateGPT and CPU’s with no AVX2”

  1. Hi ,
    I was working on Privategpt, facing inference time issue and Pydantic_error (1 validation failed) while using Gpt4all.
    The approach of multi-threading you suggested above, didn’t worked on my GPU. Can you please tell is there any difference for cpu and gpu my responses still taking same time 2 min.

      1. Any Suggestions for GPU? Taking 40-60 seconds to answer a query from Ingested files of around 1 MB and on checking nvidia-smi
        one query shows 1 running process consuming approx. 1000mb. My GPU is Nvidia A30 on Cuda 11.6

  2. Okay , But the Main issue I am facing is In GPT4ALL when I am trying to load jphme/Llama-2-13b-chat-german model bin file in privategpt it shows Pydantic error in GPT4ALL line stating 1 validation error.

    Above you have talked about converting model or something like ggml because the Llamam ggml model available on GPT4ALL is working fine. Could You help how can I convert this German model bin file such that It can be loaded in PrivateGpt to work on .


  • 💬 JumpingQuickBrownFox

Leave a Comment

Your email address will not be published. Required fields are marked *

Scroll to Top