How to run Google Gemma 2B- and 7B-parameter instruct models locally on the CPU and the GPU on Apple Silicon Macs.
In this hands-on video, we use the Hugging Face CLI, PyTorch, and the Transformers and Accelerate Python packages.
00:00
· Introduction
01:05
· Find Models in Hugging Face
01:28
· Terms
01:57
· Install the Hugging Face CLI
02:21
· Login
02:55
· Download Models
03:51
· Download a Single File
04:50
· Download a Single File as a Symlink
05:25
· Download All Files
06:32
· Hugging Face Cache
07:00
· Recap
07:29
· Using Gemma
08:02
· Python Environment
08:47
· Run Gemma 2B on the CPU
12:13
· Run Gemma 7B in the CPU
13:07
· CPU Usage and Generating Code
17:24
· List Apple Silicon GPU Devices with PyTorch
18:59
· Run Gemma on Apple Silicon GPUs
23:52
· Recap
Hi everyone. It's Nono here.
And this is a hands on overview of Google's Gemma. So this is a model that Google has just released is the equivalent to Llama in Meta or Facebook AI. This is Google is releasing open. Models large language models or LLMs, which we can download directly from HuggingFace, with the HuggingFace CLI or from the HuggingFace web interface.
And they've released two and seven billion parameter networks that we can use directly on our machine. Google Gemma, open models means that they have trained these networks, they have benchmarks, and you can see how well they perform in certain tasks compared to other models like...
Llama or Mistral
a family of lightweight State of the art open models built from the same research and technology used to create the Gemini models. I have to say I haven't used them yet, but I have downloaded them to my machine. It's pretty easy and you can do the same. We're going to see how to do that right now.
The first thing that I did is I went to Hugging Face. I searched for google/gemma And then you get a series of models. So the two models are two and 7 billion parameters, but there's two different varieties of each because each of them has the model itself and the IT, which is the instruction based model.
I had to sign or accept the terms in Kaggle. And that gave me access.. It's what's called a gated model that gave me access in Hugging Face. So by logging into the console and prompting to download, I could download the model. All right. So let's now code and let's take a look at the hands on view of how to actually download those models in Mac.
And know that this might work on Linux and Windows with your machine as well.
Let's install the huggingface-cli with Homebrew.
The formulae is basically bring install HuggingFace CLI.
all you have to do in the end is brew install huggingface-cli, right? So that's what we need to do in order to get the CLI installed in our machine. That works.
And now we can do things like log in, log out and other things.
let's now see how to log in with a token to the HuggingFace CLI. You actually have to go to huggingface.co /settings /tokens which is this page here. This is the token I created today. So you have to go expressly to this URL, HuggingFace. com, settings, tokens. And then you just have to go back to your terminal and do HuggingFace CLI login.
And it will prompt you for a token, which I've just pasted. And then I will just say, login successful.
Okay. So now we're going to see how to download a model from Hugging Face. We were talking about Google's Gemma. You actually get the results here. And we can go, for example, for the Gemma 2 billion IT. And then we can read how to run things and how to do things and blah, blah, blah.
I'm going to go to files and versions so you can see the things that they've put here so we can download, and you could download one specifically, or you could download the whole repository, right? Let's see how to do both. I'm going to go to the desktop. We have a folder for today's stream, which has already downloaded two models, but I'm going to do here.
Test folder, which we're going to also enter. So we'll open this folder and I'm going to put it here for you, I think there. Yes.
And now I can do HuggingFace, CLI, download. I'm going to say local dir. Current directory, and I'm going to say from Google Gemma, and I'm going to choose. So to be, I think I'm going to choose the files, for example, config dot JSON, right?
This config dot JSON file. So we just say config dot JSON. And. Then we get the file here. So we can actually do this and preview the file in our terminal for that in case you don't know I'm Cutting the file which prints the contents here and after I do that I pass it through jq that gives us the formatted JSON, but I also could ask for property architectures, for example.
We can download one file and as you can see, the file is there. It's not a symlink.
And to do a symlink, I could actually do here --local-dir-use-symlinks and when I download let's delete that file. I have to put here true. So now this is a symlink, so if I show the original, it brings me back to the folder I showed before with the cache.
And that is symlinked. If I don't do that, if I don't want to use a symlink, I can simply download the file, which I need to probably remove and get the actual file. Okay, so that's for downloading one file.
If we want to download the entire repo, we just leave it empty like this. It will start downloading the files.
to this folder. And for the ones that are really big, probably the ones that have this LFS icon here, because it's long.. Large File Storage. Git LFS. These ones are going to be symlinked because you don't want to re download these things every time
so the ones that are symlinked, are really large and we could have symlinked everything by specifying that use symlinks option and leaving this out. So we're going to remove these things, do this.
And this is really nice now because it's just.. Symlinking from the cache of Hugging Face on my machine because they're already downloaded.
So it's really clever because it doesn't have to wait. If you had to wait, you would see something similar to this. It's basically downloading these big files from another repository.
And as you get them completed, they're actually completed in the cache and then they're symlinked here.
We're going to go to the cache of Hugging Face and we're going to see what files we have here.
These things are probably things that didn't get downloaded completely. And we have some files here for what didn't get completely downloaded before when I tried. So I have 17 gigs for Gemma7b. 51 for the other. So this is probably not finished. And this is 2v for what I started downloading.
So we've seen a brief intro to what Gemma is. Just really high level. We've seen that in Hacking phase, we can install the CLI really easily. And we've seen how to actually log in with a token, how to go to a repo and put the download command together so we can download one file.
Or multiple files, either locally or symlinking. And remember that you will always be symlinking the files that are really big.
So let's take a look at how to use Gemma locally on our machine. And, here we have the the 2B and 7B Instruct models.
Gemma is a family of lightweight, state of the art open models well suited for a variety of text generation tasks, including question answering, summarization and Reasoning. These relatively small size makes it possible to deploy them in environments with limited resources, such as a laptop, a desktop, or your own cloud infrastructure, democratizing access to state of the art AI models and helping foster innovation for everyone.
Let's create a new environment so we can run Gemma.
Conda create. The name is going to be Gemma. It's going to be Python 3. I don't really know what the best support is right now. And I'm going to say, yes. So we're going to install that. And this should be done already.
All right. So conda activate Gemma, and we're going to do what they told us to do, which is. Install, upgrade, or that's you, transformers.
This is from Hugging Face. So it will know already how to get these models from Hugging Face. Because we already have this model on the cache.
It might just be really fast and know that the model is already there and make use of the model however it wants to do it.
Let's see how to run a Google Gemma on the CPU right now. So we're going to import from transformers, import auto tokenizer, auto model for lm now the tokenizer is gonna be out token dot from pre-train and then we're gonna do here Google Gemma to be it. This is the one that we downloaded.
The model is gonna be boom from pre-train Google, Gemma to PIT. So that should be the same, right? Alright. The input text is gonna be 10 plans for tourists in Malaga, Spain. Boom.
Input IDs. So we'll do tokenizer, input text return tensors PT. And the outputs, model, generate, input IDs, print, tokenizer, dot decode, outputs. Does this work? Let's take a look. We just have to do run ZPU.
Oh, okay. Conda,
deactivate. It says we might have to. Restart. Let's take a look at what's the issue here. None of PyTorch TensorFlow flags have been found. On the activate Gemma and then we'll do pip install. PyTorch, see PyTorch or just Torch, I guess. all right. What's going on now is we're installing PyTorch and maybe that Makes it work. Let's try again. So now something else is happening there. And the UMI from pre trained, prompt pre trained. All right, human error. There you go. One more time. And downloading shards, loading checkpoints.
So it seems This is the Picasso Museum. All right. You got it right. And the tokens are super short. Why is it outputting so few?
Right, let's just try that. Max length, 112.
Oh, all right. List 10 plans for tourists in Malaga. Visit the Picasso Museum, take a walk along the Paso Marítimo, visit the Cathedral of Malaga, explore the Alcazaba, go on a day trip to Ronda, visit the Museum of Malaga, take a cooking class, visit the Botanical Garden, go on a hike in the Sierra Nevada mountains, visit the Aquarium of Malaga.
These are just a few ideas for things to do in Malaga. There are many other things to see and do in this beautiful city. This is pretty awesome, guys. This is It's working pretty well and that was with the 2 billion parameter.
Let's actually. Change this to the 7 billion parameters and let's run exactly the same code here.
Conda activate gemma
So the model is a lot slower, but we'll get the output this is all running on the CPU. Taking a really long time, a really long time.
Okay. Nice. All right. So these actually, templates for tourists in Malaga, Spain history and culture, explore the Moorish architecture of the Alcazar. of Malaga and the Cathedral of the Santa Maria. This is a Roman theater and the Yeraltarium Museum.
Immerse yourself in the vibrant flamenco culture with a show. Beach and sun relax. The trip to Ronda. Explore the orange blossom. Foodie adventure. Shopping nightlife. Nice relax and enjoy. Yeah, I don't really know why this took so long, but definitely a lot more detail.
Can we solve coding questions with this?
Can we ask a different question. Write a TypeScript react component with an okay bottom. Let's try that with 2 billion, but it's pretty fast. And I'm not sure if the 7 billion is going to be really slow. So that, that was super fast. These are the the, this is the CPU, the cores. And that was generating, but I put so few tokens that couldn't really write anything. So let's try again with more tokens loading checkpoints. There's no rendering lag. So this computer can handle it pretty easily. And now let's see what happens.
It also takes some time, right? But it might be like 10 or 20 seconds to answer, let's say. It seems like these bits here, this peak that you can see is that model that is using. So let's see when the output comes, let's see if these things here actually continue being there or disappear, right?
If that's the model using some piece of our CPU.
This is taking longer because it's probably generating a lot more tokens. I don't really know what. I don't know what the default token size is, but definitely it's not 200, it's not 500. All right, so here's our component our application. Handle click, on click, okay. Handle, okay, click
We import a useState hook. And it explains how to do all these things. Okay. That sounds good. And what I'm going to do now is I'm going to ask maybe with 64 tokens. So it's not too much. You can see these pieces here, these mounts here are the pieces that are consumed by running that.
So it seems like this model makes use of the CPU, but. In a good way. Now we're going to see the 7B is that my camera probably would lag a bit. So I'm going to leave it. Maybe we missed some frames and let's see those mounts. Let's see how those are and whether there is some GPU use as well.
So I'm going to leave this here. All
Let's run that.
There you go, on, on first instance when the model is being loaded we lose some frames. So we lose, we lost 50 frames due to rendering lag and 19 due to encoding lag. There is no stress on the GPU because this is an example code for running this on the CPU. And as you can see, there's some stuff there going on with the CPU performance scores.
And then we generate, and this is a thing that is being generated, but we seem like we need to actually add more I'm going to remove myself just to, oh, I can't remove myself from here, but from here maybe, yeah. So I remove myself because there's going to be some encoding lag. So does you don't see me blocked or something?
And that's it. So we. Seem to be okay. So this generated it's close this stuff. There's no, Oh, it's generating still. Okay. Yeah, it seems like the CPU also suffers, but it seems like it's more used with the two billion parameters. I don't fully, I don't fully understand what's going on.
Why the other one takes more work on the CPUs.
All right, here it is. import. So it's working slowly. You can see here on these columns. Great. This works pretty well for being 7 billion.
Running the model on a single GPU or multi GPU. We need pip install accelerate. Run your raw Python training script on any kind of device.
Alright, so we're trying to see if we can directly run Gemma on the MacBook Pros. Apple, Silicon GPU.
Let's actually just create a file, so Torch, that's TorchDevices. py, all right. So import Torch backends, MPS is available. MPS device equals, let me put this here. All so I'm basically just following this example. So Torch device, MPS X Torch ones, one device. MPSDevice, print x else, print MPSDevice not found. Alright, let's see what this thing does for us. Python, TorqueDevices.
TensorOneDeviceMPS0. Great. We do have it. We've seen that this code sample gives us the answer to whether we have the MacBook Pro device on PyTorch here. So yeah, machine learning has advanced a lot. And now we have. Native support without doing anything.
So what I want to do here is see if we actually were to put here generate, let's take a look at where the code example was should show.
Okay. So that's great. And let's close things. Great. So here, Tokenizer to CUDA. Okay. So device map, auto, and Tokenizer run tensors to MPS. Let's try that and don't be ambitious. And let's put this here. PyTorch run CPU. And now the only thing we need is not to get any errors. I don't really know what's going to happen.
Maybe we do get errors, but it seems like it's working. I'm not losing any frames here. Dropped frames due to network. There's been some lag there, which I don't know. Oh, nice. So take a look now. So this. Spike here. This is really nice because this means that the model is actually making use of the GPU and it's not overloading my CPU at all.
Which is pretty cool. I would say, and now there's some spikes on the CPU, but a lot less than before. So yeah, that's super nice. We're going to continue seeing this graph. And I'm going to leave it in here. So maybe we can put this, and I can go here. All right. So we got this message here.
Things to eat in Napoli, Italy. Great. So we're going to put here, maybe less tokens is faster. I have a timer here so we're going to Count down.
All right, so I'm gonna do it now. 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16. Okay. 16 seconds. And we've got an output in Markdown. So I can actually open a writer with an empty file here and I can type this and this is actually ready for print. It's a Markdown text and we can You see it in different formats, but in the end, the important thing is that it's valid markdown and it's been generated by the 2 billion parameter one.
So let's actually go for the 7 billion parameter. Let's see how it performs here and whether this is something that we could use.
Okay. So the prompt is things to eat in Napoli, Italy, and that's the answer for 2 billion. Wow. What? And this is the answer for 7 billion. Oh, it didn't have space to do more. All right, let's let it talk and let's see how long this one takes. We're making use here of the of the GPU a lot more than before and a lot less on the CPU, right?
So it seems like there is some load on the CPU as well. And the GPU, there's like spikes. So what's going on actually, we're losing a lot of frames because the computer is like trying to generate tokens and it's actually doing an inference in my machine. So it's not super performant, and I have three screens and yeah, a lot of frames lost, but the good news is that the tooling parameters is something you can use locally and without any problem.
We're at 2 minutes, 50 seconds
All right. And we're at three minutes 50. So I'm not sure if this, yeah. So this is generated just finished generating. All right. So let's actually get this, right? So let's put this on the same document under Gemma 7 billion. So I don't really know, maybe it's there,
this is what we got. Naples is a vibrant city. Steep in history. Blah blah blah. Pizza, Spaghetti, Pizza Napoletana, Gelato, Tartarella. Tips to experience the best Napoli.
Refrigerate your dishes. Enjoy the relaxed atmosphere. Wow, okay. There is a difference. This was what we got.
Pretty different suggestions.
Anyhow this was a great experiment we've been able to run Gemma on the GPU and also on the CPU and how those differ. We lose frames on this even in this expensive power rule. M3 Max MacBook Pro. We lose frames because they're bloating our CPU and GPU and things Get frozen there, but it's been Pretty cool to be able to do this so quickly and so easily
I will remind that you can join the Discord community at nono.ma/discord, and don't forget to subscribe and click on the bell. If you're going to get notified when I go live next or when I upload new videos. Honestly, I want to thank you for watching and i will see you next time thanks a lot i'm Nono Martinez Alonso, host of this stream and also the Getting Simple podcast.
Bye.