Setting up

Clone the GitHub repo. We assume you cloned this into /scratch/gpfs/YourNetID.

git clone [email protected]:Metric-Void/FastChat.git

First, load Anaconda and install the environment. This script assumed that you did not symlink ~/.conda to /scratch/gpfs/YourNetID/.conda, which some people do to avoid adding prefixes to conda commands.

flash-attn must be installed last because it is distributed as source code and requires other packages to be installed before it can build. It takes quite a while to build this thing.

module load anaconda3/2023.3
module load cudatoolkit/11.7

conda env update --prefix /scratch/gpfs/$USER/.conda/fastchat -f environment.yml
conda activate /scratch/gpfs/$USER/.conda/fastchat
pip install flash-attn==1.0.3.post0

Running the model

Model weights are stored under my scratch folder. I’ve given read access to them.

All scripts below assume you were running them in the FastChat directory, cloned from GitHub.

Interactively

Use srun to run an interactive session with the model! Note that this is not suitable for profiling.

srun --pty --nodes=1 --ntasks=1 --cpus-per-task=8 \\
	--gres=gpu:1 --mem=60G --time=00:20:00 \\
		python -m fastchat.serve.cli \\
		--model-path /scratch/gpfs/zl1111/vicuna-13b --style rich

Batching

Questions are provided as a jsonl file, with a one-line JSON object on each line, as shown below.

{"question_id": "1", "text": "How to put an elephant into a fridge?"}
{"question_id": "2", "text": "Which browser is better, Edge or Chrome?"}
{"question_id": "3", "text": "Should watermelon be salty, sour, ginger, or garlic-flavored?"}
{"question_id": "4", "text": "How to put pineapples on piazzas?"}
{"question_id": "5", "text": "Do you pour coffee into black tea, or black tea into coffee?"}
{"question_id": "6", "text": "How to make strawberry juice with percolator?"}
{"question_id": "7", "text": "Which gear should be selected when grinding hot water?"}

This script allows you to run inference locally on a machine with a GPU (Not Della). Put here for explanation purposes.

python -m fastchat.eval.get_model_answer \\
  --model-path /scratch/gpfs/zl1111/llama-7b \\
  --model-id llama-7b \\
  --question-file /scratch/gpfs/$USER/FastChat/playground/test-data/chats.jsonl

A wrapped SLURM version of this script is shown below. It is designed to run on Della; you can find this script at inference.slurm

#!/bin/bash
#SBATCH --job-name=vicuna-inference
#SBATCH --nodes=1
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=8
#SBATCH --gres=gpu:1
#SBATCH --mem=40G
#SBATCH --time=00:20:00
#SBATCH --mail-type=begin
#SBATCH --mail-type=end
#SBATCH --mail-type=fail
#SBATCH [email protected] # Change this

# export OMP_NUM_THREADS=4

module purge
module load anaconda3/2023.3
module load cudatoolkit/11.7

conda activate /scratch/gpfs/$USER/.conda/fastchat
cd /scratch/gpfs/$USER/FastChat

export WANDB_MODE=offline

python -m fastchat.eval.get_model_answer \\
  --model-path /scratch/gpfs/zl1111/vicuna-13b \\
  --model-id vicuna-13b \\
  --question-file /scratch/gpfs/$USER/FastChat/playground/test-data/chats.jsonl

The script shown above used a vicuna-13b model stored at /scratch/gpfs/zl1111. By default, the answers will appear in a file called /scratch/gpfs/$USER/FastChat/answer.jsonl You can also add --answer-file to the script to specify where the answer should be stored.