-
Notifications
You must be signed in to change notification settings - Fork 18
Open
Description
Thanks for sharing your work, it helps me a lot.
However, when I tried to reproduce data in the article Bridging the Gap Between Promise and Performance for Microscaling FP4 Quantization, I did not get the same gsm8k evaluation result on w4a4 RTN quantized Llama-3.1-8B-Instruct. I got less than 1.00 instead of 78.39 in the article. I tried to get how the model answered the questions, I found that the model repeated certain words or said random meaningless words instead of a sensible sentence.
At the same time, I reproduced the same result as the article when evaluating Winogrande and Hellaswag with the same model.
I evaluated gsm8k using transformers for model inference and lm_eval python api for evaluation.
#!/bin/bash
export OMP_NUM_THREADS=8
export PYTORCH_CUDA_ALLOC_CONF="max_split_size_mb:128"
MODEL=${MODEL:-"../models/Llama-3.1-8B-Instruct"}
MODEL_ID=$( echo $MODEL | awk -F/ '{print $NF}' )
# Data params
NUM_SEQUENCES=${NUM_SEQUENCES:-128}
# Quantization params
FORMAT=${FORMAT:-"nvfp"}
W_BITS=${W_BITS:-4}
A_BITS=${A_BITS:-4}
W_GROUP_SIZE=${W_GROUP_SIZE:-16}
A_GROUP_SIZE=${A_GROUP_SIZE:-16}
GPTQ=${GPTQ:-0}
W_OBSERVER=${W_OBSERVER:-"minmax"}
QUANTIZATION_ORDER=${QUANTIZATION_ORDER:-"default"}
# Save params
EXPORT_QUANTIZATION=${EXPORT_QUANTIZATION:-"pseudoquant"}
# Transform params
TRANSFORM_CLASS=${TRANSFORM_CLASS:-"identity"}
HADAMARD_GROUP_SIZE=${HADAMARD_GROUP_SIZE:-128}
# Evaluation params
EVAL_PERPLEXITY=${EVAL_PERPLEXITY:-0}
EVAL_OPENLLM=${EVAL_OPENLLM:-1}
EVAL_BENCH=${EVAL_BENCH:-3}
LM_EVAL_BATCH_SIZE=${LM_EVAL_BATCH_SIZE:-"auto"}
# Misc params
LOG_WANDB=${LOG_WANDB:-0}
DTYPE=${DTYPE:-"auto"}
CPU_OFFLOAD_ACTIVATIONS=${CPU_OFFLOAD_ACTIVATIONS:-0}
SCRIPT_ARGS=""
if [[ $GPTQ == 1 ]]; then
SCRIPT_ARGS="${SCRIPT_ARGS} --gptq"
fi
if [[ $EVAL_PERPLEXITY == 1 ]]; then
SCRIPT_ARGS="${SCRIPT_ARGS} --eval_perplexity"
fi
if [[ $EVAL_OPENLLM == 1 ]]; then
SCRIPT_ARGS="${SCRIPT_ARGS} --eval_openllm --lm_eval_tasks"
if [[ $EVAL_BENCH == 1 ]]; then
SCRIPT_ARGS="${SCRIPT_ARGS} winogrande" #fast
fi
if [[ $EVAL_BENCH == 2 ]]; then
SCRIPT_ARGS="${SCRIPT_ARGS} hellaswag" #0.5h
fi
if [[ $EVAL_BENCH == 3 ]]; then
SCRIPT_ARGS="${SCRIPT_ARGS} gsm8k_llama" #1.5h
fi
if [[ $EVAL_BENCH == 4 ]]; then
SCRIPT_ARGS="${SCRIPT_ARGS} mmlu_cot_llama" #1day
fi
fi
if [[ $LOG_WANDB == 1 ]]; then
SCRIPT_ARGS="${SCRIPT_ARGS} --log_wandb"
fi
METHOD_NAME=""
if [[ $GPTQ == 1 ]]; then
METHOD_NAME="GPTQ"
else
METHOD_NAME="RTN"
fi
if [[ $CPU_OFFLOAD_MODULES == 1 ]]; then
SCRIPT_ARGS="${SCRIPT_ARGS} --cpu_offload_modules"
fi
if [[ $CPU_OFFLOAD_ACTIVATIONS == 1 ]]; then
SCRIPT_ARGS="${SCRIPT_ARGS} --cpu_offload_activations"
fi
export WANDB_PROJECT="FP-Quantization-Harness"
export WANDB_NAME=${MODEL}/${FORMAT}-w${W_BITS}-a${A_BITS}-${METHOD_NAME}-${TRANSFORM_CLASS}-transform
if [[ $EXPORT_QUANTIZATION == "realquant" || $EXPORT_QUANTIZATION == "pseudoquant" ]]; then
SCRIPT_ARGS="${SCRIPT_ARGS} --export_quantized_model ${EXPORT_QUANTIZATION}"
if [[ $EXPORT_QUANTIZATION == "realquant" ]]; then
SAVE_DIR=quantized_models
else
SAVE_DIR=pseudoquantized_models
fi
fi
python3 model_quant.py \
--model_name_or_path=${MODEL} \
--format=${FORMAT} \
--w_bits=${W_BITS} \
--a_bits=${A_BITS} \
--w_group_size=${W_GROUP_SIZE} \
--a_group_size=${A_GROUP_SIZE} \
--transform_class=${TRANSFORM_CLASS} \
--w_observer=${W_OBSERVER} \
--quantization_order=${QUANTIZATION_ORDER} \
$SCRIPT_ARGS \
--hadamard_group_size=${HADAMARD_GROUP_SIZE} \
--dataset_name_or_path=fineweb-edu \
--num_sequences=${NUM_SEQUENCES} \
--sequence_length=2048 \
--dtype=${DTYPE} \
--lm_eval_batch_size=${LM_EVAL_BATCH_SIZE} \
--save_path "${SAVE_DIR}/${MODEL_ID}-${FORMAT}-w${W_BITS}-a${A_BITS}-${METHOD_NAME}-${TRANSFORM_CLASS}-transform" \
--export_quantized_model pseudoquant \
--cpu_offload_activations \
--cpu_offload_modules \
--fuse_global_scale \
--amp
if args.eval_openllm:
results = {}
lm = HFLM(
pretrained=model,
tokenizer=tokenizer,
batch_size=args.lm_eval_batch_size,
max_length=4096, # from open LLM openllm
)
task_manager = lm_eval.tasks.TaskManager()
# GSM8K Llama-3.1
if "gsm8k_llama" in args.lm_eval_tasks:
task_results = lm_eval.simple_evaluate(
model=lm,
tasks="gsm8k_llama",
batch_size=args.lm_eval_batch_size,
apply_chat_template=True,
fewshot_as_multiturn=True,
task_manager=task_manager,
)["results"]
results.update(task_results)
print(make_table({"results": task_results, "versions": {}, "n-shot": {}, "higher_better": {}}))
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
No labels