skar0's picture
Upload from /root/refusal_direction/pipeline/runs/Llama-3.1-8B-Instruct/orthogonalized_model by /root/refusal_direction/pipeline/model_utils/model_base.py
d4cd5d9 verified
metadata
language: en
license: apache-2.0

Model Card

Metrics

  • position: -2
  • layer: 11
  • refusal_score: -9.444916725158691
  • refusal_score_baseline: 7.121610641479492
  • steering_score: 9.821893692016602
  • steering_score_baseline: -12.952377319335938
  • kl_div_score: 0.020955339406569296
  • no_filter: 19
  • nan_values: 0
  • late_layer: 35
  • high_kl: 49
  • low_refusal: 57