gao-NLP/Llama3-8x8b-MoE-Base

本项目基于Meta发布的llama3-8B模型进行开发。即将MLP复制8份，创建一个随机初始化的router，其余参数权重保持不变，搭建一个热启动的MoE模型。这种方式能够极大地降低从头开始训练一个MoE模型的成本，便于快速的在下游任务中微调使用。

其中 router_warmboot表示使用chines-mixtral [https://github.com/ymcui/Chinese-Mixtral] (https://github.com/ymcui/Chinese-Mixtral)版本中的router参数进行llama3-MoE-base参数的初始化，router_random是router随机初始化的版本。

详情请见github仓库https://github.com/cooper12121/llama3-8x8b-MoE

generate

  import sys
  sys.path.append("/apdcephfs_qy3/share_301372554/share_info/qianggao/")

  from modeling_file.llama3_moe.modeling_llama_moe import LlamaMoEForCausalLM
  from modeling_file.llama3_moe.tokenization_llama_fast import LlamaTokenizerFast
  model_ckpt = "/apdcephfs_qy3/share_301372554/share_info/qianggao/ckpt/llama3-8x8b-MoE-base"
  tokenizer = LlamaTokenizerFast.from_pretrained(model_ckpt)
  # print(tokenizer)
  model = LlamaMoEForCausalLM.from_pretrained(model_ckpt,device_map="auto",use_cache=False)
  text_list = ["hello,what is your name?","你好，你叫什么名字"]
  
  tokenizer.pad_token = tokenizer.eos_token
  tokenizer.pad_token_id = tokenizer.eos_token_id

  inputs = tokenizer(text_list,return_tensors="pt", padding=True).to("cuda")
 
  output = model.generate(**inputs,pad_token_id=tokenizer.eos_token_id,max_new_tokens=100)
  print(tokenizer.batch_decode(output))

其中modeling_file文件可从github仓库获取