README.md · wilsontam/bert-base-chinese-dialogue at 61a163ad2649f7d2a2dfada7ed82323e9bc1fe9c

metadata

language: zh
tags:
  - bert-base-chinese
  - Chinese dialogue
widget:
  - text: '[CLS] 福州宠物医院哪家好呢 [eos] 我的喵在善化坊那边的精灵仁爱医院包括绝育驱虫咳嗽之类的东西 [SEP]'

This is a model post trained using the following multi-turn Chinese dialogue corpora (only the training set portions defined in the literature):

Douban
E-commerce
Restore-200k

The criteria to minimize are masked LM and next sentence prediction (3 category labels: 0 (random response from corpora), 1 (random response within a dialogue context), 2 (correct next response)).

If you want to use this model to encode a multiple-turn dialogue, the format is "[CLS] turn t-2 [eos] turn t-1 [SEP] response [SEP]" where tokens before and include the first SEP token are considered as segment 0. Any tokens after it are considered as segment 1. This is similar to the format used in NSP training in Bert. In addition, we use a newly introduced token [eos] to separate between different turns. It is okay if you only have one turn as segment 0 and 1 response turn as segment 1 using this input format: "[CLS] turn t-1 [SEP] response [SEP]" without using [eos] .