Base model facebook/opt-2.7b

Fine-tuned for causal language modeling of transcribed spoken dialogue from the TalkBank CABank collection. Training corpora include:

(Corpus descriptions are from TalkBank)

Data input format: The data format models a sequence of spoken dialogue between two or more participants:

  • The sequence is prefixed with information about the participants including name (can be a proper noun, a title/role, or unknown), age (can be a number or unknown), and sex (can be male, female, other, unknown).
  • It then proceeds to sequentially list all utterances in the conversation, each prefixed with their participant code (S1, S2, S3, etc.).
  • Utterances support a limited set of transcription notations in the CHAT & CHAT-CA formats:
    • Pauses: (.) for a generic short pause, or (N.N) for a timed pause. For example (3.4) is a pause for 3.4 seconds.
    • Non-verbal sounds: &=laughs, &=cough, &=breathes, &=click, etc. Anything describing a speaker-produced non-verbal sound can come after a prefix of &=
    • Comments about speaker or setting: [% baby crying in background], [% smiling], [% phone clicking noise], [% imitating him], etc. Anything describing the state of the speaker or environment can be in this block. Also, a comment block can be used to describe speaker-produced sounds, but it is more common to use the &= prefix for that.
    • Unknown or unintelligible utterances: xxx
    • Breathing: hhh

Example:

<participant> S1 (name: Dave, age: 33, sex: male) <participant> S2 (name: unknown, age: unknown, sex: unknown) <dialog> S1: Hi! (2.3) are you there? S2: hhh hhh [% background noise] uh yeah (0.8) I can hear you. (1.2) &=cough can you hear me? S1: ...

Usage Info:

Per the OPT documentation, the model was trained with tokenizer setting use_fast=False.

To use this model for real-time inference in a continuous duplex dialogue system, see: https://github.com/AbrahamSanders/realtime-chatbot.

Downloads last month
12
Inference Examples
This model does not have enough activity to be deployed to Inference API (serverless) yet. Increase its social visibility and check back later, or deploy to Inference Endpoints (dedicated) instead.