BoJack commited on
Commit
18a317b
·
verified ·
1 Parent(s): 2a72be2

Upload 5 files

Browse files
Files changed (6) hide show
  1. .gitattributes +1 -0
  2. README.md +107 -3
  3. config.yaml +107 -0
  4. configuration.json +12 -0
  5. example/test.wav +0 -0
  6. logo.png +3 -0
.gitattributes CHANGED
@@ -33,3 +33,4 @@ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
33
  *.zip filter=lfs diff=lfs merge=lfs -text
34
  *.zst filter=lfs diff=lfs merge=lfs -text
35
  *tfevents* filter=lfs diff=lfs merge=lfs -text
 
 
33
  *.zip filter=lfs diff=lfs merge=lfs -text
34
  *.zst filter=lfs diff=lfs merge=lfs -text
35
  *tfevents* filter=lfs diff=lfs merge=lfs -text
36
+ logo.png filter=lfs diff=lfs merge=lfs -text
README.md CHANGED
@@ -1,3 +1,107 @@
1
- ---
2
- license: apache-2.0
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: other
3
+ license_name: model-license
4
+ license_link: https://github.com/alibaba-damo-academy/FunASR
5
+ frameworks:
6
+ - Pytorch
7
+ tasks:
8
+ - emotion-recognition
9
+ ---
10
+
11
+ <div align="center">
12
+ <h1>
13
+ EMOTION2VEC
14
+ </h1>
15
+ <p>
16
+ emotion2vec: universal speech emotion representation model <br>
17
+ <b><em>emotion2vec: Self-Supervised Pre-Training for Speech Emotion Representation</em></b>
18
+ </p>
19
+ <p>
20
+ <img src="logo.png" style="width: 200px; height: 200px;">
21
+ </p>
22
+ <p>
23
+ </p>
24
+ </div>
25
+
26
+ # Guides
27
+ emotion2vec is the first universal speech emotion representation model. Through self-supervised pre-training, emotion2vec has the ability to extract emotion representation across different tasks, languages, and scenarios.
28
+
29
+ The version is an pre-trained representation model without fine-tuning, which can be used for feature extraction.
30
+
31
+ # Model Card
32
+ GitHub Repo: [emotion2vec](https://github.com/ddlBoJack/emotion2vec)
33
+ |Model|⭐Model Scope|🤗Hugging Face|Fine-tuning Data (Hours)|
34
+ |:---:|:-------------:|:-----------:|:-------------:|
35
+ |emotion2vec|[Link](https://www.modelscope.cn/models/iic/emotion2vec_base/summary)|[Link](https://huggingface.co/emotion2vec/emotion2vec_base)|/|
36
+ emotion2vec+ seed|[Link](https://modelscope.cn/models/iic/emotion2vec_plus_seed/summary)|[Link](https://huggingface.co/emotion2vec/emotion2vec_plus_seed)|201|
37
+ emotion2vec+ base|[Link](https://modelscope.cn/models/iic/emotion2vec_plus_base/summary)|[Link](https://huggingface.co/emotion2vec/emotion2vec_plus_base)|4788|
38
+ emotion2vec+ large|[Link](https://modelscope.cn/models/iic/emotion2vec_plus_large/summary)|[Link](https://huggingface.co/emotion2vec/emotion2vec_plus_large)|42526|
39
+
40
+ # Installation
41
+
42
+ `pip install -U funasr modelscope`
43
+
44
+ # Usage
45
+ input: 16k Hz speech recording
46
+
47
+ granularity:
48
+ - "utterance": Extract features from the entire utterance
49
+ - "frame": Extract frame-level features (50 Hz)
50
+
51
+ extract_embedding: Whether to extract features
52
+
53
+ ## Inference based on ModelScope
54
+
55
+ ```python
56
+ from modelscope.pipelines import pipeline
57
+ from modelscope.utils.constant import Tasks
58
+
59
+ inference_pipeline = pipeline(
60
+ task=Tasks.emotion_recognition,
61
+ model="iic/emotion2vec_base")
62
+
63
+ rec_result = inference_pipeline('https://isv-data.oss-cn-hangzhou.aliyuncs.com/ics/MaaS/ASR/test_audio/asr_example_zh.wav', output_dir="./outputs", granularity="utterance", extract_embedding=True)
64
+ print(rec_result)
65
+ ```
66
+
67
+
68
+ ## Inference based on FunASR
69
+
70
+ ```python
71
+ from funasr import AutoModel
72
+
73
+ model = AutoModel(model="iic/emotion2vec_base")
74
+
75
+ res = model(input='https://isv-data.oss-cn-hangzhou.aliyuncs.com/ics/MaaS/ASR/test_audio/asr_example_zh.wav', output_dir="./outputs", granularity="utterance", extract_embedding=True)
76
+ print(res)
77
+ ```
78
+ Note: The model will automatically download.
79
+
80
+ Supports input file list, wav.scp (Kaldi style):
81
+ ```cat wav.scp
82
+ wav_name1 wav_path1.wav
83
+ wav_name2 wav_path2.wav
84
+ ...
85
+ ```
86
+
87
+ Outputs are emotion representation, saved in the output_dir in numpy format (can be loaded with np.load())
88
+
89
+ # Note
90
+
91
+ This repository is the Huggingface version of emotion2vec, with identical model parameters as the original model and Model Scope version.
92
+
93
+ Original repository: [https://github.com/ddlBoJack/emotion2vec](https://github.com/ddlBoJack/emotion2vec)
94
+
95
+ Model Scope repository: [https://github.com/alibaba-damo-academy/FunASR](https://github.com/alibaba-damo-academy/FunASR/tree/funasr1.0/examples/industrial_data_pretraining/emotion2vec)
96
+
97
+ Hugging Face repository: [https://huggingface.co/emotion2vec](https://huggingface.co/emotion2vec)
98
+
99
+ # Citation
100
+ ```BibTeX
101
+ @article{ma2023emotion2vec,
102
+ title={emotion2vec: Self-Supervised Pre-Training for Speech Emotion Representation},
103
+ author={Ma, Ziyang and Zheng, Zhisheng and Ye, Jiaxin and Li, Jinchao and Gao, Zhifu and Zhang, Shiliang and Chen, Xie},
104
+ journal={arXiv preprint arXiv:2312.15185},
105
+ year={2023}
106
+ }
107
+ ```
config.yaml ADDED
@@ -0,0 +1,107 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+
2
+ # network architecture
3
+ model: Emotion2vec
4
+ model_conf:
5
+ loss_beta: 0.0
6
+ loss_scale: null
7
+ depth: 8
8
+ start_drop_path_rate: 0.0
9
+ end_drop_path_rate: 0.0
10
+ num_heads: 12
11
+ norm_eps: 1e-05
12
+ norm_affine: true
13
+ encoder_dropout: 0.1
14
+ post_mlp_drop: 0.1
15
+ attention_dropout: 0.1
16
+ activation_dropout: 0.0
17
+ dropout_input: 0.0
18
+ layerdrop: 0.05
19
+ embed_dim: 768
20
+ mlp_ratio: 4.0
21
+ layer_norm_first: false
22
+ average_top_k_layers: 8
23
+ end_of_block_targets: false
24
+ clone_batch: 8
25
+ layer_norm_target_layer: false
26
+ batch_norm_target_layer: false
27
+ instance_norm_target_layer: true
28
+ instance_norm_targets: false
29
+ layer_norm_targets: false
30
+ ema_decay: 0.999
31
+ ema_same_dtype: true
32
+ log_norms: true
33
+ ema_end_decay: 0.99999
34
+ ema_anneal_end_step: 20000
35
+ ema_encoder_only: false
36
+ max_update: 100000
37
+ extractor_mode: layer_norm
38
+ shared_decoder: null
39
+ min_target_var: 0.1
40
+ min_pred_var: 0.01
41
+ supported_modality: AUDIO
42
+ mae_init: false
43
+ seed: 1
44
+ skip_ema: false
45
+ cls_loss: 1.0
46
+ recon_loss: 0.0
47
+ d2v_loss: 1.0
48
+ decoder_group: false
49
+ adversarial_training: false
50
+ adversarial_hidden_dim: 128
51
+ adversarial_weight: 0.1
52
+ cls_type: chunk
53
+ normalize: true
54
+
55
+ modalities:
56
+ audio:
57
+ type: AUDIO
58
+ prenet_depth: 4
59
+ prenet_layerdrop: 0.05
60
+ prenet_dropout: 0.1
61
+ start_drop_path_rate: 0.0
62
+ end_drop_path_rate: 0.0
63
+ num_extra_tokens: 10
64
+ init_extra_token_zero: true
65
+ mask_noise_std: 0.01
66
+ mask_prob_min: null
67
+ mask_prob: 0.5
68
+ inverse_mask: false
69
+ mask_prob_adjust: 0.05
70
+ keep_masked_pct: 0.0
71
+ mask_length: 5
72
+ add_masks: false
73
+ remove_masks: false
74
+ mask_dropout: 0.0
75
+ encoder_zero_mask: true
76
+ mask_channel_prob: 0.0
77
+ mask_channel_length: 64
78
+ ema_local_encoder: false
79
+ local_grad_mult: 1.0
80
+ use_alibi_encoder: true
81
+ alibi_scale: 1.0
82
+ learned_alibi: false
83
+ alibi_max_pos: null
84
+ learned_alibi_scale: true
85
+ learned_alibi_scale_per_head: true
86
+ learned_alibi_scale_per_layer: false
87
+ num_alibi_heads: 12
88
+ model_depth: 8
89
+ decoder:
90
+ decoder_dim: 384
91
+ decoder_groups: 16
92
+ decoder_kernel: 7
93
+ decoder_layers: 4
94
+ input_dropout: 0.1
95
+ add_positions_masked: false
96
+ add_positions_all: false
97
+ decoder_residual: true
98
+ projection_layers: 1
99
+ projection_ratio: 2.0
100
+ extractor_mode: layer_norm
101
+ feature_encoder_spec: '[(512, 10, 5)] + [(512, 3, 2)] * 4 + [(512,2,2)] + [(512,2,2)]'
102
+ conv_pos_width: 95
103
+ conv_pos_groups: 16
104
+ conv_pos_depth: 5
105
+ conv_pos_pre_ln: false
106
+
107
+
configuration.json ADDED
@@ -0,0 +1,12 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "framework": "pytorch",
3
+ "task" : "emotion-recognition",
4
+ "pipeline": {"type":"funasr-pipeline"},
5
+ "model": {"type" : "funasr"},
6
+ "file_path_metas": {
7
+ "init_param":"emotion2vec_base.pt",
8
+ "config":"config.yaml"},
9
+ "model_name_in_hub": {
10
+ "ms":"iic/emotion2vec_base",
11
+ "hf":""}
12
+ }
example/test.wav ADDED
Binary file (131 kB). View file
 
logo.png ADDED

Git LFS Details

  • SHA256: 8a1aa31431bfb2bf126d7cf383c8b681b2372c333f1328b342bab5969dc0a569
  • Pointer size: 132 Bytes
  • Size of remote file: 1.85 MB