Uploading images and scripts

Browse files

Uploading images and scripts from git

Files changed (12) hide show

README.md +245 -18
beam_search_utils.py +5 -6
hyper_optim.py +149 -0
images/bsd_equation.png +0 -0
images/bsd_example_pic.png +0 -0
images/overall_dataflow.png +0 -0
images/two_realms.png +0 -0
images/word_level_spk_prob.png +0 -0
requirements.txt +1 -0
run_optuna_hyper_optim.sh +68 -0
run_speaker_tagging_beam_search.sh +4 -2
speaker_tagging_beamsearch.py +53 -25

README.md CHANGED Viewed

@@ -1,25 +1,129 @@
 # llm_speaker_tagging
-SLT 2024 Challenge: Post-ASR-Speaker-Tagging Baseline
-# Project Name
-SLT 2024 Challenge GenSEC Track 2: Post-ASR-Speaker-Tagging Baseline
-## Features
-- Data download and cleaning
-- n-gram + beam search decoder based baselinee system
-## Installation
 Run the following commands at the main level of this repository.
 ### Conda Environment
 ```
 conda create --name llmspk python=3.10
 ```
 ### Install requirements
 You need to install the following packages
@@ -58,26 +162,62 @@ Clone the dataset from Hugging Face server.
 git clone https://huggingface.co/datasets/GenSEC-LLM/SLT-Task2-Post-ASR-Speaker-Tagging
 ```
 ```
-find . $PWD/SLT-Task2-Post-ASR-Speaker-Tagging/err_source_text/dev -name *.seglst.json > err_dev.src.list
-find . $PWD/SLT-Task2-Post-ASR-Speaker-Tagging/ref_annotated_text/dev -name *.seglst.json > err_dev.ref.list
 ```
 ### Launch the baseline script
-Now you are ready to launch the script.
 Launch the baseline script `run_speaker_tagging_beam_search.sh`
-```
 BASEPATH=${PWD}
 DIAR_LM_PATH=$BASEPATH/arpa_model/4gram_small.arpa
 ASRDIAR_FILE_NAME=err_dev
 WORKSPACE=$BASEPATH/SLT-Task2-Post-ASR-Speaker-Tagging
 INPUT_ERROR_SRC_LIST_PATH=$BASEPATH/$ASRDIAR_FILE_NAME.src.list
 GROUNDTRUTH_REF_LIST_PATH=$BASEPATH/$ASRDIAR_FILE_NAME.ref.list
-DIAR_OUT_DOWNLOAD=$WORKSPACE/short2_all_seglst_infer
 mkdir -p $DIAR_OUT_DOWNLOAD
 ### SLT 2024 Speaker Tagging Setting v1.0.2
 ALPHA=0.4
 BETA=0.04
@@ -94,11 +234,8 @@ echo "UNIQ MEMO:" $UNIQ_MEMO
 TRIAL=telephonic
 BATCH_SIZE=11
-rm $WORKSPACE/$ASRDIAR_FILE_NAME.src.seglst.json
-rm $WORKSPACE/$ASRDIAR_FILE_NAME.ref.seglst.json
-rm $WORKSPACE/$ASRDIAR_FILE_NAME.hyp.seglst.json
 python $BASEPATH/speaker_tagging_beamsearch.py \
     port=[5501,5502,5511,5512,5521,5522,5531,5532] \
     arpa_language_model=$DIAR_LM_PATH \
     batch_size=$BATCH_SIZE \
@@ -111,7 +248,6 @@ python $BASEPATH/speaker_tagging_beamsearch.py \
     beam_width=$BEAM_WIDTH \
     word_window=$WORD_WINDOW \
     peak_prob=$PEAK_PROB \
-    out_dir=$DIAR_OUT_DOWNLOAD
 ```
 ### Evaluate
@@ -120,7 +256,7 @@ We use [MeetEval](https://github.com/fgnt/meeteval) software to evaluate `cpWER`
 cpWER measures both speaker tagging and word error rate (WER) by testing all the permutation of trancripts and choosing the permutation that
 gives the lowest error.
-```
 echo "Evaluating the original source transcript."
 meeteval-wer cpwer -h $WORKSPACE/$ASRDIAR_FILE_NAME.src.seglst.json -r $WORKSPACE/$ASRDIAR_FILE_NAME.ref.seglst.json
 echo "Source     cpWER: " $(jq '.error_rate' "[ $WORKSPACE/$ASRDIAR_FILE_NAME.src.seglst_cpwer.json) ]"
@@ -130,6 +266,97 @@ meeteval-wer cpwer -h $WORKSPACE/$ASRDIAR_FILE_NAME.hyp.seglst.json -r $WORKSPAC
 echo "Hypothesis cpWER: " $(jq '.error_rate'  $WORKSPACE/$ASRDIAR_FILE_NAME.hyp.seglst_cpwer.json)
 ```
 ### Reference
 @inproceedings{park2024enhancing,

 # llm_speaker_tagging
+SLT 2024 Challenge: Track-2 Post-ASR-Speaker-Tagging
+Baseline and Instructions for Track-2
+# GenSEC Challenge Track-2 Introduction
+SLT 2024 Challenge GenSEC Track 2: Post-ASR-Speaker-Tagging
+- Track-2 is a challenge track that aims to correct the speaker tagging of the ASR-generated transcripts tagged with a speaker diarization system.
+- Since the traditional speaker diarization systems cannot take lexical cues into account, leading to errors that disrupt the context of human conversations.
+- In the provided dataset, we refer to these erroneous transcript as `err_source_text` (Error source text). Here is an example.
+- Erroneous Original Transcript `err_source_text`:
+```json
+[
+{"session_id":"session_gen1sec2", "start_time":10.02, "end_time":11.74, "speaker":"speaker1", "words":"what should we talk about well i"},
+{"session_id":"session_gen1sec2", "start_time":13.32, "end_time":17.08, "speaker":"speaker2", "words":"don't tell you what's need to be"},
+{"session_id":"session_gen1sec2", "start_time":17.11, "end_time":17.98, "speaker":"speaker1", "words":"discussed"},
+{"session_id":"session_gen1sec2", "start_time":18.10, "end_time":19.54, "speaker":"speaker2", "words":"because that's something you should figure out"},
+{"session_id":"session_gen1sec2", "start_time":20.10, "end_time":21.40, "speaker":"speaker1", "words":"okay, then let's talk about our gigs sounds"},
+{"session_id":"session_gen1sec2", "start_time":21.65, "end_time":23.92, "speaker":"speaker2", "words":"good do you have any specific ideas"},
+]
+```
+Note that the word `well` `i`, `discussed` and `sounds` are tagged with wrong speakers.
+- We expect track2 participants to generate the corrected speaker taggings.
+- Corrected Transcript Example (hypothesis):
+```json
+[
+ {"session_id":"session_gen1sec2", "start_time":0.0, "end_time":0.0, "speaker":"speaker1", "words":"what should we talk about"},
+ {"session_id":"session_gen1sec2", "start_time":0.0, "end_time":0.0, "speaker":"speaker2", "words":"well i don't tell you what's need to be discussed"},
+ {"session_id":"session_gen1sec2", "start_time":0.0, "end_time":0.0, "speaker":"speaker2", "words":"because that's something you should figure out"},
+ {"session_id":"session_gen1sec2", "start_time":0.0, "end_time":0.0, "speaker":"speaker1", "words":"okay then let's talk about our gigs"},
+ {"session_id":"session_gen1sec2", "start_time":0.0, "end_time":0.0, "speaker":"speaker2", "words":"sounds good do you have any specific ideas"}
+]
+```
+- Note that `start_time` and `end_time` cannot be estimated so the timestamps are all assigned as `0.0`.
+- Please ensure that the order of sentences is maintained so that the output transcripts can be evaluated correctly.
+- **Dataset:** All development set and evaluation set data samples are formatted in the `seglst.json` format, which is a list containing dictionary variables with the keys specified above:
+```python
+{
+"session_id": str,
+"start_time": float,
+"end_time": float,
+"speaker": str,
+"words": str,
+}
+```
+## Track-2 Rules and Regulations
+1. The participants should **only use text (transcripts)** as the only modality. We do not provide any speech (audio) signal for the transcripts.
+2. The participants are allowed to correct the words (e.g. `spk1:hi are wow` to `spk1:how are you`) without changing the speaker labels. That is, this involves Track-1 in a way.
+3. The participants are allowed to use any type of language model and methods.
+  - It does not need to be instruct (chat-based) large language models such as GPTs, LLaMa.
+  - No restrictions on the parameter size of the LLM.
+  - The participants can use prompt tuning, model alignment or any type of fine-tuning methods.
+  - The participants are also allowed to use beam search decoding techniques with LLMs.
+4. The submitted system output format should be session by session `seglst.json` format and evaluated by `cpwer` metric.
+5. The participants will submit two json files:
+    (1) `err_dev.hyp.seglst.json`
+    (2) `err_eval.hyp.seglst.json`
+    for both dev and eval set, respectively.
+6. In each `err_dev.hyp.seglst.json` `err_eval.hyp.seglst.json`, there is only one list containing the all 142 (dev), 104 (eval) sessions and each session is separated by `session_id` key.
+- Example of the final submission form `err_dev.hyp.seglst.json` and `err_eval.hyp.seglst.json`:
+```json
+[
+ {"session_id":"session_abc123ab", "start_time":0.0, "end_time":0.0, "speaker":"speaker1", "words":"well it is what it is"},
+ {"session_id":"session_abc123ab", "start_time":0.0, "end_time":0.0, "speaker":"speaker2", "words":"yeah so be it"},
+ {"session_id":"session_xyz456cd", "start_time":0.0, "end_time":0.0, "speaker":"speaker1", "words":"wow you are late again"},
+ {"session_id":"session_xyz456cd", "start_time":0.0, "end_time":0.0, "speaker":"speaker2", "words":"sorry traffic jam"},
+ {"session_id":"session_xyz456cd", "start_time":0.0, "end_time":0.0, "speaker":"speaker3", "words":"hey how was last night"}
+]
+```
+## Baseline System Introduction: Contextudal Beam Search Decoding
+The baseline system is based on the system proposed in [Enhancing Speaker Diarization with Large Language Models: A Contextual Beam Search Approach
+](https://arxiv.org/pdf/2309.05248) (We refer to this method as Contextual Beam Search (CBS)). Note that Track-2 GenSEC challenge only allows text modality, so this method injects placehold probabilities represented by `peak_prob.`
+The prposed CBS method brings the beam search technique used for ASR language model to speaker diarization.
+<img src="images/two_realms.png" width="720" alt="Two Realms"/>
+In CBS method, the following three probability values are needed:
+**P(E|S)**: Speaker diarization posterior probability (Given speaker S, acoustic observation E)
+**P(W)**: th probability of the next word W
+**P(S|W)**: the conditional probability value of the speaker S given the next word
+<img src="images/bsd_equation.png" width="360" alt="BSD Equation"/>
+Note that the CBS approach assumes that one word is spoken by one speaker. In this baseline system, a placeholder speaker probability `peak_prob` is added since we do not have access to acoustic-only speaker diarization system.
+<img src="images/word_level_spk_prob.png" width="720" alt="Word Level Speaker Probability"/>
+The following diagram explains how beam search decoding works with speaker diarization and ASR.
+<img src="images/bsd_example_pic.png" width="880" alt="Example of beam search decoding with scores"/>
+The overall data-flow is shown as follows. Note that we have fixed value for speaker probability values.
+<img src="images/overall_dataflow.png" width="720" alt="Overall Dataflow"/>
+## Baseline System Installation
 Run the following commands at the main level of this repository.
 ### Conda Environment
+The baseline system works with `conda` environment with python 3.10.
 ```
 conda create --name llmspk python=3.10
 ```
 ### Install requirements
 You need to install the following packages
 git clone https://huggingface.co/datasets/GenSEC-LLM/SLT-Task2-Post-ASR-Speaker-Tagging
 ```
+In folder, you will see the following folder structures.
+```bash
+.
+├── err_source_text
+│   ├── dev
+│   │   ├── session_014b5cda.seglst.json
+│   │   ├── session_02d73d95.seglst.json
+│.
+│..
+│   │   ├── session_fcd0a550.seglst.json
+│   │   └── session_ff16b903.seglst.json
+│   └── eval
+│       ├── session_0259446c.seglst.json
+│       ├── session_0bea34fa.seglst.json
+│..
+│...
+│       ├── session_f84edf1f.seglst.json
+│       └── session_febfa7aa.seglst.json
+├── ref_annotated_text
+│   └── dev
+│       ├── session_014b5cda.seglst.json
+│       ├── session_02d73d95.seglst.json
+│.
+│..
+│       ├── session_fcd0a550.seglst.json
+│       └── session_ff16b903.seglst.json
+```
+The file counts are as follows:
+ - `err_source_text`: dev 142 files, eval 104 files
+ - `ref_annotated_text`: dev 142 files
+Run the following commands to construct the input list files `err_dev.src.list` and `err_dev.ref.list`.
 ```
+find $PWD/SLT-Task2-Post-ASR-Speaker-Tagging/err_source_text/dev -maxdepth 1 -type f -name "*.seglst.json" > err_dev.src.list
+find $PWD/SLT-Task2-Post-ASR-Speaker-Tagging/ref_annotated_text/dev -maxdepth 1 -type f -name "*.seglst.json" > err_dev.ref.list
 ```
 ### Launch the baseline script
+Now you are ready to launch the baseline script.
 Launch the baseline script `run_speaker_tagging_beam_search.sh`
+```bash
 BASEPATH=${PWD}
 DIAR_LM_PATH=$BASEPATH/arpa_model/4gram_small.arpa
 ASRDIAR_FILE_NAME=err_dev
+OPTUNA_STUDY_NAME=speaker_beam_search_${ASRDIAR_FILE_NAME}
 WORKSPACE=$BASEPATH/SLT-Task2-Post-ASR-Speaker-Tagging
 INPUT_ERROR_SRC_LIST_PATH=$BASEPATH/$ASRDIAR_FILE_NAME.src.list
 GROUNDTRUTH_REF_LIST_PATH=$BASEPATH/$ASRDIAR_FILE_NAME.ref.list
+DIAR_OUT_DOWNLOAD=$WORKSPACE/$ASRDIAR_FILE_NAME
 mkdir -p $DIAR_OUT_DOWNLOAD
 ### SLT 2024 Speaker Tagging Setting v1.0.2
 ALPHA=0.4
 BETA=0.04
 TRIAL=telephonic
 BATCH_SIZE=11
 python $BASEPATH/speaker_tagging_beamsearch.py \
+    hyper_params_optim=false \
     port=[5501,5502,5511,5512,5521,5522,5531,5532] \
     arpa_language_model=$DIAR_LM_PATH \
     batch_size=$BATCH_SIZE \
     beam_width=$BEAM_WIDTH \
     word_window=$WORD_WINDOW \
     peak_prob=$PEAK_PROB \
 ```
 ### Evaluate
 cpWER measures both speaker tagging and word error rate (WER) by testing all the permutation of trancripts and choosing the permutation that
 gives the lowest error.
+```bash
 echo "Evaluating the original source transcript."
 meeteval-wer cpwer -h $WORKSPACE/$ASRDIAR_FILE_NAME.src.seglst.json -r $WORKSPACE/$ASRDIAR_FILE_NAME.ref.seglst.json
 echo "Source     cpWER: " $(jq '.error_rate' "[ $WORKSPACE/$ASRDIAR_FILE_NAME.src.seglst_cpwer.json) ]"
 echo "Hypothesis cpWER: " $(jq '.error_rate'  $WORKSPACE/$ASRDIAR_FILE_NAME.hyp.seglst_cpwer.json)
 ```
+The `cpwer` result will be stored in `./SLT-Task2-Post-ASR-Speaker-Tagging/err_dev.hyp.seglst_cpwer.json` file.
+```bash
+cat ./SLT-Task2-Post-ASR-Speaker-Tagging/err_dev.hyp.seglst_cpwer.json`
+```
+The result file contains a json-dictionary. `"error_rate"` is the `cpwer` value we want to minimize.
+```json
+{
+  "error_rate": 0.18784847090516965,
+  "errors": 73077,
+  "length": 389021,
+  "insertions": 13739,
+  "deletions": 42173,
+  "substitutions": 17165,
+  "reference_self_overlap": null,
+  "hypothesis_self_overlap": null,
+  "missed_speaker": 0,
+  "falarm_speaker": 6,
+  "scored_speaker": 330,
+  "assignment": null
+}
+```
+## Appendix A: LLM example of speaker tagging correction
+This is an example of GPT-based speaker tagging correction. The following text is the prompt fed into Chat-GPT (GPT-4), including the example in this document.
+```markdown
+- Track-2 is a challenge track that aims to correct the speaker tagging of the ASR-generated transcripts tagged with a speaker diarization system.
+- Since the traditional speaker diarization systems cannot take lexical cues into account, leading to errors that disrupt the context of human conversations.
+- In the provided dataset, we refer to these erroneous transcript as `err_source_text` (Error source text). Here is an example.
+- Erroneous Original Transcript `err_source_text`:
+[
+{"session_id":"session_gen1sec2", "start_time":10.02, "end_time":11.74, "speaker":"speaker1", "words":"what should we talk about well i"},
+{"session_id":"session_gen1sec2", "start_time":13.32, "end_time":17.08, "speaker":"speaker2", "words":"don't tell you what's need to be"},
+{"session_id":"session_gen1sec2", "start_time":17.11, "end_time":17.98, "speaker":"speaker1", "words":"discussed"},
+{"session_id":"session_gen1sec2", "start_time":18.10, "end_time":19.54, "speaker":"speaker2", "words":"because that's something you should figure out"},
+{"session_id":"session_gen1sec2", "start_time":20.10, "end_time":21.40, "speaker":"speaker1", "words":"okay then let's talk about our gigs sounds"},
+{"session_id":"session_gen1sec2", "start_time":21.65, "end_time":23.92, "speaker":"speaker2", "words":"good do you have any specific ideas"},
+]
+Relocate the words to make the conversation looks and sounds natural.
+- Do not add or remove words.
+- Do not switch the speaker labels.
+- Do not change speaker1 to speaker2 or speaker2 to speaker 1.
+- Move one or two words to other speakers dictionary to make the conversation sound correct.
+```
+- The answer from GPT-4 (Tested on 05/01/2024).
+```markdown
+To correct the conversation's flow in the transcript, I'll rearrange the words between the speakers while keeping the speakers consistent with the rules you provided. Here is the corrected version:
+json [code]
+[
+    {"session_id":"session_gen1sec2", "start_time":10.02, "end_time":11.74, "speaker":"speaker1", "words":"what should we talk about"},
+    {"session_id":"session_gen1sec2", "start_time":13.32, "end_time":17.08, "speaker":"speaker2", "words":"well i don't tell you what's need to be"},
+    {"session_id":"session_gen1sec2", "start_time":17.11, "end_time":17.98, "speaker":"speaker1", "words":"discussed"},
+    {"session_id":"session_gen1sec2", "start_time":18.10, "end_time":19.54, "speaker":"speaker2", "words":"because that's something you should figure out"},
+    {"session_id":"session_gen1sec2", "start_time":20.10, "end_time":21.40, "speaker":"speaker1", "words":"okay then let's talk about our gigs"},
+    {"session_id":"session_gen1sec2", "start_time":21.65, "end_time":23.92, "speaker":"speaker2", "words":"sounds good do you have any specific ideas"}
+]
+Here are the changes I made:
+I moved "well i" from speaker1's first utterance to the beginning of speaker2's first utterance to better connect the dialogue.
+"sounds" was moved to the beginning of speaker2's last utterance to better transition from speaker1's proposal to discuss their gigs.
+```
+In this example, note that the word `well` `i`, and `sounds` are corrected properly while GPT-4 missed `discussed` and left it assigned to `speaker1`.
+- The answer from Llama3 (Tested on 05/01/2024).
+```markdown
+Here is the relocated transcript to make the conversation sound natural:
+[
+{"session_id":"session_gen1sec2", "start_time":10.02, "end_time":11.74, "speaker":"speaker1", "words":"what should we talk about"},
+{"session_id":"session_gen1sec2", "start_time":13.32, "end_time":15.08, "speaker":"speaker1", "words":"well i don't tell you what's need to be"},
+{"session_id":"session_gen1sec2", "start_time":15.11, "end_time":17.08, "speaker":"speaker2", "words":"discussed because that's something you should figure out"},
+{"session_id":"session_gen1sec2", "start_time":18.10, "end_time":19.54, "speaker":"speaker1", "words":"okay then let's talk about our gigs"},
+{"session_id":"session_gen1sec2", "start_time":21.65, "end_time":22.40, "speaker":"speaker1", "words":"sounds good"},
+{"session_id":"session_gen1sec2", "start_time":22.45, "end_time":23.92, "speaker":"speaker2", "words":"do you have any specific ideas"},
+]
+I made some adjustments to the words to make the conversation flow naturally. Let me know if this meets your requirements!
+```
+In this Llama3 example, note that the word `well`, `i` and `sounds` are corrected but nearly all sentences got swapped speaker labels.
 ### Reference
 @inproceedings{park2024enhancing,

beam_search_utils.py CHANGED Viewed

@@ -8,7 +8,7 @@ import json
 import concurrent.futures
 import kenlm
-__INFO_TAG__ = "[INFO]"
 class SpeakerTaggingBeamSearchDecoder:
     def __init__(self, loaded_kenlm_model: kenlm, cfg: dict):
@@ -127,11 +127,10 @@ class SpeakerTaggingBeamSearchDecoder:
                 div_trans_info_dict[seq_id]['words'] = w_seq
         return div_trans_info_dict
 def run_mp_beam_search_decoding(
     speaker_beam_search_decoder,
     loaded_kenlm_model,
-    trans_info_dict,
     org_trans_info_dict,
     div_mp,
     win_len,
@@ -147,7 +146,7 @@ def run_mp_beam_search_decoding(
     else:
         num_workers = len(port)
-    uniq_id_list = sorted(list(trans_info_dict.keys() ))
     tp = concurrent.futures.ProcessPoolExecutor(max_workers=num_workers)
     futures = []
@@ -159,7 +158,7 @@ def run_mp_beam_search_decoding(
         else:
             port_num = None
         count += 1
-        uniq_trans_info_dict = {uniq_id: trans_info_dict[uniq_id]}
         futures.append(tp.submit(speaker_beam_search_decoder.beam_search_diarization, uniq_trans_info_dict, port_num=port_num))
     pbar = tqdm(total=len(uniq_id_list), desc="Running beam search decoding", unit="files")
@@ -321,5 +320,5 @@ def write_seglst_jsons(
     print(f"{__INFO_TAG__} Writing {diar_out_path}/{session_id}.seglst.json")
     total_output_filename = total_output_filename.replace("src", ext_str).replace("ref", ext_str)
-    with open(f'{diar_out_path}/../{total_output_filename}.seglst.json', 'w') as file:
         json.dump(total_infer_list, file, indent=4)  # indent=4 for pretty printing

 import concurrent.futures
 import kenlm
+__INFO_TAG__ = "[BeamSearchUtil INFO]"
 class SpeakerTaggingBeamSearchDecoder:
     def __init__(self, loaded_kenlm_model: kenlm, cfg: dict):
                 div_trans_info_dict[seq_id]['words'] = w_seq
         return div_trans_info_dict
 def run_mp_beam_search_decoding(
     speaker_beam_search_decoder,
     loaded_kenlm_model,
+    div_trans_info_dict,
     org_trans_info_dict,
     div_mp,
     win_len,
     else:
         num_workers = len(port)
+    uniq_id_list = sorted(list(div_trans_info_dict.keys() ))
     tp = concurrent.futures.ProcessPoolExecutor(max_workers=num_workers)
     futures = []
         else:
             port_num = None
         count += 1
+        uniq_trans_info_dict = {uniq_id: div_trans_info_dict[uniq_id]}
         futures.append(tp.submit(speaker_beam_search_decoder.beam_search_diarization, uniq_trans_info_dict, port_num=port_num))
     pbar = tqdm(total=len(uniq_id_list), desc="Running beam search decoding", unit="files")
     print(f"{__INFO_TAG__} Writing {diar_out_path}/{session_id}.seglst.json")
     total_output_filename = total_output_filename.replace("src", ext_str).replace("ref", ext_str)
+    with open(f'{diar_out_path}/{total_output_filename}.seglst.json', 'w') as file:
         json.dump(total_infer_list, file, indent=4)  # indent=4 for pretty printing

hyper_optim.py ADDED Viewed

	@@ -0,0 +1,149 @@

+import optuna
+import os
+import tempfile
+import time
+import json
+import subprocess
+import logging
+from beam_search_utils import (
+    write_seglst_jsons,
+    run_mp_beam_search_decoding,
+    convert_nemo_json_to_seglst,
+)
+from hydra.core.config_store import ConfigStore
+def evaluate(cfg, temp_out_dir, workspace_dir, asrdiar_file_name, source_info_dict, hypothesis_sessions_dict, reference_info_dict):
+    write_seglst_jsons(hypothesis_sessions_dict, input_error_src_list_path=cfg.input_error_src_list_path, diar_out_path=temp_out_dir, ext_str='hyp')
+    write_seglst_jsons(reference_info_dict, input_error_src_list_path=cfg.groundtruth_ref_list_path, diar_out_path=temp_out_dir, ext_str='ref')
+    write_seglst_jsons(source_info_dict, input_error_src_list_path=cfg.groundtruth_ref_list_path, diar_out_path=temp_out_dir, ext_str='src')
+    # Construct the file paths
+    src_seglst_json = os.path.join(temp_out_dir, f"{asrdiar_file_name}.src.seglst.json")
+    hyp_seglst_json = os.path.join(temp_out_dir, f"{asrdiar_file_name}.hyp.seglst.json")
+    ref_seglst_json = os.path.join(temp_out_dir, f"{asrdiar_file_name}.ref.seglst.json")
+    # Construct the output JSON file path
+    output_cpwer_hyp_json_file = os.path.join(temp_out_dir, f"{asrdiar_file_name}.hyp.seglst_cpwer.json")
+    output_cpwer_src_json_file = os.path.join(temp_out_dir, f"{asrdiar_file_name}.src.seglst_cpwer.json")
+    # Run meeteval-wer command
+    cmd_hyp = [
+        "meeteval-wer",
+        "cpwer",
+        "-h", hyp_seglst_json,
+        "-r", ref_seglst_json
+    ]
+    subprocess.run(cmd_hyp)
+    cmd_src = [
+        "meeteval-wer",
+        "cpwer",
+        "-h", src_seglst_json,
+        "-r", ref_seglst_json
+    ]
+    subprocess.run(cmd_src)
+    # Read the JSON file and print the cpWER
+    try:
+        with open(output_cpwer_hyp_json_file, "r") as file:
+            data_h = json.load(file)
+            print("Hypothesis cpWER:", data_h["error_rate"])
+        cpwer = data_h["error_rate"]
+        logging.info(f"-> HYPOTHESIS cpWER={cpwer:.4f}")
+    except FileNotFoundError:
+        raise FileNotFoundError(f"Output JSON: {output_cpwer_hyp_json_file}\nfile not found.")
+    try:
+        with open(output_cpwer_src_json_file, "r") as file:
+            data_s = json.load(file)
+            print("Source cpWER:", data_s["error_rate"])
+        source_cpwer = data_s["error_rate"]
+        logging.info(f"-> SOURCE cpWER={source_cpwer:.4f}")
+    except FileNotFoundError:
+        raise FileNotFoundError(f"Output JSON: {output_cpwer_src_json_file}\nfile not found.")
+    return cpwer
+def optuna_suggest_params(cfg, trial):
+    cfg.alpha = trial.suggest_float("alpha", 0.01, 5.0)
+    cfg.beta = trial.suggest_float("beta", 0.001, 2.0)
+    cfg.beam_width = trial.suggest_int("beam_width", 4, 64)
+    cfg.word_window = trial.suggest_int("word_window", 16, 64)
+    cfg.use_ngram = True
+    cfg.parallel_chunk_word_len = trial.suggest_int("parallel_chunk_word_len", 50, 300)
+    cfg.peak_prob = trial.suggest_float("peak_prob", 0.9, 1.0)
+    return cfg
+def beamsearch_objective(
+    trial,
+    cfg,
+    speaker_beam_search_decoder,
+    loaded_kenlm_model,
+    div_trans_info_dict,
+    org_trans_info_dict,
+    source_info_dict,
+    reference_info_dict,
+    ):
+    with tempfile.TemporaryDirectory(dir=cfg.temp_out_dir, prefix="GenSEC_") as loca_temp_out_dir:
+        start_time2 = time.time()
+        cfg = optuna_suggest_params(cfg, trial)
+        trans_info_dict = run_mp_beam_search_decoding(speaker_beam_search_decoder,
+                                                        loaded_kenlm_model=loaded_kenlm_model,
+                                                        div_trans_info_dict=div_trans_info_dict,
+                                                        org_trans_info_dict=org_trans_info_dict,
+                                                        div_mp=True,
+                                                        win_len=cfg.parallel_chunk_word_len,
+                                                        word_window=cfg.word_window,
+                                                        port=cfg.port,
+                                                        use_ngram=cfg.use_ngram,
+                                                        )
+        hypothesis_sessions_dict = convert_nemo_json_to_seglst(trans_info_dict)
+        cpwer = evaluate(cfg, loca_temp_out_dir, cfg.workspace_dir, cfg.asrdiar_file_name, source_info_dict, hypothesis_sessions_dict, reference_info_dict)
+    logging.info(f"Beam Search time taken for trial {trial}: {(time.time() - start_time2)/60:.2f} mins")
+    logging.info(f"Trial: {trial.number}")
+    logging.info(f"[ cpWER={cpwer:.4f} ]")
+    logging.info("-----------------------------------------------")
+    return cpwer
+def optuna_hyper_optim(
+    cfg,
+    speaker_beam_search_decoder,
+    loaded_kenlm_model,
+    div_trans_info_dict,
+    org_trans_info_dict,
+    source_info_dict,
+    reference_info_dict,
+    ):
+    """
+    Optuna hyper-parameter optimization function.
+    Parameters:
+        cfg (dict): A dictionary containing the configuration parameters.
+    """
+    worker_function = lambda trial: beamsearch_objective(    # noqa: E731
+        trial=trial,
+        cfg=cfg,
+        speaker_beam_search_decoder=speaker_beam_search_decoder,
+        loaded_kenlm_model=loaded_kenlm_model,
+        div_trans_info_dict=div_trans_info_dict,
+        org_trans_info_dict=org_trans_info_dict,
+        source_info_dict=source_info_dict,
+        reference_info_dict=reference_info_dict,
+    )
+    study = optuna.create_study(
+        direction="minimize",
+        study_name=cfg.optuna_study_name,
+        storage=cfg.storage,
+        load_if_exists=True
+    )
+    logger = logging.getLogger()
+    logger.setLevel(logging.INFO)  # Setup the root logger.
+    if cfg.output_log_file is not None:
+        logger.addHandler(logging.FileHandler(cfg.output_log_file, mode="a"))
+    logger.addHandler(logging.StreamHandler())
+    optuna.logging.enable_propagation()  # Propagate logs to the root logger.
+    study.optimize(worker_function, n_trials=cfg.optuna_n_trials)

images/bsd_equation.png ADDED Viewed

images/bsd_example_pic.png ADDED Viewed

images/overall_dataflow.png ADDED Viewed

images/two_realms.png ADDED Viewed

images/word_level_spk_prob.png ADDED Viewed

requirements.txt CHANGED Viewed

@@ -6,4 +6,5 @@ meeteval
 tqdm
 requests
 simplejson
 pydiardecode @ git+https://github.com/tango4j/pydiardecode@main

 tqdm
 requests
 simplejson
+optuna
 pydiardecode @ git+https://github.com/tango4j/pydiardecode@main

run_optuna_hyper_optim.sh ADDED Viewed

	@@ -0,0 +1,68 @@

+### Speaker Tagging Task-2 Parameters
+BASEPATH=${PWD}
+# OPTUNA TRIALS
+OPTUNA_N_TRIALS=999999999
+DIAR_LM_PATH=$BASEPATH/arpa_model/4gram_small.arpa
+ASRDIAR_FILE_NAME=err_dev
+OPTUNA_STUDY_NAME=speaker_beam_search_${ASRDIAR_FILE_NAME}
+WORKSPACE=$BASEPATH/SLT-Task2-Post-ASR-Speaker-Tagging
+INPUT_ERROR_SRC_LIST_PATH=$BASEPATH/$ASRDIAR_FILE_NAME.src.list
+GROUNDTRUTH_REF_LIST_PATH=$BASEPATH/$ASRDIAR_FILE_NAME.ref.list
+DIAR_OUT_DOWNLOAD=$WORKSPACE/$ASRDIAR_FILE_NAME
+TEMP_OUT_DIR=$WORKSPACE/temp_out_dir
+OPTUNA_OUTPUT_LOG_FOLDER=$WORKSPACE/log_outputs
+OPTUNA_OUTPUT_LOG_FILE=$OPTUNA_OUTPUT_LOG_FOLDER/${OPTUNA_STUDY_NAME}.log
+STORAGE_PATH="sqlite:///$WORKSPACE/log_outputs/${OPTUNA_STUDY_NAME}.db"
+mkdir -p $DIAR_OUT_DOWNLOAD
+mkdir -p $TEMP_OUT_DIR
+mkdir -p $OPTUNA_OUTPUT_LOG_FOLDER
+### SLT 2024 Speaker Tagging Setting v1.0.2
+ALPHA=0.4
+BETA=0.04
+PARALLEL_CHUNK_WORD_LEN=100
+BEAM_WIDTH=8
+WORD_WINDOW=32
+PEAK_PROB=0.95
+USE_NGRAM=True
+LM_METHOD=ngram
+# Get the base name of the test_manifest and remove extension
+UNIQ_MEMO=$(basename "${INPUT_ERROR_SRC_LIST_PATH}" .json | sed 's/\./_/g')
+echo "UNIQ MEMO:" $UNIQ_MEMO
+TRIAL=telephonic
+BATCH_SIZE=11
+rm $WORKSPACE/$ASRDIAR_FILE_NAME.src.seglst.json
+rm $WORKSPACE/$ASRDIAR_FILE_NAME.ref.seglst.json
+rm $WORKSPACE/$ASRDIAR_FILE_NAME.hyp.seglst.json
+python $BASEPATH/speaker_tagging_beamsearch.py \
+    port=[5501,5502,5511,5512,5521,5522,5531,5532] \
+    arpa_language_model=$DIAR_LM_PATH \
+    batch_size=$BATCH_SIZE \
+    groundtruth_ref_list_path=$GROUNDTRUTH_REF_LIST_PATH \
+    input_error_src_list_path=$INPUT_ERROR_SRC_LIST_PATH \
+    parallel_chunk_word_len=$PARALLEL_CHUNK_WORD_LEN \
+    use_ngram=$USE_NGRAM \
+    alpha=$ALPHA \
+    beta=$BETA \
+    beam_width=$BEAM_WIDTH \
+    word_window=$WORD_WINDOW \
+    peak_prob=$PEAK_PROB \
+    out_dir=$DIAR_OUT_DOWNLOAD \
+    hyper_params_optim=true \
+    optuna_n_trials=$OPTUNA_N_TRIALS \
+    workspace_dir=$WORKSPACE \
+    asrdiar_file_name=$ASRDIAR_FILE_NAME \
+    storage=$STORAGE_PATH \
+    optuna_study_name=$OPTUNA_STUDY_NAME \
+    temp_out_dir=$TEMP_OUT_DIR \
+    output_log_file=$OPTUNA_OUTPUT_LOG_FILE || exit 1

run_speaker_tagging_beam_search.sh CHANGED Viewed

@@ -5,10 +5,11 @@
 BASEPATH=${PWD}
 DIAR_LM_PATH=$BASEPATH/arpa_model/4gram_small.arpa
 ASRDIAR_FILE_NAME=err_dev
 WORKSPACE=$BASEPATH/SLT-Task2-Post-ASR-Speaker-Tagging
 INPUT_ERROR_SRC_LIST_PATH=$BASEPATH/$ASRDIAR_FILE_NAME.src.list
 GROUNDTRUTH_REF_LIST_PATH=$BASEPATH/$ASRDIAR_FILE_NAME.ref.list
-DIAR_OUT_DOWNLOAD=$WORKSPACE/short2_all_seglst_infer
 mkdir -p $DIAR_OUT_DOWNLOAD
@@ -35,6 +36,7 @@ rm $WORKSPACE/$ASRDIAR_FILE_NAME.hyp.seglst.json
 python $BASEPATH/speaker_tagging_beamsearch.py \
     port=[5501,5502,5511,5512,5521,5522,5531,5532] \
     arpa_language_model=$DIAR_LM_PATH \
     batch_size=$BATCH_SIZE \
@@ -47,7 +49,7 @@ python $BASEPATH/speaker_tagging_beamsearch.py \
     beam_width=$BEAM_WIDTH \
     word_window=$WORD_WINDOW \
     peak_prob=$PEAK_PROB \
-    out_dir=$DIAR_OUT_DOWNLOAD
 echo "Evaluating the original source transcript."

 BASEPATH=${PWD}
 DIAR_LM_PATH=$BASEPATH/arpa_model/4gram_small.arpa
 ASRDIAR_FILE_NAME=err_dev
+OPTUNA_STUDY_NAME=speaker_beam_search_${ASRDIAR_FILE_NAME}
 WORKSPACE=$BASEPATH/SLT-Task2-Post-ASR-Speaker-Tagging
 INPUT_ERROR_SRC_LIST_PATH=$BASEPATH/$ASRDIAR_FILE_NAME.src.list
 GROUNDTRUTH_REF_LIST_PATH=$BASEPATH/$ASRDIAR_FILE_NAME.ref.list
+DIAR_OUT_DOWNLOAD=$WORKSPACE/$ASRDIAR_FILE_NAME
 mkdir -p $DIAR_OUT_DOWNLOAD
 python $BASEPATH/speaker_tagging_beamsearch.py \
+    hyper_params_optim=false \
     port=[5501,5502,5511,5512,5521,5522,5531,5532] \
     arpa_language_model=$DIAR_LM_PATH \
     batch_size=$BATCH_SIZE \
     beam_width=$BEAM_WIDTH \
     word_window=$WORD_WINDOW \
     peak_prob=$PEAK_PROB \
 echo "Evaluating the original source transcript."

speaker_tagging_beamsearch.py CHANGED Viewed

@@ -11,11 +11,13 @@ from beam_search_utils import (
     convert_nemo_json_to_seglst,
 )
 from hydra.core.config_store import ConfigStore
-__INFO_TAG__ = "[INFO]"
 @dataclass
 class RealigningLanguageModelParameters:
     batch_size: int = 32
     use_mp: bool = True
     input_error_src_list_path: Optional[str] = None
@@ -31,46 +33,72 @@ class RealigningLanguageModelParameters:
     beam_width: int = 16
     out_dir: Optional[str] = None
 cs = ConfigStore.instance()
 cs.store(name="config", node=RealigningLanguageModelParameters)
 @hydra.main(config_name="config", version_base="1.1")
 def main(cfg: RealigningLanguageModelParameters) -> None:
     trans_info_dict = load_input_jsons(input_error_src_list_path=cfg.input_error_src_list_path, peak_prob=float(cfg.peak_prob))
     reference_info_dict  = load_reference_jsons(reference_seglst_list_path=cfg.groundtruth_ref_list_path)
     source_info_dict = load_reference_jsons(reference_seglst_list_path=cfg.input_error_src_list_path)
     loaded_kenlm_model = kenlm.Model(cfg.arpa_language_model)
     speaker_beam_search_decoder = SpeakerTaggingBeamSearchDecoder(loaded_kenlm_model=loaded_kenlm_model, cfg=cfg)
     div_trans_info_dict = speaker_beam_search_decoder.divide_chunks(trans_info_dict=trans_info_dict,
                                                                     win_len=cfg.parallel_chunk_word_len,
                                                                     word_window=cfg.word_window,
                                                                     port=cfg.port,)
-    trans_info_dict = run_mp_beam_search_decoding(speaker_beam_search_decoder,
-                                                    loaded_kenlm_model=loaded_kenlm_model,
-                                                    trans_info_dict=div_trans_info_dict,
-                                                    org_trans_info_dict=trans_info_dict,
-                                                    div_mp=True,
-                                                    win_len=cfg.parallel_chunk_word_len,
-                                                    word_window=cfg.word_window,
-                                                    port=cfg.port,
-                                                    use_ngram=cfg.use_ngram,
-                                                    )
-    hypothesis_sessions_dict = convert_nemo_json_to_seglst(trans_info_dict)
-    write_seglst_jsons(hypothesis_sessions_dict, input_error_src_list_path=cfg.input_error_src_list_path, diar_out_path=cfg.out_dir, ext_str='hyp')
-    write_seglst_jsons(reference_info_dict, input_error_src_list_path=cfg.groundtruth_ref_list_path, diar_out_path=cfg.out_dir, ext_str='ref')
-    write_seglst_jsons(source_info_dict, input_error_src_list_path=cfg.groundtruth_ref_list_path, diar_out_path=cfg.out_dir, ext_str='src')
-    print(f"{__INFO_TAG__} Parameters used: \
-            \n ALPHA: {cfg.alpha} \
-            \n BETA: {cfg.beta} \
-            \n BEAM WIDTH: {cfg.beam_width} \
-            \n Word Window: {cfg.word_window} \
-            \n Use Ngram: {cfg.use_ngram} \
-            \n Chunk Word Len: {cfg.parallel_chunk_word_len} \
-            \n SpeakerLM Model: {cfg.arpa_language_model}") \
 if __name__ == '__main__':
     main()

     convert_nemo_json_to_seglst,
 )
 from hydra.core.config_store import ConfigStore
+from hyper_optim import optuna_hyper_optim
 @dataclass
 class RealigningLanguageModelParameters:
+    # Beam search parameters
     batch_size: int = 32
     use_mp: bool = True
     input_error_src_list_path: Optional[str] = None
     beam_width: int = 16
     out_dir: Optional[str] = None
+    # Optuna parameters
+    hyper_params_optim: bool = False
+    optuna_n_trials: int = 200
+    workspace_dir: Optional[str]  = None
+    asrdiar_file_name: Optional[str]  = None
+    storage: Optional[str] = "sqlite:///optuna-speaker-beam-search.db"
+    optuna_study_name: Optional[str] = "speaker_beam_search"
+    output_log_file: Optional[str] = None
+    temp_out_dir: Optional[str] = None
 cs = ConfigStore.instance()
 cs.store(name="config", node=RealigningLanguageModelParameters)
 @hydra.main(config_name="config", version_base="1.1")
 def main(cfg: RealigningLanguageModelParameters) -> None:
+    __INFO_TAG__ = "[INFO]"
     trans_info_dict = load_input_jsons(input_error_src_list_path=cfg.input_error_src_list_path, peak_prob=float(cfg.peak_prob))
     reference_info_dict  = load_reference_jsons(reference_seglst_list_path=cfg.groundtruth_ref_list_path)
     source_info_dict = load_reference_jsons(reference_seglst_list_path=cfg.input_error_src_list_path)
+    # Load ARPA language model in advance
     loaded_kenlm_model = kenlm.Model(cfg.arpa_language_model)
     speaker_beam_search_decoder = SpeakerTaggingBeamSearchDecoder(loaded_kenlm_model=loaded_kenlm_model, cfg=cfg)
     div_trans_info_dict = speaker_beam_search_decoder.divide_chunks(trans_info_dict=trans_info_dict,
                                                                     win_len=cfg.parallel_chunk_word_len,
                                                                     word_window=cfg.word_window,
                                                                     port=cfg.port,)
+    if cfg.hyper_params_optim:
+        print(f"{__INFO_TAG__} Optimizing hyper-parameters...")
+        cfg = optuna_hyper_optim(cfg=cfg,
+                                speaker_beam_search_decoder=speaker_beam_search_decoder,
+                                loaded_kenlm_model=loaded_kenlm_model,
+                                div_trans_info_dict=div_trans_info_dict,
+                                org_trans_info_dict=trans_info_dict,
+                                source_info_dict=source_info_dict,
+                                reference_info_dict=reference_info_dict,
+                                )
+        __INFO_TAG__ = f"{__INFO_TAG__} Optimized hyper-parameters - "
+    else:
+        trans_info_dict = run_mp_beam_search_decoding(speaker_beam_search_decoder,
+                                                        loaded_kenlm_model=loaded_kenlm_model,
+                                                        div_trans_info_dict=div_trans_info_dict,
+                                                        org_trans_info_dict=trans_info_dict,
+                                                        div_mp=True,
+                                                        win_len=cfg.parallel_chunk_word_len,
+                                                        word_window=cfg.word_window,
+                                                        port=cfg.port,
+                                                        use_ngram=cfg.use_ngram,
+                                                        )
+        hypothesis_sessions_dict = convert_nemo_json_to_seglst(trans_info_dict)
+        write_seglst_jsons(hypothesis_sessions_dict, input_error_src_list_path=cfg.input_error_src_list_path, diar_out_path=cfg.out_dir, ext_str='hyp')
+        write_seglst_jsons(reference_info_dict, input_error_src_list_path=cfg.groundtruth_ref_list_path, diar_out_path=cfg.out_dir, ext_str='ref')
+        write_seglst_jsons(source_info_dict, input_error_src_list_path=cfg.groundtruth_ref_list_path, diar_out_path=cfg.out_dir, ext_str='src')
+        print(f"{__INFO_TAG__} Parameters used: \
+                \n ALPHA: {cfg.alpha} \
+                \n BETA: {cfg.beta} \
+                \n BEAM WIDTH: {cfg.beam_width} \
+                \n Word Window: {cfg.word_window} \
+                \n Use Ngram: {cfg.use_ngram} \
+                \n Chunk Word Len: {cfg.parallel_chunk_word_len} \
+                \n SpeakerLM Model: {cfg.arpa_language_model}")
 if __name__ == '__main__':
     main()