classla
/

Wav2Vec2BertPrimaryStressAudioFrameClassifier

@@ -15,11 +15,9 @@ base_model:
 - facebook/w2v-bert-2.0
 ---
-# Model Card for Model ID
-<!-- Provide a quick summary of what the model is/does. -->
 ## Model Details
@@ -38,37 +36,98 @@ This is the model card of a 🤗 transformers model that has been pushed on the
 <!-- Provide the basic links for the model. -->
-- **Repository:** [More Information Needed]
-- **Paper [optional]:** [More Information Needed]
-- **Demo [optional]:** [More Information Needed]
-## Uses
-<!-- Address questions around how the model is intended to be used, including the foreseeable users of the model and those affected by the model. -->
 ### Direct Use
-<!-- This section is for the model use without fine-tuning or plugging into a larger ecosystem/app. -->
-[More Information Needed]
-### Downstream Use [optional]
-<!-- This section is for the model use when fine-tuned for a task, or when plugged into a larger ecosystem/app -->
-[More Information Needed]
-### Out-of-Scope Use
-<!-- This section addresses misuse, malicious use, and uses that the model will not work well for. -->
-[More Information Needed]
-## Bias, Risks, and Limitations
-<!-- This section is meant to convey both technical and sociotechnical limitations. -->
-[More Information Needed]
 ### Recommendations
@@ -101,13 +160,11 @@ Use the code below to get started with the model.
 #### Training Hyperparameters
-- **Training regime:** [More Information Needed] <!--fp32, fp16 mixed precision, bf16 mixed precision, bf16 non-mixed precision, fp16 non-mixed precision, fp8 mixed precision -->
-#### Speeds, Sizes, Times [optional]
-<!-- This section provides information about throughput, start/end time, checkpoint size if relevant, etc. -->
-[More Information Needed]
 ## Evaluation
@@ -115,94 +172,12 @@ Use the code below to get started with the model.
 ### Testing Data, Factors & Metrics
-#### Testing Data
-<!-- This should link to a Dataset Card if possible. -->
-[More Information Needed]
-#### Factors
-<!-- These are the things the evaluation is disaggregating by, e.g., subpopulations or domains. -->
-[More Information Needed]
-#### Metrics
-<!-- These are the evaluation metrics being used, ideally with a description of why. -->
-[More Information Needed]
-### Results
-[More Information Needed]
 #### Summary
-## Model Examination [optional]
-<!-- Relevant interpretability work for the model goes here -->
-[More Information Needed]
-## Environmental Impact
-<!-- Total emissions (in grams of CO2eq) and additional considerations, such as electricity usage, go here. Edit the suggested text below accordingly -->
-Carbon emissions can be estimated using the [Machine Learning Impact calculator](https://mlco2.github.io/impact#compute) presented in [Lacoste et al. (2019)](https://arxiv.org/abs/1910.09700).
-- **Hardware Type:** [More Information Needed]
-- **Hours used:** [More Information Needed]
-- **Cloud Provider:** [More Information Needed]
-- **Compute Region:** [More Information Needed]
-- **Carbon Emitted:** [More Information Needed]
-## Technical Specifications [optional]
-### Model Architecture and Objective
-[More Information Needed]
-### Compute Infrastructure
-[More Information Needed]
-#### Hardware
-[More Information Needed]
-#### Software
-[More Information Needed]
-## Citation [optional]
-<!-- If there is a paper or blog post introducing the model, the APA and Bibtex information for that should go in this section. -->
-**BibTeX:**
-[More Information Needed]
-**APA:**
-[More Information Needed]
-## Glossary [optional]
-<!-- If relevant, include terms and calculations in this section that can help readers understand the model or model card. -->
-[More Information Needed]
-## More Information [optional]
-[More Information Needed]
-## Model Card Authors [optional]
-[More Information Needed]
-## Model Card Contact
-[More Information Needed]

 - facebook/w2v-bert-2.0
 ---
+# Model Card
+This model annotates primary stress in words on 20ms frames.
 ## Model Details
 <!-- Provide the basic links for the model. -->
+- **Paper [optional]:** Coming soon
 ### Direct Use
+The model is intended for data-driven analyses in primary stress position. ATM, it has been proven to work on 4 datasets in 3 languages.
+## Example use
+```python
+import numpy as np
+from datasets import Audio, Dataset
+from transformers import AutoFeatureExtractor, Wav2Vec2BertForAudioFrameClassification
+import torch
+import numpy as np
+if torch.cuda.is_available():
+    device = torch.device("cuda")
+else:
+    device = torch.device("cpu")
+model_name = "5roop/Wav2Vec2BertPrimaryStressAudioFrameClassifier"
+feature_extractor = AutoFeatureExtractor.from_pretrained(model_name)
+model = Wav2Vec2BertForAudioFrameClassification.from_pretrained(model_name).to(device)
+# Path to the file, containing the word to be annotated:
+f = "wavs/word.wav"
+def frames_to_intervals(frames: list[int]) -> list[tuple[float]]:
+    from itertools import pairwise
+    import pandas as pd
+    results = []
+    ndf = pd.DataFrame(
+        data={
+            "time_s": [0.020 * i for i in range(len(frames))],
+            "frames": frames,
+        }
+    )
+    ndf = ndf.dropna()
+    indices_of_change = ndf.frames.diff()[ndf.frames.diff() != 0].index.values
+    for si, ei in pairwise(indices_of_change):
+        if ndf.loc[si : ei - 1, "frames"].mode()[0] == 0:
+            pass
+        else:
+            results.append(
+                (round(ndf.loc[si, "time_s"], 3), round(ndf.loc[ei - 1, "time_s"], 3))
+            )
+    if results == []:
+        return None
+    # Post-processing: if multiple regions were returned, only the longest should be taken:
+    if len(results) > 1:
+        results = sorted(results, key=lambda t: t[1]-t[0], reverse=True)
+    return results[0:1]
+def evaluator(chunks):
+    sampling_rate = chunks["audio"][0]["sampling_rate"]
+    with torch.no_grad():
+        inputs = feature_extractor(
+            [i["array"] for i in chunks["audio"]],
+            return_tensors="pt",
+            sampling_rate=sampling_rate,
+        ).to(device)
+        logits = model(**inputs).logits
+    y_pred_raw = np.array(logits.cpu())
+    y_pred = y_pred_raw.argmax(axis=-1)
+    primary_stress = [frames_to_intervals(i) for i in y_pred]
+    return {
+        "y_pred": y_pred,
+        "y_pred_logits": y_pred_raw,
+        "primary_stress": primary_stress,
+    }
+# Create a dataset with a single instance and map our evaluator function on it:
+ds = Dataset.from_dict({"audio": [f]}).cast_column("audio", Audio(16000, mono=True))
+ds = ds.map(evaluator, batched=True, batch_size=1) # Adjust batch size according to your hardware specs
+print(ds["y_pred"][0])
+# Outputs: [0, 0, 1, 1, 1, 1, 1, ...]
+print(ds["y_pred_logits"][0])
+# Outputs:
+# [[ 0.89419061, -0.77746612],
+#  [ 0.44213724, -0.34862748],
+#  [-0.08605709,  0.13012762],
+# ....
+print(ds["prosodic_units"][0])
+# Outputs: [0.34, 0.4]
+```
 ### Recommendations
 #### Training Hyperparameters
+- Learning rate: 1e-5
+- Batch size: 32
+- Number of epochs: 20
+- Weight decay: 0.01
+- Gradient accumulation steps: 1
 ## Evaluation
 ### Testing Data, Factors & Metrics
 #### Summary
+## Citation
+Coming soon