File size: 4,713 Bytes
2037997
490ede8
 
 
b39ff7f
5aaa530
 
2aef41b
5aaa530
2aef41b
490ede8
 
 
 
 
 
 
 
2aef41b
5aaa530
d21a862
5aaa530
 
 
 
 
2aef41b
5aaa530
 
2aef41b
 
5aaa530
 
2aef41b
5aaa530
 
2aef41b
5aaa530
 
 
2037997
5aaa530
490ede8
2037997
5aaa530
 
d21a862
 
fffe342
 
d21a862
2037997
490ede8
2037997
490ede8
 
 
2037997
fb22304
2037997
490ede8
 
 
 
2037997
b7df5f0
 
 
 
 
 
 
 
 
 
 
 
 
 
 
490ede8
a3fd10e
d35dd54
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
---
base_model: FacebookAI/roberta-base
datasets:
- SynthSTEL/styledistance_training_triplets
- StyleDistance/synthstel
language:
- en
library_name: sentence-transformers
license: mit
pipeline_tag: sentence-similarity
tags:
- datadreamer
- datadreamer-0.35.0
- synthetic
- sentence-transformers
- feature-extraction
- sentence-similarity
widget:
- example_title: Example 1
  source_sentence: Did you hear about the Wales wing? He'll h8 2 withdraw due 2 injuries
    from future competitions.
  sentences:
  - We're raising funds 2 improve our school's storage facilities and add new playground
    equipment!
  - Did you hear about the Wales wing? He'll hate to withdraw due to injuries from
    future competitions.
- example_title: Example 2
  source_sentence: You planned the DesignMeets Decades of Design event; you executed
    it perfectly.
  sentences:
  - We'll find it hard to prove the thief didn't face a real threat!
  - You orchestrated the DesignMeets Decades of Design gathering; you actualized it
    flawlessly.
- example_title: Example 3
  source_sentence: Did the William Barr maintain a commitment to allow Robert Mueller
    to finish the inquiry?\
  sentences:
  - Will the artist be compiling a music album, or will there be a different focus
    in the future?
  - Did William Barr maintain commitment to allow Robert Mueller to finish inquiry?
---

# Model Card

This repository contains the model introduced in [StyleDistance: Stronger Content-Independent Style Embeddings with Synthetic Parallel Examples](https://huggingface.co/papers/2410.12757).

StyleDistance is a **style embedding model** that aims to embed texts with similar writing styles closely and different styles far apart, regardless of content. You may find this model useful for stylistic analysis of text, clustering, authorship identfication and verification tasks, and automatic style transfer evaluation.

## Training Data and Variants of StyleDistance

StyleDistance was contrastively trained on [SynthSTEL](https://huggingface.co/datasets/StyleDistance/synthstel), a synthetically generated dataset of positive and negative examples of 40 style features being used in text. By utilizing this synthetic dataset, StyleDistance is able to achieve stronger content-independence than other style embeddding models currently available. This particular model was trained using a combination of the synthetic dataset and a [real dataset that makes use of authorship datasets from Reddit to train style embeddings](https://aclanthology.org/2022.repl4nlp-1.26/). For a version that is purely trained on synthetic data, see this other version of [StyleDistance](https://huggingface.co/StyleDistance/styledistance_synthetic_only). 

## Example Usage

```python3
from sentence_transformers import SentenceTransformer
from sentence_transformers.util import cos_sim

model = SentenceTransformer('StyleDistance/styledistance') # Load model

input = model.encode("Did you hear about the Wales wing? He'll h8 2 withdraw due 2 injuries from future competitions.")
others = model.encode(["We're raising funds 2 improve our school's storage facilities and add new playground equipment!", "Did you hear about the Wales wing? He'll hate to withdraw due to injuries from future competitions."])
print(cos_sim(input, others))
```

---
## Citation

```latex
@misc{patel2025styledistancestrongercontentindependentstyle,
      title={StyleDistance: Stronger Content-Independent Style Embeddings with Synthetic Parallel Examples}, 
      author={Ajay Patel and Jiacheng Zhu and Justin Qiu and Zachary Horvitz and Marianna Apidianaki and Kathleen McKeown and Chris Callison-Burch},
      year={2025},
      eprint={2410.12757},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2410.12757}, 
}
```

---
## Trained with DataDreamer

This model was trained with a synthetic dataset with [DataDreamer 🤖💤](https://datadreamer.dev). The synthetic dataset card and model card can be found [here](datadreamer.json). The training arguments can be found [here](training_args.json).

---
#### Funding Acknowledgements

<small> This research is supported in part by the Office of the Director of National Intelligence (ODNI), Intelligence Advanced Research Projects Activity (IARPA), via the HIATUS Program contract #2022-22072200005. The views and conclusions contained herein are those of the authors and should not be interpreted as necessarily representing the official policies, either expressed or implied, of ODNI, IARPA, or the U.S. Government. The U.S. Government is authorized to reproduce and distribute reprints for governmental purposes notwithstanding any copyright annotation therein. </small>