mstatt commited on
Commit
e955dc1
1 Parent(s): 8283b01

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +131 -3
README.md CHANGED
@@ -1,3 +1,131 @@
1
- ---
2
- license: mit
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ language:
3
+ - en
4
+ license: apache-2.0
5
+ tags:
6
+ - NLP
7
+ pipeline_tag: summarization
8
+
9
+ ---
10
+ # Topic Change Point Detection Model
11
+ ## Model Details
12
+
13
+ - **Model Name:** Falconsai/topic_change_point
14
+ - **Model Type:** Fine-tuned `google/t5-small`
15
+ - **Language:** English
16
+ - **License:** MIT
17
+
18
+ ## Overview
19
+ The Topic Change Point Detection model is designed to identify topics and track how they change within a block of text. It is based on the google/t5-small model, fine-tuned on a custom dataset that maps texts to their respective topic changes. This model can be used to analyze and categorize texts according to their topics and the transitions between them.
20
+
21
+
22
+ ### Model Architecture
23
+
24
+ The base model architecture is T5 (Text-To-Text Transfer Transformer), which treats every NLP problem as a text-to-text problem. The specific version used here is `google/t5-small`, which has been fine-tuned to understand and predict conversation arcs.
25
+
26
+ Fine-Tuning Data
27
+ The model was fine-tuned on a dataset consisting of texts and their corresponding topic changes. The dataset should be formatted in a specified file with two columns: text and topic_changes.
28
+
29
+ Intended Use
30
+ The model is intended for identifying topics and detecting changes in topics across a block of text. It can be useful for applications in various fields: Psychology/Psychiatry for session assesment (This initial use case), content analysis, document insights, conversational analysis, and other areas where understanding the flow of topics is important.
31
+
32
+ ## How to Use
33
+
34
+ ### Inference
35
+
36
+ To use this model for inference, you need to load the fine-tuned model and tokenizer. Here is an example of how to do this using the `transformers` library:
37
+
38
+
39
+ Running Pipeline
40
+ ```python
41
+ # Use a pipeline as a high-level helper
42
+ from transformers import pipeline
43
+
44
+ text_block = 'Your block of text here.'
45
+ pipe = pipeline("summarization", model="Falconsai/topic_change_point")
46
+ res1 = pipe(convo1, max_length=1024, min_length=512, do_sample=False)
47
+ print(res1)
48
+
49
+ ```
50
+
51
+
52
+
53
+ Running on CPU
54
+ ```python
55
+ # Load model directly
56
+ from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
57
+
58
+ tokenizer = AutoTokenizer.from_pretrained("Falconsai/topic_change_point")
59
+ model = AutoModelForSeq2SeqLM.from_pretrained("Falconsai/topic_change_point")
60
+
61
+ input_text = 'Your block of text here.'
62
+ input_ids = tokenizer(input_text, return_tensors="pt").input_ids
63
+
64
+ outputs = model.generate(input_ids)
65
+ print(tokenizer.decode(outputs[0]))
66
+ ```
67
+
68
+ Running on GPU
69
+ ```python
70
+ # pip install accelerate
71
+ from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
72
+
73
+ tokenizer = AutoTokenizer.from_pretrained("Falconsai/topic_change_point")
74
+ model = AutoModelForSeq2SeqLM.from_pretrained("Falconsai/topic_change_point", device_map="auto")
75
+
76
+ input_text = 'Your block of text here.'
77
+ input_ids = tokenizer(input_text, return_tensors="pt").input_ids.to("cuda")
78
+
79
+ outputs = model.generate(input_ids)
80
+ print(tokenizer.decode(outputs[0]))
81
+
82
+ ```
83
+
84
+
85
+
86
+
87
+
88
+ ## Training
89
+
90
+ The training process involves the following steps:
91
+
92
+ 1. **Load and Explore Data:** Load the dataset and perform initial exploration to understand the data distribution.
93
+ 2. **Preprocess Data:** Tokenize the text block and prepare them for the T5 model.
94
+ 3. **Fine-Tune Model:** Fine-tune the `google/t5-small` model using the preprocessed data.
95
+ 4. **Evaluate Model:** Evaluate the model's performance on a validation set to ensure it's learning correctly.
96
+ 5. **Save Model:** Save the fine-tuned model for future use.
97
+
98
+ ## Evaluation
99
+
100
+ The model's performance should be evaluated on a separate validation set to ensure it accurately predicts the conversation arcs. Metrics such as accuracy, precision, recall, and F1 score can be used to assess its performance.
101
+
102
+ ## Limitations
103
+
104
+ - **Data Dependency:** The model's performance is highly dependent on the quality and representativeness of the training data.
105
+ - **Generalization:** The model may not generalize well to conversation texts that are significantly different from the training data.
106
+
107
+ ## Ethical Considerations
108
+
109
+ When deploying the model, be mindful of the ethical implications, including but not limited to:
110
+
111
+ - **Privacy:** Ensure that text data used for training and inference does not contain sensitive or personally identifiable information.
112
+ - **Bias:** Be aware of potential biases in the training data that could affect the model's predictions.
113
+
114
+ ## License
115
+
116
+ This project is licensed under the MIT License. See the [LICENSE](LICENSE) file for details.
117
+
118
+ ## Citation
119
+
120
+ If you use this model in your research, please cite it as follows:
121
+
122
+ ```
123
+ @misc{topic_change_point,
124
+ author = {Michael Stattelman},
125
+ title = {Topic Change Point Detection},
126
+ year = {2024},
127
+ publisher = {Falcons.ai},
128
+ }
129
+ ```
130
+
131
+ ---