File size: 3,374 Bytes
f41799c
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
cc241f5
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
---
language: en
license: apache-2.0
tags:
- text-classification
- question-classification
- LoRA
- quantization
datasets:
- squad
- glue
model_name: question-classification-lora-quant
base_model: google/gemma-2b-it
widget:
- text: "What is the capital of France?"
- text: "This is a beautiful day."
metrics:
- accuracy
- f1
- precision
- recall
---

# Model Card: Question Classification using LoRA with Quantization

## Model Overview

This model is a fine-tuned version of [google/gemma-2b-it](https://huggingface.co/google/gemma-2b-it) designed to classify text into two categories: **QUESTION** or **NOT_QUESTION**. It was fine-tuned on a custom dataset that combines the **SQuAD** dataset (containing questions) and the **GLUE SST-2** dataset (containing general non-question sentences).

### Model Architecture

- Base Model: `google/gemma-2b-it`
- Fine-tuning Method: LoRA (Low-Rank Adaptation) with k-bit quantization (4-bit quantization with NF4).
- Configurations:
  - Quantization: 4-bit quantization using `BitsAndBytesConfig`
  - Adapter (LoRA) settings:
    - Rank: 64
    - LoRA Alpha: 32
    - Dropout: 0.05
    - Target Modules: `q_proj`, `k_proj`, `v_proj`, `o_proj`

## Dataset

The model was trained using a combination of two datasets:
- **SQuAD v1.1** (Question dataset)
- **GLUE SST-2** (Non-question dataset)

Each dataset was preprocessed to contain a label:
- **QUESTION**: For SQuAD questions
- **NOT_QUESTION**: For non-question sentences from GLUE SST-2.

### Data Preprocessing
- A random removal probability (`P_remove = 0.3`) was applied to remove some of the questions containing a question mark (`?`), to increase the model's robustness.
- Both datasets were balanced with an equal number of samples (`N=100` for training and testing).

## Model Performance

- **Metrics Evaluated**:
  - Accuracy
  - F1 Score
  - Precision
  - Recall
- These metrics were computed on a balanced test dataset containing both question and non-question examples.

## How to Use

You can use this model to classify whether a given text is a question or not. Here’s how you can use it:

```python
from transformers import AutoTokenizer, AutoModelForSequenceClassification

tokenizer = AutoTokenizer.from_pretrained("your_model_name")
model = AutoModelForSequenceClassification.from_pretrained("your_model_name")

inputs = tokenizer("What is the capital of France?", return_tensors="pt")
outputs = model(**inputs)
predictions = torch.argmax(outputs.logits, axis=1)

label = "QUESTION" if predictions == 1 else "NOT_QUESTION"
print(f"Predicted Label: {label}")
```

## Limitations

- The model was trained on English data only, so it may not perform well on non-English languages.
- Since it is fine-tuned on specific datasets (SQuAD and GLUE SST-2), performance may vary with out-of-domain data.
- The model assumes well-formed input sentences, so performance may degrade with informal or very short text.

## Intended Use

This model is intended for text classification tasks where distinguishing between questions and non-questions is needed. Potential use cases include:
- Improving chatbot or virtual assistant interactions.
- Enhancing query detection for search engines.

## License

This model follows the same license as [google/gemma-2b-it](https://huggingface.co/google/gemma-2b-it). Please refer to the original license for any usage restrictions.