File size: 4,369 Bytes
49f0c5b
 
30fc9b7
 
 
49f0c5b
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
30fc9b7
 
 
 
49f0c5b
 
 
30fc9b7
 
 
 
 
 
 
 
 
49f0c5b
 
 
30fc9b7
 
 
 
 
 
 
 
 
 
 
 
 
 
 
49f0c5b
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
30fc9b7
 
 
49f0c5b
 
 
 
 
30fc9b7
66c569a
49f0c5b
 
 
30fc9b7
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
# prodigy-ecfr-textcat

## About the Project

Our goal is to organize these financial institution rules and regulations so financial institutions  can go through newly created rules and regulations to know which departments to send the information to and to allow easy retrieval of these regulations when necessary. Text mining and information retrieval will allow a large step of the process to be automated. Automating these steps will allow less time and effort to be contributed for financial institutions employees. This allows more time and work to be used to accomplish other projects.

## Table of Contents

- [About the Project](#about-the-project)
- [Getting Started](#getting-started)
  - [Prerequisites](#prerequisites)
  - [Installation](#installation)
- [Usage](#usage)
- [File Structure](#file-structure)
- [License](#license)
- [Acknowledgements](#acknowledgements)

## Getting Started

Instructions on setting up the project on a local machine.

### Prerequisites

Before running the project, ensure you have the following software dependencies installed:
- [Python 3.x](https://www.python.org/downloads/)
- [spaCy](https://spacy.io/usage)
- [Prodigy](https://prodi.gy/docs/) (optional)

### Installation

Follow these step-by-step instructions to install and configure the project:

1. **Clone this repository to your local machine.**
   ```bash
   git clone <https://github.com/ManjinderUNCC/prodigy-ecfr-textcat.git>
2. Install the required dependencies by running:
```bash
pip install -r requirements.txt
```

## Usage

To use the project, follow these steps:

1. **Prepare your data:**
   - Place your dataset files in the `/data` directory.
   - Optionally, annotate your data using Prodigy and save the annotations in the `/data` directory.

2. **Train the text classification model:**
   - Run the training script located in the `/python_Code` directory.

3. **Evaluate the model:**
   - Use the evaluation script to assess the model's performance on labeled data.

4. **Make predictions:**
   - Apply the trained model to new, unlabeled data to classify it into relevant categories.


## File Structure

Describe the organization of files and directories within the project.

- `/corpus`
  - `/labels`
    - `ner.json`
    - `parser.json`
    - `tagger.json`
    - `textcat_multilabel.json`
- `/data`
  - `eval.jsonl`
  - `firstStep_file.jsonl`
  - `five_examples_annotated5.jsonl`
  - `goldenEval.jsonl`
  - `thirdStep_file.jsonl`
  - `train.jsonl`
  - `train200.jsonl`
  - `train4465.jsonl`
- `/my_trained_model`
  - `/textcat_multilabel`
    - `cfg`
    - `model`
  - `/vocab`
    - `key2row`
    - `lookups.bin`
    - `strings.json`
    - `vectors`
    - `vectors.cfg`
  - `config.cfg`
  - `meta.json`
  - `tokenizer`
- `/output`
  - `/experiment1`
    - `/model-best`
      - `/textcat_multilabel`
        - `cfg`
        - `model`
      - `/vocab`
        - `key2row`
        - `lookups.bin`
        - `strings.json`
        - `vectors`
        - `vectors.cfg`
      - `config.cfg`
      - `meta.json`
      - `tokenizer`
    - `/model-last`
      - `/textcat_multilabel`
        - `cfg`
        - `model`
      - `/vocab`
        - `key2row`
        - `lookups.bin`
        - `strings.json`
        - `vectors`
        - `vectors.cfg`
      - `config.cfg`
      - `meta.json`
      - `tokenizer`
  - `/experiment3`
    - `/model-best`
      - `/textcat_multilabel`
        - `cfg`
        - `model`
      - `/vocab`
        - `key2row`
        - `lookups.bin`
        - `strings.json`
        - `vectors`
        - `vectors.cfg`
      - `config.cfg`
      - `meta.json`
      - `tokenizer`
    - `/model-last`
      - `/textcat_multilabel`
        - `cfg`
        - `model`
      - `/vocab`
        - `key2row`
        - `lookups.bin`
        - `strings.json`
        - `vectors`
        - `vectors.cfg`
      - `config.cfg`
      - `meta.json`
      - `tokenizer`
- `/python_Code`
  - `finalStep-formatLabel.py`
  - `firstStep-format.py`
  - `five_examples_annotated.ipynb`
  - `secondStep-score.py`
  - `thirdStep-label.py`
  - `train_eval_split.ipynb`
- `TerminalCode.txt`
- `requirements.txt`
- `Terminal Commands vs Project.yml`
- `Project.yml`
- `README.md`
- `prodigy.json`

## License

- Package A: MIT License
- Package B: Apache License 2.00

## Acknowledgements

Manjinder Sandhu, Dagim Bantikassegn, Alex Brooks, Tyler Dabbs