Joshua Lansford commited on
Commit
d348741
·
1 Parent(s): 678563e

Added readme and docstrings.

Browse files
Files changed (3) hide show
  1. README.md +136 -2
  2. app.py +4 -4
  3. transmorgrify.py +43 -7
README.md CHANGED
@@ -1,5 +1,5 @@
1
  ---
2
- title: SentanceTransmorgrifier
3
  emoji: s
4
  colorFrom: yellow
5
  colorTo: yellow
@@ -10,4 +10,138 @@ pinned: false
10
  license: apache-2.0
11
  ---
12
 
13
- Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
+ title: Sentance Transmorgrifier
3
  emoji: s
4
  colorFrom: yellow
5
  colorTo: yellow
 
10
  license: apache-2.0
11
  ---
12
 
13
+ ## Sentance Transmorgrifier
14
+
15
+ # What is the Sentance Transmorgrifier?
16
+ - The Sentance Transmorgrifier is a generic text to text conversion tool which uses a categorical gradiant boost library, [catboost](https://catboost.ai/), as its back end.
17
+ - This library does not use neural net or word embeddings but does the transformation on the character level.
18
+ - For Sentance Transmorgrifier to work, there has to be some common characters between the from and two conversion.
19
+ - The model uses a modified form of the [logest common subsequence algorithm](https://en.wikipedia.org/wiki/Longest_common_subsequence_problem) to transform the sentance conversion into a sequence of three types of operations:
20
+ 1. Match: Pass the character from input to output
21
+ 2. Drop: Remove the incoming character from the input.
22
+ 3. Insert: Generate a character and add it to the output.
23
+ - The transformation uses a sliding context window of the next n incoming characters, n output transformed chars and n output untransformed chars.
24
+ - Because the window is sliding, there is no fixed length on the character sequences which can be transformed.
25
+
26
+ # How can I use the Sentance Transmorgrifier
27
+ - The project has been configured to be able to be used in two different ways.
28
+
29
+ ## Shell access
30
+ - The transmorgrify.py script can be called directly with arguments specifying an input csv file, what lables are from and to and what to save the resulting model as or to process the input csv to an output. Here is an example:
31
+
32
+ ```sh
33
+ python transmorgrify.py \
34
+ --train --in_csv /home/lansford/Sync/projects/tf_over/sentance_transmogrifier/examples/phonetic/phonetic.csv \
35
+ --a_header English \
36
+ --b_header Phonetic\
37
+ --device 0:1 \
38
+ --model phonetics_gpu_4000.tm \
39
+ --verbose \
40
+ --iterations 4000 \
41
+ --train_percentage 50
42
+ ```
43
+ - `--train` This says that the system is supposed to train as apposed to doing inference.
44
+ - `--a_header` This indicates the header in the csv file which identifies the from column
45
+ - `--b_header` This indicates the to column
46
+ - `--device` This specifies the gpu if you have one or type `cpu` if you do not have a gpu.
47
+ - `--model` This indicates where to save the model
48
+ - `--verbose` Self explanitory
49
+ - `--iterations` This indicates how many catboost iterations should be executed on your input data.
50
+ - `--train_percentage` If you are going to use the same file for testing as well as the training, giving a train percentage will only use the percentage specified for training.
51
+
52
+ ```sh
53
+ python transmorgrify.py \
54
+ --execute \
55
+ --in_csv /home/lansford/Sync/projects/tf_over/sentance_transmogrifier/examples/phonetic/phonetic.csv \
56
+ --a_header English \
57
+ --b_header Phonetic\
58
+ --device cpu \
59
+ --model phonetics_gpu_4000.tm \
60
+ --verbose \
61
+ --include_stats \
62
+ --out_csv ./phonetics_out_gpu_4000.csv \
63
+ --train_percentage 50
64
+ ```
65
+ - `--execute` This indicates that this is supposed to execute the model as apposed to training it.
66
+ - `--a_header` This indicates the header in the csv file which identifies the from column.
67
+ - `--b_header` This indicates the to column. The to column must be specified if `--include_stats` is also specified.
68
+ - `--device` This specifies the gpu if you have one or type `cpu` if you do not have a gpu.
69
+ - `--model` This indicates where to load the model
70
+ - `--verbose` Self explanitory
71
+ - `--include_stats` This adds editing distance to the output csv so that you can sort and graph how well the model did. It reports the Levenshtein Distance from input to output before and after transformation and the percent improvement.
72
+ - `--out_csv` This indicates where the data should be saved after being processed by the model.
73
+ - `--train_percentage` If you are going to use the same file for testing as well as the training, give the same train percentage as was given for training and the execution will only use the remaining data not used for training.
74
+
75
+ ## Python object access
76
+ - If instead of wanting to run this from the command line you want to use this in a python app, you can import the Transmorgrifier from the transmorgrify.py file into your app and use the methods on it. This model has the following functions.
77
+ - `train`
78
+ ```
79
+ Train the Transmorgrifier model. This does not save it to disk but just trains in memory.
80
+
81
+ Keyword arguments:
82
+ from_sentances -- An array of strings for the input sentances.
83
+ to_sentances -- An array of strings of the same length as from_sentances which the model is to train to convert to.
84
+ iterations -- An integer specifying the number of iterations to convert from or to. (default 4000)
85
+ device -- The gpu reference which catboost wants or "cpu". (default cpu)
86
+ trailing_context -- The number of characters after the action point to include for context. (default 7)
87
+ leading_context -- The number of characters before the action point to include for context. (default 7)
88
+ verbose -- Increased the amount of text output during training. (default True)
89
+ ```
90
+ - `save`
91
+ ```
92
+ Saves the model previously trained with train to a specified model file.
93
+
94
+ Keyword arguments:
95
+ model -- The pathname to save the model such as "my_model.tm"
96
+ ```
97
+ - `load`
98
+ ```
99
+ Loads the model previously saved from the file system.
100
+
101
+ Keyword arguments:
102
+ model -- The filename of the model to load. (default my_model.tm)
103
+ ```
104
+ - `execute`
105
+ ```
106
+ Runs the data from from_sentaces. The results are returned
107
+ using yield so you need to wrap this in list() if you want
108
+ to index it. from_sentances can be an array or a generator.
109
+
110
+ Keyword arguments:
111
+ from_sentances -- Something iterable which returns strings.
112
+ ```
113
+ - Here is an example of using object access to train a model
114
+ ```python
115
+ import pandas as pd
116
+ import transmorgrify
117
+
118
+ #load training data
119
+ train_data = pd.read_csv( "training.csv" )
120
+
121
+ #do the training
122
+ my_model = transmorgrify.Transmorgrifier()
123
+ my_model.train(
124
+ from_sentances=train_data["from_header"],
125
+ to_sentances=train_data["to_header"],
126
+ iterations=4000 )
127
+
128
+ #save the results
129
+ my_model.save( "my_model.tm" )
130
+ ```
131
+
132
+ - Here is an example of using object access to use a model
133
+ ```python
134
+ import pandas as pd
135
+ import transmorgrify
136
+
137
+ #load inference data
138
+ inference_data = pd.read_csv( "inference.csv" )
139
+
140
+ #Load the model
141
+ my_model = transmorgrify.Transmorgrifier()
142
+ my_model.load( "my_model.tm" )
143
+
144
+ #do the inference
145
+ #model returns a generator so wrap it with a list
146
+ results = list( my_model.execute( inference_data["from_header"] ) )
147
+ ```
app.py CHANGED
@@ -1,16 +1,16 @@
1
  import gradio as gr
2
  import transmorgrify
3
 
4
- eng_to_ipa_tm = transmorgrify.Transmorgrifyer()
5
  eng_to_ipa_tm.load( "./examples/phonetic/phonetics_gpu_4000.tm" )
6
 
7
- ipa_to_eng_tm = transmorgrify.Transmorgrifyer()
8
  ipa_to_eng_tm.load( "./examples/phonetic/reverse_phonetics_gpu_4000.tm")
9
 
10
- eng_to_pig_tm = transmorgrify.Transmorgrifyer()
11
  eng_to_pig_tm.load( "./examples/piglattin/piglattin_gpu_4000.tm" )
12
 
13
- pig_to_eng_tm = transmorgrify.Transmorgrifyer()
14
  pig_to_eng_tm.load( "./examples/piglattin/reverse_piglattin_gpu_4000.tm" )
15
 
16
 
 
1
  import gradio as gr
2
  import transmorgrify
3
 
4
+ eng_to_ipa_tm = transmorgrify.Transmorgrifier()
5
  eng_to_ipa_tm.load( "./examples/phonetic/phonetics_gpu_4000.tm" )
6
 
7
+ ipa_to_eng_tm = transmorgrify.Transmorgrifier()
8
  ipa_to_eng_tm.load( "./examples/phonetic/reverse_phonetics_gpu_4000.tm")
9
 
10
+ eng_to_pig_tm = transmorgrify.Transmorgrifier()
11
  eng_to_pig_tm.load( "./examples/piglattin/piglattin_gpu_4000.tm" )
12
 
13
+ pig_to_eng_tm = transmorgrify.Transmorgrifier()
14
  pig_to_eng_tm.load( "./examples/piglattin/reverse_piglattin_gpu_4000.tm" )
15
 
16
 
transmorgrify.py CHANGED
@@ -14,8 +14,20 @@ START = 3
14
 
15
  FILE_VERSION = 1
16
 
17
- class Transmorgrifyer:
18
- def train( self, from_sentances, to_sentances, iterations, device, trailing_context, leading_context, verbose ):
 
 
 
 
 
 
 
 
 
 
 
 
19
  X,Y = _parse_for_training( from_sentances, to_sentances, num_pre_context_chars=leading_context, num_post_context_chars=trailing_context )
20
 
21
  #train and save the action_model
@@ -30,7 +42,15 @@ class Transmorgrifyer:
30
  self.leading_context = leading_context
31
  self.iterations = iterations
32
 
33
- def save( self, model ):
 
 
 
 
 
 
 
 
34
  self.name = model
35
  with zipfile.ZipFile( model, mode="w", compression=zipfile.ZIP_DEFLATED, compresslevel=9 ) as myzip:
36
  with myzip.open( 'params.json', mode='w' ) as out:
@@ -47,7 +67,15 @@ class Transmorgrifyer:
47
  myzip.write( temp_filename, "char.cb" )
48
  os.unlink( temp_filename )
49
 
50
- def load( self, model ):
 
 
 
 
 
 
 
 
51
  self.name = model
52
  with zipfile.ZipFile( model, mode='r' ) as zip:
53
  with zip.open( 'params.json' ) as fin:
@@ -72,6 +100,14 @@ class Transmorgrifyer:
72
 
73
 
74
  def execute( self, from_sentances, verbose=False ):
 
 
 
 
 
 
 
 
75
  for i,from_sentance in enumerate(from_sentances):
76
 
77
  yield _do_reconstruct(
@@ -469,7 +505,7 @@ def train( in_csv, a_header, b_header, model, iterations, device, leading_contex
469
  if verbose: print( "parcing data for training" )
470
 
471
 
472
- tm = Transmorgrifyer()
473
 
474
  tm.train( from_sentances=train_data[a_header],
475
  to_sentances=train_data[b_header],
@@ -490,7 +526,7 @@ def execute( include_stats, in_csv, out_csv, a_header, b_header, model, execute_
490
  execute_data = full_data.iloc[split_index:,:].reset_index(drop=True)
491
 
492
 
493
- tm = Transmorgrifyer()
494
  tm.load( model )
495
 
496
  results = list(tm.execute( execute_data[a_header ], verbose=verbose ))
@@ -601,7 +637,7 @@ def main():
601
 
602
 
603
  if args.gradio:
604
- tm = Transmorgrifyer()
605
  tm.load( args.model )
606
 
607
  tm.demo( args.share is not None )
 
14
 
15
  FILE_VERSION = 1
16
 
17
+ class Transmorgrifier:
18
+ def train( self, from_sentances, to_sentances, iterations = 4000, device = 'cpu', trailing_context = 7, leading_context = 7, verbose=True ):
19
+ """
20
+ Train the Transmorgrifier model. This does not save it to disk but just trains in memory.
21
+
22
+ Keyword arguments:
23
+ from_sentances -- An array of strings for the input sentances.
24
+ to_sentances -- An array of strings of the same length as from_sentances which the model is to train to convert to.
25
+ iterations -- An integer specifying the number of iterations to convert from or to. (default 4000)
26
+ device -- The gpu reference which catboost wants or "cpu". (default cpu)
27
+ trailing_context -- The number of characters after the action point to include for context. (default 7)
28
+ leading_context -- The number of characters before the action point to include for context. (default 7)
29
+ verbose -- Increased the amount of text output during training. (default True)
30
+ """
31
  X,Y = _parse_for_training( from_sentances, to_sentances, num_pre_context_chars=leading_context, num_post_context_chars=trailing_context )
32
 
33
  #train and save the action_model
 
42
  self.leading_context = leading_context
43
  self.iterations = iterations
44
 
45
+ return self
46
+
47
+ def save( self, model='my_model.tm' ):
48
+ """
49
+ Saves the model previously trained with train to a specified model file.
50
+
51
+ Keyword arguments:
52
+ model -- The pathname to save the model such as "my_model.tm" (default my_model.tm)
53
+ """
54
  self.name = model
55
  with zipfile.ZipFile( model, mode="w", compression=zipfile.ZIP_DEFLATED, compresslevel=9 ) as myzip:
56
  with myzip.open( 'params.json', mode='w' ) as out:
 
67
  myzip.write( temp_filename, "char.cb" )
68
  os.unlink( temp_filename )
69
 
70
+ return self
71
+
72
+ def load( self, model='my_model.tm' ):
73
+ """
74
+ Loads the model previously saved from the file system.
75
+
76
+ Keyword arguments:
77
+ model -- The filename of the model to load. (default my_model.tm)
78
+ """
79
  self.name = model
80
  with zipfile.ZipFile( model, mode='r' ) as zip:
81
  with zip.open( 'params.json' ) as fin:
 
100
 
101
 
102
  def execute( self, from_sentances, verbose=False ):
103
+ """
104
+ Runs the data from from_sentaces. The results are returned
105
+ using yield so you need to wrap this in list() if you want
106
+ to index it. from_sentances can be an array or a generator.
107
+
108
+ Keyword arguments:
109
+ from_sentances -- Something iterable which returns strings.
110
+ """
111
  for i,from_sentance in enumerate(from_sentances):
112
 
113
  yield _do_reconstruct(
 
505
  if verbose: print( "parcing data for training" )
506
 
507
 
508
+ tm = Transmorgrifier()
509
 
510
  tm.train( from_sentances=train_data[a_header],
511
  to_sentances=train_data[b_header],
 
526
  execute_data = full_data.iloc[split_index:,:].reset_index(drop=True)
527
 
528
 
529
+ tm = Transmorgrifier()
530
  tm.load( model )
531
 
532
  results = list(tm.execute( execute_data[a_header ], verbose=verbose ))
 
637
 
638
 
639
  if args.gradio:
640
+ tm = Transmorgrifier()
641
  tm.load( args.model )
642
 
643
  tm.demo( args.share is not None )