File size: 10,214 Bytes
bf58fe8
 
 
87c3140
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
prompt_author: Will Weaver          
prompt_author_institution: UM          
prompt_description: Basic prompt used by the University of Michigan. Designed to be a starting point for more complex prompts.   
LLM: gpt
instructions: '1. Refactor the unstructured OCR text into a dictionary based on the
  JSON structure outlined below.

  2. You should map the unstructured OCR text to the appropriate JSON key and then
  populate the field based on its rules.

  3. Some JSON key fields are permitted to remain empty if the corresponding information
  is not found in the unstructured OCR text.

  4. Ignore any information in the OCR text that doesn''t fit into the defined JSON
  structure.

  5. Duplicate dictionary fields are not allowed.

  6. Ensure that all JSON keys are in lowercase.

  7. Ensure that new JSON field values follow sentence case capitalization.

  8. Ensure all key-value pairs in the JSON dictionary strictly adhere to the format
  and data types specified in the template.

  9. Ensure the output JSON string is valid JSON format. It should not have trailing
  commas or unquoted keys.

  10. Only return a JSON dictionary represented as a string. You should not explain
  your answer.'
json_formatting_instructions: "The next section of instructions outlines how to format\
  \ the JSON dictionary. The keys are the same as those of the final formatted JSON\
  \ object.\nFor each key there is a format requirement that specifies how to transcribe\
  \ the information for that key. \nThe possible formatting options are:\n1. \"verbatim\
  \ transcription\" - field is populated with verbatim text from the unformatted OCR.\n\
  2. \"spell check transcription\" - field is populated with spelling corrected text\
  \ from the unformatted OCR.\n3. \"boolean yes no\" - field is populated with only\
  \ yes or no.\n4. \"boolean 1 0\" - field is populated with only 1 or 0.\n5. \"integer\"\
  \ - field is populated with only an integer.\n6. \"[list]\" - field is populated\
  \ from one of the values in the list.\n7. \"yyyy-mm-dd\" - field is populated with\
  \ a date in the format year-month-day.\nThe desired null value is also given. Populate\
  \ the field with the null value of the information for that key is not present in\
  \ the unformatted OCR text."
mapping:
  COLLECTING:
  - collectors
  - collector_number
  - determined_by
  - multiple_names
  - verbatim_date
  - date
  - end_date
  GEOGRAPHY:
  - country
  - state
  - county
  - min_elevation
  - max_elevation
  - elevation_units
  LOCALITY:
  - locality_name
  - verbatim_coordinates
  - decimal_coordinates
  - datum
  - plant_description
  - cultivated
  - habitat
  MISCELLANEOUS: []
  TAXONOMY:
  - catalog_number
  - genus
  - species
  - subspecies
  - variety
  - forma
rules:
  Dictionary:
    catalog_number:
      description: The barcode identifier, typically a number with at least 6 digits,
        but fewer than 30 digits.
      format: verbatim transcription
      null_value: ''
    collector_number:
      description: Unique identifier or number that denotes the specific collecting
        event and associated with the collector.
      format: verbatim transcription
      null_value: s.n.
    collectors:
      description: Full name(s) of the individual(s) responsible for collecting the
        specimen. When multiple collectors are involved, their names should be separated
        by commas.
      format: verbatim transcription
      null_value: not present
    country:
      description: Country that corresponds to the current geographic location of
        collection. Capitalize first letter of each word. If abbreviation is given
        populate field with the full spelling of the country's name.
      format: spell check transcription
      null_value: ''
    county:
      description: Administrative division 2 that corresponds to the current geographic
        location of collection; capitalize first letter of each word. Administrative
        division 2 is equivalent to a U.S. county, parish, borough.
      format: spell check transcription
      null_value: ''
    cultivated:
      description: Cultivated plants are intentionally grown by humans. In text descriptions,
        look for planting dates, garden locations, ornamental, cultivar names, garden,
        or farm to indicate cultivated plant.
      format: boolean yes no
      null_value: ''
    date:
      description: 'Date the specimen was collected formatted as year-month-day. If
        specific components of the date are unknown, they should be replaced with
        zeros. Examples: ''0000-00-00'' if the entire date is unknown, ''YYYY-00-00''
        if only the year is known, and ''YYYY-MM-00'' if year and month are known
        but day is not.'
      format: yyyy-mm-dd
      null_value: ''
    datum:
      description: Datum of location coordinates. Possible values are include in the
        format list. Leave field blank if unclear. [WGS84, WGS72, WGS66, WGS60, NAD83,
        NAD27, OSGB36, ETRS89, ED50, GDA94, JGD2011, Tokyo97, KGD2002, TWD67, TWD97,
        BJS54, XAS80, GCJ-02, BD-09, PZ-90.11, GTRF, CGCS2000, ITRF88, ITRF89, ITRF90,
        ITRF91, ITRF92, ITRF93, ITRF94, ITRF96, ITRF97, ITRF2000, ITRF2005, ITRF2008,
        ITRF2014, Hong Kong Principal Datum, SAD69]
      format: '[list]'
      null_value: ''
    decimal_coordinates:
      description: Correct and convert the verbatim location coordinates to conform
        with the decimal degrees GPS coordinate format.
      format: spell check transcription
      null_value: ''
    determined_by:
      description: Full name of the individual responsible for determining the taxanomic
        name of the specimen. Sometimes the name will be near to the characters 'det'
        to denote determination. This name may be isolated from other names in the
        unformatted OCR text.
      format: verbatim transcription
      null_value: ''
    elevation_units:
      description: 'Elevation units must be meters. If min_elevation field is populated,
        then elevation_units: ''m''. Otherwise elevation_units: ''''.'
      format: spell check transcription
      null_value: ''
    end_date:
      description: 'If a date range is provided, this represents the later or ending
        date of the collection period, formatted as year-month-day. If specific components
        of the date are unknown, they should be replaced with zeros. Examples: ''0000-00-00''
        if the entire end date is unknown, ''YYYY-00-00'' if only the year of the
        end date is known, and ''YYYY-MM-00'' if year and month of the end date are
        known but the day is not.'
      format: yyyy-mm-dd
      null_value: ''
    forma:
      description: Taxonomic determination to form (f.).
      format: verbatim transcription
      null_value: ''
    genus:
      description: Taxonomic determination to genus. Genus must be capitalized. If
        genus is not present use the taxonomic family name followed by the word 'indet'.
      format: verbatim transcription
      null_value: ''
    habitat:
      description: Description of a plant's habitat or the location where the specimen
        was collected. Ignore descriptions of the plant itself.
      format: verbatim transcription
      null_value: ''
    locality_name:
      description: Description of geographic location, landscape, landmarks, regional
        features, nearby places, or any contextual information aiding in pinpointing
        the exact origin or site of the specimen.
      format: verbatim transcription
      null_value: ''
    max_elevation:
      description: Maximum elevation or altitude in meters. If only one elevation
        is present, then max_elevation should be set to the null_value. Only if units
        are explicit then convert from feet ('ft' or 'ft.' or 'feet') to meters ('m'
        or 'm.' or 'meters'). Round to integer.
      format: integer
      null_value: ''
    min_elevation:
      description: Minimum elevation or altitude in meters. Only if units are explicit
        then convert from feet ('ft' or 'ft.' or 'feet') to meters ('m' or 'm.' or
        'meters'). Round to integer.
      format: integer
      null_value: ''
    multiple_names:
      description: Indicate whether multiple people or collector names are present
        in the unformatted OCR text. If you see more than one person's name the value
        is 'yes'; otherwise the value is 'no'.
      format: boolean yes no
      null_value: ''
    plant_description:
      description: Description of plant features such as leaf shape, size, color,
        stem texture, height, flower structure, scent, fruit or seed characteristics,
        root system type, overall growth habit and form, any notable aroma or secretions,
        presence of hairs or bristles, and any other distinguishing morphological
        or physiological characteristics.
      format: verbatim transcription
      null_value: ''
    species:
      description: Taxonomic determination to species, do not capitalize species.
      format: verbatim transcription
      null_value: ''
    state:
      description: Administrative division 1 that corresponds to the current geographic
        location of collection. Capitalize first letter of each word. Administrative
        division 1 is equivalent to a U.S. State.
      format: spell check transcription
      null_value: ''
    subspecies:
      description: Taxonomic determination to subspecies (subsp.).
      format: verbatim transcription
      null_value: ''
    variety:
      description: Taxonomic determination to variety (var).
      format: verbatim transcription
      null_value: ''
    verbatim_coordinates:
      description: Verbatim location coordinates as they appear on the label. Do not
        convert formats. Possible coordinate types are one of [Lat, Long, UTM, TRS].
      format: verbatim transcription
      null_value: ''
    verbatim_date:
      description: Date of collection exactly as it appears on the label. Do not change
        the format or correct typos.
      format: verbatim transcription
      null_value: s.d.
  SpeciesName:
    taxonomy:
    - Genus_species