cholmes commited on
Commit
8ba3988
·
verified ·
1 Parent(s): cd18354

More improvements

Browse files

* More hcat instructions
* More detailed instructions on various actions.

Files changed (1) hide show
  1. fiboa/app.py +18 -11
fiboa/app.py CHANGED
@@ -27,6 +27,12 @@ st.set_page_config(
27
  )
28
  st.title("fiboaGPT")
29
 
 
 
 
 
 
 
30
  new_prompt = PromptTemplate(input_variables=['dialect', 'input', 'table_info', 'top_k'],
31
  template=
32
  '''
@@ -36,8 +42,6 @@ and return the answer. Only limit for {top_k} when asked for "some" or "examples
36
  This duckdb database includes full support for spatial queries, so it will understand most PostGIS-type
37
  queries as well. Remember that you must cast blob column to a geom type using ST_GeomFromWKB(geometry) AS geometry
38
  before any spatial operations. Do not use ST_GeomFromWKB for non-spatial queries.
39
-
40
-
41
  If you are asked to "map" or "show on a map", then be select the "geometry" column in your query.
42
  If asked to show a "table", you must not include the "geometry" column from the query results.
43
 
@@ -56,22 +60,25 @@ If you need to compute the total area, do it manually, with a SUM of the area co
56
  The column "perimeter" is in the unit meters, you may need to convert it to other units, e.g. kilometers.
57
  The column "collection" contains the country codes for the Baltic states:
58
  "ec_lt" for Latvia, "ec_lv" for Lithuania, "ec_es" for Estonia.
 
59
 
60
- Many questions will be about specific crops. The following is the list of strings that will appear in the crop_type column:
61
-
62
- arable_crops, cereal, common_soft_wheat, winter_common_soft_wheat, spring_common_soft_wheat, unspecified_season_common_soft_wheat, durum_hard_wheat, winter_durum_hard_wheat, spring_durum_hard_wheat, unspecified_season_durum_hard_wheat, rye, winter_rye, spring_rye, unspecified_season_rye, barley, winter_barley, spring_barley, unspecified_season_barley, oats, winter_oats, spring_oats, unspecified_season_oats, grain_maize_corn_popcorn, unspecified_season_grain_maize_corn_popcorn, rice, unspecified_season_rice, triticale, winter_triticale, spring_triticale, unspecified_season_triticale, millet_sorghum, winter_millet_sorghum, spring_millet_sorghum, teff, unspecified_season_millet_sorghum, spelt, winter_spelt, spring_spelt, unspecified_season_spelt, meslin, winter_meslin, spring_meslin, unspecified_season_meslin, emmer, winter_emmer, spring_emmer, unspecified_season_emmer, einkorn, winter_einkorn, spring_einkorn, unspecified_season_einkorn, canary_seed_canaryseed, unspecified_season_canary_seed_canaryseed, unspecified_cereals, winter_unspecified_cereals, spring_unspecified_cereals, summer_unspecified_cereals, unspecified_season_unspecified_cereals, other_cereals, unspecified_season_other_cereals, legumes_dried_pulses_protein_crops, beans, chickpeas, esparsette_onobrychis, fenugreek, lentils, peas, sweet_lupins, unspecified_legumes_dried_pulses_protein_crops, other_dry_pulses, potatoes, sweet_potatoes, fodder_roots, industrial_nonfood_crops, tobacco, hops, cotton, rapeseed_rape, winter_rapeseed_rape, spring_rapeseed_rape, summer_rapeseed_rape, unspecified_season_rapeseed_rape, sunflower, poppy, winter_poppy, summer_poppy, flax_linseed, flax_linen, flax_linseed_oil, oilseed_crops, guizotia_abyssinica_nyger, hemp_cannabis, finola, fibre_crops, aromatic_medicinal_culinary_plants_spices_herbs, actaea_baneberry_christopher_herbs, alchemilla_ladys_mantle, anethum_dill, angelica, anise_aniseed, artemisia, basil, black_cumin, borage, calendula_marigold, caraway, catnip, chamomile, chervil, coriander, ericaceae_heather, galium_bedstraw, hibiscus, lavender_lavandula, lemon_balm_melissa, lovage_maggiplant, mints_peppermint, moldavian_dragonhead, nasturtiums, nettles, oregano, parsly, piper_pepper, polygonum, rosemary, rubia_tinctorum_common_madder, saffron_crocus_sativus, silver_comb, st_johns_wort, stachys_hedgenettle_chinese_artichoke, tarragon, thyme, valerian, yarrow, unspecified_aromatic_medicinal_culinary_plants_spices_herbs, other_aromatic_medicinal_culinary_plants_spices_herbs, marian_thistles, phacelia, camelina, onobrychis_sainfoins, other_industrial_crops, fresh_vegetables, flowers_ornamental_plants, adonis, anemones_windflowers, asters, begonias, bluebells, bulrush, burnet, carnation, chrysanthemum, cornflowers, corsican_hellebore, dahlia, daisy_daisies, dandelions, echinacea_sun_hat, edelweiss, fiddleneck_amsinckia, fuchsias, galega, gentians, gladiolus_gladioli, goldenrod, iris, isatis_tinctoria_woad, lilies, lotus, lunaria_honesty_silver, malva, milk_star, miscanthus_silvergrass, monstera_adansonii_eyelet, moonseed, narcissus_daffodil, peony_peonies, primrose, rhododendron, roses, rudbeckia_coneflowers, safflower, salsify, sanvitalia_procumbens, serradella, silene_catchfly, silphium_rosinweeds, snapdragons, stonecrop, tagetes, thimbles, tulips, viola, violets_pansies, zinnias, unspecified_flowers_ornamental_plants, other_flowers_ornamental_plants, plants_harvested_green, temporary_grass, poaceae_grasses, elymus, festuca_fescue, cocksfoot_catgrass, festulolium, lolium_ryegrass, setaria, sod_turf, switchgrass, timothy, legumes_harvested_green, alfalfa_lucerne, arachis, clover, melilot, vetches, unspecified_legumes_harvested_green, green_silo_maize, other_plants_harvested_green, arable_land_seed_seedlings, fallow_land_not_crop, kitchen_gardens, strawberries, cucurbits, cucumber_pickle, honeydew, melon, pumpkin_squash_gourd, watermelon, zucchini_courgette, pseudocereal, amaranth, buckwheat, quinoa, soy_soybeans, fennel, topinambur_jerusalem_artichoke, sage_chia, asparagus, brassicaceae_cruciferae, mustard, brassica_oleracea_cabbage, bok_choy_pak_choi, broccoli, brussels_sprouts, cauliflower, chinese_cabbage, collard_greens, gai_lan, kale, kohlrabi, red_cabbage, savoy_cabbage, white_cabbage, other_brassica_oleracea_cabbage, cress, horseradish, swede_rutabaga, alliums, chives, garlic, leek, onions, scallion, shallot, rhubarb, purslane, celery, celeriac, leaf_celery, aubergine_eggplant, artichoke, tomato, root_vegetables, arctium_burdock, beetroot_beets, carrots_daucus, mangelwurzel_fodder_beet, parsnips, radish, sugar_beet, turnips, unspecified_root_vegetables, capsicum, bell_pepper_paprika, chili_pepper, salads_lettuce_leaf_vegetables, chard, chicory_chicories, endive, iceberg, lambs_lettuce_rapunzel, rocket_arugula, sorrel, spinach, other_salads_lettuce_leaf_vegetables, other_arable_land_crops, pasture_meadow_grassland_grass, permanent_crops_perennial, orchards_fruits, amelanchier_serviceberry, apples, apricots, cherry_cherries, feijoa, fig, kiwi, medlar_loquat, nectarine, pawpaw, peach, pears, plums, pomegranate, quinces, unspecified_orchards_fruits, berries_berry_species, aronia_chokeberries, blackberry, blackcurrant_cassis, blueberry, cranberry, currants, gooseberry_gooseberries_cranberries, hippophae_sea_buckthorns_seaberry, jostaberry, raspberry_raspberries, redcurrant, rose_hip_rosehip, rowan_rowanberries, tayberry, unspecified_berries_berry_species, nuts, almond, hazelnuts_hazel, pecan, pistachio, sweet_chestnuts, walnuts, citrus_plantations, olive_plantations, olives_for_oil_production, table_olives, vineyards_wine_vine_rebland_grapes, nurseries_nursery, shrubberries_shrubs, azaleas, chaenomeles_cathayensis, crataegus_hawthorn, elder_elderberry, honeysuckle, ricinus_castor, wire_bush, ginko, avocado, legumes_from_trees, carob, mesquite, tamarind, unspecified_permanent_crops, other_permanent_crops_plantations, mushrooms_energy_genetically_modified_crops, energy_crops, genetically_modified_crops, igniscum_candy, sida_virginia_mallow, truffle, other_mushrooms_energy_crops_genetically_modified_crops, greenhouse_foil_film, tree_wood_forest, afforestation_reforestation, aspen, birch, dogwood_cornus, eucalyptus, oak, populus, willows_osiers, unspecified_tree_wood_forest, other_tree_wood_forest, peat_turf, unmaintained, not_known_and_other
63
-
64
- You should take any crop name the user mentions and convert it to one of the strings in the list above. Always query on the crop_type column, using one of the above, never query on 'crop' column.
65
-
66
- If the user asks for 'percent' of crops or fields for one of the countries you must always calculate the percentage manually, by summing up the area manually. You total number of hectares to calculate the percentage from is 1583923 for Lithuania, 1788859 for latvia and 973945 for Estonia. If they don't specify a country use 4346727.
67
  There is no 'percent' column, so when you calculate the percentage manually you must sum the crop area and then use the total area of the country.
68
- one of the countries you must always calculate the percentage manually, by summing up the area manually. You can use 1583923 175015 km²
69
 
70
  If the user asks for the 'top 10' (or other number) of a crop then sum by area and then sort by that sum.
71
 
72
  If the user asks for anything related to 'field size' then you must use the 'area' column and calculate it manually.
73
  If the user asks for the 'average field size' then you must calculate the average area manually, by summing up the area of all fields and dividing by the number of fields. There is no 'average_field_size' column or anything similar, the AVG call must always be against the 'area' column.
74
-
 
 
 
 
 
 
 
 
 
75
  Question: {input}
76
  '''
77
  )
 
27
  )
28
  st.title("fiboaGPT")
29
 
30
+ # Read the instructions from 'hcat-instructions.txt'
31
+ hcat_instructions = ""
32
+ with open('fiboa/hcat-instructions.txt', 'r') as file:
33
+ hcat_instructions = file.read()
34
+
35
+
36
  new_prompt = PromptTemplate(input_variables=['dialect', 'input', 'table_info', 'top_k'],
37
  template=
38
  '''
 
42
  This duckdb database includes full support for spatial queries, so it will understand most PostGIS-type
43
  queries as well. Remember that you must cast blob column to a geom type using ST_GeomFromWKB(geometry) AS geometry
44
  before any spatial operations. Do not use ST_GeomFromWKB for non-spatial queries.
 
 
45
  If you are asked to "map" or "show on a map", then be select the "geometry" column in your query.
46
  If asked to show a "table", you must not include the "geometry" column from the query results.
47
 
 
60
  The column "perimeter" is in the unit meters, you may need to convert it to other units, e.g. kilometers.
61
  The column "collection" contains the country codes for the Baltic states:
62
  "ec_lt" for Latvia, "ec_lv" for Lithuania, "ec_es" for Estonia.
63
+ Be sure to always include the collection with the right country for any query about a specific country, including it in the WHERE clause.
64
 
65
+ If the user asks for 'percent' of crops or fields for one of the countries you must always calculate the percentage manually, by summing up the area manually. You total number of hectares to calculate the percentage from is 1583923 for Lithuania, 1788859 for latvia and 973945 for Estonia. If they don't specify a country use 4346727. If you use one of these be sure to always include the right collection in the where clause.
 
 
 
 
 
 
66
  There is no 'percent' column, so when you calculate the percentage manually you must sum the crop area and then use the total area of the country.
 
67
 
68
  If the user asks for the 'top 10' (or other number) of a crop then sum by area and then sort by that sum.
69
 
70
  If the user asks for anything related to 'field size' then you must use the 'area' column and calculate it manually.
71
  If the user asks for the 'average field size' then you must calculate the average area manually, by summing up the area of all fields and dividing by the number of fields. There is no 'average_field_size' column or anything similar, the AVG call must always be against the 'area' column.
72
+ If the user asks for results by 'number of fields' or 'field count' then you must calculate the count manually, with a COUNT(*) AS field_count - there is no field_count column, and don't use 'area'.
73
+ If the user asks for deciles or quantiles do not use the NTILES functionality - it does not exist in DuckDB. Instead Always replace NTILE(N) with:Assign ROW_NUMBER() and calculate total rows and compute the group using CEIL(row_number::FLOAT / (total_rows / N)).
74
+ You should use CTE (computed_quantiles) to do this compute the row_number and quantile using a window function in a temporary result set. ROW_NUMBER() generates row numbers based on the ORDER BY area. CEIL(row_number::FLOAT / (total_rows / 5)) calculates the quantile for each row. (change to / 5 for quantile, etc) Outer Query: Aggregate area by quantile using AVG(area).
75
+ Be sure to avoid row_number in GROUP BY: The intermediate result in the WITH clause computes quantile so the outer query only groups by quantile. A sample query is 'WITH computed_quantiles AS SELECT area, CEIL(ROW_NUMBER() OVER (ORDER BY area)::FLOAT / (COUNT(*) OVER () / 5)) AS quantile FROM testing) SELECT quantile, AVG(area) AS average_field_area FROM computed_quantiles GROUP BY quantile ORDER BY quantile;
76
+ Adjust the 5 of 'count(*) over () / 5' to 10 for decile or for other numbers the user requests.
77
+ Generally you should use a Common Table Expression (CTE) or subquery to compute the things like ranks first and then filter the results in the main query as DuckDB does not allow window functions (like ROW_NUMBER()) directly inside a WHERE clause
78
+
79
+ '''
80
+ + hcat_instructions + # Concatenate the instructions here
81
+ '''
82
  Question: {input}
83
  '''
84
  )