File size: 6,953 Bytes
60b99bb
54447d6
66373a4
60b99bb
66373a4
60b99bb
 
 
66373a4
60b99bb
54447d6
 
 
 
 
 
66373a4
3761383
54447d6
 
 
 
66373a4
 
60b99bb
66373a4
 
60b99bb
66373a4
8ba3988
 
 
 
 
 
66373a4
 
 
 
832f5c2
 
 
55e6cbf
 
654f14c
39bf231
832f5c2
 
850e622
66373a4
d0ac78e
 
66373a4
0ca3401
d0ac78e
66373a4
850e622
d0ac78e
39bf231
850e622
3936ffe
2cea765
8ba3988
3761383
8ba3988
39bf231
 
 
 
9a5b97f
 
8ba3988
 
 
 
 
 
 
 
 
 
66373a4
 
 
3936ffe
66373a4
832f5c2
66373a4
 
 
54447d6
66373a4
832f5c2
850e622
 
832f5c2
2cea765
 
 
3936ffe
2cea765
832f5c2
 
 
2cea765
66373a4
 
 
 
60b99bb
 
66373a4
832f5c2
 
850e622
 
832f5c2
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
import ibis
import os
import streamlit as st

from langchain.chains import create_sql_query_chain
from langchain_community.utilities import SQLDatabase
from langchain_core.prompts.prompt import PromptTemplate
from langchain_openai import ChatOpenAI

from query import execute_prompt
# from data import DATA

if os.path.exists("duck.db"):
    os.remove("duck.db")
if os.path.exists("duck.db.wal"):
    os.remove("duck.db.wal")

geoparquet = "data.parquet"
con = ibis.connect("duckdb://duck.db", extensions = ["spatial"])
con.read_parquet(geoparquet, "crops").cast({"geometry": "geometry"})
# for code, url in DATA.items():
#     tbl = con.read_parquet(url, code).cast({"geometry": "geometry"})

st.set_page_config(
    page_title="fiboaGPT",
    page_icon="🦜",
)
st.title("fiboaGPT")

# Read the instructions from 'hcat-instructions.txt'
hcat_instructions = ""
with open('fiboa/hcat-instructions.txt', 'r') as file:
    hcat_instructions = file.read()


new_prompt = PromptTemplate(input_variables=['dialect', 'input', 'table_info', 'top_k'], 
                        template=
'''
Given an input question, first create a syntactically correct {dialect} query to run, then look at the results of the query
and return the answer. Only limit for {top_k} when asked for "some" or "examples". 

This duckdb database includes full support for spatial queries, so it will understand most PostGIS-type
queries as well.  Remember that you must cast blob column to a geom type using ST_GeomFromWKB(geometry) AS geometry
before any spatial operations. Do not use ST_GeomFromWKB for non-spatial queries.
If you are asked to "map" or "show on a map", then be sure to select the "geometry", "area", and "crop" columns in your query.
If asked to show a "table", you must not include the "geometry" column from the query results.

Use the following format: return only the SQLQuery to run. DO NOT use the prefix with "SQLQuery:".  
Do not include an explanation. Do only use SQL functions that DuckDB supports.

Use ONLY the column names that you can see in the table description.
Do NOT query for columns that do not exist. Pay attention to which column is in which table.

Tables include {table_info}. The data comes comes always from the table called "crops".
NEVER use the "testing" table. Pay close attention to this table schema.

The column "area" is in the unit hectares, you may need to convert it to other units, e.g. square meters.
There is no other column related to area information, especially not total_area or similar!
If you need to compute the total area, do it manually, with a SUM of the area column. You should always use the 'area' column - never use a 'total_area' column.
The column "perimeter" is in the unit meters, you may need to convert it to other units, e.g. kilometers.
The column "collection" contains the country codes for the Baltic states:
"ec_lv" for Latvia, "ec_lt" for Lithuania, "ec_es" for Estonia.
Be sure to always include the collection with the right country for any query about a specific country, including it in the WHERE clause.

If the user asks for 'percent' of crops or fields for one of the countries you must always calculate the percentage manually, by summing up the area manually. You total number of hectares to calculate the percentage from is 1583923 for Lithuania,  1788859 for latvia and  973945 for Estonia. If they don't specify a country use 4346727. If you use one of these be sure to always include the right collection in the where clause.
There is no 'percent' column, so when you calculate the percentage manually you must sum the crop area and then use the total area of the country.

If the user asks for the 'top 10' (or other number) of a crop then sum by area and then sort by that sum.

If the user asks for anything related to 'field size' then you must use the 'area' column and calculate it manually.
If the user asks for the 'average field size' then you must calculate the average area manually, by summing up the area of all fields and dividing by the number of fields. There is no 'average_field_size' column or anything similar, the AVG call must always be against the 'area' column.
If the user asks for results by 'number of fields' or 'field count' then you must calculate the count manually, with a COUNT(*) AS field_count - there is no field_count column, and don't use 'area'.
If the user asks for deciles or quantiles do not use the NTILES functionality - it does not exist in DuckDB. Instead Always replace NTILE(N) with:Assign ROW_NUMBER() and calculate total rows and compute the group using CEIL(row_number::FLOAT / (total_rows / N)).
You should use CTE (computed_quantiles) to do this compute the row_number and quantile using a window function in a temporary result set. ROW_NUMBER() generates row numbers based on the ORDER BY area. CEIL(row_number::FLOAT / (total_rows / 5)) calculates the quantile for each row. (change to / 5 for quantile, etc) Outer Query: Aggregate area by quantile using AVG(area).
  Be sure to  avoid row_number in GROUP BY: The intermediate result in the WITH clause computes quantile so the outer query only groups by quantile. A sample query is 'WITH computed_quantiles AS SELECT area, CEIL(ROW_NUMBER() OVER (ORDER BY area)::FLOAT / (COUNT(*) OVER () / 5)) AS quantile FROM testing) SELECT quantile, AVG(area) AS average_field_area FROM computed_quantiles GROUP BY quantile ORDER BY quantile;
  Adjust the 5 of 'count(*) over () / 5' to 10 for decile or for other numbers the user requests.
Generally you should use a Common Table Expression (CTE) or subquery to compute the things like ranks first and then filter the results in the main query as  DuckDB does not allow window functions (like ROW_NUMBER()) directly inside a WHERE clause
  
'''
+ hcat_instructions +  # Concatenate the instructions here
'''
Question: {input}
'''
)
# todo: if data get's updated, change "ec_es" to "ec_ee"

llm = ChatOpenAI(model="gpt-4o-mini", temperature=0, api_key=st.secrets["OPENAI_API_KEY"])

# Create the SQL query chain with the custom prompt
db = SQLDatabase.from_uri("duckdb:///duck.db", view_support=True)
chain = create_sql_query_chain(llm, db, prompt=new_prompt, k=100)

'''
Ask me about fiboa data (here: all baltic states)!
Request "a map" to get map output, or table for tabular output, e.g.

- Show a table of the top ten crops in the Baltics.
- Show a map with the 10 largest sugar beet fields.
- What is the percent of oats in each country?
- Show a map with the largest field in Estonia
- What are the quantiles of field size for Latvia?

'''

example = "How many berry fields are there in each country?"
with st.container():  
    if prompt := st.chat_input(example, key="chain"):
        st.chat_message("user").write(prompt)
        with st.chat_message("assistant"):
            execute_prompt(con, chain, prompt)

st.divider()

'''
Data sources: https://source.coop/fiboa | 
Data License: CC-BY-SA-4.0 | 
Software License: BSD
'''