Spaces:
Runtime error
Runtime error
Commit
·
f8f34c2
1
Parent(s):
991fcdc
Added more guidance in Readme. Now wipes variables on click to create or summarise topics
Browse files- README.md +11 -4
- app.py +14 -14
- tools/helper_functions.py +29 -0
- tools/llm_api_call.py +8 -15
README.md
CHANGED
@@ -5,7 +5,7 @@ colorFrom: purple
|
|
5 |
colorTo: yellow
|
6 |
sdk: gradio
|
7 |
app_file: app.py
|
8 |
-
pinned:
|
9 |
license: cc-by-nc-4.0
|
10 |
---
|
11 |
|
@@ -13,8 +13,15 @@ license: cc-by-nc-4.0
|
|
13 |
|
14 |
Extract topics and summarise outputs using Large Language Models (LLMs, Gemini Flash/Pro, or Claude 3 through AWS Bedrock if running on AWS). The app will query the LLM with batches of responses to produce summary tables, which are then compared iteratively to output a table with the general topics, subtopics, topic sentiment, and relevant text rows related to them. The prompts are designed for topic modelling public consultations, but they can be adapted to different contexts (see the LLM settings tab to modify).
|
15 |
|
16 |
-
You can use an AWS Bedrock model (Claude 3, paid), or Gemini (a free API, but with strict limits for the Pro model). Due to the strict API limits for the best model (Pro 1.5), the use of Gemini requires an API key. To set up your own Gemini API key, go here: https://aistudio.google.com/app/u/1/plan_information.
|
17 |
|
18 |
-
NOTE: that **API calls to Gemini are not considered secure**, so please only submit redacted, non-sensitive tabular files to this source.
|
19 |
|
20 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
5 |
colorTo: yellow
|
6 |
sdk: gradio
|
7 |
app_file: app.py
|
8 |
+
pinned: true
|
9 |
license: cc-by-nc-4.0
|
10 |
---
|
11 |
|
|
|
13 |
|
14 |
Extract topics and summarise outputs using Large Language Models (LLMs, Gemini Flash/Pro, or Claude 3 through AWS Bedrock if running on AWS). The app will query the LLM with batches of responses to produce summary tables, which are then compared iteratively to output a table with the general topics, subtopics, topic sentiment, and relevant text rows related to them. The prompts are designed for topic modelling public consultations, but they can be adapted to different contexts (see the LLM settings tab to modify).
|
15 |
|
16 |
+
You can use an AWS Bedrock model (Claude 3, paid), or Gemini (a free API, but with strict limits for the Pro model). Due to the strict API limits for the best model (Pro 1.5), the use of Gemini requires an API key. To set up your own Gemini API key, go here: https://aistudio.google.com/app/u/1/plan_information.
|
17 |
|
18 |
+
NOTE: that **API calls to Gemini are not considered secure**, so please only submit redacted, non-sensitive tabular files to this source. Also, large language models are not 100% accurate and may produce biased or harmful outputs. All outputs from this app **absolutely need to be checked by a human** to check for harmful outputs, hallucinations, and accuracy.
|
19 |
|
20 |
+
Basic use:
|
21 |
+
1. Upload a csv/xlsx/parquet file containing at least one open text column.
|
22 |
+
2. Select the relevant open text column from the dropdown.
|
23 |
+
3. If you have your own suggested (zero shot) topics, upload this (see examples folder for an example file)
|
24 |
+
4. Write a one sentence description of the consultation/context of the open text.
|
25 |
+
5. Extract topics.
|
26 |
+
6. If topic extraction fails part way through, you can upload the latest 'reference_table' and 'unique_topics_table' csv outputs on the 'Continue previous topic extraction' tab to continue from where you left off.
|
27 |
+
7. Summaries will be produced for each topic for each 'batch' of responses. If you want consolidated summaries, go to the tab 'Summarise topic outputs', upload your output reference_table and unique_topics csv files, and press summarise.
|
app.py
CHANGED
@@ -1,6 +1,6 @@
|
|
1 |
import os
|
2 |
import socket
|
3 |
-
from tools.helper_functions import ensure_output_folder_exists, add_folder_to_path, put_columns_in_df, get_connection_params, output_folder, get_or_create_env_var, reveal_feedback_buttons, wipe_logs, model_full_names, view_table
|
4 |
from tools.aws_functions import upload_file_to_s3
|
5 |
from tools.llm_api_call import llm_query, load_in_data_file, load_in_previous_data_files, sample_reference_table_summaries, summarise_output_topics
|
6 |
from tools.auth import authenticate_user
|
@@ -69,13 +69,11 @@ with app:
|
|
69 |
gr.Markdown(
|
70 |
"""# Large language model topic modelling
|
71 |
|
72 |
-
Extract topics and summarise outputs using Large Language Models (LLMs, Gemini Flash/Pro, or Claude 3 through AWS Bedrock if running on AWS). The app will query the LLM with batches of responses to produce summary tables, which are then compared iteratively to output a table with the general topics, subtopics, topic sentiment, and relevant text rows related to them. The prompts are designed for topic modelling public consultations, but they can be adapted to different contexts (see the LLM settings tab to modify).
|
73 |
|
74 |
-
You can use an AWS Bedrock model (Claude 3, paid), or Gemini (a free API, but with strict limits for the Pro model). Due to the strict API limits for the best model (Pro 1.5), the use of Gemini requires an API key. To set up your own Gemini API key, go here: https://aistudio.google.com/app/u/1/plan_information.
|
75 |
-
|
76 |
-
NOTE: that **API calls to Gemini are not considered secure**, so please only submit redacted, non-sensitive tabular files to this source. AWS Bedrock API calls are considered to be secure.
|
77 |
|
78 |
-
|
79 |
|
80 |
with gr.Tab(label="Extract topics"):
|
81 |
gr.Markdown(
|
@@ -94,7 +92,7 @@ with app:
|
|
94 |
in_colnames = gr.Dropdown(choices=["Choose column with responses"], multiselect = False, label="Select column that contains the responses (showing columns present across all files).", allow_custom_value=True, interactive=True)
|
95 |
|
96 |
with gr.Accordion("I have my own list of topics (zero shot topic modelling).", open = False):
|
97 |
-
candidate_topics = gr.File(label="Input topics from file (csv). File should have
|
98 |
|
99 |
context_textbox = gr.Textbox(label="Write a short description (one sentence of less) giving context to the large language model about the your consultation and any relevant context")
|
100 |
|
@@ -197,7 +195,8 @@ with app:
|
|
197 |
# Tabular data upload
|
198 |
in_data_files.upload(fn=put_columns_in_df, inputs=[in_data_files], outputs=[in_colnames, in_excel_sheets, data_file_names_textbox])
|
199 |
|
200 |
-
extract_topics_btn.click(
|
|
|
201 |
inputs = [in_data_files, in_colnames, batch_size_number], outputs = [file_data_state, data_file_names_textbox, total_number_of_batches], api_name="load_data").then(\
|
202 |
fn=llm_query,
|
203 |
inputs=[file_data_state, master_topic_df_state, master_reference_df_state, master_unique_topics_df_state, text_output_summary, data_file_names_textbox, total_number_of_batches, in_api_key, temperature_slide, in_colnames, model_choice, candidate_topics, latest_batch_completed, text_output_summary, text_output_file_list_state, log_files_output_list_state, first_loop_state, conversation_metadata_textbox, initial_table_prompt_textbox, prompt_2_textbox, prompt_3_textbox, system_prompt_textbox, add_to_existing_topics_system_prompt_textbox, add_to_existing_topics_prompt_textbox, number_of_prompts, batch_size_number, context_textbox, estimated_time_taken_number],
|
@@ -210,18 +209,19 @@ with app:
|
|
210 |
then(fn = reveal_feedback_buttons,
|
211 |
outputs=[data_feedback_radio, data_further_details_text, data_submit_feedback_btn, data_feedback_title], scroll_to_output=True)
|
212 |
|
213 |
-
# If uploaded partially completed consultation files do this. This should then start up the 'latest_batch_completed' change action above to continue extracting topics.
|
214 |
-
continue_previous_data_files_btn.click(
|
215 |
-
load_in_data_file, inputs = [in_data_files, in_colnames, batch_size_number], outputs = [file_data_state, data_file_names_textbox, total_number_of_batches]).\
|
216 |
-
then(load_in_previous_data_files, inputs=[in_previous_data_files], outputs=[master_reference_df_state, master_unique_topics_df_state, latest_batch_completed, in_previous_data_files_status, data_file_names_textbox])
|
217 |
-
|
218 |
# When button pressed, summarise previous data
|
219 |
-
summarise_previous_data_btn.click(
|
|
|
220 |
then(sample_reference_table_summaries, inputs=[master_reference_df_state, master_unique_topics_df_state, random_seed], outputs=[summary_reference_table_sample_state, summarised_references_markdown, master_reference_df_state, master_unique_topics_df_state]).\
|
221 |
then(summarise_output_topics, inputs=[summary_reference_table_sample_state, master_unique_topics_df_state, master_reference_df_state, model_choice, in_api_key, summarised_references_markdown, temperature_slide, data_file_names_textbox, summarised_outputs_list, latest_summary_completed_num, conversation_metadata_textbox], outputs=[summary_reference_table_sample_state, master_unique_topics_df_revised_summaries_state, master_reference_df_revised_summaries_state, summary_output_files, summarised_outputs_list, latest_summary_completed_num, conversation_metadata_textbox])
|
222 |
|
223 |
latest_summary_completed_num.change(summarise_output_topics, inputs=[summary_reference_table_sample_state, master_unique_topics_df_state, master_reference_df_state, model_choice, in_api_key, summarised_references_markdown, temperature_slide, data_file_names_textbox, summarised_outputs_list, latest_summary_completed_num, conversation_metadata_textbox], outputs=[summary_reference_table_sample_state, master_unique_topics_df_revised_summaries_state, master_reference_df_revised_summaries_state, summary_output_files, summarised_outputs_list, latest_summary_completed_num, conversation_metadata_textbox])
|
224 |
|
|
|
|
|
|
|
|
|
|
|
225 |
###
|
226 |
# LOGGING AND ON APP LOAD FUNCTIONS
|
227 |
###
|
|
|
1 |
import os
|
2 |
import socket
|
3 |
+
from tools.helper_functions import ensure_output_folder_exists, add_folder_to_path, put_columns_in_df, get_connection_params, output_folder, get_or_create_env_var, reveal_feedback_buttons, wipe_logs, model_full_names, view_table, empty_output_vars_extract_topics, empty_output_vars_summarise
|
4 |
from tools.aws_functions import upload_file_to_s3
|
5 |
from tools.llm_api_call import llm_query, load_in_data_file, load_in_previous_data_files, sample_reference_table_summaries, summarise_output_topics
|
6 |
from tools.auth import authenticate_user
|
|
|
69 |
gr.Markdown(
|
70 |
"""# Large language model topic modelling
|
71 |
|
72 |
+
Extract topics and summarise outputs using Large Language Models (LLMs, Gemini Flash/Pro, or Claude 3 through AWS Bedrock if running on AWS). The app will query the LLM with batches of responses to produce summary tables, which are then compared iteratively to output a table with the general topics, subtopics, topic sentiment, and relevant text rows related to them. The prompts are designed for topic modelling public consultations, but they can be adapted to different contexts (see the LLM settings tab to modify). Instructions on use can be found in the README.md file.
|
73 |
|
74 |
+
You can use an AWS Bedrock model (Claude 3, paid), or Gemini (a free API, but with strict limits for the Pro model). Due to the strict API limits for the best model (Pro 1.5), the use of Gemini requires an API key. To set up your own Gemini API key, go here: https://aistudio.google.com/app/u/1/plan_information.
|
|
|
|
|
75 |
|
76 |
+
NOTE: that **API calls to Gemini are not considered secure**, so please only submit redacted, non-sensitive tabular files to this source. Also, large language models are not 100% accurate and may produce biased or harmful outputs. All outputs from this app **absolutely need to be checked by a human** to check for harmful outputs, hallucinations, and accuracy.""")
|
77 |
|
78 |
with gr.Tab(label="Extract topics"):
|
79 |
gr.Markdown(
|
|
|
92 |
in_colnames = gr.Dropdown(choices=["Choose column with responses"], multiselect = False, label="Select column that contains the responses (showing columns present across all files).", allow_custom_value=True, interactive=True)
|
93 |
|
94 |
with gr.Accordion("I have my own list of topics (zero shot topic modelling).", open = False):
|
95 |
+
candidate_topics = gr.File(label="Input topics from file (csv). File should have a single column with a header, and all topic keywords below.")
|
96 |
|
97 |
context_textbox = gr.Textbox(label="Write a short description (one sentence of less) giving context to the large language model about the your consultation and any relevant context")
|
98 |
|
|
|
195 |
# Tabular data upload
|
196 |
in_data_files.upload(fn=put_columns_in_df, inputs=[in_data_files], outputs=[in_colnames, in_excel_sheets, data_file_names_textbox])
|
197 |
|
198 |
+
extract_topics_btn.click(fn=empty_output_vars_extract_topics, inputs=None, outputs=[master_topic_df_state, master_unique_topics_df_state, master_reference_df_state, text_output_file, text_output_file_list_state, latest_batch_completed, log_files_output, log_files_output_list_state, conversation_metadata_textbox, estimated_time_taken_number]).\
|
199 |
+
then(load_in_data_file,
|
200 |
inputs = [in_data_files, in_colnames, batch_size_number], outputs = [file_data_state, data_file_names_textbox, total_number_of_batches], api_name="load_data").then(\
|
201 |
fn=llm_query,
|
202 |
inputs=[file_data_state, master_topic_df_state, master_reference_df_state, master_unique_topics_df_state, text_output_summary, data_file_names_textbox, total_number_of_batches, in_api_key, temperature_slide, in_colnames, model_choice, candidate_topics, latest_batch_completed, text_output_summary, text_output_file_list_state, log_files_output_list_state, first_loop_state, conversation_metadata_textbox, initial_table_prompt_textbox, prompt_2_textbox, prompt_3_textbox, system_prompt_textbox, add_to_existing_topics_system_prompt_textbox, add_to_existing_topics_prompt_textbox, number_of_prompts, batch_size_number, context_textbox, estimated_time_taken_number],
|
|
|
209 |
then(fn = reveal_feedback_buttons,
|
210 |
outputs=[data_feedback_radio, data_further_details_text, data_submit_feedback_btn, data_feedback_title], scroll_to_output=True)
|
211 |
|
|
|
|
|
|
|
|
|
|
|
212 |
# When button pressed, summarise previous data
|
213 |
+
summarise_previous_data_btn.click(empty_output_vars_summarise, inputs=None, outputs=[summary_reference_table_sample_state, master_unique_topics_df_revised_summaries_state, master_reference_df_revised_summaries_state, summary_output_files, summarised_outputs_list, latest_summary_completed_num, conversation_metadata_textbox]).\
|
214 |
+
then(load_in_previous_data_files, inputs=[summarisation_in_previous_data_files], outputs=[master_reference_df_state, master_unique_topics_df_state, latest_batch_completed_no_loop, summarisation_in_previous_data_files_status, data_file_names_textbox]).\
|
215 |
then(sample_reference_table_summaries, inputs=[master_reference_df_state, master_unique_topics_df_state, random_seed], outputs=[summary_reference_table_sample_state, summarised_references_markdown, master_reference_df_state, master_unique_topics_df_state]).\
|
216 |
then(summarise_output_topics, inputs=[summary_reference_table_sample_state, master_unique_topics_df_state, master_reference_df_state, model_choice, in_api_key, summarised_references_markdown, temperature_slide, data_file_names_textbox, summarised_outputs_list, latest_summary_completed_num, conversation_metadata_textbox], outputs=[summary_reference_table_sample_state, master_unique_topics_df_revised_summaries_state, master_reference_df_revised_summaries_state, summary_output_files, summarised_outputs_list, latest_summary_completed_num, conversation_metadata_textbox])
|
217 |
|
218 |
latest_summary_completed_num.change(summarise_output_topics, inputs=[summary_reference_table_sample_state, master_unique_topics_df_state, master_reference_df_state, model_choice, in_api_key, summarised_references_markdown, temperature_slide, data_file_names_textbox, summarised_outputs_list, latest_summary_completed_num, conversation_metadata_textbox], outputs=[summary_reference_table_sample_state, master_unique_topics_df_revised_summaries_state, master_reference_df_revised_summaries_state, summary_output_files, summarised_outputs_list, latest_summary_completed_num, conversation_metadata_textbox])
|
219 |
|
220 |
+
# If uploaded partially completed consultation files do this. This should then start up the 'latest_batch_completed' change action above to continue extracting topics.
|
221 |
+
continue_previous_data_files_btn.click(
|
222 |
+
load_in_data_file, inputs = [in_data_files, in_colnames, batch_size_number], outputs = [file_data_state, data_file_names_textbox, total_number_of_batches]).\
|
223 |
+
then(load_in_previous_data_files, inputs=[in_previous_data_files], outputs=[master_reference_df_state, master_unique_topics_df_state, latest_batch_completed, in_previous_data_files_status, data_file_names_textbox])
|
224 |
+
|
225 |
###
|
226 |
# LOGGING AND ON APP LOAD FUNCTIONS
|
227 |
###
|
tools/helper_functions.py
CHANGED
@@ -3,6 +3,35 @@ import gradio as gr
|
|
3 |
import pandas as pd
|
4 |
|
5 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
6 |
|
7 |
def get_or_create_env_var(var_name, default_value):
|
8 |
# Get the environment variable if it exists
|
|
|
3 |
import pandas as pd
|
4 |
|
5 |
|
6 |
+
def empty_output_vars_extract_topics():
|
7 |
+
# Empty output objects before processing a new file
|
8 |
+
|
9 |
+
master_topic_df_state = pd.DataFrame()
|
10 |
+
master_unique_topics_df_state = pd.DataFrame()
|
11 |
+
master_reference_df_state = pd.DataFrame()
|
12 |
+
text_output_file = []
|
13 |
+
text_output_file_list_state = []
|
14 |
+
latest_batch_completed = 0
|
15 |
+
log_files_output = []
|
16 |
+
log_files_output_list_state = []
|
17 |
+
conversation_metadata_textbox = ""
|
18 |
+
estimated_time_taken_number = 0
|
19 |
+
|
20 |
+
return master_topic_df_state, master_unique_topics_df_state, master_reference_df_state, text_output_file, text_output_file_list_state, latest_batch_completed, log_files_output, log_files_output_list_state, conversation_metadata_textbox, estimated_time_taken_number
|
21 |
+
|
22 |
+
def empty_output_vars_summarise():
|
23 |
+
# Empty output objects before summarising files
|
24 |
+
|
25 |
+
summary_reference_table_sample_state = pd.DataFrame()
|
26 |
+
master_unique_topics_df_revised_summaries_state = pd.DataFrame()
|
27 |
+
master_reference_df_revised_summaries_state = pd.DataFrame()
|
28 |
+
summary_output_files = []
|
29 |
+
summarised_outputs_list = []
|
30 |
+
latest_summary_completed_num = 0
|
31 |
+
conversation_metadata_textbox = ""
|
32 |
+
|
33 |
+
return summary_reference_table_sample_state, master_unique_topics_df_revised_summaries_state, master_reference_df_revised_summaries_state, summary_output_files, summarised_outputs_list, latest_summary_completed_num, conversation_metadata_textbox
|
34 |
+
|
35 |
|
36 |
def get_or_create_env_var(var_name, default_value):
|
37 |
# Get the environment variable if it exists
|
tools/llm_api_call.py
CHANGED
@@ -1242,8 +1242,6 @@ def deduplicate_categories(category_series: pd.Series, join_series:pd.Series, th
|
|
1242 |
if category in deduplication_map:
|
1243 |
continue
|
1244 |
|
1245 |
-
print("old_category:", category)
|
1246 |
-
|
1247 |
# Find close matches to the current category, excluding the current category itself
|
1248 |
matches = process.extract(category, [cat for cat in category_series.unique() if cat != category], scorer=fuzz.token_set_ratio, score_cutoff=threshold)
|
1249 |
|
@@ -1251,7 +1249,7 @@ def deduplicate_categories(category_series: pd.Series, join_series:pd.Series, th
|
|
1251 |
if matches: # Check if there are any matches
|
1252 |
best_match = max(matches, key=lambda x: x[1]) # Get the match with the highest score
|
1253 |
match, score, _ = best_match # Unpack the best match
|
1254 |
-
print("Best match:", match, "score:", score)
|
1255 |
deduplication_map[match] = category # Map the best match to the current category
|
1256 |
|
1257 |
# Create the result DataFrame
|
@@ -1282,8 +1280,6 @@ def sample_reference_table_summaries(reference_df:pd.DataFrame,
|
|
1282 |
|
1283 |
reference_df_unique = reference_df.drop_duplicates("old_category")
|
1284 |
|
1285 |
-
print("reference_df_unique_old_categories:", reference_df_unique["old_category"])
|
1286 |
-
|
1287 |
reference_df_unique[["old_category"]].to_csv(output_folder + "reference_df_unique_old_categories_" + str(i) + ".csv", index=None)
|
1288 |
|
1289 |
# Deduplicate categories within each sentiment group
|
@@ -1293,7 +1289,6 @@ def sample_reference_table_summaries(reference_df:pd.DataFrame,
|
|
1293 |
|
1294 |
if deduplicated_topic_map_df['deduplicated_category'].isnull().all():
|
1295 |
# Check if 'deduplicated_category' contains any values
|
1296 |
-
|
1297 |
print("No deduplicated categories found, skipping the following code.")
|
1298 |
|
1299 |
else:
|
@@ -1301,7 +1296,7 @@ def sample_reference_table_summaries(reference_df:pd.DataFrame,
|
|
1301 |
# Remove rows where 'deduplicated_category' is blank or NaN
|
1302 |
deduplicated_topic_map_df = deduplicated_topic_map_df.loc[(deduplicated_topic_map_df['deduplicated_category'].str.strip() != '') & ~(deduplicated_topic_map_df['deduplicated_category'].isnull()), :]
|
1303 |
|
1304 |
-
deduplicated_topic_map_df.to_csv(output_folder + "deduplicated_topic_map_df_" + str(i) + ".csv", index=None)
|
1305 |
|
1306 |
reference_df = reference_df.merge(deduplicated_topic_map_df, on="old_category", how="left")
|
1307 |
|
@@ -1314,7 +1309,7 @@ def sample_reference_table_summaries(reference_df:pd.DataFrame,
|
|
1314 |
reference_df["Subtopic"] = reference_df["deduplicated_category"].combine_first(reference_df["Subtopic_old"])
|
1315 |
reference_df["Sentiment"] = reference_df["Sentiment"].combine_first(reference_df["Sentiment_old"])
|
1316 |
|
1317 |
-
reference_df.to_csv(output_folder + "
|
1318 |
|
1319 |
reference_df.drop(['old_category', 'deduplicated_category', "Subtopic_old", "Sentiment_old"], axis=1, inplace=True, errors="ignore")
|
1320 |
|
@@ -1324,8 +1319,6 @@ def sample_reference_table_summaries(reference_df:pd.DataFrame,
|
|
1324 |
reference_df["Subtopic"] = reference_df["Subtopic"].str.lower().str.capitalize()
|
1325 |
reference_df["Sentiment"] = reference_df["Sentiment"].str.lower().str.capitalize()
|
1326 |
|
1327 |
-
|
1328 |
-
|
1329 |
# Remake unique_topics_df based on new reference_df
|
1330 |
unique_topics_df = create_unique_table_df_from_reference_table(reference_df)
|
1331 |
|
@@ -1351,7 +1344,7 @@ def sample_reference_table_summaries(reference_df:pd.DataFrame,
|
|
1351 |
|
1352 |
all_summaries = pd.concat([all_summaries, filtered_reference_df_unique_sampled])
|
1353 |
|
1354 |
-
all_summaries.to_csv(output_folder + "all_summaries.csv", index=None)
|
1355 |
|
1356 |
summarised_references = all_summaries.groupby(["General Topic", "Subtopic", "Sentiment"]).agg({
|
1357 |
'Response References': 'size', # Count the number of references
|
@@ -1360,7 +1353,7 @@ def sample_reference_table_summaries(reference_df:pd.DataFrame,
|
|
1360 |
|
1361 |
summarised_references = summarised_references.loc[(summarised_references["Sentiment"] != "Not Mentioned") & (summarised_references["Response References"] > 1)]
|
1362 |
|
1363 |
-
summarised_references.to_csv(output_folder + "summarised_references.csv", index=None)
|
1364 |
|
1365 |
summarised_references_markdown = summarised_references.to_markdown(index=False)
|
1366 |
|
@@ -1420,8 +1413,8 @@ def summarise_output_topics(summarised_references:pd.DataFrame,
|
|
1420 |
|
1421 |
length_all_summaries = len(all_summaries)
|
1422 |
|
1423 |
-
print("latest_summary_completed:", latest_summary_completed)
|
1424 |
-
print("length_all_summaries:", length_all_summaries)
|
1425 |
|
1426 |
if latest_summary_completed >= length_all_summaries:
|
1427 |
print("All summaries completed. Creating outputs.")
|
@@ -1463,7 +1456,7 @@ def summarise_output_topics(summarised_references:pd.DataFrame,
|
|
1463 |
unique_table_df_revised_path = output_folder + batch_file_path_details + "_summarised_unique_topic_table_" + model_choice_clean + ".csv"
|
1464 |
unique_table_df_revised.to_csv(unique_table_df_revised_path, index = None)
|
1465 |
|
1466 |
-
reference_table_df_revised_path = output_folder + batch_file_path_details + "
|
1467 |
reference_table_df_revised.to_csv(reference_table_df_revised_path, index = None)
|
1468 |
|
1469 |
output_files.extend([reference_table_df_revised_path, unique_table_df_revised_path])
|
|
|
1242 |
if category in deduplication_map:
|
1243 |
continue
|
1244 |
|
|
|
|
|
1245 |
# Find close matches to the current category, excluding the current category itself
|
1246 |
matches = process.extract(category, [cat for cat in category_series.unique() if cat != category], scorer=fuzz.token_set_ratio, score_cutoff=threshold)
|
1247 |
|
|
|
1249 |
if matches: # Check if there are any matches
|
1250 |
best_match = max(matches, key=lambda x: x[1]) # Get the match with the highest score
|
1251 |
match, score, _ = best_match # Unpack the best match
|
1252 |
+
#print("Best match:", match, "score:", score)
|
1253 |
deduplication_map[match] = category # Map the best match to the current category
|
1254 |
|
1255 |
# Create the result DataFrame
|
|
|
1280 |
|
1281 |
reference_df_unique = reference_df.drop_duplicates("old_category")
|
1282 |
|
|
|
|
|
1283 |
reference_df_unique[["old_category"]].to_csv(output_folder + "reference_df_unique_old_categories_" + str(i) + ".csv", index=None)
|
1284 |
|
1285 |
# Deduplicate categories within each sentiment group
|
|
|
1289 |
|
1290 |
if deduplicated_topic_map_df['deduplicated_category'].isnull().all():
|
1291 |
# Check if 'deduplicated_category' contains any values
|
|
|
1292 |
print("No deduplicated categories found, skipping the following code.")
|
1293 |
|
1294 |
else:
|
|
|
1296 |
# Remove rows where 'deduplicated_category' is blank or NaN
|
1297 |
deduplicated_topic_map_df = deduplicated_topic_map_df.loc[(deduplicated_topic_map_df['deduplicated_category'].str.strip() != '') & ~(deduplicated_topic_map_df['deduplicated_category'].isnull()), :]
|
1298 |
|
1299 |
+
#deduplicated_topic_map_df.to_csv(output_folder + "deduplicated_topic_map_df_" + str(i) + ".csv", index=None)
|
1300 |
|
1301 |
reference_df = reference_df.merge(deduplicated_topic_map_df, on="old_category", how="left")
|
1302 |
|
|
|
1309 |
reference_df["Subtopic"] = reference_df["deduplicated_category"].combine_first(reference_df["Subtopic_old"])
|
1310 |
reference_df["Sentiment"] = reference_df["Sentiment"].combine_first(reference_df["Sentiment_old"])
|
1311 |
|
1312 |
+
#reference_df.to_csv(output_folder + "reference_table_after_dedup.csv", index=None)
|
1313 |
|
1314 |
reference_df.drop(['old_category', 'deduplicated_category', "Subtopic_old", "Sentiment_old"], axis=1, inplace=True, errors="ignore")
|
1315 |
|
|
|
1319 |
reference_df["Subtopic"] = reference_df["Subtopic"].str.lower().str.capitalize()
|
1320 |
reference_df["Sentiment"] = reference_df["Sentiment"].str.lower().str.capitalize()
|
1321 |
|
|
|
|
|
1322 |
# Remake unique_topics_df based on new reference_df
|
1323 |
unique_topics_df = create_unique_table_df_from_reference_table(reference_df)
|
1324 |
|
|
|
1344 |
|
1345 |
all_summaries = pd.concat([all_summaries, filtered_reference_df_unique_sampled])
|
1346 |
|
1347 |
+
#all_summaries.to_csv(output_folder + "all_summaries.csv", index=None)
|
1348 |
|
1349 |
summarised_references = all_summaries.groupby(["General Topic", "Subtopic", "Sentiment"]).agg({
|
1350 |
'Response References': 'size', # Count the number of references
|
|
|
1353 |
|
1354 |
summarised_references = summarised_references.loc[(summarised_references["Sentiment"] != "Not Mentioned") & (summarised_references["Response References"] > 1)]
|
1355 |
|
1356 |
+
#summarised_references.to_csv(output_folder + "summarised_references.csv", index=None)
|
1357 |
|
1358 |
summarised_references_markdown = summarised_references.to_markdown(index=False)
|
1359 |
|
|
|
1413 |
|
1414 |
length_all_summaries = len(all_summaries)
|
1415 |
|
1416 |
+
#print("latest_summary_completed:", latest_summary_completed)
|
1417 |
+
#print("length_all_summaries:", length_all_summaries)
|
1418 |
|
1419 |
if latest_summary_completed >= length_all_summaries:
|
1420 |
print("All summaries completed. Creating outputs.")
|
|
|
1456 |
unique_table_df_revised_path = output_folder + batch_file_path_details + "_summarised_unique_topic_table_" + model_choice_clean + ".csv"
|
1457 |
unique_table_df_revised.to_csv(unique_table_df_revised_path, index = None)
|
1458 |
|
1459 |
+
reference_table_df_revised_path = output_folder + batch_file_path_details + "_summarised_reference_table_" + model_choice_clean + ".csv"
|
1460 |
reference_table_df_revised.to_csv(reference_table_df_revised_path, index = None)
|
1461 |
|
1462 |
output_files.extend([reference_table_df_revised_path, unique_table_df_revised_path])
|