seanpedrickcase commited on
Commit
f8f34c2
·
1 Parent(s): 991fcdc

Added more guidance in Readme. Now wipes variables on click to create or summarise topics

Browse files
Files changed (4) hide show
  1. README.md +11 -4
  2. app.py +14 -14
  3. tools/helper_functions.py +29 -0
  4. tools/llm_api_call.py +8 -15
README.md CHANGED
@@ -5,7 +5,7 @@ colorFrom: purple
5
  colorTo: yellow
6
  sdk: gradio
7
  app_file: app.py
8
- pinned: false
9
  license: cc-by-nc-4.0
10
  ---
11
 
@@ -13,8 +13,15 @@ license: cc-by-nc-4.0
13
 
14
  Extract topics and summarise outputs using Large Language Models (LLMs, Gemini Flash/Pro, or Claude 3 through AWS Bedrock if running on AWS). The app will query the LLM with batches of responses to produce summary tables, which are then compared iteratively to output a table with the general topics, subtopics, topic sentiment, and relevant text rows related to them. The prompts are designed for topic modelling public consultations, but they can be adapted to different contexts (see the LLM settings tab to modify).
15
 
16
- You can use an AWS Bedrock model (Claude 3, paid), or Gemini (a free API, but with strict limits for the Pro model). Due to the strict API limits for the best model (Pro 1.5), the use of Gemini requires an API key. To set up your own Gemini API key, go here: https://aistudio.google.com/app/u/1/plan_information.
17
 
18
- NOTE: that **API calls to Gemini are not considered secure**, so please only submit redacted, non-sensitive tabular files to this source. AWS Bedrock API calls are considered to be secure.
19
 
20
- Large language models are not 100% accurate and may produce biased or harmful outputs. All outputs from this app **absolutely need to be checked by a human** to check for harmful outputs, hallucinations, and accuracy.
 
 
 
 
 
 
 
 
5
  colorTo: yellow
6
  sdk: gradio
7
  app_file: app.py
8
+ pinned: true
9
  license: cc-by-nc-4.0
10
  ---
11
 
 
13
 
14
  Extract topics and summarise outputs using Large Language Models (LLMs, Gemini Flash/Pro, or Claude 3 through AWS Bedrock if running on AWS). The app will query the LLM with batches of responses to produce summary tables, which are then compared iteratively to output a table with the general topics, subtopics, topic sentiment, and relevant text rows related to them. The prompts are designed for topic modelling public consultations, but they can be adapted to different contexts (see the LLM settings tab to modify).
15
 
16
+ You can use an AWS Bedrock model (Claude 3, paid), or Gemini (a free API, but with strict limits for the Pro model). Due to the strict API limits for the best model (Pro 1.5), the use of Gemini requires an API key. To set up your own Gemini API key, go here: https://aistudio.google.com/app/u/1/plan_information.
17
 
18
+ NOTE: that **API calls to Gemini are not considered secure**, so please only submit redacted, non-sensitive tabular files to this source. Also, large language models are not 100% accurate and may produce biased or harmful outputs. All outputs from this app **absolutely need to be checked by a human** to check for harmful outputs, hallucinations, and accuracy.
19
 
20
+ Basic use:
21
+ 1. Upload a csv/xlsx/parquet file containing at least one open text column.
22
+ 2. Select the relevant open text column from the dropdown.
23
+ 3. If you have your own suggested (zero shot) topics, upload this (see examples folder for an example file)
24
+ 4. Write a one sentence description of the consultation/context of the open text.
25
+ 5. Extract topics.
26
+ 6. If topic extraction fails part way through, you can upload the latest 'reference_table' and 'unique_topics_table' csv outputs on the 'Continue previous topic extraction' tab to continue from where you left off.
27
+ 7. Summaries will be produced for each topic for each 'batch' of responses. If you want consolidated summaries, go to the tab 'Summarise topic outputs', upload your output reference_table and unique_topics csv files, and press summarise.
app.py CHANGED
@@ -1,6 +1,6 @@
1
  import os
2
  import socket
3
- from tools.helper_functions import ensure_output_folder_exists, add_folder_to_path, put_columns_in_df, get_connection_params, output_folder, get_or_create_env_var, reveal_feedback_buttons, wipe_logs, model_full_names, view_table
4
  from tools.aws_functions import upload_file_to_s3
5
  from tools.llm_api_call import llm_query, load_in_data_file, load_in_previous_data_files, sample_reference_table_summaries, summarise_output_topics
6
  from tools.auth import authenticate_user
@@ -69,13 +69,11 @@ with app:
69
  gr.Markdown(
70
  """# Large language model topic modelling
71
 
72
- Extract topics and summarise outputs using Large Language Models (LLMs, Gemini Flash/Pro, or Claude 3 through AWS Bedrock if running on AWS). The app will query the LLM with batches of responses to produce summary tables, which are then compared iteratively to output a table with the general topics, subtopics, topic sentiment, and relevant text rows related to them. The prompts are designed for topic modelling public consultations, but they can be adapted to different contexts (see the LLM settings tab to modify).
73
 
74
- You can use an AWS Bedrock model (Claude 3, paid), or Gemini (a free API, but with strict limits for the Pro model). Due to the strict API limits for the best model (Pro 1.5), the use of Gemini requires an API key. To set up your own Gemini API key, go here: https://aistudio.google.com/app/u/1/plan_information.
75
-
76
- NOTE: that **API calls to Gemini are not considered secure**, so please only submit redacted, non-sensitive tabular files to this source. AWS Bedrock API calls are considered to be secure.
77
 
78
- Large language models are not 100% accurate and may produce biased or harmful outputs. All outputs from this app **absolutely need to be checked by a human** to check for harmful outputs, hallucinations, and accuracy.""")
79
 
80
  with gr.Tab(label="Extract topics"):
81
  gr.Markdown(
@@ -94,7 +92,7 @@ with app:
94
  in_colnames = gr.Dropdown(choices=["Choose column with responses"], multiselect = False, label="Select column that contains the responses (showing columns present across all files).", allow_custom_value=True, interactive=True)
95
 
96
  with gr.Accordion("I have my own list of topics (zero shot topic modelling).", open = False):
97
- candidate_topics = gr.File(label="Input topics from file (csv). File should have at least one column with a header and topic keywords in cells below. Topics will be taken from the first column of the file.")
98
 
99
  context_textbox = gr.Textbox(label="Write a short description (one sentence of less) giving context to the large language model about the your consultation and any relevant context")
100
 
@@ -197,7 +195,8 @@ with app:
197
  # Tabular data upload
198
  in_data_files.upload(fn=put_columns_in_df, inputs=[in_data_files], outputs=[in_colnames, in_excel_sheets, data_file_names_textbox])
199
 
200
- extract_topics_btn.click(load_in_data_file,
 
201
  inputs = [in_data_files, in_colnames, batch_size_number], outputs = [file_data_state, data_file_names_textbox, total_number_of_batches], api_name="load_data").then(\
202
  fn=llm_query,
203
  inputs=[file_data_state, master_topic_df_state, master_reference_df_state, master_unique_topics_df_state, text_output_summary, data_file_names_textbox, total_number_of_batches, in_api_key, temperature_slide, in_colnames, model_choice, candidate_topics, latest_batch_completed, text_output_summary, text_output_file_list_state, log_files_output_list_state, first_loop_state, conversation_metadata_textbox, initial_table_prompt_textbox, prompt_2_textbox, prompt_3_textbox, system_prompt_textbox, add_to_existing_topics_system_prompt_textbox, add_to_existing_topics_prompt_textbox, number_of_prompts, batch_size_number, context_textbox, estimated_time_taken_number],
@@ -210,18 +209,19 @@ with app:
210
  then(fn = reveal_feedback_buttons,
211
  outputs=[data_feedback_radio, data_further_details_text, data_submit_feedback_btn, data_feedback_title], scroll_to_output=True)
212
 
213
- # If uploaded partially completed consultation files do this. This should then start up the 'latest_batch_completed' change action above to continue extracting topics.
214
- continue_previous_data_files_btn.click(
215
- load_in_data_file, inputs = [in_data_files, in_colnames, batch_size_number], outputs = [file_data_state, data_file_names_textbox, total_number_of_batches]).\
216
- then(load_in_previous_data_files, inputs=[in_previous_data_files], outputs=[master_reference_df_state, master_unique_topics_df_state, latest_batch_completed, in_previous_data_files_status, data_file_names_textbox])
217
-
218
  # When button pressed, summarise previous data
219
- summarise_previous_data_btn.click(load_in_previous_data_files, inputs=[summarisation_in_previous_data_files], outputs=[master_reference_df_state, master_unique_topics_df_state, latest_batch_completed_no_loop, summarisation_in_previous_data_files_status, data_file_names_textbox]).\
 
220
  then(sample_reference_table_summaries, inputs=[master_reference_df_state, master_unique_topics_df_state, random_seed], outputs=[summary_reference_table_sample_state, summarised_references_markdown, master_reference_df_state, master_unique_topics_df_state]).\
221
  then(summarise_output_topics, inputs=[summary_reference_table_sample_state, master_unique_topics_df_state, master_reference_df_state, model_choice, in_api_key, summarised_references_markdown, temperature_slide, data_file_names_textbox, summarised_outputs_list, latest_summary_completed_num, conversation_metadata_textbox], outputs=[summary_reference_table_sample_state, master_unique_topics_df_revised_summaries_state, master_reference_df_revised_summaries_state, summary_output_files, summarised_outputs_list, latest_summary_completed_num, conversation_metadata_textbox])
222
 
223
  latest_summary_completed_num.change(summarise_output_topics, inputs=[summary_reference_table_sample_state, master_unique_topics_df_state, master_reference_df_state, model_choice, in_api_key, summarised_references_markdown, temperature_slide, data_file_names_textbox, summarised_outputs_list, latest_summary_completed_num, conversation_metadata_textbox], outputs=[summary_reference_table_sample_state, master_unique_topics_df_revised_summaries_state, master_reference_df_revised_summaries_state, summary_output_files, summarised_outputs_list, latest_summary_completed_num, conversation_metadata_textbox])
224
 
 
 
 
 
 
225
  ###
226
  # LOGGING AND ON APP LOAD FUNCTIONS
227
  ###
 
1
  import os
2
  import socket
3
+ from tools.helper_functions import ensure_output_folder_exists, add_folder_to_path, put_columns_in_df, get_connection_params, output_folder, get_or_create_env_var, reveal_feedback_buttons, wipe_logs, model_full_names, view_table, empty_output_vars_extract_topics, empty_output_vars_summarise
4
  from tools.aws_functions import upload_file_to_s3
5
  from tools.llm_api_call import llm_query, load_in_data_file, load_in_previous_data_files, sample_reference_table_summaries, summarise_output_topics
6
  from tools.auth import authenticate_user
 
69
  gr.Markdown(
70
  """# Large language model topic modelling
71
 
72
+ Extract topics and summarise outputs using Large Language Models (LLMs, Gemini Flash/Pro, or Claude 3 through AWS Bedrock if running on AWS). The app will query the LLM with batches of responses to produce summary tables, which are then compared iteratively to output a table with the general topics, subtopics, topic sentiment, and relevant text rows related to them. The prompts are designed for topic modelling public consultations, but they can be adapted to different contexts (see the LLM settings tab to modify). Instructions on use can be found in the README.md file.
73
 
74
+ You can use an AWS Bedrock model (Claude 3, paid), or Gemini (a free API, but with strict limits for the Pro model). Due to the strict API limits for the best model (Pro 1.5), the use of Gemini requires an API key. To set up your own Gemini API key, go here: https://aistudio.google.com/app/u/1/plan_information.
 
 
75
 
76
+ NOTE: that **API calls to Gemini are not considered secure**, so please only submit redacted, non-sensitive tabular files to this source. Also, large language models are not 100% accurate and may produce biased or harmful outputs. All outputs from this app **absolutely need to be checked by a human** to check for harmful outputs, hallucinations, and accuracy.""")
77
 
78
  with gr.Tab(label="Extract topics"):
79
  gr.Markdown(
 
92
  in_colnames = gr.Dropdown(choices=["Choose column with responses"], multiselect = False, label="Select column that contains the responses (showing columns present across all files).", allow_custom_value=True, interactive=True)
93
 
94
  with gr.Accordion("I have my own list of topics (zero shot topic modelling).", open = False):
95
+ candidate_topics = gr.File(label="Input topics from file (csv). File should have a single column with a header, and all topic keywords below.")
96
 
97
  context_textbox = gr.Textbox(label="Write a short description (one sentence of less) giving context to the large language model about the your consultation and any relevant context")
98
 
 
195
  # Tabular data upload
196
  in_data_files.upload(fn=put_columns_in_df, inputs=[in_data_files], outputs=[in_colnames, in_excel_sheets, data_file_names_textbox])
197
 
198
+ extract_topics_btn.click(fn=empty_output_vars_extract_topics, inputs=None, outputs=[master_topic_df_state, master_unique_topics_df_state, master_reference_df_state, text_output_file, text_output_file_list_state, latest_batch_completed, log_files_output, log_files_output_list_state, conversation_metadata_textbox, estimated_time_taken_number]).\
199
+ then(load_in_data_file,
200
  inputs = [in_data_files, in_colnames, batch_size_number], outputs = [file_data_state, data_file_names_textbox, total_number_of_batches], api_name="load_data").then(\
201
  fn=llm_query,
202
  inputs=[file_data_state, master_topic_df_state, master_reference_df_state, master_unique_topics_df_state, text_output_summary, data_file_names_textbox, total_number_of_batches, in_api_key, temperature_slide, in_colnames, model_choice, candidate_topics, latest_batch_completed, text_output_summary, text_output_file_list_state, log_files_output_list_state, first_loop_state, conversation_metadata_textbox, initial_table_prompt_textbox, prompt_2_textbox, prompt_3_textbox, system_prompt_textbox, add_to_existing_topics_system_prompt_textbox, add_to_existing_topics_prompt_textbox, number_of_prompts, batch_size_number, context_textbox, estimated_time_taken_number],
 
209
  then(fn = reveal_feedback_buttons,
210
  outputs=[data_feedback_radio, data_further_details_text, data_submit_feedback_btn, data_feedback_title], scroll_to_output=True)
211
 
 
 
 
 
 
212
  # When button pressed, summarise previous data
213
+ summarise_previous_data_btn.click(empty_output_vars_summarise, inputs=None, outputs=[summary_reference_table_sample_state, master_unique_topics_df_revised_summaries_state, master_reference_df_revised_summaries_state, summary_output_files, summarised_outputs_list, latest_summary_completed_num, conversation_metadata_textbox]).\
214
+ then(load_in_previous_data_files, inputs=[summarisation_in_previous_data_files], outputs=[master_reference_df_state, master_unique_topics_df_state, latest_batch_completed_no_loop, summarisation_in_previous_data_files_status, data_file_names_textbox]).\
215
  then(sample_reference_table_summaries, inputs=[master_reference_df_state, master_unique_topics_df_state, random_seed], outputs=[summary_reference_table_sample_state, summarised_references_markdown, master_reference_df_state, master_unique_topics_df_state]).\
216
  then(summarise_output_topics, inputs=[summary_reference_table_sample_state, master_unique_topics_df_state, master_reference_df_state, model_choice, in_api_key, summarised_references_markdown, temperature_slide, data_file_names_textbox, summarised_outputs_list, latest_summary_completed_num, conversation_metadata_textbox], outputs=[summary_reference_table_sample_state, master_unique_topics_df_revised_summaries_state, master_reference_df_revised_summaries_state, summary_output_files, summarised_outputs_list, latest_summary_completed_num, conversation_metadata_textbox])
217
 
218
  latest_summary_completed_num.change(summarise_output_topics, inputs=[summary_reference_table_sample_state, master_unique_topics_df_state, master_reference_df_state, model_choice, in_api_key, summarised_references_markdown, temperature_slide, data_file_names_textbox, summarised_outputs_list, latest_summary_completed_num, conversation_metadata_textbox], outputs=[summary_reference_table_sample_state, master_unique_topics_df_revised_summaries_state, master_reference_df_revised_summaries_state, summary_output_files, summarised_outputs_list, latest_summary_completed_num, conversation_metadata_textbox])
219
 
220
+ # If uploaded partially completed consultation files do this. This should then start up the 'latest_batch_completed' change action above to continue extracting topics.
221
+ continue_previous_data_files_btn.click(
222
+ load_in_data_file, inputs = [in_data_files, in_colnames, batch_size_number], outputs = [file_data_state, data_file_names_textbox, total_number_of_batches]).\
223
+ then(load_in_previous_data_files, inputs=[in_previous_data_files], outputs=[master_reference_df_state, master_unique_topics_df_state, latest_batch_completed, in_previous_data_files_status, data_file_names_textbox])
224
+
225
  ###
226
  # LOGGING AND ON APP LOAD FUNCTIONS
227
  ###
tools/helper_functions.py CHANGED
@@ -3,6 +3,35 @@ import gradio as gr
3
  import pandas as pd
4
 
5
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
6
 
7
  def get_or_create_env_var(var_name, default_value):
8
  # Get the environment variable if it exists
 
3
  import pandas as pd
4
 
5
 
6
+ def empty_output_vars_extract_topics():
7
+ # Empty output objects before processing a new file
8
+
9
+ master_topic_df_state = pd.DataFrame()
10
+ master_unique_topics_df_state = pd.DataFrame()
11
+ master_reference_df_state = pd.DataFrame()
12
+ text_output_file = []
13
+ text_output_file_list_state = []
14
+ latest_batch_completed = 0
15
+ log_files_output = []
16
+ log_files_output_list_state = []
17
+ conversation_metadata_textbox = ""
18
+ estimated_time_taken_number = 0
19
+
20
+ return master_topic_df_state, master_unique_topics_df_state, master_reference_df_state, text_output_file, text_output_file_list_state, latest_batch_completed, log_files_output, log_files_output_list_state, conversation_metadata_textbox, estimated_time_taken_number
21
+
22
+ def empty_output_vars_summarise():
23
+ # Empty output objects before summarising files
24
+
25
+ summary_reference_table_sample_state = pd.DataFrame()
26
+ master_unique_topics_df_revised_summaries_state = pd.DataFrame()
27
+ master_reference_df_revised_summaries_state = pd.DataFrame()
28
+ summary_output_files = []
29
+ summarised_outputs_list = []
30
+ latest_summary_completed_num = 0
31
+ conversation_metadata_textbox = ""
32
+
33
+ return summary_reference_table_sample_state, master_unique_topics_df_revised_summaries_state, master_reference_df_revised_summaries_state, summary_output_files, summarised_outputs_list, latest_summary_completed_num, conversation_metadata_textbox
34
+
35
 
36
  def get_or_create_env_var(var_name, default_value):
37
  # Get the environment variable if it exists
tools/llm_api_call.py CHANGED
@@ -1242,8 +1242,6 @@ def deduplicate_categories(category_series: pd.Series, join_series:pd.Series, th
1242
  if category in deduplication_map:
1243
  continue
1244
 
1245
- print("old_category:", category)
1246
-
1247
  # Find close matches to the current category, excluding the current category itself
1248
  matches = process.extract(category, [cat for cat in category_series.unique() if cat != category], scorer=fuzz.token_set_ratio, score_cutoff=threshold)
1249
 
@@ -1251,7 +1249,7 @@ def deduplicate_categories(category_series: pd.Series, join_series:pd.Series, th
1251
  if matches: # Check if there are any matches
1252
  best_match = max(matches, key=lambda x: x[1]) # Get the match with the highest score
1253
  match, score, _ = best_match # Unpack the best match
1254
- print("Best match:", match, "score:", score)
1255
  deduplication_map[match] = category # Map the best match to the current category
1256
 
1257
  # Create the result DataFrame
@@ -1282,8 +1280,6 @@ def sample_reference_table_summaries(reference_df:pd.DataFrame,
1282
 
1283
  reference_df_unique = reference_df.drop_duplicates("old_category")
1284
 
1285
- print("reference_df_unique_old_categories:", reference_df_unique["old_category"])
1286
-
1287
  reference_df_unique[["old_category"]].to_csv(output_folder + "reference_df_unique_old_categories_" + str(i) + ".csv", index=None)
1288
 
1289
  # Deduplicate categories within each sentiment group
@@ -1293,7 +1289,6 @@ def sample_reference_table_summaries(reference_df:pd.DataFrame,
1293
 
1294
  if deduplicated_topic_map_df['deduplicated_category'].isnull().all():
1295
  # Check if 'deduplicated_category' contains any values
1296
-
1297
  print("No deduplicated categories found, skipping the following code.")
1298
 
1299
  else:
@@ -1301,7 +1296,7 @@ def sample_reference_table_summaries(reference_df:pd.DataFrame,
1301
  # Remove rows where 'deduplicated_category' is blank or NaN
1302
  deduplicated_topic_map_df = deduplicated_topic_map_df.loc[(deduplicated_topic_map_df['deduplicated_category'].str.strip() != '') & ~(deduplicated_topic_map_df['deduplicated_category'].isnull()), :]
1303
 
1304
- deduplicated_topic_map_df.to_csv(output_folder + "deduplicated_topic_map_df_" + str(i) + ".csv", index=None)
1305
 
1306
  reference_df = reference_df.merge(deduplicated_topic_map_df, on="old_category", how="left")
1307
 
@@ -1314,7 +1309,7 @@ def sample_reference_table_summaries(reference_df:pd.DataFrame,
1314
  reference_df["Subtopic"] = reference_df["deduplicated_category"].combine_first(reference_df["Subtopic_old"])
1315
  reference_df["Sentiment"] = reference_df["Sentiment"].combine_first(reference_df["Sentiment_old"])
1316
 
1317
- reference_df.to_csv(output_folder + "reference_df_after_dedup.csv", index=None)
1318
 
1319
  reference_df.drop(['old_category', 'deduplicated_category', "Subtopic_old", "Sentiment_old"], axis=1, inplace=True, errors="ignore")
1320
 
@@ -1324,8 +1319,6 @@ def sample_reference_table_summaries(reference_df:pd.DataFrame,
1324
  reference_df["Subtopic"] = reference_df["Subtopic"].str.lower().str.capitalize()
1325
  reference_df["Sentiment"] = reference_df["Sentiment"].str.lower().str.capitalize()
1326
 
1327
-
1328
-
1329
  # Remake unique_topics_df based on new reference_df
1330
  unique_topics_df = create_unique_table_df_from_reference_table(reference_df)
1331
 
@@ -1351,7 +1344,7 @@ def sample_reference_table_summaries(reference_df:pd.DataFrame,
1351
 
1352
  all_summaries = pd.concat([all_summaries, filtered_reference_df_unique_sampled])
1353
 
1354
- all_summaries.to_csv(output_folder + "all_summaries.csv", index=None)
1355
 
1356
  summarised_references = all_summaries.groupby(["General Topic", "Subtopic", "Sentiment"]).agg({
1357
  'Response References': 'size', # Count the number of references
@@ -1360,7 +1353,7 @@ def sample_reference_table_summaries(reference_df:pd.DataFrame,
1360
 
1361
  summarised_references = summarised_references.loc[(summarised_references["Sentiment"] != "Not Mentioned") & (summarised_references["Response References"] > 1)]
1362
 
1363
- summarised_references.to_csv(output_folder + "summarised_references.csv", index=None)
1364
 
1365
  summarised_references_markdown = summarised_references.to_markdown(index=False)
1366
 
@@ -1420,8 +1413,8 @@ def summarise_output_topics(summarised_references:pd.DataFrame,
1420
 
1421
  length_all_summaries = len(all_summaries)
1422
 
1423
- print("latest_summary_completed:", latest_summary_completed)
1424
- print("length_all_summaries:", length_all_summaries)
1425
 
1426
  if latest_summary_completed >= length_all_summaries:
1427
  print("All summaries completed. Creating outputs.")
@@ -1463,7 +1456,7 @@ def summarise_output_topics(summarised_references:pd.DataFrame,
1463
  unique_table_df_revised_path = output_folder + batch_file_path_details + "_summarised_unique_topic_table_" + model_choice_clean + ".csv"
1464
  unique_table_df_revised.to_csv(unique_table_df_revised_path, index = None)
1465
 
1466
- reference_table_df_revised_path = output_folder + batch_file_path_details + "_summarised_reference_df_table_" + model_choice_clean + ".csv"
1467
  reference_table_df_revised.to_csv(reference_table_df_revised_path, index = None)
1468
 
1469
  output_files.extend([reference_table_df_revised_path, unique_table_df_revised_path])
 
1242
  if category in deduplication_map:
1243
  continue
1244
 
 
 
1245
  # Find close matches to the current category, excluding the current category itself
1246
  matches = process.extract(category, [cat for cat in category_series.unique() if cat != category], scorer=fuzz.token_set_ratio, score_cutoff=threshold)
1247
 
 
1249
  if matches: # Check if there are any matches
1250
  best_match = max(matches, key=lambda x: x[1]) # Get the match with the highest score
1251
  match, score, _ = best_match # Unpack the best match
1252
+ #print("Best match:", match, "score:", score)
1253
  deduplication_map[match] = category # Map the best match to the current category
1254
 
1255
  # Create the result DataFrame
 
1280
 
1281
  reference_df_unique = reference_df.drop_duplicates("old_category")
1282
 
 
 
1283
  reference_df_unique[["old_category"]].to_csv(output_folder + "reference_df_unique_old_categories_" + str(i) + ".csv", index=None)
1284
 
1285
  # Deduplicate categories within each sentiment group
 
1289
 
1290
  if deduplicated_topic_map_df['deduplicated_category'].isnull().all():
1291
  # Check if 'deduplicated_category' contains any values
 
1292
  print("No deduplicated categories found, skipping the following code.")
1293
 
1294
  else:
 
1296
  # Remove rows where 'deduplicated_category' is blank or NaN
1297
  deduplicated_topic_map_df = deduplicated_topic_map_df.loc[(deduplicated_topic_map_df['deduplicated_category'].str.strip() != '') & ~(deduplicated_topic_map_df['deduplicated_category'].isnull()), :]
1298
 
1299
+ #deduplicated_topic_map_df.to_csv(output_folder + "deduplicated_topic_map_df_" + str(i) + ".csv", index=None)
1300
 
1301
  reference_df = reference_df.merge(deduplicated_topic_map_df, on="old_category", how="left")
1302
 
 
1309
  reference_df["Subtopic"] = reference_df["deduplicated_category"].combine_first(reference_df["Subtopic_old"])
1310
  reference_df["Sentiment"] = reference_df["Sentiment"].combine_first(reference_df["Sentiment_old"])
1311
 
1312
+ #reference_df.to_csv(output_folder + "reference_table_after_dedup.csv", index=None)
1313
 
1314
  reference_df.drop(['old_category', 'deduplicated_category', "Subtopic_old", "Sentiment_old"], axis=1, inplace=True, errors="ignore")
1315
 
 
1319
  reference_df["Subtopic"] = reference_df["Subtopic"].str.lower().str.capitalize()
1320
  reference_df["Sentiment"] = reference_df["Sentiment"].str.lower().str.capitalize()
1321
 
 
 
1322
  # Remake unique_topics_df based on new reference_df
1323
  unique_topics_df = create_unique_table_df_from_reference_table(reference_df)
1324
 
 
1344
 
1345
  all_summaries = pd.concat([all_summaries, filtered_reference_df_unique_sampled])
1346
 
1347
+ #all_summaries.to_csv(output_folder + "all_summaries.csv", index=None)
1348
 
1349
  summarised_references = all_summaries.groupby(["General Topic", "Subtopic", "Sentiment"]).agg({
1350
  'Response References': 'size', # Count the number of references
 
1353
 
1354
  summarised_references = summarised_references.loc[(summarised_references["Sentiment"] != "Not Mentioned") & (summarised_references["Response References"] > 1)]
1355
 
1356
+ #summarised_references.to_csv(output_folder + "summarised_references.csv", index=None)
1357
 
1358
  summarised_references_markdown = summarised_references.to_markdown(index=False)
1359
 
 
1413
 
1414
  length_all_summaries = len(all_summaries)
1415
 
1416
+ #print("latest_summary_completed:", latest_summary_completed)
1417
+ #print("length_all_summaries:", length_all_summaries)
1418
 
1419
  if latest_summary_completed >= length_all_summaries:
1420
  print("All summaries completed. Creating outputs.")
 
1456
  unique_table_df_revised_path = output_folder + batch_file_path_details + "_summarised_unique_topic_table_" + model_choice_clean + ".csv"
1457
  unique_table_df_revised.to_csv(unique_table_df_revised_path, index = None)
1458
 
1459
+ reference_table_df_revised_path = output_folder + batch_file_path_details + "_summarised_reference_table_" + model_choice_clean + ".csv"
1460
  reference_table_df_revised.to_csv(reference_table_df_revised_path, index = None)
1461
 
1462
  output_files.extend([reference_table_df_revised_path, unique_table_df_revised_path])