seanpedrickcase commited on
Commit
20b4aa0
·
1 Parent(s): 19f22c9

Can deal with backslashes in metadata for semantic search load

Browse files
README.md CHANGED
@@ -42,7 +42,7 @@ Toggle 'Clean text during load...' to "Yes" if you want to remove html tags and
42
 
43
  'Return intermediate files', when set to "Yes", will save a tokenised text file (for keyword search), or an embedded text file (for semantic search) during data preparation. These files can then be loaded in next time alongside the data files to save preparation time for future search sessions.
44
 
45
- 'Round embeddings to three dp...' will reduce the precision of the embedding outputs to 3 decimal places, and will multiply all values by 100, reducing the size of the output numpy array by about 50%. It seems to have minimal effect on the output search result according to simple search comparisons, but I cannot guarantee this!
46
 
47
  ## Keyword search options
48
  Here are a few options to modify the BM25 search parameters. If you want more information on what each parameter does, click the relevant info button to the right of the sliders.
 
42
 
43
  'Return intermediate files', when set to "Yes", will save a tokenised text file (for keyword search), or an embedded text file (for semantic search) during data preparation. These files can then be loaded in next time alongside the data files to save preparation time for future search sessions.
44
 
45
+ 'Round embeddings...' will reduce the precision of the embedding outputs to 3 decimal places, and will multiply all values by 100, reducing the size of the output numpy array by about 50%. It seems to have minimal effect on the output search result according to simple search comparisons, but I cannot guarantee this!
46
 
47
  ## Keyword search options
48
  Here are a few options to modify the BM25 search parameters. If you want more information on what each parameter does, click the relevant info button to the right of the sliders.
output/36de65711121889ccdcb768b85e97e386d8fe4bd/keyword_search_result_20240702_school.xlsm DELETED
Binary file (9.92 kB)
 
search_funcs/semantic_ingest_functions.py CHANGED
@@ -42,7 +42,7 @@ def combine_metadata_columns(df:PandasDataFrame, cols:List[str]) -> PandasSeries
42
  df['blank_column'] = ''
43
 
44
  for n, col in enumerate(cols):
45
- df[col] = df[col].astype(str).str.replace('"',"'").str.replace('\n', ' ').str.replace('\r', ' ').str.replace('\r\n', ' ').str.cat(df['blank_column'].astype(str), sep="")
46
 
47
  df['metadata'] = df['metadata'] + '"' + cols[n] + '": "' + df[col] + '", '
48
 
 
42
  df['blank_column'] = ''
43
 
44
  for n, col in enumerate(cols):
45
+ df[col] = df[col].astype(str).str.replace('"',"'").str.replace('\n', ' ').str.replace('\r', ' ').str.replace('\r\n', ' ').str.replace('\\', '/').str.cat(df['blank_column'].astype(str), sep="")
46
 
47
  df['metadata'] = df['metadata'] + '"' + cols[n] + '": "' + df[col] + '", '
48