topic_modelling / funcs /topic_core_funcs.py

Commit History

Re-added TruncatedSVD dependency to topic_core_funcs.py
f42e3d1
Running

seanpedrickcase commited on

Debugged reference to random_seed in vectorisation and reference to torch in representation_model.py
8216d8c

seanpedrickcase commited on

Importing space package near start of app now to avoid issue with cuda being initialised before
9e84863

seanpedrickcase commited on

Rearranged functions for embeddings creation to be compatible with zero GPU space. Updated packages.
cc495e1

seanpedrickcase commited on

Added example of how to run function from command line. Updated packages. Embedding model default now smaller and at fp16.
34f1e83

seanpedrickcase commited on

Improved initial clean options. Now has option to return embeddings only.
89c4d20

seanpedrickcase commited on

App now retains original index following cleaning to allow for referring back to original data
90553eb

seanpedrickcase commited on

Allowed for app running on AWS to use smaller embedding model and not to load representation LLM (due to size restrictions).
22ca76e

seanpedrickcase commited on

Only aggregate topics not 'other', allowed for minimum sentence length, default max_topics now will auto aggregate topics. Added Cognito Auth functionality (boto3 with AWS).
1e2bb3e

seanpedrickcase commited on

Can split passages into sentences. Improved embedding, LLM representation models, improved zero shot capabilities
55f0ce3

seanpedrickcase commited on

Updated packages. Improve hierarchy vis. Better models - mixedbread and phi3. Now option to split texts into sentences before modelling.
04a15c5

seanpedrickcase commited on

Minor cleaning, csv formatting changes
d80c8f5

Sean-Case commited on

Reduce outliers now more efficient and relabels with correct vectoriser. Default topic labels now tidier. Hiearchical topics outputs more useful for joining to df afterwards. Switched low resource reduction algorithm to UMAP as default is not good.
e1c1f68

Sonnyjim commited on

Should now parse custom regex correctly. Will now wipe previously created embeddings if 'low resource mode' option switched.
0a543a0

Sean-Case commited on

Allowed for uploading custom regex for cleaning. Fixed calculate all probabilities, reduce outliers. Added text tree for hierarchical modelling.
381f959

Sonnyjim commited on

LLM model save is failing in Huggingface - attempting instead to save to base folder
c2bf185

Sean-Case commited on

Some text changes. Fixed a couple of TF-IDF embeddings issues
87306c7

Sean-Case commited on

Added clean data options, improved re-representation options and visualisation. General format changes
4effac0

Sonnyjim commited on