File size: 15,867 Bytes
d3ea2b8
 
 
 
 
 
 
 
 
 
 
 
 
fbe09b6
 
 
 
 
ce18d34
1ced3f0
ce18d34
 
d3ea2b8
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
8488e58
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
d3ea2b8
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
14e35a4
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
d3ea2b8
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257

---
tags:
- bertopic
library_name: bertopic
pipeline_tag: text-classification
---

# respapers_topics

This is a [BERTopic](https://github.com/MaartenGr/BERTopic) model. 
BERTopic is a flexible and modular topic modeling framework that allows for the generation of easily interpretable topics from large datasets. 

This pre-trained model was built to demonstrate the use of representation model inspired on KeyBERT to be use within BERTopic.

This model was trained on ~30000 Research Papers abstracts with the KeyBERTInspired representation method (bertopic.representation).
The dataset was downloaded from [kaggle](https://www.kaggle.com/datasets/arashnic/urban-sound?resource=download&select=train_tm), with the two subsets (test and train) being merged into a single dataset.

To access the complete code, you can vist this tutorial on my GitHub page:
[ResPapers](https://github.com/ccatalao/respapers/blob/main/respapers.ipynb) 


## Usage 

To use this model, please install BERTopic:

```
pip install -U bertopic
```

You can use the model as follows:

```python
from bertopic import BERTopic
topic_model = BERTopic.load("CCatalao/respapers_topics")

topic_model.get_topic_info()
```
To view the KeyBERT inspired topic representation please use the following:

```python
>>> topic_model.get_topic(0, full=True)
{'Main': [['spin', 0.01852648864225281],
  ['magnetic', 0.015019436257929909],
  ['phase', 0.013081733986038124],
  ['quantum', 0.012942253723133639],
  ['temperature', 0.012591407440537158],
  ['states', 0.011025582290837643],
  ['field', 0.010954775154251296],
  ['electron', 0.010168708734803916],
  ['transition', 0.009728560280580357],
  ['energy', 0.00937042795113575]],
 'KeyBERTInspired': [['quantum', 0.4072583317756653],
  ['phase transition', 0.35542067885398865],
  ['lattice', 0.34462833404541016],
  ['spin', 0.3268473744392395],
  ['magnetic', 0.3024371564388275],
  ['magnetization', 0.2868726849555969],
  ['phases', 0.27178525924682617],
  ['fermi', 0.26290175318717957],
  ['electron', 0.25709500908851624],
  ['phase', 0.23375216126441956]]}
```

## Topic overview

* Number of topics: 112
* Number of training documents: 29961

<details>
  <summary>Click here for an overview of all topics.</summary>
  
  | Topic ID | Topic Keywords | Topic Frequency | Label | 
|----------|----------------|-----------------|-------| 
| -1 | data - model - paper - time - based | 20 | -1_data_model_paper_time | 
| 0 | spin - magnetic - phase - quantum - temperature | 12937 | 0_spin_magnetic_phase_quantum | 
| 1 | mass - star - stars - 10 - stellar | 3048 | 1_mass_star_stars_10 | 
| 2 | reinforcement - reinforcement learning - learning - policy - robot | 2564 | 2_reinforcement_reinforcement learning_learning_policy | 
| 3 | logic - semantics - programs - automata - languages | 556 | 3_logic_semantics_programs_automata | 
| 4 | neural - networks - neural networks - deep - training | 478 | 4_neural_networks_neural networks_deep | 
| 5 | networks - community - network - social - nodes | 405 | 5_networks_community_network_social | 
| 6 | word - translation - language - words - sentence | 340 | 6_word_translation_language_words | 
| 7 | object - 3d - camera - pose - localization | 298 | 7_object_3d_camera_pose | 
| 8 | classification - label - classifier - learning - classifiers | 294 | 8_classification_label_classifier_learning | 
| 9 | convex - gradient - stochastic - convergence - optimization | 287 | 9_convex_gradient_stochastic_convergence | 
| 10 | graphs - graph - vertices - vertex - edge | 284 | 10_graphs_graph_vertices_vertex | 
| 11 | brain - neurons - connectivity - neural - synaptic | 273 | 11_brain_neurons_connectivity_neural | 
| 12 | robots - robot - planning - control - motion | 255 | 12_robots_robot_planning_control | 
| 13 | prime - numbers - polynomials - integers - zeta | 245 | 13_prime_numbers_polynomials_integers | 
| 14 | tensor - rank - matrix - low rank - pca | 226 | 14_tensor_rank_matrix_low rank | 
| 15 | power - energy - grid - renewable - load | 222 | 15_power_energy_grid_renewable | 
| 16 | channel - power - mimo - interference - wireless | 219 | 16_channel_power_mimo_interference | 
| 17 | adversarial - attacks - adversarial examples - attack - examples | 208 | 17_adversarial_attacks_adversarial examples_attack | 
| 18 | gan - gans - generative - generative adversarial - adversarial | 200 | 18_gan_gans_generative_generative adversarial | 
| 19 | media - social - twitter - users - social media | 196 | 19_media_social_twitter_users | 
| 20 | posterior - monte - monte carlo - carlo - bayesian | 190 | 20_posterior_monte_monte carlo_carlo | 
| 21 | estimator - estimators - regression - quantile - estimation | 189 | 21_estimator_estimators_regression_quantile | 
| 22 | software - code - developers - projects - development | 178 | 22_software_code_developers_projects | 
| 23 | regret - bandit - armed - arm - multi armed | 177 | 23_regret_bandit_armed_arm | 
| 24 | omega - mathbb - solutions - boundary - equation | 177 | 24_omega_mathbb_solutions_boundary | 
| 25 | numerical - scheme - mesh - method - order | 175 | 25_numerical_scheme_mesh_method | 
| 26 | causal - treatment - outcome - effects - causal inference | 174 | 26_causal_treatment_outcome_effects | 
| 27 | curvature - mean curvature - riemannian - ricci - metric | 164 | 27_curvature_mean curvature_riemannian_ricci | 
| 28 | control - distributed - systems - consensus - agents | 156 | 28_control_distributed_systems_consensus | 
| 29 | groups - group - subgroup - subgroups - finite | 153 | 29_groups_group_subgroup_subgroups | 
| 30 | segmentation - images - image - convolutional - medical | 148 | 30_segmentation_images_image_convolutional | 
| 31 | market - portfolio - asset - price - volatility | 144 | 31_market_portfolio_asset_price | 
| 32 | recommendation - user - item - items - recommender | 138 | 32_recommendation_user_item_items | 
| 33 | algebra - algebras - lie - mathfrak - modules | 131 | 33_algebra_algebras_lie_mathfrak | 
| 34 | quantum - classical - circuits - annealing - circuit | 121 | 34_quantum_classical_circuits_annealing | 
| 35 | moduli - varieties - projective - curves - bundles | 119 | 35_moduli_varieties_projective_curves | 
| 36 | graph - embedding - node - graphs - network | 117 | 36_graph_embedding_node_graphs | 
| 37 | codes - decoding - channel - code - capacity | 113 | 37_codes_decoding_channel_code | 
| 38 | sparse - signal - recovery - sensing - measurements | 107 | 38_sparse_signal_recovery_sensing | 
| 39 | knot - knots - homology - invariants - link | 103 | 39_knot_knots_homology_invariants | 
| 40 | spaces - hardy - operators - mathbb - boundedness | 95 | 40_spaces_hardy_operators_mathbb | 
| 41 | blockchain - security - privacy - authentication - encryption | 90 | 41_blockchain_security_privacy_authentication | 
| 42 | turbulence - turbulent - flow - flows - reynolds | 89 | 42_turbulence_turbulent_flow_flows | 
| 43 | privacy - differential privacy - private - differential - data | 86 | 43_privacy_differential privacy_private_differential | 
| 44 | epidemic - disease - infection - infected - infectious | 83 | 44_epidemic_disease_infection_infected | 
| 45 | citation - scientific - research - journal - papers | 82 | 45_citation_scientific_research_journal | 
| 46 | surface - droplet - fluid - liquid - droplets | 81 | 46_surface_droplet_fluid_liquid | 
| 47 | chemical - molecules - molecular - protein - learning | 79 | 47_chemical_molecules_molecular_protein | 
| 48 | kähler - manifolds - manifold - complex - metrics | 77 | 48_kähler_manifolds_manifold_complex | 
| 49 | games - game - players - nash - player | 74 | 49_games_game_players_nash | 
| 50 | patients - patient - clinical - ehr - care | 73 | 50_patients_patient_clinical_ehr | 
| 51 | music - musical - audio - chord - note | 70 | 51_music_musical_audio_chord | 
| 52 | visual - shot - image - cnns - learning | 70 | 52_visual_shot_image_cnns | 
| 53 | speaker - speech - end - recognition - speech recognition | 70 | 53_speaker_speech_end_recognition | 
| 54 | cell - cells - tissue - active - tumor | 69 | 54_cell_cells_tissue_active | 
| 55 | eeg - brain - signals - sleep - subjects | 69 | 55_eeg_brain_signals_sleep | 
| 56 | fairness - fair - discrimination - decision - algorithmic | 67 | 56_fairness_fair_discrimination_decision | 
| 57 | clustering - clusters - data - based clustering - cluster | 66 | 57_clustering_clusters_data_based clustering | 
| 58 | relativity - black - solutions - einstein - spacetime | 65 | 58_relativity_black_solutions_einstein | 
| 59 | mathbb - curves - elliptic - conjecture - fields | 62 | 59_mathbb_curves_elliptic_conjecture | 
| 60 | stokes - navier - navier stokes - equations - stokes equations | 61 | 60_stokes_navier_navier stokes_equations | 
| 61 | species - population - dispersal - ecosystem - populations | 60 | 61_species_population_dispersal_ecosystem | 
| 62 | reconstruction - ct - artifacts - image - images | 58 | 62_reconstruction_ct_artifacts_image | 
| 63 | algebra - algebras - mathcal - alpha - crossed | 58 | 63_algebra_algebras_mathcal_alpha | 
| 64 | tiling - polytopes - set - polygon - polytope | 58 | 64_tiling_polytopes_set_polygon | 
| 65 | mobile - video - network - latency - computing | 57 | 65_mobile_video_network_latency | 
| 66 | latent - variational - vae - generative - inference | 55 | 66_latent_variational_vae_generative | 
| 67 | players - game - team - player - teams | 54 | 67_players_game_team_player | 
| 68 | genes - gene - cancer - expression - sequencing | 53 | 68_genes_gene_cancer_expression | 
| 69 | forcing - kappa - definable - cardinal - zfc | 51 | 69_forcing_kappa_definable_cardinal | 
| 70 | dna - protein - folding - proteins - molecule | 50 | 70_dna_protein_folding_proteins | 
| 71 | spaces - space - metric - metric spaces - topology | 49 | 71_spaces_space_metric_metric spaces | 
| 72 | speech - separation - source separation - enhancement - speaker | 49 | 72_speech_separation_source separation_enhancement | 
| 73 | imaging - resolution - light - diffraction - phase | 47 | 73_imaging_resolution_light_diffraction | 
| 74 | traffic - traffic flow - prediction - temporal - transportation | 46 | 74_traffic_traffic flow_prediction_temporal | 
| 75 | climate - precipitation - sea - flood - extreme | 45 | 75_climate_precipitation_sea_flood | 
| 76 | audio - sound - event detection - event - bird | 43 | 76_audio_sound_event detection_event | 
| 77 | memory - storage - cache - performance - write | 40 | 77_memory_storage_cache_performance | 
| 78 | wishart - matrices - eigenvalue - free - smallest | 39 | 78_wishart_matrices_eigenvalue_free | 
| 79 | domain - domain adaptation - adaptation - transfer - target | 39 | 79_domain_domain adaptation_adaptation_transfer | 
| 80 | glass - glasses - glassy - amorphous - liquids | 39 | 80_glass_glasses_glassy_amorphous | 
| 81 | gpu - gpus - nvidia - code - performance | 38 | 81_gpu_gpus_nvidia_code | 
| 82 | face - face recognition - facial - recognition - faces | 38 | 82_face_face recognition_facial_recognition | 
| 83 | stock - market - price - financial - stocks | 37 | 83_stock_market_price_financial | 
| 84 | reaction - flux - metabolic - growth - biochemical | 34 | 84_reaction_flux_metabolic_growth | 
| 85 | fleet - routing - vehicles - ride - traffic | 34 | 85_fleet_routing_vehicles_ride | 
| 86 | cooperation - evolutionary - game - social - payoff | 33 | 86_cooperation_evolutionary_game_social | 
| 87 | students - courses - student - course - education | 33 | 87_students_courses_student_course | 
| 88 | action - temporal - video - recognition - videos | 33 | 88_action_temporal_video_recognition | 
| 89 | irreducible - group - mathcal - representations - let | 32 | 89_irreducible_group_mathcal_representations | 
| 90 | phylogenetic - tree - trees - species - gene | 32 | 90_phylogenetic_tree_trees_species | 
| 91 | processes - drift - asymptotic - estimators - stationary | 31 | 91_processes_drift_asymptotic_estimators | 
| 92 | wave - waves - water - free surface - shallow water | 30 | 92_wave_waves_water_free surface | 
| 93 | distributed - gradient - byzantine - communication - sgd | 30 | 93_distributed_gradient_byzantine_communication | 
| 94 | voters - voting - election - voter - winner | 30 | 94_voters_voting_election_voter | 
| 95 | gaussian process - gaussian - gp - process - gaussian processes | 30 | 95_gaussian process_gaussian_gp_process | 
| 96 | mathfrak - gorenstein - ring - rings - modules | 29 | 96_mathfrak_gorenstein_ring_rings | 
| 97 | motivic - gw - cohomology - dm - category | 29 | 97_motivic_gw_cohomology_dm | 
| 98 | recurrent - lstm - rnn - recurrent neural - memory | 28 | 98_recurrent_lstm_rnn_recurrent neural | 
| 99 | semigroup - semigroups - xy - ordered - pt | 27 | 99_semigroup_semigroups_xy_ordered | 
| 100 | robot - robots - human - human robot - children | 25 | 100_robot_robots_human_human robot | 
| 101 | categories - category - homotopy - functor - grothendieck | 25 | 101_categories_category_homotopy_functor | 
| 102 | queue - queues - server - scheduling - customer | 24 | 102_queue_queues_server_scheduling | 
| 103 | topic - topics - topic modeling - lda - documents | 24 | 103_topic_topics_topic modeling_lda | 
| 104 | synchronization - oscillators - chimera - coupling - coupled | 24 | 104_synchronization_oscillators_chimera_coupling | 
| 105 | stochastic - existence - equation - solutions - uniqueness | 24 | 105_stochastic_existence_equation_solutions | 
| 106 | fractional - derivative - derivatives - integral - psi | 23 | 106_fractional_derivative_derivatives_integral | 
| 107 | lasso - regression - estimator - estimators - bootstrap | 23 | 107_lasso_regression_estimator_estimators | 
| 108 | soil - moisture - machine - resolution - seismic | 22 | 108_soil_moisture_machine_resolution | 
| 109 | bayesian optimization - optimization - acquisition - bayesian - bo | 21 | 109_bayesian optimization_optimization_acquisition_bayesian | 
| 110 | urban - city - mobility - cities - social | 21 | 110_urban_city_mobility_cities |
  
</details>


## Training Procedure

The model was trained as follows:

```python
from bertopic import BERTopic
from sklearn.feature_extraction.text import CountVectorizer
from bertopic.representation import KeyBERTInspired

from sentence_transformers import SentenceTransformer
from umap import UMAP
from hdbscan import HDBSCAN

# Prepre sub-models
embedding_model = SentenceTransformer('sentence-transformers/all-mpnet-base-v2')
umap_model = UMAP(n_components=5, n_neighbors=50, random_state=42, metric="cosine", verbose=True)
hdbscan_model = HDBSCAN(min_samples=20, gen_min_span_tree=True, prediction_data=False, min_cluster_size=20)
vectorizer_model = CountVectorizer(stop_words="english", ngram_range=(1, 3), min_df=5)

# Representation models
representation_models = {"KeyBERTInspired": KeyBERTInspired()}

# Fit BERTopic
topic_model = BERTopic(
                umap_model=umap_model,
                hdbscan_model=hdbscan_model,
                vectorizer_model=vectorizer_model,
                representation_model=representation_models,
                min_topic_size= 10,
                n_gram_range= (1, 1),
                nr_topics=None,
                seed_topic_list=None,
                top_n_words=10,
                calculate_probabilities=False,
                language=None,
                verbose = True
).fit(docs)


```


## Training hyperparameters

* calculate_probabilities: False
* language: None
* low_memory: False
* min_topic_size: 10
* n_gram_range: (1, 1)
* nr_topics: None
* seed_topic_list: None
* top_n_words: 10
* verbose: True

## Framework versions

* Numpy: 1.22.4
* HDBSCAN: 0.8.33
* UMAP: 0.5.3
* Pandas: 1.5.3
* Scikit-Learn: 1.2.2
* Sentence-transformers: 2.2.2
* Transformers: 4.29.2
* Numba: 0.56.4
* Plotly: 5.13.1
* Python: 3.10.11