File size: 15,867 Bytes
d3ea2b8 fbe09b6 ce18d34 1ced3f0 ce18d34 d3ea2b8 8488e58 d3ea2b8 14e35a4 d3ea2b8 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 |
---
tags:
- bertopic
library_name: bertopic
pipeline_tag: text-classification
---
# respapers_topics
This is a [BERTopic](https://github.com/MaartenGr/BERTopic) model.
BERTopic is a flexible and modular topic modeling framework that allows for the generation of easily interpretable topics from large datasets.
This pre-trained model was built to demonstrate the use of representation model inspired on KeyBERT to be use within BERTopic.
This model was trained on ~30000 Research Papers abstracts with the KeyBERTInspired representation method (bertopic.representation).
The dataset was downloaded from [kaggle](https://www.kaggle.com/datasets/arashnic/urban-sound?resource=download&select=train_tm), with the two subsets (test and train) being merged into a single dataset.
To access the complete code, you can vist this tutorial on my GitHub page:
[ResPapers](https://github.com/ccatalao/respapers/blob/main/respapers.ipynb)
## Usage
To use this model, please install BERTopic:
```
pip install -U bertopic
```
You can use the model as follows:
```python
from bertopic import BERTopic
topic_model = BERTopic.load("CCatalao/respapers_topics")
topic_model.get_topic_info()
```
To view the KeyBERT inspired topic representation please use the following:
```python
>>> topic_model.get_topic(0, full=True)
{'Main': [['spin', 0.01852648864225281],
['magnetic', 0.015019436257929909],
['phase', 0.013081733986038124],
['quantum', 0.012942253723133639],
['temperature', 0.012591407440537158],
['states', 0.011025582290837643],
['field', 0.010954775154251296],
['electron', 0.010168708734803916],
['transition', 0.009728560280580357],
['energy', 0.00937042795113575]],
'KeyBERTInspired': [['quantum', 0.4072583317756653],
['phase transition', 0.35542067885398865],
['lattice', 0.34462833404541016],
['spin', 0.3268473744392395],
['magnetic', 0.3024371564388275],
['magnetization', 0.2868726849555969],
['phases', 0.27178525924682617],
['fermi', 0.26290175318717957],
['electron', 0.25709500908851624],
['phase', 0.23375216126441956]]}
```
## Topic overview
* Number of topics: 112
* Number of training documents: 29961
<details>
<summary>Click here for an overview of all topics.</summary>
| Topic ID | Topic Keywords | Topic Frequency | Label |
|----------|----------------|-----------------|-------|
| -1 | data - model - paper - time - based | 20 | -1_data_model_paper_time |
| 0 | spin - magnetic - phase - quantum - temperature | 12937 | 0_spin_magnetic_phase_quantum |
| 1 | mass - star - stars - 10 - stellar | 3048 | 1_mass_star_stars_10 |
| 2 | reinforcement - reinforcement learning - learning - policy - robot | 2564 | 2_reinforcement_reinforcement learning_learning_policy |
| 3 | logic - semantics - programs - automata - languages | 556 | 3_logic_semantics_programs_automata |
| 4 | neural - networks - neural networks - deep - training | 478 | 4_neural_networks_neural networks_deep |
| 5 | networks - community - network - social - nodes | 405 | 5_networks_community_network_social |
| 6 | word - translation - language - words - sentence | 340 | 6_word_translation_language_words |
| 7 | object - 3d - camera - pose - localization | 298 | 7_object_3d_camera_pose |
| 8 | classification - label - classifier - learning - classifiers | 294 | 8_classification_label_classifier_learning |
| 9 | convex - gradient - stochastic - convergence - optimization | 287 | 9_convex_gradient_stochastic_convergence |
| 10 | graphs - graph - vertices - vertex - edge | 284 | 10_graphs_graph_vertices_vertex |
| 11 | brain - neurons - connectivity - neural - synaptic | 273 | 11_brain_neurons_connectivity_neural |
| 12 | robots - robot - planning - control - motion | 255 | 12_robots_robot_planning_control |
| 13 | prime - numbers - polynomials - integers - zeta | 245 | 13_prime_numbers_polynomials_integers |
| 14 | tensor - rank - matrix - low rank - pca | 226 | 14_tensor_rank_matrix_low rank |
| 15 | power - energy - grid - renewable - load | 222 | 15_power_energy_grid_renewable |
| 16 | channel - power - mimo - interference - wireless | 219 | 16_channel_power_mimo_interference |
| 17 | adversarial - attacks - adversarial examples - attack - examples | 208 | 17_adversarial_attacks_adversarial examples_attack |
| 18 | gan - gans - generative - generative adversarial - adversarial | 200 | 18_gan_gans_generative_generative adversarial |
| 19 | media - social - twitter - users - social media | 196 | 19_media_social_twitter_users |
| 20 | posterior - monte - monte carlo - carlo - bayesian | 190 | 20_posterior_monte_monte carlo_carlo |
| 21 | estimator - estimators - regression - quantile - estimation | 189 | 21_estimator_estimators_regression_quantile |
| 22 | software - code - developers - projects - development | 178 | 22_software_code_developers_projects |
| 23 | regret - bandit - armed - arm - multi armed | 177 | 23_regret_bandit_armed_arm |
| 24 | omega - mathbb - solutions - boundary - equation | 177 | 24_omega_mathbb_solutions_boundary |
| 25 | numerical - scheme - mesh - method - order | 175 | 25_numerical_scheme_mesh_method |
| 26 | causal - treatment - outcome - effects - causal inference | 174 | 26_causal_treatment_outcome_effects |
| 27 | curvature - mean curvature - riemannian - ricci - metric | 164 | 27_curvature_mean curvature_riemannian_ricci |
| 28 | control - distributed - systems - consensus - agents | 156 | 28_control_distributed_systems_consensus |
| 29 | groups - group - subgroup - subgroups - finite | 153 | 29_groups_group_subgroup_subgroups |
| 30 | segmentation - images - image - convolutional - medical | 148 | 30_segmentation_images_image_convolutional |
| 31 | market - portfolio - asset - price - volatility | 144 | 31_market_portfolio_asset_price |
| 32 | recommendation - user - item - items - recommender | 138 | 32_recommendation_user_item_items |
| 33 | algebra - algebras - lie - mathfrak - modules | 131 | 33_algebra_algebras_lie_mathfrak |
| 34 | quantum - classical - circuits - annealing - circuit | 121 | 34_quantum_classical_circuits_annealing |
| 35 | moduli - varieties - projective - curves - bundles | 119 | 35_moduli_varieties_projective_curves |
| 36 | graph - embedding - node - graphs - network | 117 | 36_graph_embedding_node_graphs |
| 37 | codes - decoding - channel - code - capacity | 113 | 37_codes_decoding_channel_code |
| 38 | sparse - signal - recovery - sensing - measurements | 107 | 38_sparse_signal_recovery_sensing |
| 39 | knot - knots - homology - invariants - link | 103 | 39_knot_knots_homology_invariants |
| 40 | spaces - hardy - operators - mathbb - boundedness | 95 | 40_spaces_hardy_operators_mathbb |
| 41 | blockchain - security - privacy - authentication - encryption | 90 | 41_blockchain_security_privacy_authentication |
| 42 | turbulence - turbulent - flow - flows - reynolds | 89 | 42_turbulence_turbulent_flow_flows |
| 43 | privacy - differential privacy - private - differential - data | 86 | 43_privacy_differential privacy_private_differential |
| 44 | epidemic - disease - infection - infected - infectious | 83 | 44_epidemic_disease_infection_infected |
| 45 | citation - scientific - research - journal - papers | 82 | 45_citation_scientific_research_journal |
| 46 | surface - droplet - fluid - liquid - droplets | 81 | 46_surface_droplet_fluid_liquid |
| 47 | chemical - molecules - molecular - protein - learning | 79 | 47_chemical_molecules_molecular_protein |
| 48 | kähler - manifolds - manifold - complex - metrics | 77 | 48_kähler_manifolds_manifold_complex |
| 49 | games - game - players - nash - player | 74 | 49_games_game_players_nash |
| 50 | patients - patient - clinical - ehr - care | 73 | 50_patients_patient_clinical_ehr |
| 51 | music - musical - audio - chord - note | 70 | 51_music_musical_audio_chord |
| 52 | visual - shot - image - cnns - learning | 70 | 52_visual_shot_image_cnns |
| 53 | speaker - speech - end - recognition - speech recognition | 70 | 53_speaker_speech_end_recognition |
| 54 | cell - cells - tissue - active - tumor | 69 | 54_cell_cells_tissue_active |
| 55 | eeg - brain - signals - sleep - subjects | 69 | 55_eeg_brain_signals_sleep |
| 56 | fairness - fair - discrimination - decision - algorithmic | 67 | 56_fairness_fair_discrimination_decision |
| 57 | clustering - clusters - data - based clustering - cluster | 66 | 57_clustering_clusters_data_based clustering |
| 58 | relativity - black - solutions - einstein - spacetime | 65 | 58_relativity_black_solutions_einstein |
| 59 | mathbb - curves - elliptic - conjecture - fields | 62 | 59_mathbb_curves_elliptic_conjecture |
| 60 | stokes - navier - navier stokes - equations - stokes equations | 61 | 60_stokes_navier_navier stokes_equations |
| 61 | species - population - dispersal - ecosystem - populations | 60 | 61_species_population_dispersal_ecosystem |
| 62 | reconstruction - ct - artifacts - image - images | 58 | 62_reconstruction_ct_artifacts_image |
| 63 | algebra - algebras - mathcal - alpha - crossed | 58 | 63_algebra_algebras_mathcal_alpha |
| 64 | tiling - polytopes - set - polygon - polytope | 58 | 64_tiling_polytopes_set_polygon |
| 65 | mobile - video - network - latency - computing | 57 | 65_mobile_video_network_latency |
| 66 | latent - variational - vae - generative - inference | 55 | 66_latent_variational_vae_generative |
| 67 | players - game - team - player - teams | 54 | 67_players_game_team_player |
| 68 | genes - gene - cancer - expression - sequencing | 53 | 68_genes_gene_cancer_expression |
| 69 | forcing - kappa - definable - cardinal - zfc | 51 | 69_forcing_kappa_definable_cardinal |
| 70 | dna - protein - folding - proteins - molecule | 50 | 70_dna_protein_folding_proteins |
| 71 | spaces - space - metric - metric spaces - topology | 49 | 71_spaces_space_metric_metric spaces |
| 72 | speech - separation - source separation - enhancement - speaker | 49 | 72_speech_separation_source separation_enhancement |
| 73 | imaging - resolution - light - diffraction - phase | 47 | 73_imaging_resolution_light_diffraction |
| 74 | traffic - traffic flow - prediction - temporal - transportation | 46 | 74_traffic_traffic flow_prediction_temporal |
| 75 | climate - precipitation - sea - flood - extreme | 45 | 75_climate_precipitation_sea_flood |
| 76 | audio - sound - event detection - event - bird | 43 | 76_audio_sound_event detection_event |
| 77 | memory - storage - cache - performance - write | 40 | 77_memory_storage_cache_performance |
| 78 | wishart - matrices - eigenvalue - free - smallest | 39 | 78_wishart_matrices_eigenvalue_free |
| 79 | domain - domain adaptation - adaptation - transfer - target | 39 | 79_domain_domain adaptation_adaptation_transfer |
| 80 | glass - glasses - glassy - amorphous - liquids | 39 | 80_glass_glasses_glassy_amorphous |
| 81 | gpu - gpus - nvidia - code - performance | 38 | 81_gpu_gpus_nvidia_code |
| 82 | face - face recognition - facial - recognition - faces | 38 | 82_face_face recognition_facial_recognition |
| 83 | stock - market - price - financial - stocks | 37 | 83_stock_market_price_financial |
| 84 | reaction - flux - metabolic - growth - biochemical | 34 | 84_reaction_flux_metabolic_growth |
| 85 | fleet - routing - vehicles - ride - traffic | 34 | 85_fleet_routing_vehicles_ride |
| 86 | cooperation - evolutionary - game - social - payoff | 33 | 86_cooperation_evolutionary_game_social |
| 87 | students - courses - student - course - education | 33 | 87_students_courses_student_course |
| 88 | action - temporal - video - recognition - videos | 33 | 88_action_temporal_video_recognition |
| 89 | irreducible - group - mathcal - representations - let | 32 | 89_irreducible_group_mathcal_representations |
| 90 | phylogenetic - tree - trees - species - gene | 32 | 90_phylogenetic_tree_trees_species |
| 91 | processes - drift - asymptotic - estimators - stationary | 31 | 91_processes_drift_asymptotic_estimators |
| 92 | wave - waves - water - free surface - shallow water | 30 | 92_wave_waves_water_free surface |
| 93 | distributed - gradient - byzantine - communication - sgd | 30 | 93_distributed_gradient_byzantine_communication |
| 94 | voters - voting - election - voter - winner | 30 | 94_voters_voting_election_voter |
| 95 | gaussian process - gaussian - gp - process - gaussian processes | 30 | 95_gaussian process_gaussian_gp_process |
| 96 | mathfrak - gorenstein - ring - rings - modules | 29 | 96_mathfrak_gorenstein_ring_rings |
| 97 | motivic - gw - cohomology - dm - category | 29 | 97_motivic_gw_cohomology_dm |
| 98 | recurrent - lstm - rnn - recurrent neural - memory | 28 | 98_recurrent_lstm_rnn_recurrent neural |
| 99 | semigroup - semigroups - xy - ordered - pt | 27 | 99_semigroup_semigroups_xy_ordered |
| 100 | robot - robots - human - human robot - children | 25 | 100_robot_robots_human_human robot |
| 101 | categories - category - homotopy - functor - grothendieck | 25 | 101_categories_category_homotopy_functor |
| 102 | queue - queues - server - scheduling - customer | 24 | 102_queue_queues_server_scheduling |
| 103 | topic - topics - topic modeling - lda - documents | 24 | 103_topic_topics_topic modeling_lda |
| 104 | synchronization - oscillators - chimera - coupling - coupled | 24 | 104_synchronization_oscillators_chimera_coupling |
| 105 | stochastic - existence - equation - solutions - uniqueness | 24 | 105_stochastic_existence_equation_solutions |
| 106 | fractional - derivative - derivatives - integral - psi | 23 | 106_fractional_derivative_derivatives_integral |
| 107 | lasso - regression - estimator - estimators - bootstrap | 23 | 107_lasso_regression_estimator_estimators |
| 108 | soil - moisture - machine - resolution - seismic | 22 | 108_soil_moisture_machine_resolution |
| 109 | bayesian optimization - optimization - acquisition - bayesian - bo | 21 | 109_bayesian optimization_optimization_acquisition_bayesian |
| 110 | urban - city - mobility - cities - social | 21 | 110_urban_city_mobility_cities |
</details>
## Training Procedure
The model was trained as follows:
```python
from bertopic import BERTopic
from sklearn.feature_extraction.text import CountVectorizer
from bertopic.representation import KeyBERTInspired
from sentence_transformers import SentenceTransformer
from umap import UMAP
from hdbscan import HDBSCAN
# Prepre sub-models
embedding_model = SentenceTransformer('sentence-transformers/all-mpnet-base-v2')
umap_model = UMAP(n_components=5, n_neighbors=50, random_state=42, metric="cosine", verbose=True)
hdbscan_model = HDBSCAN(min_samples=20, gen_min_span_tree=True, prediction_data=False, min_cluster_size=20)
vectorizer_model = CountVectorizer(stop_words="english", ngram_range=(1, 3), min_df=5)
# Representation models
representation_models = {"KeyBERTInspired": KeyBERTInspired()}
# Fit BERTopic
topic_model = BERTopic(
umap_model=umap_model,
hdbscan_model=hdbscan_model,
vectorizer_model=vectorizer_model,
representation_model=representation_models,
min_topic_size= 10,
n_gram_range= (1, 1),
nr_topics=None,
seed_topic_list=None,
top_n_words=10,
calculate_probabilities=False,
language=None,
verbose = True
).fit(docs)
```
## Training hyperparameters
* calculate_probabilities: False
* language: None
* low_memory: False
* min_topic_size: 10
* n_gram_range: (1, 1)
* nr_topics: None
* seed_topic_list: None
* top_n_words: 10
* verbose: True
## Framework versions
* Numpy: 1.22.4
* HDBSCAN: 0.8.33
* UMAP: 0.5.3
* Pandas: 1.5.3
* Scikit-Learn: 1.2.2
* Sentence-transformers: 2.2.2
* Transformers: 4.29.2
* Numba: 0.56.4
* Plotly: 5.13.1
* Python: 3.10.11
|