sanash43's picture
Add BERTopic model
63ac5ac verified
  - bertopic
library_name: bertopic
pipeline_tag: text-classification


This is a BERTopic model. BERTopic is a flexible and modular topic modeling framework that allows for the generation of easily interpretable topics from large datasets.


To use this model, please install BERTopic:

pip install -U bertopic

You can use the model as follows:

from bertopic import BERTopic
topic_model = BERTopic.load("sanash43/dssg_topicmodel_500000")


Topic overview

  • Number of topics: 49
  • Number of training documents: 500000
Click here for an overview of all topics.
Topic ID Topic Keywords Topic Frequency Label
0 the - of - to - and - in 444110 0_the_of_to_and
1 university - college - student - passed - permit 31380 1_university_college_student_passed
2 001 - 000 - xxxxxxxxxxxx - on9998 - 8703 10678 2_001_000_xxxxxxxxxxxx_on9998
3 ergocentric - inc - or - services - 1231 3124 3_ergocentric_inc_or_services
4 regular - force - labrador - newfoundland - commercial 1590 4_regular_force_labrador_newfoundland
5 seeding - hail - storm - radar - weather 1228 5_seeding_hail_storm_radar
6 000000 - rental - 42012e12 - 5000 - 2170 926 6_000000_rental_42012e12_5000
7 hearing - loss - tinnitus - noise - ear 796 7_hearing_loss_tinnitus_noise
8 the - and - of - in - you 684 8_the_and_of_in
9 traduction - documents - parl - mots - tr03 534 9_traduction_documents_parl_mots
10 mci - 24 - 1943 - 23 - inst 517 10_mci_24_1943_23
11 cbsa - lasfc - dasile - demandeurs - total 467 11_cbsa_lasfc_dasile_demandeurs
12 wwater - burlington - laboratory - eclabbur - testing 424 12_wwater_burlington_laboratory_eclabbur
13 epoll - ou - doffres - elector - 10162 306 13_epoll_ou_doffres_elector
14 heritage - sussex - the - residence - building 249 14_heritage_sussex_the_residence
15 greenough - daycare - wellington - consulting - october 239 15_greenough_daycare_wellington_consulting
16 tage - floor - rue - confirmed - dorchester 228 16_tage_floor_rue_confirmed
17 jeunes - youth - we - de - les 216 17_jeunes_youth_we_de
18 bnp - hartals - violence - the - that 211 18_bnp_hartals_violence_the
19 10aig - i0aig - 10aic - ioaig - i0aic 187 19_10aig_i0aig_10aic_ioaig
20 complaints - files - case - rdims - vs 173 20_complaints_files_case_rdims
21 mckinsey - formatted - font - publishingemail - page 165 21_mckinsey_formatted_font_publishingemail
22 cerb - english - french - xxxxxxxxxxxx - rdprm 151 22_cerb_english_french_xxxxxxxxxxxx
23 aeroplane - pilot - complete - private - passed 132 23_aeroplane_pilot_complete_private
24 blue - bridge - delay - water - edt 130 24_blue_bridge_delay_water
25 dymista - nasal - fluticasone - propionate - spray 123 25_dymista_nasal_fluticasone_propionate
26 individual - wh - pied - dd - tob 113 26_individual_wh_pied_dd
27 holman - financial - 19971101 - services - ar 80 27_holman_financial_19971101_services
28 pch - anthem - c210 - senator - bill 77 28_pch_anthem_c210_senator
29 6299 - r300 - assigned - liabilities - 21111 72 29_6299_r300_assigned_liabilities
30 cad - registered - 000 - eur - 19112015 71 30_cad_registered_000_eur
31 original - single - age - months - commercial 70 31_original_single_age_months
32 biden - trump - votes - wshdc - election 57 32_biden_trump_votes_wshdc
33 link - bellletstalk - mental - farmers - thefirstsixteen 54 33_link_bellletstalk_mental_farmers
34 visits - average - daily - busiest - active 44 34_visits_average_daily_busiest
35 de - laroport - dorval - mirabel - et 41 35_de_laroport_dorval_mirabel
36 undefined - null - owning - created - status 40 36_undefined_null_owning_created
37 20190101 - treasurer - 20191231 - pastor - member 39 37_20190101_treasurer_20191231_pastor
38 1000040908 - protak - consulting - cad - cleared 37 38_1000040908_protak_consulting_cad
39 parental - z5 - 75 - maternity - zq 27 39_parental_z5_75_maternity
40 propane - per - cost - cents - bushel 26 40_propane_per_cost_cents
41 male - haiti - female - minor - colombia 26 41_male_haiti_female_minor
42 tsnrc - tsmrc - sda - standard - option 25 42_tsnrc_tsmrc_sda_standard
43 stakeholders - 10072019 - 0000 - delegation - accredited 25 43_stakeholders_10072019_0000_delegation
44 meop - eoms - multilateral - observation - eom 25 44_meop_eoms_multilateral_observation
45 de - cuves - la - des - anodes 22 45_de_cuves_la_des
46 destroyed - goods - importer - customs - rh 22 46_destroyed_goods_importer_customs
47 pa - mexico - passed - female - male 21 47_pa_mexico_passed_female
48 linda - cheverie - giulia - transcripts - command 18 48_linda_cheverie_giulia_transcripts

Training hyperparameters

  • calculate_probabilities: False
  • language: None
  • low_memory: False
  • min_topic_size: 10
  • n_gram_range: (1, 1)
  • nr_topics: 50
  • seed_topic_list: None
  • top_n_words: 10
  • verbose: False
  • zeroshot_min_similarity: 0.7
  • zeroshot_topic_list: None

Framework versions

  • Numpy: 1.26.4
  • HDBSCAN: 0.8.38.post1
  • UMAP: 0.5.6
  • Pandas: 2.2.1
  • Scikit-Learn: 1.4.0
  • Sentence-transformers: 3.0.1
  • Transformers: 4.43.4
  • Numba: 0.60.0
  • Plotly: 5.23.0
  • Python: 3.9.19