File size: 9,001 Bytes
3d9f4d7
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148

---
tags:
- bertopic
library_name: bertopic
pipeline_tag: text-classification
---

# bertopic_github_dataset_viewer_issues

This is a [BERTopic](https://github.com/MaartenGr/BERTopic) model. 
BERTopic is a flexible and modular topic modeling framework that allows for the generation of easily interpretable topics from large datasets. 

## Usage 

To use this model, please install BERTopic:

```
pip install -U bertopic
```

You can use the model as follows:

```python
from bertopic import BERTopic
topic_model = BERTopic.load("asoria/bertopic_github_dataset_viewer_issues")

topic_model.get_topic_info()
```

## Topic overview

* Number of topics: 78
* Number of training documents: 3066

<details>
  <summary>Click here for an overview of all topics.</summary>
  
  | Topic ID | Topic Keywords | Topic Frequency | Label | 
|----------|----------------|-----------------|-------| 
| -1 | jobs - datasets - cache - fix - pandas | 11 | -1_jobs_datasets_cache_fix | 
| 0 | issue - viewer - dataset - for - bigsciencep3 | 534 | 0_issue_viewer_dataset_for | 
| 1 | parquet - files - metadata - parquetanddatasetinfo - configparquetandinfo | 144 | 1_parquet_files_metadata_parquetanddatasetinfo | 
| 2 | vulnerability - cryptography - dependencies - 4106 - update | 132 | 2_vulnerability_cryptography_dependencies_4106 | 
| 3 | docs - doc - page - add - md | 109 | 3_docs_doc_page_add | 
| 4 | rows - firstrows - row - truncated - response | 90 | 4_rows_firstrows_row_truncated | 
| 5 | duckdb - index - splitduckdbindex - fts - try | 78 | 5_duckdb_index_splitduckdbindex_fts | 
| 6 | hub - hubcache - timeout - datasethubcache - tags | 75 | 6_hub_hubcache_timeout_datasethubcache | 
| 7 | audio - opus - extension - torchaudio - torch | 59 | 7_audio_opus_extension_torchaudio | 
| 8 | filter - endpoint - isvalid - column - parameters | 54 | 8_filter_endpoint_isvalid_column | 
| 9 | datasets - update - upgrade - dependency - to | 54 | 9_datasets_update_upgrade_dependency | 
| 10 | docker - images - build - image - compose | 53 | 10_docker_images_build_image | 
| 11 | cache - refresh - entries - entry - warm | 51 | 11_cache_refresh_entries_entry | 
| 12 | mongo - mongodb - indexes - atlas - index | 48 | 12_mongo_mongodb_indexes_atlas | 
| 13 | image - images - modality - support - pdf2image | 47 | 13_image_images_modality_support | 
| 14 | unblock - block - blocked - blocklist - datasets | 46 | 14_unblock_block_blocked_blocklist | 
| 15 | error - expected - xerrorcode - messages - catch | 44 | 15_error_expected_xerrorcode_messages | 
| 16 | backfill - cron - job - time - move | 44 | 16_backfill_cron_job_time | 
| 17 | jobs - waiting - job - finishedat - started | 44 | 17_jobs_waiting_job_finishedat | 
| 18 | env - config - configs - vars - default | 41 | 18_env_config_configs_vars | 
| 19 | gitpython - 3137 - 3141 - github - builddepsdev | 41 | 19_gitpython_3137_3141_github | 
| 20 | assets - s3 - cachedassets - cached - fsspec | 40 | 20_assets_s3_cachedassets_cached | 
| 21 | splitnamesfromstreaming - split - streaming - rename - names | 39 | 21_splitnamesfromstreaming_split_streaming_rename | 
| 22 | statistics - stats - descriptive - splitdescriptivestatistics - class | 38 | 22_statistics_stats_descriptive_splitdescriptivestatistics | 
| 23 | private - gated - datasets - public - gatedauto | 35 | 23_private_gated_datasets_public | 
| 24 | metrics - healthcheck - port - adminmetrics - admin | 33 | 24_metrics_healthcheck_port_adminmetrics | 
| 25 | steps - processing - step - triggers - graph | 32 | 25_steps_processing_step_triggers | 
| 26 | ci - codecov - pr - fork - invalid | 31 | 26_ci_codecov_pr_fork | 
| 27 | splits - split - list - configs - returned | 31 | 27_splits_split_list_configs | 
| 28 | openapi - openapijson - spec - publish - spectral | 31 | 28_openapi_openapijson_spec_publish | 
| 29 | queue - incremental - based - field - jobs | 31 | 29_queue_incremental_based_field | 
| 30 | error - datasetwithscriptnotsupportederror - exist - no - datasetgenerationerror | 31 | 30_error_datasetwithscriptnotsupportederror_exist_no | 
| 31 | ram - 5gb - heavy - reduce - overcommitment | 31 | 31_ram_5gb_heavy_reduce | 
| 32 | workers - number - reduce - increase - heavy | 30 | 32_workers_number_reduce_increase | 
| 33 | admin - ui - app - difficulty - prefix | 30 | 33_admin_ui_app_difficulty | 
| 34 | chart - fixchart - helm - alb - featchart | 28 | 34_chart_fixchart_helm_alb | 
| 35 | aiohttp - 386 - bump - 392 - 391 | 27 | 35_aiohttp_386_bump_392 | 
| 36 | e2e - tests - test - ci - testmetrics | 27 | 36_e2e_tests_test_ci | 
| 37 | huggingfacehub - upgrade - 0151 - version - branch | 27 | 37_huggingfacehub_upgrade_0151_version | 
| 38 | test - tests - unit - pytestmemray - fixtures | 26 | 38_test_tests_unit_pytestmemray | 
| 39 | webhook - webhooks - payload - visibility - hub | 26 | 39_webhook_webhooks_payload_visibility | 
| 40 | migration - migrations - database - scripts - databases | 26 | 40_migration_migrations_database_scripts | 
| 41 | refactor - dead - code - remove - abstractions | 25 | 41_refactor_dead_code_remove | 
| 42 | retry - retryable - codes - every - createcommiterror | 25 | 42_retry_retryable_codes_every | 
| 43 | log - logs - debug - level - crashes | 25 | 43_log_logs_debug_level | 
| 44 | croissant - jsonld - fields - either - recordset | 25 | 44_croissant_jsonld_fields_either | 
| 45 | pods - pod - number - scale - reverseproxy | 24 | 45_pods_pod_number_scale | 
| 46 | scan - urls - spawning - presidio - optinouturls | 24 | 46_scan_urls_spawning_presidio | 
| 47 | resources - feat - reduce - increase - production | 22 | 47_resources_feat_reduce_increase | 
| 48 | download - manual - require - enum - extracted | 21 | 48_download_manual_require_enum | 
| 49 | comment - issues - close - fix - tag | 20 | 49_comment_issues_close_fix | 
| 50 | cache - entries - clean - hf - blocked | 19 | 50_cache_entries_clean_hf | 
| 51 | worker - generic - workerjobtypesblocked - treccartools - dependencies | 19 | 51_worker_generic_workerjobtypesblocked_treccartools | 
| 52 | datasetviewer - rename - datasetsserver - domain - server | 18 | 52_datasetviewer_rename_datasetsserver_domain | 
| 53 | across - group - pip - directories - bump | 18 | 53_across_group_pip_directories | 
| 54 | runner - runners - validation - job - parent | 18 | 54_runner_runners_validation_job | 
| 55 | upgrade - datasets - feat - 221 - 1162dev0 | 18 | 55_upgrade_datasets_feat_221 | 
| 56 | jwt - array - authorization - cookies - bypass | 18 | 56_jwt_array_authorization_cookies | 
| 57 | allow - script - scriptbased - scripts - redpajamadata1t | 17 | 57_allow_script_scriptbased_scripts | 
| 58 | unique - metrics - metric - cache - cron | 16 | 58_unique_metrics_metric_cache | 
| 59 | aiohttp - libslibcommon - libslibapi - 386 - 385 | 16 | 59_aiohttp_libslibcommon_libslibapi_386 | 
| 60 | pillow - 1001 - 1020 - bump - from | 16 | 60_pillow_1001_1020_bump | 
| 61 | storage - disk - storageclient - storageadmin - client | 15 | 61_storage_disk_storageclient_storageadmin | 
| 62 | resources - increase - 108010 - reduce - 2468 | 15 | 62_resources_increase_108010_reduce | 
| 63 | poetry - dependabot - align - version - 20 | 14 | 63_poetry_dependabot_align_version | 
| 64 | upgrade - datasets - 188 - pufanyimimicit - meaning | 14 | 64_upgrade_datasets_188_pufanyimimicit | 
| 65 | auth - authentication - asynchronous - authcheck - 307 | 14 | 65_auth_authentication_asynchronous_authcheck | 
| 66 | lock - locks - finishing - release - ttl | 14 | 66_lock_locks_finishing_release | 
| 67 | nginx - proxy - reverse - reverseproxy - 1253 | 14 | 67_nginx_proxy_reverse_reverseproxy | 
| 68 | orjson - 3915 - 390 - bump - from | 13 | 68_orjson_3915_390_bump | 
| 69 | gradio - 3340 - 4110 - frontadminui - upgrade | 13 | 69_gradio_3340_4110_frontadminui | 
| 70 | starlette - 0280 - 0362 - bump - 0231 | 13 | 70_starlette_0280_0362_bump | 
| 71 | secrets - fixs3 - correct - secret - name | 13 | 71_secrets_fixs3_correct_secret | 
| 72 | search - elastic - functionality - times - currently | 13 | 72_search_elastic_functionality_times | 
| 73 | token - hftoken - app - secret - hf | 12 | 73_token_hftoken_app_secret | 
| 74 | efs - nfs - mount - parquetmetadata - storage | 12 | 74_efs_nfs_mount_parquetmetadata | 
| 75 | ruff - vscode - 045 - settings - ruffcache | 12 | 75_ruff_vscode_045_settings | 
| 76 | kubernetes - kube - infrastructure - pdb - disruption | 12 | 76_kubernetes_kube_infrastructure_pdb |
  
</details>

## Training hyperparameters

* calculate_probabilities: False
* language: english
* low_memory: False
* min_topic_size: 10
* n_gram_range: (1, 1)
* nr_topics: None
* seed_topic_list: None
* top_n_words: 10
* verbose: False
* zeroshot_min_similarity: 0.7
* zeroshot_topic_list: None

## Framework versions

* Numpy: 1.26.4
* HDBSCAN: 0.8.38.post1
* UMAP: 0.5.6
* Pandas: 2.1.4
* Scikit-Learn: 1.5.2
* Sentence-transformers: 3.1.1
* Transformers: 4.44.2
* Numba: 0.60.0
* Plotly: 5.24.1
* Python: 3.10.12