File size: 9,001 Bytes
3d9f4d7 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 |
---
tags:
- bertopic
library_name: bertopic
pipeline_tag: text-classification
---
# bertopic_github_dataset_viewer_issues
This is a [BERTopic](https://github.com/MaartenGr/BERTopic) model.
BERTopic is a flexible and modular topic modeling framework that allows for the generation of easily interpretable topics from large datasets.
## Usage
To use this model, please install BERTopic:
```
pip install -U bertopic
```
You can use the model as follows:
```python
from bertopic import BERTopic
topic_model = BERTopic.load("asoria/bertopic_github_dataset_viewer_issues")
topic_model.get_topic_info()
```
## Topic overview
* Number of topics: 78
* Number of training documents: 3066
<details>
<summary>Click here for an overview of all topics.</summary>
| Topic ID | Topic Keywords | Topic Frequency | Label |
|----------|----------------|-----------------|-------|
| -1 | jobs - datasets - cache - fix - pandas | 11 | -1_jobs_datasets_cache_fix |
| 0 | issue - viewer - dataset - for - bigsciencep3 | 534 | 0_issue_viewer_dataset_for |
| 1 | parquet - files - metadata - parquetanddatasetinfo - configparquetandinfo | 144 | 1_parquet_files_metadata_parquetanddatasetinfo |
| 2 | vulnerability - cryptography - dependencies - 4106 - update | 132 | 2_vulnerability_cryptography_dependencies_4106 |
| 3 | docs - doc - page - add - md | 109 | 3_docs_doc_page_add |
| 4 | rows - firstrows - row - truncated - response | 90 | 4_rows_firstrows_row_truncated |
| 5 | duckdb - index - splitduckdbindex - fts - try | 78 | 5_duckdb_index_splitduckdbindex_fts |
| 6 | hub - hubcache - timeout - datasethubcache - tags | 75 | 6_hub_hubcache_timeout_datasethubcache |
| 7 | audio - opus - extension - torchaudio - torch | 59 | 7_audio_opus_extension_torchaudio |
| 8 | filter - endpoint - isvalid - column - parameters | 54 | 8_filter_endpoint_isvalid_column |
| 9 | datasets - update - upgrade - dependency - to | 54 | 9_datasets_update_upgrade_dependency |
| 10 | docker - images - build - image - compose | 53 | 10_docker_images_build_image |
| 11 | cache - refresh - entries - entry - warm | 51 | 11_cache_refresh_entries_entry |
| 12 | mongo - mongodb - indexes - atlas - index | 48 | 12_mongo_mongodb_indexes_atlas |
| 13 | image - images - modality - support - pdf2image | 47 | 13_image_images_modality_support |
| 14 | unblock - block - blocked - blocklist - datasets | 46 | 14_unblock_block_blocked_blocklist |
| 15 | error - expected - xerrorcode - messages - catch | 44 | 15_error_expected_xerrorcode_messages |
| 16 | backfill - cron - job - time - move | 44 | 16_backfill_cron_job_time |
| 17 | jobs - waiting - job - finishedat - started | 44 | 17_jobs_waiting_job_finishedat |
| 18 | env - config - configs - vars - default | 41 | 18_env_config_configs_vars |
| 19 | gitpython - 3137 - 3141 - github - builddepsdev | 41 | 19_gitpython_3137_3141_github |
| 20 | assets - s3 - cachedassets - cached - fsspec | 40 | 20_assets_s3_cachedassets_cached |
| 21 | splitnamesfromstreaming - split - streaming - rename - names | 39 | 21_splitnamesfromstreaming_split_streaming_rename |
| 22 | statistics - stats - descriptive - splitdescriptivestatistics - class | 38 | 22_statistics_stats_descriptive_splitdescriptivestatistics |
| 23 | private - gated - datasets - public - gatedauto | 35 | 23_private_gated_datasets_public |
| 24 | metrics - healthcheck - port - adminmetrics - admin | 33 | 24_metrics_healthcheck_port_adminmetrics |
| 25 | steps - processing - step - triggers - graph | 32 | 25_steps_processing_step_triggers |
| 26 | ci - codecov - pr - fork - invalid | 31 | 26_ci_codecov_pr_fork |
| 27 | splits - split - list - configs - returned | 31 | 27_splits_split_list_configs |
| 28 | openapi - openapijson - spec - publish - spectral | 31 | 28_openapi_openapijson_spec_publish |
| 29 | queue - incremental - based - field - jobs | 31 | 29_queue_incremental_based_field |
| 30 | error - datasetwithscriptnotsupportederror - exist - no - datasetgenerationerror | 31 | 30_error_datasetwithscriptnotsupportederror_exist_no |
| 31 | ram - 5gb - heavy - reduce - overcommitment | 31 | 31_ram_5gb_heavy_reduce |
| 32 | workers - number - reduce - increase - heavy | 30 | 32_workers_number_reduce_increase |
| 33 | admin - ui - app - difficulty - prefix | 30 | 33_admin_ui_app_difficulty |
| 34 | chart - fixchart - helm - alb - featchart | 28 | 34_chart_fixchart_helm_alb |
| 35 | aiohttp - 386 - bump - 392 - 391 | 27 | 35_aiohttp_386_bump_392 |
| 36 | e2e - tests - test - ci - testmetrics | 27 | 36_e2e_tests_test_ci |
| 37 | huggingfacehub - upgrade - 0151 - version - branch | 27 | 37_huggingfacehub_upgrade_0151_version |
| 38 | test - tests - unit - pytestmemray - fixtures | 26 | 38_test_tests_unit_pytestmemray |
| 39 | webhook - webhooks - payload - visibility - hub | 26 | 39_webhook_webhooks_payload_visibility |
| 40 | migration - migrations - database - scripts - databases | 26 | 40_migration_migrations_database_scripts |
| 41 | refactor - dead - code - remove - abstractions | 25 | 41_refactor_dead_code_remove |
| 42 | retry - retryable - codes - every - createcommiterror | 25 | 42_retry_retryable_codes_every |
| 43 | log - logs - debug - level - crashes | 25 | 43_log_logs_debug_level |
| 44 | croissant - jsonld - fields - either - recordset | 25 | 44_croissant_jsonld_fields_either |
| 45 | pods - pod - number - scale - reverseproxy | 24 | 45_pods_pod_number_scale |
| 46 | scan - urls - spawning - presidio - optinouturls | 24 | 46_scan_urls_spawning_presidio |
| 47 | resources - feat - reduce - increase - production | 22 | 47_resources_feat_reduce_increase |
| 48 | download - manual - require - enum - extracted | 21 | 48_download_manual_require_enum |
| 49 | comment - issues - close - fix - tag | 20 | 49_comment_issues_close_fix |
| 50 | cache - entries - clean - hf - blocked | 19 | 50_cache_entries_clean_hf |
| 51 | worker - generic - workerjobtypesblocked - treccartools - dependencies | 19 | 51_worker_generic_workerjobtypesblocked_treccartools |
| 52 | datasetviewer - rename - datasetsserver - domain - server | 18 | 52_datasetviewer_rename_datasetsserver_domain |
| 53 | across - group - pip - directories - bump | 18 | 53_across_group_pip_directories |
| 54 | runner - runners - validation - job - parent | 18 | 54_runner_runners_validation_job |
| 55 | upgrade - datasets - feat - 221 - 1162dev0 | 18 | 55_upgrade_datasets_feat_221 |
| 56 | jwt - array - authorization - cookies - bypass | 18 | 56_jwt_array_authorization_cookies |
| 57 | allow - script - scriptbased - scripts - redpajamadata1t | 17 | 57_allow_script_scriptbased_scripts |
| 58 | unique - metrics - metric - cache - cron | 16 | 58_unique_metrics_metric_cache |
| 59 | aiohttp - libslibcommon - libslibapi - 386 - 385 | 16 | 59_aiohttp_libslibcommon_libslibapi_386 |
| 60 | pillow - 1001 - 1020 - bump - from | 16 | 60_pillow_1001_1020_bump |
| 61 | storage - disk - storageclient - storageadmin - client | 15 | 61_storage_disk_storageclient_storageadmin |
| 62 | resources - increase - 108010 - reduce - 2468 | 15 | 62_resources_increase_108010_reduce |
| 63 | poetry - dependabot - align - version - 20 | 14 | 63_poetry_dependabot_align_version |
| 64 | upgrade - datasets - 188 - pufanyimimicit - meaning | 14 | 64_upgrade_datasets_188_pufanyimimicit |
| 65 | auth - authentication - asynchronous - authcheck - 307 | 14 | 65_auth_authentication_asynchronous_authcheck |
| 66 | lock - locks - finishing - release - ttl | 14 | 66_lock_locks_finishing_release |
| 67 | nginx - proxy - reverse - reverseproxy - 1253 | 14 | 67_nginx_proxy_reverse_reverseproxy |
| 68 | orjson - 3915 - 390 - bump - from | 13 | 68_orjson_3915_390_bump |
| 69 | gradio - 3340 - 4110 - frontadminui - upgrade | 13 | 69_gradio_3340_4110_frontadminui |
| 70 | starlette - 0280 - 0362 - bump - 0231 | 13 | 70_starlette_0280_0362_bump |
| 71 | secrets - fixs3 - correct - secret - name | 13 | 71_secrets_fixs3_correct_secret |
| 72 | search - elastic - functionality - times - currently | 13 | 72_search_elastic_functionality_times |
| 73 | token - hftoken - app - secret - hf | 12 | 73_token_hftoken_app_secret |
| 74 | efs - nfs - mount - parquetmetadata - storage | 12 | 74_efs_nfs_mount_parquetmetadata |
| 75 | ruff - vscode - 045 - settings - ruffcache | 12 | 75_ruff_vscode_045_settings |
| 76 | kubernetes - kube - infrastructure - pdb - disruption | 12 | 76_kubernetes_kube_infrastructure_pdb |
</details>
## Training hyperparameters
* calculate_probabilities: False
* language: english
* low_memory: False
* min_topic_size: 10
* n_gram_range: (1, 1)
* nr_topics: None
* seed_topic_list: None
* top_n_words: 10
* verbose: False
* zeroshot_min_similarity: 0.7
* zeroshot_topic_list: None
## Framework versions
* Numpy: 1.26.4
* HDBSCAN: 0.8.38.post1
* UMAP: 0.5.6
* Pandas: 2.1.4
* Scikit-Learn: 1.5.2
* Sentence-transformers: 3.1.1
* Transformers: 4.44.2
* Numba: 0.60.0
* Plotly: 5.24.1
* Python: 3.10.12
|