--- license: cc-by-nc-sa-4.0 language: - zh pipeline_tag: summarization tags: - mT5 - summarization --- # HeackMT5-ZhSum100k: A Summarization Model for Chinese Texts This model, `heack/HeackMT5-ZhSum100k`, is a fine-tuned mT5 model for Chinese text summarization tasks. It was trained on a diverse set of Chinese datasets and is able to generate coherent and concise summaries for a wide range of texts. ## Model Details - Model: mT5 - Language: Chinese - Training data: Mainly Chinese Financial News Sources, NO BBC or CNN source. Training data contains 1M lines. - Finetuning epochs: 10 ## Evaluation Results The model achieved the following results: - ROUGE-1: 56.46 - ROUGE-2: 45.81 - ROUGE-L: 52.98 - ROUGE-Lsum: 20.22 ## Usage Here is how you can use this model for text summarization: ```python from transformers import MT5ForConditionalGeneration, T5Tokenizer model = MT5ForConditionalGeneration.from_pretrained("heack/HeackMT5-ZhSum100k") tokenizer = T5Tokenizer.from_pretrained("heack/HeackMT5-ZhSum100k") chunk = """ 财联社5月22日讯,据平安包头微信公众号消息,近日,包头警方发布一起利用人工智能(AI)实施电信诈骗的典型案例,福州市某科技公司法人代表郭先生10分钟内被骗430万元。 4月20日中午,郭先生的好友突然通过微信视频联系他,自己的朋友在外地竞标,需要430万保证金,且需要公对公账户过账,想要借郭先生公司的账户走账。 基于对好友的信任,加上已经视频聊天核实了身份,郭先生没有核实钱款是否到账,就分两笔把430万转到了好友朋友的银行卡上。郭先生拨打好友电话,才知道被骗。骗子通过智能AI换脸和拟声技术,佯装好友对他实施了诈骗。 值得注意的是,骗子并没有使用一个仿真的好友微信添加郭先生为好友,而是直接用好友微信发起视频聊天,这也是郭先生被骗的原因之一。骗子极有可能通过技术手段盗用了郭先生好友的微信。幸运的是,接到报警后,福州、包头两地警银迅速启动止付机制,成功止付拦截336.84万元,但仍有93.16万元被转移,目前正在全力追缴中。 """ inputs = tokenizer.encode("summarize: " + chunk, return_tensors='pt', max_length=512, truncation=True) summary_ids = model.generate(inputs, max_length=150, num_beams=4, length_penalty=1.5, no_repeat_ngram_size=2) summary = tokenizer.decode(summary_ids[0], skip_special_tokens=True) print(summary) 包头警方发布一起利用AI实施电信诈骗典型案例:法人代表10分钟内被骗430万元 ``` ## If you need a longer abbreviation, refer to the following code 如果需要更长的缩略语,参考如下代码: ```python from transformers import MT5ForConditionalGeneration, T5Tokenizer model_heack = MT5ForConditionalGeneration.from_pretrained("heack/HeackMT5-ZhSum100k") tokenizer_heack = T5Tokenizer.from_pretrained("heack/HeackMT5-ZhSum100k") def _split_text(text, length): chunks = [] start = 0 while start < len(text): if len(text) - start > length: pos_forward = start + length pos_backward = start + length pos = start + length while (pos_forward < len(text)) and (pos_backward >= 0) and (pos_forward < 20 + pos) and (pos_backward + 20 > pos) and text[pos_forward] not in {'.', '。',',',','} and text[pos_backward] not in {'.', '。',',',','}: pos_forward += 1 pos_backward -= 1 if pos_forward - pos >= 20 and pos_backward <= pos - 20: pos = start + length elif text[pos_backward] in {'.', '。',',',','}: pos = pos_backward else: pos = pos_forward chunks.append(text[start:pos+1]) start = pos + 1 else: chunks.append(text[start:]) break # Combine last chunk with previous one if it's too short if len(chunks) > 1 and len(chunks[-1]) < 100: chunks[-2] += chunks[-1] chunks.pop() return chunks def get_summary_heack(text, each_summary_length=150): chunks = _split_text(text, 300) summaries = [] for chunk in chunks: inputs = tokenizer_heack.encode("summarize: " + chunk, return_tensors='pt', max_length=512, truncation=True) summary_ids = model_heack.generate(inputs, max_length=each_summary_length, num_beams=4, length_penalty=1.5, no_repeat_ngram_size=2) summary = tokenizer_heack.decode(summary_ids[0], skip_special_tokens=True) summaries.append(summary) return " ".join(summaries) ``` ## Credits This model is trained and maintained by KongYang from Shanghai Jiao Tong University. For any questions, please reach out to me at my WeChat ID: kongyang. **许可协议 / License Agreement** --- 为维护开源生态的可持续发展,并确保开发者能持续优化模型质量,我们制定以下条款: ## 定义 **"衍生作品"** 指通过量化、剪枝、蒸馏、架构修改等技术手段,直接或间接基于本模型产生的任何变体,包括但不限于: - GGUF/GGML等量化格式转换产物 - 通过知识蒸馏获得的轻量化模型 - 基于本模型参数进行的架构调整(如层数修改、注意力机制变更) 1. **数据与训练成本说明** 训练高质量AI模型需耗费巨额资源: - 数据清洗与标注成本占项目总投入的60%以上,且全部采用**国内合规数据源**,避免国际媒体(如BBC)对中文语境的曲解性"幻觉翻译"。 - 本项目坚持使用中立、客观的语料,旨在传播技术普惠性,促进人类理解与文明互鉴。 2. **商业授权条款** 非商业用途: **免费** 若需用于商业场景(包括企业产品/服务): | 企业类型 | 永久授权费(人民币元) | |------------|------------| | 初创企业或个人(年营业额100万以下) | 1,000元| | 中型企业(年营业额100万以上的非上市公司) | 5,000元| | 上市公司 | 20,000元| - 扫码支付后,您的Hugging Face账号将获得商业使用权 - 每家企业仅限绑定1个主账号 **商业授权范围包括:** 对衍生作品的商业性使用,无论其是否经过格式转换或架构修改 **支付方式**: 支付宝/微信收款码 3. **原始数据服务** 如需获取原始训练数据,请通过上述二维码支付 **5000元** 并邮件联系 weixin: kongyang --- To sustain open-source ecosystems and ensure model quality, we establish these terms: ## Definitions **"Derivative Works"** refer to any variants directly or indirectly derived from this model through technical means including but not limited to: - Quantized format conversions (GGUF/GGML, etc.) - Lightweight models obtained via knowledge distillation - Architectural modifications based on model parameters (e.g., layer adjustments, attention mechanism alterations) 1. **Data & Training Costs** - Over 60% of project costs are spent on **data cleaning** using **domestic compliant sources**, avoiding biased narratives from international media. - We commit to neutral, objective training data to promote technological inclusivity. 2. **Commercial License** **Non-commercial Use**: **Free** **For Commercial Applications** (including enterprise products/services): | Enterprise Type | Perpetual License Fee(CNY¥) | |------------|------------| | Startups Or Individuals(Annual Revenue < ¥1M) | 1,000| | Mid-sized Enterprises (Non-listed, Annual Revenue ≥ ¥1M) | 5,000| | Listed Companies | 20,000| - Scan QR code and bind your Hugging Face account - 1 primary account per organization **Commercial Authorization Includes:** Commercial use of derivative works, regardless of format conversions or architectural modifications **Payment Method**: 支付宝/微信收款码 3. **Raw Data Access** For uncleaned raw datasets (including multimodal collections), pay **5000 CNY** via the QR code and email support@opentech.cn --- **我们相信:技术向善,开源共荣** **Our Belief: Ethical Tech Thrives Through Open Collaboration** ## WeChat ID kongyang ## Citation If you use this model in your research, please cite: ```bibtex @misc{kongyang2023heackmt5zhsum100k, title={HeackMT5-ZhSum100k: A Large-Scale Multilingual Abstractive Summarization for Chinese Texts}, author={Kong Yang}, year={2023} }