InsLLM / README.md
FrankRin's picture
Update README.md
81373ae verified
metadata
license: apache-2.0
datasets:
  - FrankRin/Insur-QA
language:
  - zh
base_model:
  - Qwen/Qwen1.5-14B-Chat
pipeline_tag: text-generation

This repository contains the InsLLM, version of Qwen1.5-14B-Chat as the base model.

InsQABench

InsQABench is the first large-scale specialized question-answering dataset and evaluation benchmark in the Chinese insurance sector, developed and open-sourced by the VLR Lab (Vision and Learning Representation Group) at Huazhong University of Science and Technology.

Overview

InsLLM is an intelligent insurance system equipped with capabilities for insurance-related question answering, database querying, and contract parsing. Designed for diverse user groups and application scenarios, it offers the following key features:

  • Insurance Text Processing: The system is capable of understanding and generating content related to complex professional terms and document formats specific to the insurance domain. This includes tasks like information extraction and document summarization. We have constructed fine-tuning datasets based on publicly available insurance data and real-world insurance documents.
  • Insurance Reasoning: Leveraging the SQL-ReAct method, the system can optimize and correct SQL queries based on user inputs, efficiently handling complex query tasks within insurance databases.
  • Insurance Knowledge Compliance: Equipped with the Insur-Know module, the system supports contract parsing and fact extraction enhanced by retrieval, ensuring accurate handling of complex issues in insurance contracts.

Additionally, our research offers the following contributions:

  • High-quality insurance question-answering training datasets and effective training paradigms
  • A comprehensive insurance model evaluation framework and evaluation datasets

Insur-QA Dataset

In the basic insurance knowledge section, we translated the InsuranceQA dataset to create the InsuranceQA_zh dataset.

For the insurance contract data section, we downloaded PDF insurance policies from various insurance companies available online and parsed them using the Adobe PDF Extract API. After restructuring the paragraph text from the parsed results, we used Gemini to generate QA pairs, forming <Q, A, E> triples.

The specific composition of the datasets is as follows:

Task Dataset Source Size
Basic Insurance Knowledge Q&A Training Set BX_GPT3.5 10k
Test Set Insurance_QA_zh 3k
Insurance Contract Q&A Training Set Insurance Contracts 40k
Test Set Insurance Contracts 100
Insurance Database Q&A Training Set Insurance Contracts 44k
Test Set Insurance Contracts 546

Citation

If you find our work helpful in your research, please consider citing it as follows:

@misc{
    
}

License

InsQABench is available under the Apache License.