«上一篇/Previous Article|本期目录/Table of Contents|下一篇/Next Article»

ISSN.1000-5013.202411013]
点击复制

面向多义词例句语料生成的大模型微调指令自动化生成框架()

分享到：

《华侨大学学报（自然科学版）》[ISSN:1000-5013/CN:35-1079/N]

卷:: 第46卷
期数:: 2025年第3期

页码:: 328-336

栏目:

出版日期:: 2025-05-20

文章信息/Info

Title:: Framework for Automated Generation of Fine-Tuning Instructions for Large Model in Ploysemy Example Sentence Corpora Creation

文章编号:: 1000-5013(2025)03-0328-09

作者:: 张子龙¹; 胡渲郎¹; 牛林峰¹; 郝瑜鑫²; 王华珍¹; 1. 华侨大学计算机科学与技术学院, 福建厦门 361021;2. 华侨大学华文教育研究院, 福建厦门 361021

Author(s):: ZHANG Zilong¹; HU Xuanlang¹; NIU Linfeng¹; HAO Yuxin²; WANG Huazhen¹; 1. School of Computer Science and Technology, Huaqiao University, Xiamen 361021, China; 2. Chinese Education Research Institute, Huaqiao University, Xiamen 361021, China

关键词:: 大型语言模型; 指令生成; 多义词; 例句生成; ChatGPT

Keywords:: large language model; instruction generation; polysemy; example sentence generation; ChatGPT

分类号:: TP3

DOI:: 10.11830/ISSN.1000-5013.202411013

文献标志码:: A

摘要:: 首先,构建包含主体描述集和指令示例列表的人工指令集,作为指令池的初始化输入;然后,将指令池中的指令输入大模型,生成多条机器指令与其对应的语料,并对生成的语料进行文本修正,以获取符合要求的多义词语料;最后,采用编辑距离算法进行机器指令去重,并使用谱聚类算法对候选机器指令进行聚类,从而实现机器指令的自动化生成。通过更新的指令池,实现多义词例句语料的迭代生成。结果表明:构建的多义词例句数据集及其对应的大模型机器指令集具有较好的语言多样性、内容多样性;文本构建的多义词例句数据集在例句长度、情感、词汇标准等级难度、主题等方面能满足第二语言学习者的需求。

Abstract:: First, a manual instruction set containing a body description set and a list of instruction examples is constructed as the initial input for the instruction pool. Then, input the instructions from the instruction pool into the large model to generate a number of machine-generated instructions corresponding to their corpora, the generated corpora are refined with text correction to obtain the desired polysemy example sentence corpus. Finally,the edit distance algorithm is used to remove the weight of machine instructions, and the spectral clustering algorithm is used to cluster the candidate machine instructions, thereby achieving automated generation of machine instructions. By updating the instruction pool, iterative generation of the polysemy example sentence corpus is realized. The results show that the constructed polysemy example sentence dataset and its corresponding large model machine instruction set exhibit good linguistic diversity and content diversity. The constructed polysemy example sentence dataset meets the needs of second language learners in terms of sentence length, sentiment, vocabulary difficulty standard level, and topics.

参考文献/References:

[1] BROWN T,MANN B,RYDER N,et al.Language models are few-shot learners[J].Advances in Neural Information Processing Systems,2020,33:1877-1901.
[2] TOUVRON H,LAVRIL T,IZACARD G,et al.Llama: Open and efficient foundation language models[EB/OL].(2023-02-27)[2024-12-24] .https://arxiv.org/abs/2302.13971.
[3] 赵颜利,董博,雷燕.我国语义标注领域研究现状分析[J].福建师范大学学报(自然科学版),2020,36(4):17-24,36.DOI:10.12046/j.issn.1000-5277.2020.04.003.
[4] 胡长虹.《国语辞典》和《现代汉语词典》常用多义动词义项处理对比研究[D].烟台:鲁东大学,2013.
[5] 周娟.《现代汉语词典》新旧版本多义词义项变化计量研究[D].南宁:广西大学,2011.DOI:10.7666/d.y1952844.
[6] 陈国华,李申.《汉语大词典》义项失序问题研究[J].辞书研究,2015(1):10-18.DOI:10.3969/j.issn.1000-6125.2015.01.002.
[7] 李安.多义词义项的语义关系及其对词义消歧的影响[J].语言文字应用,2014(1):29-37.
[8] LOPEZ-AREVALO I,SOSA-SOSA V J,ROJAS-LOPEZ F,et al.Improving selection of synsets from WordNet for domain-specific word sense disambiguation[J].Computer Speech & Language,2017,41:128-145.DOI:10.1016/j.csl.2016.06.003.
[9] AL-SAIAGH W,TIUN S,AL-SAFFAR A,et al.Word sense disambiguation using hybrid swarm intelligence approach[J].PloS One,2018,13(12):e0208695.DOI:10.1371/journal.pone.0208695.
[10] RAHMAN N,BHOGESWAR B.Improvement of query-based text summarization using word sense disambiguation[J].Complex & Intelligent Systems,2020,6:75-85.DOI:10.1007/s40747-019-0115-2.
[11] WANG Yizhong,MISHRA S,ALIPOORMOLABASHI P,et al.Super-NaturalInstructions: Generalization via declarative instructions on 1600+ NLP tasks[EB/OL].(2022-04-16)[2024-12-24] .https://arxiv.org/abs/2204.07705.
[12] XU Can,SUN Qingfeng,ZHENG Kai,et al.Wizardlm: Empowering large language models to follow complex instructions[EB/OL].(2023-04-24)[2024-12-24] .https://arxiv.org/abs/2304.12244.
[13] LIU Hanmeng,TENG Zhiyang,CUI Leyang,et al.Logicot: Logical chain-of-thought instruction-tuning data collection with GPT-4[EB/OL].(2023-10-28)[2024-12-24] .https://arxiv.org/abs/2305.12147.
[14] ZELIKMAN E,WU Yuhuai,MU J,et al.Star: Bootstrapping reasoning with reasoning[J].Advances in Neural Information Processing Systems,2022,35:15476-15488.
[15] WANG Yizhong,KORDI Y,MISHRA S,et al.Self-instruct: Aligning language models with self-generated instructions[EB/OL].(2022-12-21)[2024-12-24] .https://arxiv.org/abs/2212.10560.
[16] ZHOU Yongchao,MURESANU A I,HAN Ziwen,et al.Large language models are human-level prompt engineers[EB/OL].(2022-11-03)[2024-12-24] .https://arxiv.org/abs/2211.01910.
[17] XU Canwen,GUO Daya,DUAN Nan,et al.Baize: An open-source chat model with parameter-efficient tuning on self-chat data[EB/OL].(2023-04-03)[2024-12-24] .https://arxiv.org/abs/2304.01196.

备注/Memo

备注/Memo:: 收稿日期: 2024-11-29
通信作者: 王华珍(1978-),女,副教授,博士,主要从事人工神经网络深度学习、自然语言处理、知识图谱和人工智能教育的研究。E-mail:wanghuazhen@hqu.edu.cn。
基金项目: 教育部中外语言交流合作中心2021年国际中文教育研究课题(21YH30B)http://hdxb.hqu.edu.cn/

常用功能

工具/Tools

统计/Statistics

摘要浏览/Viewed194
全文下载/Downloads163
评论/Comments

更新日期/Last Update: 2025-05-20