基于标引经验和机器学习相结合的多层自动分类
何 琳 侯汉清
(南京农业大学人文社会科学学院 210095)
摘 要 由于《中国图书馆分类法》的类目数目庞大和文献在各类目上分布的不均衡,导致基于机器统计学习的自动分类技术在此类多层分类上的力不从心。基于人工标引经验的自动分类试图通过情报检索语言兼容互换的原理解决这一问题,然而直接应用标引词串对分类进行匹配在实际应用中产生了一系列的问题。本文试图通过两种分类技术相结合的方法对信息资源进行分类,提出了用相关度度量来测定关键词和类目概念之间的关联,构建关键词、分类号、归属度三元组矩阵的方法进行分类匹配,并在小规模的测试集上得到了较好的效果。本文详细讨论此种分类器的构建原理、构建方法以及分类流程,并对该方法存在的不足进行了分析。
关键词 《中国图书馆分类法》 分类矩阵 自动分类
Indexing experiences and machine learning based multilevel auto-classifying
He Lin, Hou Hanqing
(School of Humanities and Social Sciences, Nanjing Agriculture University)
Abstracts
The huge amount of categories in “Chinese library classification” and the imbalanced distributions of documents over each category cause the weakness of machine learning based auto-classifying technique in the multilevel classification of such categories. Manned indexing experiences based auto-classifying tries to solve this problem through the principle of the compactness and inter changeableness of information search languages. However, direct application of indexing phrases to match the targets causes a series of problems in the actual situations. This article suggests the way of combining these 2 classifying techniques in classification, proposes the relatedness measurement in testing the relatedness between the keyword and the category concept, forms the 3-units matrix of keyword, classification number and relatedness to match the classification while admirable results were obtained on the small scale test sets. The principles, methods, processes of such classification are discussed in details and its insufficiencies analyzed.
Keywords
Chinese library classification classifying matrix auto-classifying