Document classificationFrom Wikipedia, the free encyclopedia
Document classification or document categorization is a problem in library science, information science and computer science. The task is to assign a document to one or more classes or categories. This may be done "manually" (or "intellectually") or algorithmically. The intellectual classification of documents has mostly been the province of library science, while the algorithmic classification of documents is used mainly in information science and computer science. The problems are overlapping, however, and there is therefore also interdisciplinary research on document classification.
The documents to be classified may be texts, images, music, etc. Each kind of document possesses its special classification problems. When not otherwise specified, text classification is implied.
Documents may be classified according to their subjects or according to other attributes (such as document type, author, printing year etc.). In the rest of this article only subject classification is considered. There are two main philosophies of subject classification of documents: The content based approach and the request based approach.
 "Content based" versus "request based" classification
Content based classification is classification in which the weight given to particular subjects in a document determines the class to which the document is assigned. It is, for example, a rule in much library classification that at least 20% of the content of a book should be about the class to which the book is assigned. In automatic classification it could be the number of times given words appears in a document.
Request oriented classification (or -indexing) is classification in which the anticipated request from users is influencing how documents are being classified. The classifier ask himself: “Under which descriptors should this entity be found?” and “think of all the possible queries and decide for which ones the entity at hand is relevant” (Soergel, 1985, p. 230).
Request oriented classification may be classification that is targeted towards a particular audience or user group. For example, a library or a database for feminist studies may classify/index documents different compared to a historical library. It is probably better, however, to understand request oriented classification as policy based classification: The classification is done according to some ideals and reflects the purpose of the library or database doing the classification. In this way it is not necessarily a kind of classification or indexing based on user studies. Only if empirical data about use or users are applied should request oriented classification be regarded as a user-based approach.
 Classification versus indexing
Sometimes a distinction is made between assigning documents to classes ("classification") versus assigning subjects to documents ("subject indexing") but as Frederick Wilfrid Lancaster has argued, this distinction not fruitful. "These terminological distinctions,” he writes, “are quite meaningless and only serve to cause confusion” (Lancaster, 2003, p. 21). The view that this distinction is purely superficial is also supported by the fact that a classification system may be transformed into a thesaurus and vice versa (cf., Aitchison, 1986, 2004; Broughton, 2008; Riesthuis & Bliedung, 1991). Therefore is the act of labeling a document (say by assigning a term from a controlled vocabulary to a document) at the same time to assign that document to the class of documents indexed by that term (all documents indexed or classified as X belong to the same class of documents).
 Automatic document classification
Automatic document classification tasks can be divided into three sorts: supervised document classification where some external mechanism (such as human feedback) provides information on the correct classification for documents, unsupervised document classification (also known as document clustering), where the classification must be done entirely without reference to external information, and semi-supervised document classification, where parts of the documents are labeled by the external mechanism.
Automatic document classification techniques include:
- Expectation maximization (EM)
- Naive Bayes classifier
- Latent semantic indexing
- Support vector machines (SVM)
- Artificial neural network
- K-nearest neighbour algorithms
- Decision trees such as ID3 or C4.5
- Concept Mining
- Rough set based classifier
- Soft set based classifier
- Multiple-instance learning
- Natural language processing approaches
Classification techniques have been applied to
- spam filtering, a process which tries to discern E-mail spam messages from legitimate emails
- topic spotting, automatically determining the topic of a text
- language identification, automatically determining the language of a text
- genre classification, automatically determining the genre of a text
- readability assessment, automatically determining the degree of readability of a text, either to find suitable materials for different age groups or reader types or as part of a larger text simplification system
 See also
 Further reading
- Fabrizio Sebastiani. Machine learning in automated text categorization. ACM Computing Surveys, 34(1):1–47, 2002.
- Stefan Büttcher, Charles L. A. Clarke, and Gordon V. Cormack. Information Retrieval: Implementing and Evaluating Search Engines. MIT Press, 2010.
- Introduction to document classification
- Bibliography on Automated Text Categorization
- Bibliography on Query Classification
- Text Classification analysis page
- Learning to Classify Text - Chap. 6 of the book Natural Language Processing with Python (available online)
- Library of Congress (2008). The subject headings manual. Washington, DC.: Library of Congress, Policy and Standards Division. (Sheet H 180: "Assign headings only for topics that comprise at least 20% of the work.")
- Soergel, Dagobert (1985). Organizing information: Principles of data base and retrieval systems. Orlando, FL: Academic Press.
- Lancaster, F. W. (2003). Indexing and abstracting in theory and practice. Library Association, London.
- Aitchison, J. (1986). “A classification as a source for thesaurus: The Bibliographic Classification of H. E. Bliss as a source of thesaurus terms and structure.” Journal of Documentation, Vol. 42 No. 3, pp. 160-181.
- Aitchison, J. (2004). “Thesauri from BC2: Problems and possibilities revealed in an experimental thesaurus derived from the Bliss Music schedule.” Bliss Classification Bulletin, Vol. 46, pp. 20-26.
- Broughton, V. (2008). “A faceted classification as the basis of a faceted terminology: Conversion of a classified structure to thesaurus format in the Bliss Bibliographic Classification (2nd Ed.).” Axiomathes, Vol. 18 No.2, pp. 193-210.
- Riesthuis, G. J. A., & Bliedung, St. (1991). “Thesaurification of the UDC.” Tools for knowledge organization and the human interface, Vol. 2, pp. 109-117. Index Verlag, Frankfurt.
- Stephan Busemann, Sven Schmeier and Roman G. Arens (2000). Message classification in the call center. In Sergei Nirenburg, Douglas Appelt, Fabio Ciravegna and Robert Dale, eds., Proc. 6th Applied Natural Language Processing Conf. (ANLP'00), pp. 158-165, ACL.
- Santini, Marina; Rosso, Mark (2008), Testing a Genre-Enabled Application: A Preliminary Assessment, BCS IRSG Symposium: Future Directions in Information Access, London, UK, pp. 54–63, http://www.bcs.org/upload/pdf/ewic_fd08_paper7.pdf
Classification2018-09-14 15:19:53CLASSIFICATION 这篇文章主要介绍了数据挖掘中的四种分类算法，前三个是决策树，然后是KNN，最后一个是朴素贝叶斯。 日常生活中分类过程随处可见，比如：医生对病人诊断时就是一个典型的分类过程，任何一个医生...
CART决策树（Classification and Regression Tree）
KNN(K Nearest Neighbor)
该算法用到的一个核心概率公式: P(B|A) = (P(A|B)P(B))/P(A) ，从这个公式可以看到贝叶斯的巨大作用就是对因果关系进行了交换。
Pattern Classification2019-02-08 09:37:38Pattern Classification (Second Edition). Machine Learning
Classification editor2020-12-09 15:05:27Such instantiations are called <em>classification graphs</em>. The tool recognizes all ontologies <em>O</em> that are follow the ontology statement <em>O IS_A classification_root</em>. When a new ...
Classification 2.02020-12-03 00:48:17<p>Not-yet-approved <a href="https://cgal.geometryfactory.com/CGAL/Members/wiki/Features/Small_Features/Classification_2.0">small feature</a>: - support of mesh classification - support of cluster ...
Classification Bar2020-12-01 20:23:11<div><p>This commit is intended to add a classification bar to each tool and telemetry screen, useful for when you want to mark COSMOS for a certain classification level. Can be extended beyond ...
Landcover Classification2020-12-01 19:17:16<p>We need to make the classification for landcover exposure match with the volcanic ash landcover classification. The landcover classification came from Badan Geologi. We might also need to ...
Classification bindings2020-12-09 05:28:26- bindings for Classification of point sets (mesh could be integrated later but it'd be nice to have Surface Mesh before that), including example in both Python and Java - TBB support for PSP and ...
Object Classification: classification seemingly random2020-12-01 19:46:44<div><p>I am running the Object Classification workflow using segmented images from the Pixel Classification workflow. I've encountered a strange behavior that occurs when I am using the Brush ...
Hierarchical Classification2020-09-10 20:30:01Hierarchical Classification 层次分类
Document classification2019-10-06 00:17:28Document classification Document classification - Wikipedia, the free encyclopediaDocument classification From Wikipedia, the free en...
Does bonnetal classification module support multilabel classification?2020-12-09 07:38:17<div><p>Does bonnetal classification module support multilabel classification? Thanks a lot! Jack</p><p>该提问来源于开源项目：PRBonn/bonnetal</p></div>
DNN Sentence Classification2019-06-03 21:31:19In the sentence classification task, context formed from sentences adjacent to the sentence being classified can provide important information for classification. This context is, however, often ...
Image Classification Background2019-12-14 13:43:07I believe image classification is a great start point before diving into other computer vision fields, espacially for begginers who know nothing about deep learning. When I started to learn...
I believe image classification is a great start point before diving into other computer vision fields, espacially
for begginers who know nothing about deep learning. When I started to learn computer vision, I’ve made a lot of mistakes, I wish someone could have told me that which paper I should start with back then. There doesn’t seem to have a repository to have a list of image classification papers like deep_learning_object_detection until now. Therefore, I decided to make a repository
of a list of deep learning image classification papers and codes to help others. My personal advice for people who
know nothing about deep learning, try to start with vgg, then googlenet, resnet, feel free to continue reading other listed papers or switch to other fields after you are finished.
Note: I also have a repository of pytorch implementation of some of the image classification networks, you can check out here.
Linear classification2019-09-29 23:15:13Classification, to find out which bounder side of a point or get the bounder to separate the dataset. This article is mainly about Linear Classification, using one hyper plane to separate the dat...
Textclassification2020-11-11 19:50:35Textclassification 中文短文本分类 包含TextCNN, TextDCNN, TextDPCNN, TextRCNN, TextRNN, TextRNN+Attention, Transformer, FastText等模型
Network Classification Fields2020-12-26 17:15:52<p>This would be a product of source.network_classification and destination.network_classification <p>eg <pre><code> source.network_classification: trusted destination.network_classification: ...
Multilabel Classification Evaluation2020-12-30 18:07:43Currently the evaluation class only supports single label classification, even though SS3 inherently supports multilabel classification. These are the steps (I see) needed to support multilabel ...
Connectionist Temporal Classification2016-12-05 22:38:23Connectionist Temporal Classification: Labelling Unsegmented Sequence Data with Recurrent Neural Networks
Text Classification2019-07-02 09:12:00Text Classification For purpose of word embedding extrinsic evaluation, especially downstream task. Some concepts are informed from 复旦大学NLP组 Statistical-Based Method Statistics perspective based ...
Classification Plug-In: Better Widget & Cluster Classification2020-12-08 18:43:23<p>This PR adds some features to the classification plug-in: - much better reworked widget - possibility to compute classification on clusters (from RANSAC or region growing) <p>Some undocumented code...
Mater tmva classification2020-11-27 16:33:51<div><ul><li>Code to support book multiple ml methods in the envelope class.</li><li>A new class Classification to perform two class classification in the new architecture of TMVA.</li></ul>该提问来源...
An approach for adaptive associative classification2021-02-21 09:34:05As a branch of classification, associative classification combines the basic ideas of association rule mining and general classification. Previous studies show that associative classification can ...
Classification versus Recommendation2021-01-11 11:18:12<p>I tried training a feedforward network with the output layer being classification data to all my training instances, but the output it generated doesn't seem right. I am hoping there is a ...
Statistical classification2018-02-16 19:25:07link address : https://en.wikipedia.org/wiki/Statistical_classification>>For the unsupervised learning approach, see Cluster ...In machine learning and statistics, classification is...
TMVA class Classification2020-11-30 15:05:32* Added class TMVA::Classification to perform two class Classification * Support to Train/Test multiple booked ml methods in parallel with MultiProc, calling the method Evaluate * Documentation with ...
classification 物体识别分类 项目介绍 该项目对物体进行识别分类。 项目配置 作者开发环境： Python 3.7 PyTorch >= 1.5.1 数据集 采用"Stanford Dogs Dataset"数据集官方地址：...
There are some examples of using PyTorch for image classification Usage Each file of this project is an example of image classification, you can learn from level1 to levelN. For more explaination of ...