精华内容
下载资源
问答
  • Department of Information Management, National Central University, Jhongli 32001, Taiwan Received 26 August 2012; Accepted 19 September 2012 Academic Editors: F. Camastra, J. A

    Department of Information Management, National Central University, Jhongli 32001, Taiwan

    Received 26 August 2012; Accepted 19 September 2012

    Academic Editors: F. Camastra, J. A. Hernandez, P. Kokol, J. Wang, and S. Zhu

    Copyright © 2012 Chih-Fong Tsai. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

    Abstract

    Content-based image retrieval (CBIR) systems require users to query images by their low-level visual content; this not only makes it hard for users to formulate queries, but also can lead to unsatisfied retrieval results. To this end, image annotation was proposed. The aim of image annotation is to automatically assign keywords to images, so image retrieval users are able to query images by keywords. Image annotation can be regarded as the image classification problem: that images are represented by some low-level features and some supervised learning techniques are used to learn the mapping between low-level features and high-level concepts (i.e., class labels). One of the most widely used feature representation methods is bag-of-words (BoW). This paper reviews related works based on the issues of improving and/or applying BoW for image annotation. Moreover, many recent works (from 2006 to 2012) are compared in terms of the methodology of BoW feature generation and experimental design. In addition, several different issues in using BoW are discussed, and some important issues for future research are discussed.

    1. Introduction

    Advances in computer and multimedia technologies allow for the production of digital images and large repositories for image storage with little cost. This has led to the rapid increase in the size of image collections, including digital libraries, medical imaging, art and museum, journalism, advertising and home photo archives, and so forth. As a result, it is necessary to design image retrieval systems which can operate on a large scale. The main goal is to create, manage, and query image databases in an efficient and effective, that is, accurate manner.

    Content-based image retrieval (CBIR), which was proposed in the early 1990s, is a technique to automatically index images by extracting their (low-level) visual features, such as color, texture, and shape, and the retrieval of images is based solely upon the indexed image features [13]. Therefore, it is hypothesized that relevant images can be retrieved by calculating the similarity between the low-level image contents through browsing, navigation, query-by-example, and so forth. Typically, images are represented as points in a high dimensional feature space. Then, a metric is used to measure similarity or dissimilarity between images on this space. Thus, images close to the query are similar to the query and retrieved. Although CBIR introduced automated image feature extraction and indexation, it does not overcome the so-called semantic gap described below.

    The semantic gap is the gap between the extracted and indexed low-level features by computers and the high-level concepts (or semantics) of user’s queries. That is, the automated CBIR systems cannot be readily matched to the users’ requests. The notation of similarity in the user’s mind is typically based on high-level abstractions, such as activities, entities/objects, events, or some evoked emotions, among others. Therefore, retrieval by similarity using low-level features like color or shape will not be very effective. In other words, human similarity judgments do not obey the requirements of the similarity metric used in CBIR systems. In addition, general users usually find it difficult to search or query images by using color, texture, and/or shape features directly. They usually prefer textual or keyword-based queries, since they are easier and more intuitive for representing their information needs [46]. However, it is very challenging to make computers capable of understanding or extracting high-level concepts from images as humans do.

    Consequently, the semantic gap problem has been approached by automatic image annotation. In automatic image annotation, computers are able to learn which low-level features correspond to which high-level concepts. Specifically, the aim of image annotation is to make the computers extract meanings from the low-level features by a learning process based on a given set of training data which includes pairs of low-level features and their corresponding concepts. Then, the computers can assign the learned keywords to images automatically. For the review of image annotation, please refer to Tsai and Hung [7], Hanbury [8], and Zhang et al. [9].

    Image annotation can be defined as the process of automatically assigning keywords to images. It can be regarded as an automatic classification of images by labeling images into one of a number of predefined classes or categories, where classes have assigned keywords or labels which can describe the conceptual content of images in that class. Therefore, the image annotation problem can be thought of as image classification or categorization.

    More specifically, image classification can be divided into object categorization [10] and scene classification. For example, object categorization focuses on classifying images into “concrete” categories, such as “agate”, “car”, “dog”, and so on. On the other hand, scene classification can be regarded as abstract keyword based image annotation [1112], where scene categories are such as “harbor”, “building”, and “sunset”, which can be regarded as an assemblage of multiple physical or entity objects as a single entity. The difference between object recognition/categorization and scene classification was defined by Quelhas et al. [13].

    However, image annotation performance is heavily dependent on image feature representation. Recently, the bag-of-words (BoW) or bag-of-visual-words model, a well-known and popular feature representation method for document representation in information retrieval, was first applied to the field of image and video retrieval by Sivic and Zisserman [14]. Moreover, BoW has generally shown promising performance for image annotation and retrieval tasks [1522].

    The BoW feature is usually based on tokenizing keypoint-based features, for example, scale-invariant feature transform (SIFT) [23], to generate a visual-word vocabulary (or codebook). Then, the visual-word vector of an image contains the presence or absence information of each visual word in the image, for example, the number of keypoints in the corresponding cluster, that is, visual word.

    Since 2003, BoW has been used extensively in image annotation, but there has not as yet been any comprehensive review of this topic. Therefore, the aim of this paper is to review the work of using BoW for image annotation from 2006 to 2012.

    The rest of this paper is organized as follows. Section 2 describes the process of extracting the BoW feature for image representation and annotation. Section 3 discusses some important extension studies of BoW, including the improvement of BoW per se and its application to other related research problems. Section 4 provides some comparisons of related work in terms of the methodology of constructing the BoW feature, including the detection method, the clustering algorithm, the number of visual words, and so forth and the experimental set up including the datasets used, the number of object or scene categories, and so forth. Finally, Section 5concludes the paper.

    2. Bag-of-Words Representation

    The bag-of-words (BoW) methodology was first proposed in the text retrieval domain problem for text document analysis, and it was further adapted for computer vision applications [24]. For image analysis, a visual analogue of a word is used in the BoW model, which is based on the vector quantization process by clustering low-level visual features of local regions or points, such as color, texture, and so forth.

    To extract the BoW feature from images involves the following steps: (i) automatically detect regions/points of interest, (ii) compute local descriptors over those regions/points, (iii) quantize the descriptors into words to form the visual vocabulary, and (iv) find the occurrences in the image of each specific word in the vocabulary for constructing the BoW feature (or a histogram of word frequencies) [24]. Figure 1 describes these four steps to extract the BoW feature from images.

    376804.fig.001
    Figure 1: Four steps for constructing the bag-of-words for image representation.

    The BoW model can be defined as follows. Given a training dataset  containing  images represented by , and , where  is the extracted visual features, a specific unsupervised learning algorithm, such as -means, is used to group  based on a fixed number of visual words  (or categories) represented by , and , where  is the cluster number. Then, we can summarize the data in a cooccurrence table of counts , where  denotes how often the word  occurred in an image .

    2.1. Interest Point Detection

    The first step of the BoW methodology is to detect local interest regions or points. For feature extraction of interest points (or keypoints), they are computed at predefined locations and scales. Several well-known region detectors that have been described in the literature are discussed below [2526].(i)Harris-Laplace regions are detected by the scale-adapted Harris function and selected in scale-space by the Laplacian-of-Gaussian operator. Harris-Laplace detects corner-like structures.(ii)DoG regions are localized at local scale-space maxima of the difference-of-Gaussian. This detector is suitable for finding blob-like structures. In addition, the DoG point detector has previously been shown to perform well, and it is also faster and more compact (less feature points per image) than other detectors.(iii)Hessian-Laplace regions are localized in space at the local maxima of the Hessian determinant and in scale at the local maxima of the Laplacian-of-Gaussian.(iv)Salient regions are detected in scale-space at local maxima of the entropy. The entropy of pixel intensity histograms is measured for circular regions of various sizes at each image position.(v)Maximally stable extremal regions (MSERs) are components of connected pixels in a thresholded image. A watershed-like segmentation algorithm is applied to image intensities and segment boundaries which are stable over a wide range of thresholds that define the region.

    In Mikolajczyk et al. [27], they compare six types of well-known detectors, which are detectors based on affine normalization around Harris and Hessian points, MSER, an edge-based region detector, a detector based on intensity extrema, and a detector of salient regions. They conclude that the Hessian-Affine detector performs best.

    On the other hand, according to Hörster and Lienhart [21], interest points can be detected by the sparse or dense approach. For sparse features, interest points are detected at local extremas in the difference of a Gaussian pyramid [23]. A position and scale are automatically assigned to each point and thus the extracted regions are invariant to these properties. For dense features, on the other hand, interest points are defined at evenly sampled grid points. Feature vectors are then computed based on three different neighborhood sizes, that is, at different scales, around each interest point.

    Some authors believe that a very precise segmentation of an image is not required for the scene classification problem [28], and some studies have shown that coarse segmentation is very suitable for scene recognition. In particular, Bosch et al. [29] compare four dense descriptors with the widely used sparse descriptor (i.e., the Harris detector) [1415] and show that the best results are obtained with the dense descriptors. This is because there is more information on scene images, and intuitively a dense image description is necessary to capture uniform regions such as sky, calm water, or road surface in many natural scenes. Similarly, Jurie and Triggs [30] show that the sampling of many patches on a regular dense grid (or a fixed number of patches) outperforms the use of interest points. In addition, Fei-Fei  and Perona [31], and Bosch et al. [29] show that dense descriptors outperform the sparse ones.

    2.2. Local Descriptors

    In most studies, some single local descriptors are extracted, in which the Scale Invariant Feature Transform (SIFT) descriptor is the most widely extracted [23]. It combines a scale invariant region detector and a descriptor based on the gradient distribution in the detected regions. The descriptor is represented by a 3D histogram of gradient locations and orientations. The dimensionality of the SIFT descriptor is 128.

    In order to reduce the dimensionality of the SIFT descriptor, which is usually 128 dimensions per keypoint, principal component analysis (PCA) can be used for increasing image retrieval accuracy and faster matching [32]. Specifically, Uijlings et al. [33] show that retrieval performance can be increased by using PCA for the removal of redundancy in the dimensions.

    SIFT was found to work best [13253435]. Specifically, Mikolajczyk and Schmid [34] compared 10 different descriptors extracted by the Harris-Affine detector, which are SIFT, gradient location and orientation histograms (GLOH) (i.e., an extension of SIFT), shape context, PCA-SIFT, spin images, steerable filters, differential invariants, complex filters, moment invariants, and cross-correlation of sampled pixel values. They show that the SIFT-based descriptors perform best.

    In addition, Quelhas et al. [13] confirm in practice that DoG + SIFT constitutes a reasonable choice. Very few consider the extraction of different descriptors. For example, Li et al. [36] combine or fuse the SIFT descriptor and the concatenation of block and blob based HSV histogram and local binary patterns to generate the BoW.

    2.3. Visual Word Generation/Vector Quantization

    When the keypoints are detected and their features are extracted, such as with the SIFT descriptor, the final step of extracting the BoW feature from images is based on vector quantization. In general, the -means clustering algorithm is used for this task, and the number of visual words generated is based on the number of clusters (i.e., ). Jiang et al. [17] conducted a comprehensive study on the representation choices of BoW, including vocabulary size, weighting scheme, such as binary, term frequency (TF) and term frequency-inverse document frequency (TF-IDF), stop word removal, feature selection, and so forth for video and image annotation.

    To generate visual words, many studies focus on capturing spatial information in order to improve the limitations of the conventional BoW model, such as Yang et al. [37], Zhang et al. [38], Chen et al. [39], S. Kim and D. Kim [40], Lu and Ip [41], Lu and Ip [42], Uijlings et al. [43], Cao and Fei-Fei [44], Philbin et al. [45], Wu et al. [46], Agarwal and Triggs [47], Lazebnik et al. [48], Marszałek and Schmid [49], and Monay et al. [50], in which spatial pyramid matching introduced by Lazebnik et al. [48] has been widely compared as one of the baselines.

    However, Van de Sande et al. [51] have shown that the severe drawback of the bag-of-words model is its high computational cost in the quantization step. In other words, the most expensive part in a state-of-the-art setup of the bag-of-words model is the vector quantization step, that is, finding the closest cluster for each data point in the -means algorithm.

    Uijlings et al. [33] compare -means and random forests for the word assignment task in terms of computational efficiency. By using different descriptors with different grid sizes, random forests are significantly faster than -means. In addition, using random forests to generate BoW can provide a slightly better Mean Average Precision (MAP) than -means does. They also recommend two BoW pipelines when the focuses are on accuracy and speed, respectively.

    In their seminal work, Philbin et al. [45], the approximate -means, hierarchical -means, and (exact) -means are compared in terms of the precision performance and computational cost, where approximate -means works best. (See Section 4.3 for further discussion).

    Chum et al. [52] observe that feature detection and quantization are noisy processes and this can result in variation in the particular visual words that appear in different images of the same object, leading to missed results.

    2.4. Learning Models

    After the BoW feature is extracted from images, it is entered into a classifier for training or testing. Besides constructing the discriminative models as classifiers for image annotation, some Bayesian text models by Latent Semantic Analysis [53], such as probabilistic Latent Semantic Analysis (pLSA) [54] and Latent Dirichlet Analysis (LDA) [55] can be adapted to model object and scene categories.

    2.4.1. Discriminative Models

    The construction of discriminative models for image annotation is based on the supervised machine learning principle for pattern recognition. Supervised learning can be thought as learning by examples or learning with a teacher [56]. The teacher has knowledge of the environment which is represented by a set of input-output examples. In order to classify unknown patterns, a certain number of training samples are available for each class, and they are used to train the classifier [57].

    The learning task is to compute a classifier or model  that approximates the mapping between the input-output examples and correctly labels the training set with some level of accuracy. This can be called thetraining or model generation stage. After the model  is generated or trained, it is able to classify an unknown instance, into one of the learned class labels in the training set. More specifically, the classifier calculates the similarity of all trained classes and assigns the unlabeled instance to the class with the highest similarity measure. More specifically, the most widely developed classifier is based on support vector machines (SVM) [58].

    2.4.2. Generative Models

    In text analysis, pLSA and LDA are used to discover topics in a document using the BoW document representation. For image annotation, documents and discovered topics are thought of as images and object categories, respectively. Therefore, an image containing instances of several objects is modeled as a mixture of topics. This topic distribution over the images is used to classify an image as belonging to a certain scene. For example, if an image contains “water with waves”, “sky with clouds”, and “sand”, it will be classified into the “coast” scene class [24].

    Following the previous definition of BoW, in pLSA there is a latent variable model for cooccurrence data which associates an unobserved class variable  with each observation. A joint probability model  over  is defined by the mixture:where  are the topic specific distributions and each image is modeled as a mixture of topics, .

    On the other hand, LDA treats the multinomial weights  over topics as latent random variables. In particular, the pLSA model is extended by sampling those weights from a Dirichlet distribution. This extension allows the model to assign probabilities to data outside the training corpus and uses fewer parameters, which can reduce the overfitting problem.

    The goal of LDA is to maximize the following likelihood:where  and  are multinomial parameters over the topics and words, respectively, and  and are Dirichlet distributions parameterized by the hyperparameters  and .

    Bosch et al. [24] compare BoW + pLSA with different semantic modeling approaches, such as the traditional global based feature representation, block-based feature representation [59] with the -nearest neighbor classifier. They show that BoW + pLSA performs best. Specifically, the HIS histogram + cooccurrence matrices + edge direction histogram are used as the image descriptors.

    However, it is interesting that Lu and Ip [41] and Quelhas et al. [60] show that pLSA does not perform better than BoW + SVM over the Corel dataset, where the former uses blocked based HSV and Gabor texture features and the latter uses keypoint based SIFT features.

    3. Extensions of BoW

    This section reviews the literature regarding using BoW for some related problems. They are divided into five categories, namely, feature representation, vector quantization, visual vocabulary construction, image segmentation, and others.

    3.1. Feature Representation

    Since the annotation accuracy is heavily dependent on feature representation, using different region/point descriptors and/or the BoW feature representation will provide different levels of discriminative power for annotation. For example, Mikolajczyk and Schmid [34] compare 10 different local descriptors for object recognition. Jiang et al. [17] examine the classification accuracy of the BoW features using different numbers of visual words and different weighting schemes.

    Due to the drawbacks that vector quantization may reduce the discriminative power of images and the BoW methodology ignores geometric relationships among visual words, Zhong et al. [61] present a novel scheme where SIFT features are bundled into local groups. These bundled features are repeatable and are much more discriminative than an individual SIFT feature. In other words, a bundled feature provides a flexible representation that allows us to partially match two groups of SIFT features.

    On the other hand, since the image feature generally carries mixed information of the entire image which may contain multiple objects and background, the annotation accuracy can be degraded by such noisy (or diluted) feature representations. Chen et al. [62] propose a novel feature representation, pseudo-objects. It is based on a subset of proximate feature points with its own feature vector to represent a local area to approximate candidate objects in images.

    Gehler and Nowozin [63] focus on feature combination, which is to combine multiple complementary features based on different aspects such as shape, color, or texture. They study several models that aim at learning the correct weighting of different features from training data. They provide insight into when combination methods can be expected to work and how the benefit of complementary features can be exploited most efficiently.

    Qin and Yung [64] use localized maximum-margin learning to fuse different types of features during the BoW modeling. Particularly, the region of interest is described by a linear combination of the dominant feature and other features extracted from each patch at different scales, respectively. Then, dominant feature clustering is performed to create contextual visual words, and each image in the training set is evaluated against the codebook using the localized maximum-margin learning method to fuse other features, in order to select a list of contextual visual words that best represents the patches of the image.

    As there is a relation between the composition of a photograph and its subject, similar subjects are typically photographed in a similar style. Van Gemert [65] exploits the assumption that images within a category share a similar style, such as colorfulness, lighting, depth of field, viewpoints and saliency. They use the photographic style for category-level image classification. In particular, where the spatial pyramid groups features spatially [48], they focus on more general feature grouping, including these photographic style attributes.

    In Rasiwasia and Vasconcelos [66], they introduce an intermediate space, based on a low dimensional semantic “theme” image representation, which is learned with weak supervision from casual image annotations. Each theme induces a probability density on the space of low-level features, and images are represented as vectors of posterior theme probabilities.

    3.2. Vector Quantization

    In order to reduce the quantization noise, Jégou et al. [67] construct short codes using quantization. The goal is to estimate distances using vector-to-centroid distances, that is, the query vector is not quantized, codes are assigned to the database vectors only. In other words, the feature space is decomposed into a Cartesian product of low-dimensional subspaces, and then each subspace is quantized separately. In particular, a vector is represented by a short code composed of its subspace quantization indices.

    As abrupt quantization into discrete bins does cause some aliasing, Agarwal and Triggs [47] focus on soft vector quantization, that is, softly voting into the cluster centers that lie close to the patch, for example, with Gaussian weights. They show that diagonal-covariance Gaussian mixtures fitted using expectation-maximization performs better than hard vector quantization.

    Similarly, Fernando et al. [68] propose a supervised learning algorithm based on a Gaussian mixture model, which not only generalizes the -means by allowing “soft assignments”, but also exploits supervised information to improve the discriminative power of the clusters. In their approach, an EM-based approach is used to optimize a convex combination of two criteria, in which the first one is unsupervised and based on the likelihood of the training data, and the second is supervised and takes into account the purity of the clusters.

    On the other hand, Wu et al. [69] propose a Semantics-Preserving Bag-of-Words (SPBoW) model, which considers the distance between the semantically identical features as a measurement of the semantic gap and tries to learn a codebook by minimizing this semantic gap. That is, the codebook generation task is formulated as a distance metric learning problem. In addition, one visual feature can be assigned to multiple visual words in different object categories.

    In de Campos et al. [70], images are modeled as order-less sets of weighted visual features where each visual feature is associated with a weight factor that may inform re its relevance. In this approach, visual saliency maps are used to determine the relevance weight of a feature.

    Zheng et al. [71] argue that for the BoW model used in information retrieval and document categorization, the textual word possesses semantics itself and the documents are well-structured data regulated by grammar, linguistic, and lexicon rules. However, there appears to be no well-defined rules in the visual word composition of images. For instance, the objects of the same class might have arbitrarily different shapes and visual appearances, while objects of different classes might share similar local appearances. To this end, a higher-level visual representation, visual synset for object recognition is presented. First, an intermediate visual descriptor, delta visual phrase, is constructed from a frequently co-occurring visual word-set with similar spatial context. Second, the delta visual phrases are clustered into a visual synset based their probabilistic “semantics”, that is, class probability distribution.

    Besides reducing the vector quantization noise, another severe drawback of the BoW model is its high computational cost. To address this problem, Moosmann et al. [72] introduce extremely randomized clustering forests based on ensembles of randomly created clustering trees and show that more accurate results can be obtained as well as much faster training and testing.

    Recently, Van de Sande et al. [51] proposed two algorithms to combine GPU hardware and a parallel programming model to accelerate the quantization and classification components of the visual categorization architecture.

    On the other hand, Hare et al. [73] show the intensity inversion characteristics of the SIFT descriptor and local interest region detectors can be exploited to decrease the time it takes to create vocabularies of visual terms. In particular, they show that clustering inverted and noninverted (or minimum and maximum) features separate results in the same retrieval performance when compared to the clustering of all the features as a single set (with the same overall vocabulary size).

    3.3. Visual Vocabulary Construction

    Since related studies, such as Jegou et al. [74], Marszałek and Schmid [49], Sivic and Zisserman [14], and Winn et al. [75], have shown that the commonly generated visual words are still not as expressive as text words, in Zhang et al. [76], images are represented as visual documents composed of repeatable and distinctive visual elements, which are comparable to text words. They propose descriptive visual words (DVWs) and descriptive visual phrases (DVPs) as the visual correspondences to text words and phrases, where visual phrases refer to the frequently co-occurring visual word pairs.

    Gavves et al. [77] focus on identifying pairs of independent, distant words—the visual synonyms—that are likely to host image patches of similar visual reality. Specifically, landmark images are considered, where the image geometry guides the detection of synonym pairs. Image geometry is used to find those image features that lie in a nearly identical physical location, yet are assigned to different words of the visual vocabulary.

    On the other hand, López-Sastre et al. [78] present a novel method for constructing a visual vocabulary that takes into account the class labels of images. It consists of two stages: Cluster Precision Maximisation (CPM) and Adaptive Refinement. In the first stage, a Reciprocal Nearest Neighbours (RNN) clustering algorithm is guided towards class representative visual words by maximizing a new cluster precision criterion. Next, an adaptive threshold refinement scheme is proposed with the aim of increasing vocabulary compactness, while at the same time improving the recognition rate and further increasing the representativeness of the visual words for category-level object recognition. In other words, this is a correlation clustering based approach, which works as a kind of metaclustering and optimizes the cut-off threshold for each cluster separately.

    Constructing visual codebook ensembles is another approach to improve image annotation accuracy. In Luo et al. [18], three methods for constructing visual codebook ensembles are presented. The first one is based on diverse individual visual codebooks by randomly choosing interesting points. The second one uses a random subtraining image dataset with random interesting points. The third one directly utilizes different patch information for constructing an ensemble with high diversity. Consequently, different types of image presentations are obtained. Then, a classification ensemble is learned by the different expression datasets from the same training set.

    Bae and Juang [79] apply the idea of linguistic parsing to generate the BoW feature for image annotation. That is, images are represented by a number of variable-size patches by a multidimensional incremental parsing algorithm. Then, the occurrence pattern of these parsed visual patches is fed into the LSA framework.

    Since one major challenge in object categorization is to find class models that are “invariant” enough to incorporate naturally-occurring intraclass variations and yet “discriminative” enough to distinguish between different classes, Winn et al. [75] proposed a supervised learning algorithm, which automatically finds such models. In particular, it classifies a region according to the proportions of different visual words. The specific visual words and the typical proportions in each object are learned from a segmented training set.

    Kesorn and Poslad [80] propose a framework to enhance the visual word quality. First of all, visual words from representative keypoints are constructed by reducing similar keypoints. Second, domain specific noninformative visual words are detected, which are useless for representing the content of visual data but which can degrade the categorization capability. A noninformative visual word is defined as having a high document frequency and a small statistical association with all the concepts in the image collection. Third, the vector space model of visual words is restructured with respect to a structural ontology model in order to solve visual synonym and polysemy problems.

    Tirlly et al. [81] present a new image representation called visual sentences that allows us to “read” visual words in a certain order, as in the case of text. Particularly, simple spatial relations between visual words are considered. In addition, pLSA is used to eliminate the noisiest visual words.

    3.4. Image Segmentation

    Effective image segmentation can be an important factor affecting the BoW feature generation. Uijlings et al. [43] study the role of context in the BoW approach. They observe that using the precise localization of object patches based on image segmentation is likely to yield a better performance than the dense sampling strategy, which sample patches of 8 * 8 pixels at every 4th pixel.

    Besides point detection, an image can be segmented into several or a fixed number of regions or blocks. However, very few compared the effect of image segmentation on generating the BoW feature. In Cheng and Wang [82], 20–50 regions per image are segmented, and each region is represented by a HSV histogram and cooccurrence texture features. By using contextual Bayesian networks to model spatial relationship between local regions and integrating multiattributes to infer high-level semantics of an image, this approach performs better and is comparable with a number of works using SIFT descriptors and pLSA for image annotation.

    Similarly, Wu et al. [46] extract a texture histogram from the 8 * 8 blocks/patches per image based on their proposed visual language modeling method utilizing the spatial correlation of visual words. This representation is compared with the BoW model including pLSA and LDA using the SIFT descriptor. They show that neither image segmentation nor interest point detection is used in the visual language modeling method, which makes the method not only very efficient, but also very effective over the Caltech 7 dataset.

    In addition to using the BoW feature for image annotation, Larlus et al. [83] combine BoW with random fields and some generative models, such as a Dirichlet processes for more effective object segmentation.

    3.5. Others
    3.5.1. BoW Applications

    Although the BoW model has been extensively studied for general object and scene categorization, it has also been considered in some domain specific applications, such as human action recognition [84], facial expression recognition [85], medical images [86], robot, sport image analysis [80], 3D image retrieval and classification [8788], image quality assessment [89], and so forth.

    3.5.2. Describing Objects/Scenes for Recognition

    Farhadi et al. [90] propose shifting the goal of recognition from naming to describing. That is, they focus on describing objects by their attributes, which is not only to name familiar objects, but also to report unusual aspects of a familiar object, such as “spotty dog”, not just “dog”, and to say something about unfamiliar objects, such as “hairy and four-legged”, not just “unknown”.

    On the other hand, Sudderth et al. [91] develop hierarchical, probabilistic models for objects, the parts composing them, and the visual scenes surrounding them. These models share information between object categories in three distinct ways. First, parts define distributions over a common low-level feature vocabulary. Second, objects are defined using a common set of parts. Finally, object appearance information is shared between the many scenes in which that object is found.

    3.5.3. Query Expansion

    Chum et al. [52] adopt the BoW architecture with spatial information for query expansion, which has proven successful in achieving high precision at low recall. On the other hand, Philbin et al. [92] quantize a keypoint to the -nearest visual words as a form of query expansion.

    3.5.4. Similarity Measure

    Based on the BoW feature representation, Jegou et al. [74] introduce a contextual dissimilarity measure (CDM), which is iteratively obtained by regularizing the average distance of each point to its neighborhood. In addition, CDM is learned in an unsupervised manner, which does not need to learn the distance measure from a set of training images.

    3.5.5. Large Scale Image Databases

    Since the aim of image annotation is to support very large scale keyword-based image search, such as web image retrieval, it is very critical to assess existing approaches over some large scale dataset(s). Chum et al. [52], Hörster and Lienhart [21], and Lienhart and Slaney [93] used datasets composed of 100000 to 250000 images belonging to 12 categories, which were downloaded from Flickr.

    Moreover, Philbin et al. [45] use over 1000000 images from Flickr for experiments and Zhang et al. [94] use about 370000 images collected from Google belonging to 1506 object or scene categories.

    On the other hand, Torralba and Efros [95] study some bias issues of object recognition datasets. They provide some suggestions for creating a new and high quality dataset to minimize the selection bias, capture bias, and negative set bias. Furthermore, they claim that in the state of today’s datasets there are virtually no studies demonstrating cross-dataset generalization, for example, training on ImageNet, while testing on PASCAL VOC. This could be considered as an additional experimental setup for future works.

    3.5.6. Integration of Feature Selection and/or (Spatial) Feature Extraction

    Although modeling the spatial relationship between visual words can improve the recognition performance, the spatial features are expensive to compute. Liu et al. [96] propose a method that simultaneously performs feature selection and (spatial) feature extraction based on higher-order spatial features for speed and storage improvements.

    For the dimensionality reduction purpose, Elfiky et al. [97] present a novel framework for obtaining a compact pyramid representation. In particular, the divisive information theoretic feature clustering (DITC) algorithm is used to create a compact pyramid representation.

    Bosch et al. [98] investigate whether dimensionality reduction using a latent generative model is beneficial for the task of weakly supervised scene classification. In their approach, latent “topics” using pLSA are first of all discovered, and a generative model is then applied to the BoW representation for each image.

    In contrast to reducing the dimensionality of the feature representation, selecting more discriminative features (e.g., SIFT descriptors) from a given set of training images has been considered. Shang and Xiao [99] introduce a pairwise image matching scheme to select the discriminative features. Specifically, the feature weights are updated by the labeled information from the training set. As a result, the selected features corresponding to the foreground content of the images can highlight the information category of the images.

    3.5.7. Integration of Segmentation, Classification, and/or Retrieval

    Simultaneously learning object/scene category models and performing segmentation on the detected objects were studied in Cao and Fei-Fei [44]. They propose a spatially coherent latent topic model (Spatial-LTM), which represents an image containing objects in a hierarchical way by oversegmented image regions of homogeneous appearances and the salient image patches within the regions. It can provide a unified representation for spatially coherent BoW topic models and can simultaneously segment and classify objects.

    On the other hand, Tong et al. [100] propose a statistical framework for large-scale near duplicate image retrieval which unifies the step of generating a BoW representation and the step of image retrieval. In this approach, each image is represented by a kernel density function, and the similarity between the query image and a database image is then estimated as the query likelihood.

    Shotton et al. [101] utilize semantic texton forests, which are ensembles of decision trees that act directly on image pixels, where the nodes in the trees provide an implicit hierarchical clustering into semantic textons and an explicit local classification estimate. In addition, the bag of semantic textons combines a histogram of semantic textons over an image region with a region prior category distribution, and the bag of semantic textons is computed over the whole image for categorization and over local rectangular regions for segmentation.

    3.5.8. Discriminative Learning Models

    Romberg et al. [102] extend the standard single-layer pLSA to multiple layers, where the multiple layers handle multiple modalities and a hierarchy of abstractions. In particular, the multilayer multimodal pLSA (mm-pLSA) model is based on a two leaf-pLSAs and a single top-level pLSA node merging the two leaf-pLSAs. In addition, SIFT features and image annotations (tags) as well as the combination of SIFT and HOG features are considered as two pairs of different modalities.

    3.5.9. Novel Category Discovery

    In their study, Lee and Grauman [103] discover new categories by knowing some categories. That is, previously learned categories are used to discover their familiarity in unsegmented, unlabeled images. In their approach, two variants of a novel object-graph descriptor to encode 2D and 3D spatial layout of object-level cooccurrence patterns relative to an unfamiliar region, and they are used to model the interaction between an image’s known and unknown objects for detecting new visual categories.

    3.5.10. Interest Point Detection

    Since interest point detection is an important step for extracting the BoW feature, Stottinger et al. [104] propose color interest points for sparse image representation. Particularly, light-invariant interest points are introduced to reduce the sensitivity to varying imaging conditions. Color statistics based on occurrence probability lead to color boosted points, which are obtained through saliency-based feature selection.

    4. Comparisons of Related Work

    This section compares related work in terms of the ways the BoW feature and experimental setup are structured. These comparisons allow us to figure out the most suitable interest point detector(s), clustering algorithm(s), and so forth used to extract the BoW feature from images. In addition, we are able to realize the most widely used dataset(s) and experimental settings for image annotation by BoW.

    4.1. Methodology of BoW Feature Generation

    Table 1 compares related work for the methodology of extracting the BoW feature. Note that we leave a blank if the information in our comparisons is not clearly described in these related works.

    tab1
    Table 1: Comparisons of interest point detection, visual words generation, and learning models.

    From Table 1 we can observe that the most widely used interest point detector for generating the BoW feature is DoG, and the second and third most popular detectors are Harris-Laplace and Hessian-Laplace, respectively. Besides extracting sparse BoW features, many related studies have focused on dense BoW features.

    On the other hand, several studies used some region segmentation algorithms, such as NCuts [116] and Mean-shift [117], to segment an image into several regions to represent keypoints.

    For the local feature descriptor to describe interest points, most studies used a 128 dimensional SIFT feature, in which some considered using PCA to reduce the dimensionality of SIFT, but some “fuse” the color feature and SIFT resulting in longer dimensional features than SIFT. Except for extracting SIFT related features, some studies considered conventional color and texture features to represent local regions or points.

    About vector quantization, we can see that -means is the most widely used clustering algorithm to generate the codebook or visual vocabularies. However, in order to solve the limitations of -means, for example, clustering accuracy and computational cost, some studies used hierarchical -means, approximate -means, accelerated -means, and so forth.

    For the number of visual words, related works have considered various amounts of clusters during vector quantization. This may be because the datasets used in these works are different. In Jiang et al. [17], different numbers of visual words were studied, and their results show that 1000 is a reasonable choice. Some related studies also used similar numbers of visual words to generate their BoW features.

    On the other hand, the most and second most widely used weighting schemes are TF and TF-IDF. This is consistent with Jiang et al. [17], who concluded that these two weighting schemes perform better than the other weighting schemes.

    Finally, SVM is no doubt the most popular classification technique as the learning model for image annotation. In particular, one of the most widely used kernel functions for constructing the SVM classifier is the Gaussian radial basis function. However, some other SVM classifiers, such as linear SVM and SVM with a polynomial kernel have also been considered in the literature.

    4.2. Experimental Design

    Table 2 compares related work for the experimental design. That is, the chosen dataset(s) and baseline(s) are examined.

    tab2
    Table 2: Comparisons of datasets used and annotation performance.

    According to Table 2, most studies considered more than one single dataset for their experiments, and many of them contained object and scene categories. This is very important for image annotation that the annotated keywords should be broadened for users to perform keyword-based queries for image retrieval.

    Specifically, the PASCAL, Caltech, and Corel datasets are the three most widely used benchmarks for image classification. However, the datasets used in most studies usually contain a small number of categories and images, except for the studies focusing on retrieval rather than classification. That is, similar based queries are used to retrieve relevant images instead of training a learning model to classify unknown images into one specific category.

    For the chosen baselines, most studies compared BoW and/or spatial pyramid matching based BoW since their aims were to propose novel approaches to improve these two feature representations. Specifically, Lazebnik et al. [48] proposed spatial pyramid matching based BoW as the most popular baseline.

    Besides improving the feature representation per se, some studies focused on improving the performance of LDA and/or pLSA discriminative learning models. Another popular baseline is that of Fei-Fei and Perona [31], who proposed a Bayesian hierarchical model to represent each region as part of a “theme.”

    4.3. Discussion

    The above comparisons indicate several issues that were not examined in the literature. Since the local features can be represented using object-based regions by region segmentation [143144] or point-based regions by point detection (c.f. Section 2.1), regarding the BoW feature based on tokenizing, it is unknown which local feature is more appropriate for large scale image annotation (For large scale image annotation, this means that the number of annotated keywords is certainly large and their meanings are very broad, containing object and scene concepts.)

    In addition, the local feature descriptor is the key component to the success of better image annotation; it is a fact that the number of visual words (i.e., clusters) is another factor affecting image annotation performance. Although Jiang et al. [17] conducted a comprehensive study of using various amounts of visual words, they only used one dataset, that is, TRECVID, containing 20 concepts. Therefore, one important issue is to provide the guidelines for determining the number of visual words over different kinds of image datasets having different image contents.

    The learning techniques can be divided into generative and discriminative models, but there are very few studies which assess their annotation performance over different kinds of image datasets which is necessary in order to fully understand the value of these two kinds of learning models. On the other hand, a combination of generative and discriminative learning techniques [145] or hybrid models are considered for the image annotation task.

    For the experimental setup, the target of most studies was not image retrieval. In other words, the performance evaluation was usually for small scale problems based on datasets containing a small number of categories, say 10. However, image retrieval users will not be satisfied with a system providing only 10 keyword-based queries to search relevant images. Some benchmarks are much more suitable for larger scale image annotation, such as the Large Scale Visual Recognition Challenge 2012 (ILSVRC2012) by ImageNet (http://www.image-net.org/challenges/LSVRC/2012/index) and Photo Annotation and Retrieval 2012 by ImageCLEF (http://www.imageclef.org/2012/photo). In particular, the ImageNet dataset contains over 10000 categories and 10000000 labeled images and ImageCLEF uses a subset of the MIRFLICKR collection (http://press.liacs.nl/mirflickr/), which contains 25 thousand images and 94 concepts.

    However, it is also possible that some smaller scale datasets composed of a relatively small number of images and/or categories can be combined into larger datasets. For example, the combination of Caltech 256 and Corel could be regarded as a benchmark that is more close to the real world problem.

    5. Conclusion

    In this paper, a number of recent related works using BoW for image annotation are reviewed. We can observe that this topic has been extensively studied recently. For example, there are many issues for improving the discriminative power of BoW feature representations by such techniques as image segmentation, vector quantization, and visual vocabulary construction. In addition, there are other directions for integrating the BoW feature for different applications, such as face detection, medical image analysis, 3D image retrieval, and so forth.

    From comparisons of related work, we can find the most widely used methodology to extract the BoW feature which can be regarded as a baseline for future research. That is, DoG is used as the kepoint detector and each keypoint is represented by the SIFT feature. The vector quantization step is based on the -means clustering algorithm with 1000 visual words. However, the number of visual words (i.e., the  values) is dependent on the dataset used. Finally, the weighting scheme can be either TF or TF-IDF.

    On the other hand, for the dataset issue in the experimental design, which can affect the contribution and final conclusion, the PASCAL, Caltech, and/or Corel datasets can be used as the initial study.

    According to the comparative results, there are some future research directions. First, the local feature descriptor for vector quantization usually by point-based SIFT feature can be compared with other descriptors, such as a region-based feature or a combination of different features. Second, a guideline for determining the number of visual words over what kind of datasets should be provided. The third issue is to assess the performance of generative and discriminative learning models over different kinds of datasets, such as different dataset sizes and different image contents, for example, a single object per image and multiple objects per image. Finally, it is worth examining the scalability of BoW feature representation for large scale image annotation.

    References

    1. A. W. M. Smeulders, M. Worring, S. Santini, A. Gupta, and R. Jain, “Content-based image retrieval at the end of the early years,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 22, no. 12, pp. 1349–1380, 2000. View at Publisher · View at Google Scholar · View at Scopus
    2. M. L. Kherfi, D. Ziou, and A. Bernardi, “Image retrieval from the World Wide Web: issues, techniques, and systems,” ACM Computing Surveys, vol. 36, no. 1, pp. 35–67, 2004. View at Publisher · View at Google Scholar · View at Scopus
    3. R. Datta, D. Joshi, J. Li, and J. Z. Wang, “Image retrieval: ideas, influences, and trends of the new age,”ACM Computing Surveys, vol. 40, no. 2, article 5, 2008. View at Publisher · View at Google Scholar ·View at Scopus
    4. Y. Choi and E. M. Rasmussen, “Users' relevance criteria in image retrieval in American history,”Information Processing and Management, vol. 38, no. 5, pp. 695–726, 2002. View at Publisher · View at Google Scholar · View at Scopus
    5. M. Markkula, M. Tico, B. Sepponen, K. Nirkkonen, and E. Sormunen, “A test collection for the evaluation of content-based image retrieval algorithms—a user and task-based approach,” Information Retrieval, vol. 4, no. 3-4, pp. 275–293, 2001. View at Publisher · View at Google Scholar · View at Scopus
    6. A. Goodrum and A. Spink, “Image searching on the Excite Web search engine,” Information Processing and Management, vol. 37, no. 2, pp. 295–311, 2001. View at Publisher · View at Google Scholar · View at Scopus
    7. C. F. Tsai and C. Hung, “Automatically annotating images with keywords: a review of image annotation systems,” Recent Patents on Computer Science, vol. 1, no. 1, pp. 55–68, 2008. View at Google Scholar
    8. A. Hanbury, “A survey of methods for image annotation,” Journal of Visual Languages and Computing, vol. 19, no. 5, pp. 617–627, 2008. View at Publisher · View at Google Scholar · View at Scopus
    9. D. Zhang, M. M. Islam, and G. Lu, “A review on automatic image annotation techniques,” Pattern Recognition, vol. 45, pp. 346–362, 2011. View at Publisher · View at Google Scholar · View at Scopus
    10. A. Pinz, “Object categorization,” Foundations and Trends in Computer Graphics and Vision, vol. 1, no. 4, pp. 255–353, 2006. View at Publisher · View at Google Scholar · View at Scopus
    11. C. F. Tsai, K. Mcgarry, and J. Tait, “CLAIRE: a modular support vector image indexing and classification system,” ACM Transactions on Information Systems, vol. 24, no. 3, pp. 353–379, 2006. View at Publisher· View at Google Scholar · View at Scopus
    12. W.-C. Lin, M. Oakes, J. Tait, and C.-F. Tsai, “Improving image annotation via useful representative feature selection,” Cognitive Processing, vol. 10, no. 3, pp. 233–242, 2009. View at Publisher · View at Google Scholar
    13. P. Quelhas, F. Monay, J. M. Odobez, D. Gatica-Perez, and T. Tuytelaars, “A thousand words in a scene,”IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 29, no. 9, pp. 1575–1589, 2007.View at Publisher · View at Google Scholar · View at Scopus
    14. J. Sivic and A. Zisserman, “Video google: a text retrieval approach to object matching in videos,” inProceedings of the 9th IEEE International Conference on Computer Vision (ICCV '03), pp. 1470–1477, October 2003. View at Scopus
    15. J. Sivic, B. C. Russell, A. A. Efros, A. Zisserman, and W. T. Freeman, “Discovering objects and their location in images,” in Proceedings of the 10th IEEE International Conference on Computer Vision (ICCV '05), pp. 370–377, October 2005. View at Publisher · View at Google Scholar · View at Scopus
    16. R. Fergus, L. Fei-Fei, P. Perona, and A. Zisserman, “Learning object categories from Google's image search,” in Proceedings of the 10th IEEE International Conference on Computer Vision (ICCV '05), pp. 1816–1823, October 2005. View at Publisher · View at Google Scholar · View at Scopus
    17. Y. G. Jiang, J. Yang, C. W. Ngo, and A. G. Hauptmann, “Representations of keypoint-based semantic concept detection: a comprehensive study,” IEEE Transactions on Multimedia, vol. 12, no. 1, pp. 42–53, 2010. View at Publisher · View at Google Scholar · View at Scopus
    18. H. L. Luo, H. Wei, and L. L. Lai, “Creating efficient visual codebook ensembles for object categorization,” IEEE Transactions on Systems, Man, and Cybernetics Part A, vol. 41, no. 2, pp. 238–253, 2010. View at Publisher · View at Google Scholar · View at Scopus
    19. J. Fan, Y. Gao, and H. Luo, “Multi-level annotation of natural scenes using dominant image components and semantic concepts,” in Proceedings of the 12th ACM International Conference on Multimedia (MM '04), pp. 540–547, October 2004. View at Scopus
    20. G. Wang, Y. Zhang, and L. Fei-Fei, “Using dependent regions for object categorization in a generative framework,” in Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR '06), pp. 1597–1604, June 2006. View at Publisher · View at Google Scholar · View at Scopus
    21. E. Hörster and R. Lienhart, “Fusing local image descriptors for large-scale image retrieval,” inProceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR '07), pp. 1–8, June 2007. View at Publisher · View at Google Scholar · View at Scopus
    22. H. Jégou, M. Douze, and C. Schmid, “Improving bag-of-features for large scale image search,”International Journal of Computer Vision, vol. 87, no. 3, pp. 316–336, 2010. View at Publisher · View at Google Scholar · View at Scopus
    23. D. G. Lowe, “Distinctive image features from scale-invariant keypoints,” International Journal of Computer Vision, vol. 60, no. 2, pp. 91–110, 2004. View at Publisher · View at Google Scholar · View at Scopus
    24. A. Bosch, X. Muñoz, and R. Martí, “Which is the best way to organize/classify images by content?”Image and Vision Computing, vol. 25, no. 6, pp. 778–791, 2007. View at Publisher · View at Google Scholar · View at Scopus
    25. K. Mikolajczyk, B. Leibe, and B. Schiele, “Local features for object class recognition,” in Proceedings of the 10th IEEE International Conference on Computer Vision (ICCV '05), pp. 1792–1799, October 2005.View at Publisher · View at Google Scholar · View at Scopus
    26. T. Tuytelaars and K. Mikolajczyk, “Local invariant feature detectors: a survey,” Foundations and Trends in Computer Graphics and Vision, vol. 3, no. 3, pp. 177–280, 2007. View at Publisher · View at Google Scholar · View at Scopus
    27. K. Mikolajczyk, T. Tuytelaars, C. Schmid et al., “A comparison of affine region detectors,” International Journal of Computer Vision, vol. 65, no. 1-2, pp. 43–72, 2005. View at Publisher · View at Google Scholar· View at Scopus
    28. D. Gökalp and S. Aksoy, “Scene classification using bag-of-regions representations,” in Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR '07), pp. 1–8, June 2007. View at Publisher · View at Google Scholar · View at Scopus
    29. A. Bosch, A. Zisserman, and X. Munoz, “Scene classification via pLSA,” in European Conference on Computer Vision, pp. 517–530, 2006.
    30. F. Jurie and B. Triggs, “Creating efficient codebooks for visual recognition,” in Proceedings of the 10th IEEE International Conference on Computer Vision (ICCV '05), pp. 604–610, October 2005. View at Publisher · View at Google Scholar · View at Scopus
    31. L. Fei-Fei and P. Perona, “A bayesian hierarchical model for learning natural scene categories,” inProceedings of the 6th IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR '05), pp. 524–531, June 2005. View at Scopus
    32. Y. Ke and R. Sukthankar, “PCA-SIFT: a more distinctive representation for local image descriptors,” inProceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR '04), pp. 506–513, July 2004. View at Scopus
    33. J. R. R. Uijlings, A. W. M. Smeulders, and R. J. H. Scha, “Real-time visual concept classification,” IEEE Transactions on Multimedia, vol. 12, no. 7, pp. 665–681, 2010. View at Publisher · View at Google Scholar · View at Scopus
    34. K. Mikolajczyk and C. Schmid, “A performance evaluation of local descriptors,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 27, no. 10, pp. 1615–1630, 2005. View at Publisher · View at Google Scholar · View at Scopus
    35. J. Zhang, M. Marszałek, S. Lazebnik, and C. Schmid, “Local features and kernels for classification of texture and object categories: a comprehensive study,” International Journal of Computer Vision, vol. 73, no. 2, pp. 213–238, 2007. View at Publisher · View at Google Scholar · View at Scopus
    36. Z. Li, Z. Shi, X. Liu, Z. Li, and Z. Shi, “Fusing semantic aspects for image annotation and retrieval,”Journal of Visual Communication and Image Representation, vol. 21, no. 8, pp. 798–805, 2010. View at Publisher · View at Google Scholar · View at Scopus
    37. L. Yang, N. Zheng, and J. Yang, “A unified context assessing model for object categorization,” Computer Vision and Image Understanding, vol. 115, no. 3, pp. 310–322, 2011. View at Publisher · View at Google Scholar · View at Scopus
    38. S. Zhang, Q. Tian, G. Hua et al., “Modeling spatial and semantic cues for large-scale near-duplicated image retrieval,” Computer Vision and Image Understanding, vol. 115, no. 3, pp. 403–414, 2011. View at Publisher · View at Google Scholar · View at Scopus
    39. X. Chen, X. Hu, and X. Shen, “Spatial weighting for bag-or-visual-words and its application in content-based image retrieval,” in Pacific-Asia Conference on Knowledge Discovery and Data Mining, pp. 867–874, 2009.
    40. S. Kim and D. Kim, “Scene classification using pLSA with visterm spatial location,” in Proceedings of the 1st ACM International Workshop on Interactive Multimedia for Consumer Electronics (IMCE '09), pp. 57–66, October 2009. View at Publisher · View at Google Scholar · View at Scopus
    41. Z. Lu and H. H. S. Ip, “Image categorization with spatial mismatch kernels,” in Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR '09), pp. 397–404, June 2009. View at Publisher · View at Google Scholar · View at Scopus
    42. Z. Lu and H. H. S. Ip, “Image categorization by learning with context and consistency,” in Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR '09), pp. 2719–2726, June 2009. View at Publisher · View at Google Scholar · View at Scopus
    43. J. R. R. Uijlings, A. W. M. Smeulders, and R. J. H. Scha, “What is the spatial extent of an object?” inProceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR '09), pp. 770–777, June 2009. View at Publisher · View at Google Scholar · View at Scopus
    44. L. Cao and L. Fei-Fei, “Spatially coherent latent topic model for concurrent segmentation and classification of objects and scenes,” in Proceedings of the IEEE 11th International Conference on Computer Vision (ICCV '07), pp. 1–8, October 2007. View at Publisher · View at Google Scholar · View at Scopus
    45. J. Philbin, O. Chum, M. Isard, J. Sivic, and A. Zisserman, “Object retrieval with large vocabularies and fast spatial matching,” in Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR '07), pp. 1–8, June 2007. View at Publisher · View at Google Scholar · View at Scopus
    46. L. Wu, M. Li, Z. Li, W. Y. Ma, and N. Yu, “Visual language modeling for image classification,” inProceedings of the 9th ACM SIG Multimedia International Workshop on Multimedia Information Retrieval (MIR '07), pp. 115–124, September 2007. View at Publisher · View at Google Scholar · View at Scopus
    47. A. Agarwal and B. Triggs, “Hyperfeatures—multilevel local coding for visual recognition,” in Conference on Computer Vision, pp. 30–43, 2006.
    48. S. Lazebnik, C. Schmid, and J. Ponce, “Beyond bags of features: spatial pyramid matching for recognizing natural scene categories,” in Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR '06), pp. 2169–2178, June 2006. View at Publisher ·View at Google Scholar · View at Scopus
    49. M. Marszałek and C. Schmid, “Spatial weighting for bag-of-features,” in Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR '06), pp. 2118–2125, June 2006. View at Publisher · View at Google Scholar · View at Scopus
    50. F. Monay, P. Quelhas, J. M. Odobez, and D. Gatica-Perez, “Integrating co-occurrence and spatial contexts on patch-based scene segmentation,” in Proceedings of the Conference on Computer Vision and Pattern Recognition (CVPR '06), pp. 14–21, June 2006. View at Publisher · View at Google Scholar · View at Scopus
    51. K. E. A. Van De Sande, T. Gevers, and C. G. M. Snoek, “Empowering visual categorization with the GPU,” IEEE Transactions on Multimedia, vol. 13, no. 1, pp. 60–70, 2011. View at Publisher · View at Google Scholar · View at Scopus
    52. O. Chum, J. Philbin, J. Sivic, M. Isard, and A. Zisserman, “Total recall: automatic query expansion with a generative feature model for object retrieval,” in Proceedings of the IEEE 11th International Conference on Computer Vision (ICCV '07), pp. 1–8, October 2007. View at Publisher · View at Google Scholar ·View at Scopus
    53. S. Deerwester, S. T. Dumais, G. W. Furnas, T. K. Landauer, and R. Harshman, “Indexing by latent semantic analysis,” Journal for the American Society for InFormation Science, vol. 41, no. 6, pp. 391–407, 1990. View at Google Scholar
    54. T. Hofmann, “Unsupervised learning by probabilistic latent semantic analysis,” Machine Learning, vol. 42, no. 1-2, pp. 177–196, 2001. View at Publisher · View at Google Scholar
    55. D. M. Blei, A. Y. Ng, and M. I. Jordan, “Latent Dirichlet allocation,” Journal of Machine Learning Research, vol. 3, no. 4-5, pp. 993–1022, 2003. View at Google Scholar · View at Scopus
    56. T. Mitchell, Machine Learning, McGraw-Hill, New York, NY, USA, 1997.
    57. S. Haykin, Neural Networks: A Comprehensive Foundation, Prentice Hall, Upper Saddle River, NJ, USA, 2nd edition, 1999.
    58. V. Vapnik, Statistical Learning Theory, John Wiley & Sons, New York, NY, USA, 1998.
    59. M. Summer and R. W. Picard, “Indoor-outdoor image classification,” IEEE International Workshop on Content-Based Access of Image and Video Databases, pp. 42–50, 1998. View at Google Scholar
    60. P. Quelhas, F. Monay, J. M. Odobez, D. Gatica-Perez, T. Tuytelaars, and L. Van Gool, “Modeling scenes with local descriptors and latent aspects,” in Proceedings of the 10th IEEE International Conference on Computer Vision (ICCV '05), pp. 883–890, October 2005. View at Publisher · View at Google Scholar ·View at Scopus
    61. W. Zhong, K. Qifa, M. Isard, and S. Jian, “Bundling features for large scale partial-duplicate web image search,” in Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR '09), pp. 25–32, June 2009. View at Publisher · View at Google Scholar · View at Scopus
    62. K. T. Chen, K. H. Lin, Y. H. Kuo, Y. L. Wu, and W. H. Hsu, “Boosting image object retrieval and indexing by automatically discovered pseudo-objects,” Journal of Visual Communication and Image Representation, vol. 21, no. 8, pp. 815–825, 2010. View at Publisher · View at Google Scholar · View at Scopus
    63. P. Gehler and S. Nowozin, “On feature combination for multiclass object classification,” in Proceedings of the IEEE International Conference on Computer Vision (ICCV '09), pp. 221–228, 2009.
    64. J. Qin and N. H. Yung, “Feature fusion within local region using localized maximum-margin learning for scene categorization,” Pattern Recognition, vol. 45, pp. 1671–1683, 2012. View at Google Scholar
    65. J. C. Van Gemert, “Exploiting photographic style for category-level image classification by generalizing the spatial pyramid,” in Proceedings of the 1st ACM International Conference on Multimedia Retrieval (ICMR '11), pp. 1–8, April 2011. View at Publisher · View at Google Scholar · View at Scopus
    66. N. Rasiwasia and N. Vasconcelos, “Scene classification with low-dimensional semantic spaces and weak supervision,” in Proceedings of the 26th IEEE Conference on Computer Vision and Pattern Recognition (CVPR '08), pp. 1–8, June 2008. View at Publisher · View at Google Scholar · View at Scopus
    67. H. Jégou, M. Douze, and C. Schmid, “Product quantization for nearest neighbor search,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 33, no. 1, pp. 117–128, 2011. View at Publisher · View at Google Scholar · View at Scopus
    68. B. Fernando, E. Fromont, D. Muselet, and M. Sebban, “Supervised learning of Gaussian mixture models for visual vocabulary generation,” Pattern Recognition, vol. 45, pp. 897–907, 2011. View at Publisher ·View at Google Scholar · View at Scopus
    69. L. Wu, S. C. H. Hoi, and N. Yu, “Semantics-preserving bag-of-words models and applications,” IEEE Transactions on Image Processing, vol. 19, no. 7, pp. 1908–1920, 2010. View at Publisher · View at Google Scholar · View at Scopus
    70. T. de Campos, G. Csurka, and F. Perronnin, “Images as sets of locally weighted features,” Computer Vision and Image Understanding, vol. 116, pp. 68–85, 2012. View at Google Scholar
    71. Y. T. Zheng, M. Zhao, S. Y. Neo, T. S. Chua, and Q. Tian, “Visual synset: towards a higher-level visual representation,” in Proceedings of the 26th IEEE Conference on Computer Vision and Pattern Recognition (CVPR '08), pp. 1–8, June 2008. View at Publisher · View at Google Scholar · View at Scopus
    72. F. Moosmann, B. Triggs, and F. Jurie, “Fast discriminative visual codebooks using randomized clustering forests,” in International Conference on Neural Information Processing Systems, pp. 985–992, 2006.
    73. J. S. Hare, S. Samangooei, and P. H. Lewis, “Efficient clustering and quantisation of SIFT features: exploiting characteristics of the SIFT descriptor and interest region detectors under image inversion,” inProceedings of the 1st ACM International Conference on Multimedia Retrieval (ICMR '11), pp. 1–8, April 2011. View at Publisher · View at Google Scholar · View at Scopus
    74. H. Jegou, H. Harzallah, and C. Schmid, “A contextual dissimilarity measure for accurate and efficient image search,” in Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR '07), pp. 1–8, June 2007. View at Publisher · View at Google Scholar · View at Scopus
    75. J. Winn, A. Criminisi, and T. Minka, “Object categorization by learned universal visual dictionary,” inProceedings of the 10th IEEE International Conference on Computer Vision (ICCV '05), pp. 1800–1807, October 2005. View at Publisher · View at Google Scholar · View at Scopus
    76. S. Zhang, Q. Tian, G. Hua, Q. Huang, and W. Guo, “Generating descriptive visual words and visual phrases for large-scale image applications,” IEEE Transactions on Image Processing, vol. 20, no. 9, pp. 2664–2677, 2011. View at Google Scholar
    77. E. Gavves, C. G. M. Snoek, and A. W. Smeulders, “Visual synonyms for landmark image retrieval,”Computer Vision and Image Understanding, vol. 116, pp. 238–249, 2012. View at Google Scholar
    78. R. J. López-Sastre, T. Tuytelaars, F. J. Acevedo-Rodríguez, and S. Maldonado-Bascón, “Towards a more discriminative and semantic visual vocabulary,” Computer Vision and Image Understanding, vol. 115, no. 3, pp. 415–425, 2011. View at Publisher · View at Google Scholar · View at Scopus
    79. S. H. Bae and B. H. Juang, “IPSILON: incremental parsing for semantic indexing of latent concepts,”IEEE Transactions on Image Processing, vol. 19, no. 7, pp. 1933–1947, 2010. View at Publisher · View at Google Scholar · View at Scopus
    80. K. Kesorn and S. Poslad, “An enhanced bag-of-visual words vector space model to represent visual content in athletics images,” IEEE Transactions on Multimedia, vol. 14, no. 1, pp. 211–222, 2012. View at Google Scholar
    81. P. Tirilly, V. Claveau, and P. Gros, “Language modeling for bag-of-visual words image categorization,” in Proceedings of the International Conference on Image and Video Retrieval (CIVR '08), pp. 249–258, July 2008. View at Publisher · View at Google Scholar · View at Scopus
    82. H. Cheng and R. Wang, “Semantic modeling of natural scenes based on contextual Bayesian networks,”Pattern Recognition, vol. 43, no. 12, pp. 4042–4054, 2010. View at Publisher · View at Google Scholar ·View at Scopus
    83. D. Larlus, J. Verbeek, and F. Jurie, “Category level object segmentation by combining bag-of-words models with dirichlet processes and random fields,” International Journal of Computer Vision, vol. 88, no. 2, pp. 238–253, 2010. View at Publisher · View at Google Scholar · View at Scopus
    84. Y. Wang and G. Mori, “Human action recognition by semilatent topic models,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 31, no. 10, pp. 1762–1774, 2009. View at Publisher · View at Google Scholar · View at Scopus
    85. B. Fasel, F. Monay, and D. Gatica-Perez, “Latent semantic analysis of facial action codes for automatic facial expression recognition,” in Proceedings of the 6th ACM SIGMM International Workshop on Multimedia Information Retrieval (MIR '04), pp. 181–188, October 2004. View at Scopus
    86. J. Wang, Y. Li, Y. Zhang et al., “Bag-of-features based medical image retrieval via multiple assignment and visual words weighting,” IEEE Transactions on Medial Imaging, vol. 30, no. 11, pp. 1996–2011, 2011.View at Google Scholar
    87. X. Li and A. Godil, “Investigating the bag-of-words method for 3D shape retrieval,” EURASIP Journal on Advances in Signal Processing, vol. 2010, Article ID 108130, 2010. View at Publisher · View at Google Scholar · View at Scopus
    88. R. Toldo, U. Castellani, and A. Fusiello, “A bag of words approach for 3D object categorization,” inInternational Conference on Computer Vision/Computer Graphics Collaboration Techniques, pp. 116–127, 2009.
    89. P. Ye and D. Doermann, “No-reference image quality assessment using visual codebooks,” IEEE Transactions on Image Processing, vol. 21, no. 7, pp. 3129–3138, 2012. View at Google Scholar
    90. A. Farhadi, I. Endres, D. Hoiem, and D. Forsyth, “Describing objects by their attributes,” in Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR '09), pp. 1778–1785, June 2009. View at Publisher · View at Google Scholar · View at Scopus
    91. E. B. Sudderth, A. Torralba, W. T. Freeman, and A. S. Willsky, “Describing visual scenes using transformed objects and parts,” International Journal of Computer Vision, vol. 77, no. 1–3, pp. 291–330, 2008. View at Publisher · View at Google Scholar · View at Scopus
    92. J. Philbin, O. Chum, M. Isard, J. Sivic, and A. Zisserman, “Lost in quantization: improving particular object retrieval in large scale image databases,” in Proceedings of the 26th IEEE Conference on Computer Vision and Pattern Recognition (CVPR '08), pp. 1–8, June 2008. View at Publisher · View at Google Scholar · View at Scopus
    93. R. Lienhart and M. Slaney, “PLSA on large scale image databases,” in Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP '07), pp. IV1217–IV1220, April 2007. View at Publisher · View at Google Scholar · View at Scopus
    94. S. Zhang, Q. Tian, G. Hua, Q. Huang, and S. Li, “Descriptive visual words and visual phrases for image applications,” in Proceedings of the 17th ACM International Conference on Multimedia (MM '09), pp. 75–84, October 2009. View at Publisher · View at Google Scholar · View at Scopus
    95. A. Torralba and A. A. Efros, “Unbiased look at dataset bias,” in Proceedings of the IEEE International Conference on Computer Vision and Pattern Recognition (CVPR '11), pp. 1521–1528, 2011.
    96. D. Liu, G. Hua, P. Viola, and T. Chen, “Integrated feature selection and higher-order spatial feature extraction for object categorization,” in Proceedings of the 26th IEEE Conference on Computer Vision and Pattern Recognition (CVPR '08), pp. 1–8, June 2008. View at Publisher · View at Google Scholar · View at Scopus
    97. N. M. Elfiky, F. S. Khan, J. van de Weijer, and J. Gonzalez, “Discriminative compact pyramids for object and scene recognition,” Pattern Recognition, vol. 45, pp. 1627–1636, 2012. View at Google Scholar
    98. A. Bosch, A. Zisserman, and X. Muñoz, “Scene classification using a hybrid generative/discriminative approach,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 30, no. 4, pp. 712–727, 2008. View at Publisher · View at Google Scholar · View at Scopus
    99. L. Shang and B. Xiao, “Discriminative features for image classification and retrieval,” Pattern Recognition Letters, vol. 33, pp. 744–751, 2012. View at Google Scholar
    100. W. Tong, F. Li, R. Jin, and A. Jain, “Large-scale near-duplicate image retrieval by kernel density estimation,” International Journal of Multimedia Information Retrieval, vol. 1, pp. 45–58, 2012. View at Google Scholar
    101. J. Shotton, M. Johnson, and R. Cipolla, “Semantic texton forests for image categorization and segmentation,” in Proceedings of the 26th IEEE Conference on Computer Vision and Pattern Recognition (CVPR '08), pp. 1–8, June 2008. View at Publisher · View at Google Scholar · View at Scopus
    102. S. Romberg, R. Lienhart, and E. Horster, “Multimodal image retrieval: fusing modalities with multilayer multimodal pLSA,” International Journal of Multimedia Information Retrieval, vol. 1, no. 1, pp. 31–44, 2012. View at Google Scholar
    103. Y. J. Lee and K. Grauman, “Object-graphs for context-aware visual category discovery,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 34, no. 2, pp. 346–358, 2012. View at Google Scholar
    104. J. Stottinger, A. Hanbury, N. Sebe, and T. Gevers, “Sparse color interest points for image retrieval and object categorization,” IEEE Transactions on Image Processing, vol. 21, no. 5, pp. 2681–2692, 2012. View at Google Scholar
    105. G. Ding, J. Wang, and K. Qin, “A visual word weighting scheme based on emerging itemsets for video annotation,” Information Processing Letters, vol. 110, no. 16, pp. 692–696, 2010. View at Publisher · View at Google Scholar · View at Scopus
    106. J. Qin and N. H. C. Yung, “Scene categorization via contextual visual words,” Pattern Recognition, vol. 43, no. 5, pp. 1874–1888, 2010. View at Publisher · View at Google Scholar · View at Scopus
    107. P. Tirilly, V. Claveau, and P. Gros, “Distances and weighting schemes for bag of visual words image retrieval,” in Proceedings of the ACM SIGMM International Conference on Multimedia Information Retrieval (MIR '10), pp. 323–332, March 2010. View at Publisher · View at Google Scholar · View at Scopus
    108. Y. Xiang, X. Zhou, T. S. Chua, and C. W. Ngo, “A revisit of generative model for automatic image annotation using markov random fields,” in Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR '09), pp. 1153–1160, June 2009. View at Publisher ·View at Google Scholar · View at Scopus
    109. M. Marszalek and C. Schmid, “Constructing category hierarchies for visual recognition,” in European Conference on Computer Vision, pp. 479–491, 2008.
    110. K. E. A. Van De Sande, T. Gevers, and C. G. M. Snoek, “A comparison of color features for visual concept classification,” in Proceedings of the International Conference on Image and Video Retrieval (CIVR '08), pp. 141–150, July 2008. View at Publisher · View at Google Scholar · View at Scopus
    111. L. J. Li and L. Fei-Fei, “What, where and who? Classifying events by scene and object recognition,” inProceedings of the IEEE 11th International Conference on Computer Vision (ICCV '07), pp. 1–8, October 2007. View at Publisher · View at Google Scholar · View at Scopus
    112. Y. Junsong, W. Ying, and Y. Ming, “Discovery of collocation patterns: from visual words to visual phrases,” in Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR '07), pp. 1–8, June 2007. View at Publisher · View at Google Scholar · View at Scopus
    113. F. Perronnin, C. Dance, G. Csurka, and M. Bressan, “Adapted vocabularies for generic visual categorization,” in European Conference on Computer Vision, pp. 464–475, 2006.
    114. H. Bay, A. Ess, T. Tuytelaars, and L. Van Gool, “Speeded-Up Robust Features (SURF),” Computer Vision and Image Understanding, vol. 110, no. 3, pp. 346–359, 2008. View at Publisher · View at Google Scholar· View at Scopus
    115. H. Lee, G. Shim, Y. B. Kim, J. Park, and J. Kim, “A search ant and labor ant algorithm for clustering data,” in International Conference on Ant Colony Optimization and Swarm Intelligence, pp. 500–501, 2006.
    116. J. Shi and J. Malik, “Normalized cuts and image segmentation,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 22, no. 8, pp. 888–905, 2000. View at Publisher · View at Google Scholar ·View at Scopus
    117. D. Comaniciu and P. Meer, “Mean shift: a robust approach toward feature space analysis,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 24, no. 5, pp. 603–619, 2002. View at Publisher · View at Google Scholar · View at Scopus
    118. P. Duygulu, K. Barnard, J. F. G. de Freitas, and D. A. Forsyth, “Object recognition as machine translation: learning a lexicon for a fixed image vocabulary,” in European Conference on Computer Vision, pp. 97–112, 2002.
    119. M. Stark and B. Schiele, “How good are local features for classes of geometric objects,” in Proceedings of the IEEE 11th International Conference on Computer Vision (ICCV '07), pp. 1–8, October 2007. View at Publisher · View at Google Scholar · View at Scopus
    120. S. K. Divvala, D. Hoiem, J. H. Hays, A. A. Efros, and M. Hebert, “An empirical study of context in object detection,” in Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR '09), pp. 1271–1278, June 2009. View at Publisher · View at Google Scholar · View at Scopus
    121. D. Nister and H. Stewenius, “Scalable recognition with vocabulary tree,” in IEEE International Conference on Computer Vision and Pattern Recognition (CVPR '06), pp. 1470–1477, 2006.
    122. J. Vogel and B. Schiele, “Semantic modeling of natural scenes for content-based image retrieval,”International Journal of Computer Vision, vol. 72, no. 2, pp. 133–157, 2007. View at Publisher · View at Google Scholar · View at Scopus
    123. M. R. Boutell, J. Luo, and C. M. Brown, “Scene parsing using region-based generative models,” IEEE Transactions on Multimedia, vol. 9, no. 1, pp. 136–146, 2007. View at Publisher · View at Google Scholar· View at Scopus
    124. J. Jeon, V. Lavrenko, and R. Manmatha, “Automatic image annotation and retrieval using cross-media relevance models,” in ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 119–126, 2003.
    125. V. Lavrenko, R. Manmatha, and J. Jeon, “A model for learning the semantics of pictures,” inInternational Conference on Neural Information Processing Systems, pp. 553–560, 2003.
    126. F. Monay and D. Gatica-Perez, “Modeling semantic aspects for cross-media image indexing,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 29, no. 10, pp. 1802–1817, 2007. View at Publisher · View at Google Scholar · View at Scopus
    127. C. Siagian and L. Itti, “Gist: a mobile robotics application of context-based vision in outdoor environment,” in IEEE International Conference on Computer Vision and Pattern Recognition (CVPR '05), pp. 1063–1069, 2005.
    128. C. Siagian and L. Itti, “Rapid biologically-inspired scene classification using features shared with visual attention,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 29, no. 2, pp. 300–312, 2007. View at Publisher · View at Google Scholar · View at Scopus
    129. A. Bar-Hillel, T. Hertz, N. Shental, and D. Weinshall, “Learning distance functions using equivalence relations,” in Proceedings of the 20th International Conference on Machine Learning, pp. 11–18, August 2003. View at Scopus
    130. J. V. Davis, B. Kulis, P. Jain, S. Sra, and I. S. Dhillon, “Information-theoretic metric learning,” inProceedings of the 24th International Conference on Machine Learning (ICML '07), pp. 209–216, June 2007. View at Publisher · View at Google Scholar · View at Scopus
    131. J. Goldberger, S. Roweis, G. Hinton, and R. Salakhutdinov, “Neighborhood component analysis,” inInternational Conference on Neural Information Processing Systems, pp. 513–520, 2004.
    132. K. Weinberger, J. Blitzer, and L. Saul, “Distance metric learning for large margin nearest neighbor classification,” in International Conference on Neural Information Processing Systems, pp. 1473–1480, 2006.
    133. J. Yang, Y. G. Jiang, A. G. Hauptmann, and C. W. Ngo, “Evaluating bag-of-visual-words representations in scene classification,” in Proceedings of the 9th ACM SIG Multimedia International Workshop on Multimedia Information Retrieval (MIR '07), pp. 197–206, September 2007. View at Publisher · View at Google Scholar · View at Scopus
    134. S. L. Feng, R. Manmatha, and V. Lavrenko, “Multiple Bernoulli relevance models for image and video annotation,” in Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR '04), pp. 1002–1009, July 2004. View at Scopus
    135. S. Savarese, J. Winn, and A. Criminisi, “Discriminative object class models of appearance and shape by correlatons,” in Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR '06), pp. 2033–2040, June 2006. View at Publisher · View at Google Scholar · View at Scopus
    136. J. Liu and M. Shah, “Scene modeling using co-clustering,” in Proceedings of the IEEE 11th International Conference on Computer Vision (ICCV '07), pp. 1–8, October 2007. View at Publisher · View at Google Scholar · View at Scopus
    137. A. Vailaya, M. A. T. Figueiredo, A. K. Jain, and H. J. Zhang, “Image classification for content-based indexing,” IEEE Transactions on Image Processing, vol. 10, no. 1, pp. 117–130, 2001. View at Publisher ·View at Google Scholar · View at Scopus
    138. J. Zhang, M. Marszalek, S. Lazebnik, and C. Schmid, “Local features and kernels for classification of texture and object categories: an in-depth study,” Tech. Rep. RR-5737, INRIA Rhône-Alpes, 2005. View at Google Scholar
    139. A. Opelt, M. Fussenegger, A. Pinz, and P. Auer, “Weak hypotheses and boosting for generic object detection and recognition,” in European Conference on Computer Vision, pp. 71–84, 2004.
    140. J. Farquhar, S. Szedmak, H. Meng, and J. Shawe-Taylor, “Improving “bag-of-keypoints” image categorication,” Tech. Rep., University of Southampton, 2005. View at Google Scholar
    141. T. Deselaers, D. Keysets, and H. Ney, “Classification error rate for quantitative evaluation of content-based image retrieval systems,” in Proceedings of the 17th International Conference on Pattern Recognition (ICPR '04), pp. 505–508, August 2004. View at Scopus
    142. F. Li, W. Tong, R. Jin, A. K. Jain, and J. E. Lee, “An efficient key point quantization algorithm for large scale image retrieval,” in Proceedings of the 1st ACM Workshop on Large-Scale Multimedia Retrieval and Mining (LS-MMRM '09), pp. 89–96, October 2009. View at Publisher · View at Google Scholar · View at Scopus
    143. A. K. Bhogal, N. Singla, and M. Kaur, “Comparison of algorithms for segmentation of complex scene images,” International Journal of Advanced Engineering Sciences and Technologies, vol. 8, no. 2, pp. 306–310, 2011. View at Google Scholar
    144. H. Zhang, J. E. Fritts, and S. A. Goldman, “Image segmentation evaluation: a survey of unsupervised methods,” Computer Vision and Image Understanding, vol. 110, no. 2, pp. 260–280, 2008. View at Publisher · View at Google Scholar · View at Scopus
    145. A. Perina, M. Cristani, U. Castellani, V. Murino, and N. Jojic, “Free energy score spaces: using generative information in discriminative classifiers,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 34, no. 7, pp. 1249–1262, 2012. View at Google Scholar
    from: http://www.hindawi.com/journals/isrn/2012/376804/
    展开全文
  • 《Few-Shot Representation Learning for Out-Of-Vocabulary Words》 这篇文章是发表在2019年NAACL上的,主要是针对out of vocabulary问题提出的想法。 这里感觉和bert,elmo的想法类似,提炼的都是词向量,只不过这...

    《Few-Shot Representation Learning for Out-Of-Vocabulary Words》

    这篇文章是发表在2019年NAACL上的,主要是针对out of vocabulary问题提出的想法。

    这里感觉和bert,elmo的想法类似,提炼的都是词向量,只不过这篇文章是将提炼词向量去解决out of Vocabulary这一个小问题上,切入点比较好;而且这篇文章自己提出了一个模型去进行词向量的学习。而之前的语言模型是,从大规模语料当中学习到语言学的一些通用知识,从而去进行下游任务。

    这篇文章提炼词向量的方法是通过层级context encoder+Model Agnostic Meta-Learning (MAML)学习算法。在Rare-NER和词性标注的下游任务中取得了显著的改善。

    分以下四部分介绍:

    • Motivation
    • Model
    • Experiment
    • Discussion

    1、Motivation

    在真实世界当中,out of vocabulary, 不会频繁的出现在训练语料里,对这部分的表示进行学习是一个challenge。论文提出了一种层级注意力结构对有限的observations进行词向量表示。
    用一个词的上下文信息去进行编码,并且只使用K个observations去训练模型(目的就是希望模型能够准确对出现频率较少的词进行表示)。

    为了使模型对新的语料有着更好的鲁棒性,提出了一种新的训练方法ModelAgnostic Meta-Learning (MAML)

    2、Model

    2.1 The Few-Shot Regression Framework

    Problem formulation

    首先在训练集上,我们使用训练方法产生词向量。这些词向量作为我们训练的一个目标标签,Oracle embedding。训练方法(MAML)如下:首先从大规模语料当中选出n个词,对于每一个词,我们可以用St表式所有包含这个词的句子。为了训练我们的模型,解决out of vocabulary的问题,我们随机的采样所有词的k(2, 4, 6)个句子,形成一个episode,该episode为一个小样本。在这个语料当中去训练我们的模型,然后用新的测试语料Dn微调。如此反复进行训练。同时加入字符特征。最后我们选择余弦距离作为我们的评价指标,目标是想让模型生成的词向量和oracle embedding 尽可能的接近。

    2.2 Hierarchical Context Encoding (HiCE)

    模型结构如下:
    在这里插入图片描述
    该模型也是训练一个语言模型,
    输入:包含 w t w_t wt k k k 个句子 s t , k s_{t, k} st,k ,其中 w t w_t wt 被mask
    输出 w t w_t wt 的词向量表示

    模型结构主要分为两部分,第1部分是context encoder,第2部分是,Multi context Aggregator。

    第1部分主要是对输入去进行一个transformer encoder的编码,得到每一个句子的表示,第2部分将这每一个句子的表示进行连接,再经过一个transformer encoder,并和字符特征去进行concatenation最后输出层得到词向量。模型结构比较简单,通过此模型结构能够去补获上下文的信息以及整体的全局信息。

    都是标准的transformer,模型参数和计算过程就不赘述了。

    3、Experiment

    Present two types of experiments to evaluate the effectiveness of the proposed HiCE model.

    • intrinsic evaluation–WikiText-103 (Merity et al., 2017)[1] WikiText-103 which used as Dt contains 103 million words extracted from a selected set of articles
    • extrinsic evaluation

    3.1 Intrinsic Evaluation: Evaluate OOV Embeddings on the Chimera Benchmark

    Evaluate HiCE on Chimera (Lazaridou et al., 2017)[2], a widely used benchmark dataset for evaluating word embedding for OOV words,对于每一个OOV的单词,只有几个句子会出现,用Spearman correlation去评估结果的好坏。结果如下:
    在这里插入图片描述
    1、We can see that adapting with MAML can improve
    the performance when the number of context sentences is relatively large (i.e., 4 and 6 shot), as it can mitigate the semantic gap between source corpus D T D_T DT and target corpus DN

    3.2 Extrinsic Evaluation: Evaluate OOV Embeddings on Downstream Tasks

    Named Entity Recognition

    • Rare-NER: focus on unusual, previouslyunseen entities in the context of emerging discussions
    • Bio-NER: focuses on technical terms in the biology domain

    在这里插入图片描述
    2、 The experiment demonstrates that HiCE trained on DT is already able to leverage the general language knowledge which can be transferred through different domains, and adaptation with MAML can further reduce the domain gap and enhance the performance

    模型结构还是挺强的,上面说道HICE已经可以学习到跨领域的通用知识,并且通过MAML能够更好地减少领域鸿沟

    4、Discussion

    1、首先文章的切入点比较好,针对NLP领域的一个小问题,即OOV问题提出自己的解决方法。

    2、从大规模语料库中提取词向量,并且使用一种新的结构作为语言模型提炼语言学中的一些通用知识。

    3、为了减少领域之间的gap问题,使用MAML的学习方法,加强模型的鲁棒性。

    4、实验中应该加入HICE+MAML的对比试验,因为MAML的微调既会对Morph产生影响,又会对结构产生影响。

    5、如果直接用bert在这几个任务上实验,效果如何。

    Reference

    [1]Stephen Merity, Caiming Xiong, James Bradbury, and Richard Socher. 2017. Pointer sentinel mixture models. In ICLR’17
    [2]Angeliki Lazaridou, Marco Marelli, and Marco Baroni.2017. Multimodal word meaning induction from minimal exposure to natural text. Cognitive Science.

    展开全文
  • R语言笔记一

    万次阅读 多人点赞 2016-06-19 21:44:10
    BUT: The one exception is a list, which is represented as a vector but can contain objects of different classes (indeed, that’s usually why we use them) Empty vectors can be created with the ...

    常用函数

    object.size() ##查询数据大小
    names() ##查询数据变量名称
    head(x, 10) ,tail(x, 10) ##查询数据前/后10行
    summary() ##对数据集的详细统计呈现
    table(x$y) ##对y值出现次数统计
    str() ##查询数据集/函数的详细结构
    nrow(),ncol() ##查询行列数
    sqrt(x) ##square root取x的平方根
    abs(x) ##absolute value取x的绝对值
    names(vect2)<-c(“foo”,”bar”,”norf”) ##给向量命名
    identical(vect,vect2) ##TRUE 检查两个向量是否一样
    vect[c(“foo”,”bar”)] ##用名字选取向量
    colnames(my_data)<-cnames ##修改数据框的列名
    t() ##互换数据框的行列
    length(“”)统计字符数,空字符时计数为1
    nchar(“”)统计字符数,空字符时计数为0
    tolower()将字符转换为小写
    toupper()将字符转换为大写
    chartr(“A”,”B”,x):字符串x中使用B替换A
    na.omit(),移除所有含有缺失值的观测(行删除,listwise deletion)
    paste()

    paste("Var",1:5,sep="")
    [1] "Var1" "Var2" "Var3" "Var4" "Var5"
    
    > x<-list(a='aaa',b='bbb',c="ccc")
    > y<-list(d="163.com",e="qq.com")
    > paste(x,y,sep="@")
    [1] "aaa@163.com" "bbb@qq.com"  "ccc@163.com"
    
    #增加collapse参数,设置分隔符
    > paste(x,y,sep="@",collapse=';')
    [1] "aaa@163.com;bbb@qq.com;ccc@163.com"
    > paste(x,collapse=';')
    [1] "aaa;bbb;ccc"
    

    strsplit()字符串拆分

    strsplit(x, split, fixed = FALSE, perl = FALSE, useBytes = FALSE)
    x为需要拆分的字串向量
    split为拆分位置的字串向量,默认为正则表达式匹配(fixed=FALSE),
    设置fixed=TRUE,表示使用普通文本匹配或正则表达式的精确匹配。普通文本的运算速度快
    perl=TRUE/FALSE的设置和perl语言版本有关,如果正则表达式很长,正确设置表达式并且使用perl=TRUE可以提高运算速度。
    useBytes设置是否逐个字节进行匹配,默认为FALSE,即按字符而不是字节进行匹配。
    strsplit得到的结果是列表,后面要怎么处理就得看情况而定了
    

    字符串替换:sub(),gsub()

    严格地说R语言没有字符串替换的函数
    R语言对参数都是传值不传址
    sub和gsub的区别是前者只做一次替换,gsub把满足条件的匹配都做替换
    > text<-c("Hello, Adam","Hi,Adam!","How are you,Ava")
    > sub(pattern="Adam",replacement="word",text)
    [1] "Hello, word"     "Hi,word!"        "How are you,Ava"
    > sub(pattern="Adam|Ava",replacement="word",text)
    [1] "Hello, word"      "Hi,word!"         "How are you,word"
    > gsub(pattern="Adam|Ava",replacement="word",text)
    [1] "Hello, word"      "Hi,word!"         "How are you,word"
    

    字符串提取substr(), substring()

    substr和substring函数通过位置进行字符串拆分或提取,它们本身并不使用正则表达式
    结合正则表达式函数regexpr、gregexpr或regexec使用可以非常方便地从大量文本中提取所需信息
    语法格式
    substr(x, start, stop) 
    substring(text, first, last = 1000000L)
    第 1个参数均为要拆分的字串向量,第2个参数为截取的起始位置向量,第3个参数为截取字串的终止位置向量
    substr返回的字串个数等于第一个参数的长度
    substring返回字串个数等于三个参数中最长向量长度,短向量循环使用
    > x <- "123456789" 
    > substr(x, c(2,4), c(4,5,8)) 
    [1] "234" 
    > substring(x, c(2,4), c(4,5,8)) 
    [1] "234"     "45"      "2345678"
    因为x的向量长度为1,substr获得的结果只有1个字串,
    即第2和第3个参数向量只用了第一个组合:起始位置2,终止位置4。
    substring的语句三个参数中最长的向量为c(4,5,8),执行时按短向量循环使用的规则第一个参数事实上就是c(x,x,x),
    第二个参数就成了c(2,4,2),最终截取的字串起始位置组合为:2-4, 4-5和2-8。
    

    Workspace and Files

    ls() ##查询工作区对象
    list.files(), dir() ##列出工作目录所有文件
    dir.create(“testdir”) ##创建testdir目录
    file.create(“mytest.R”) ##创建mytest.R文件
    file.exists(“mytest.R”) ##查询文件是否存在
    file.info(“mytest.R”) , file.info(“mytest.R”)$mode ##查询文件包含信息,或特定信息
    file.rename(“mytest.R”,”mytest2.R”) ##重命名为mytest2.R
    file.remove(“mytest.R”) ##删文件
    file.copy(“mytest2.R”,”mytest3.R”) ##复制为mytest3.R文件
    file.path(“mytest3.R”) ##在众多工作文件中,指定提供某个文件的相对路径。
    file.path(“folder1”,”folder2”) ##”folder1/folder2”也能创建独立于系统的路径供R工作。?

    Create a directory in the current working directory called “testdir2” and a subdirectory for it called “testdir3”, all in one command by using dir.create() and file.path().

     dir.create(file.path('testdir2','testdir3'),recursive = TRUE)
    
     unlink("testdir2", recursive = TRUE)    ##删除目录及所有(没有recursive=T,R会阻止)。名称源于unix命令。
    setwd('testdir')     ##设testdir目录,为工作目录
    > old.dir <- getwd()
    args()  ##查询函数参数构成
    sample(x) ##也可以对x重新排序
    > sample(1:6, 4, replace = TRUE)
    [1] 4 5 1 3
    
    >flips <- sample(c(0,1),100,replace = TRUE, prob = c(0.3,0.7)) #prob设定0和1出现的概率
    
    > flips
      [1] 1 1 1 1 1 1 1 0 1 1 1 0 0 1 1 0 1 1 1 1 1 1 1 1 1 1 1 0 1 1 0 1 1 1 1 1 1 1 1 1 0 1 1 1 1 1
     [47] 1 0 1 1 1 1 1 0 1 0 1 1 1 1 0 0 1 1 1 1 0 1 1 1 1 0 0 0 0 0 1 0 0 1 1 0 1 1 1 1 1 0 0 1 1 1
     [93] 1 1 1 1 1 0 1 1
    

    Sequence of Numbers

    > 1:10
     [1]  1  2  3  4  5  6  7  8  9 10
    
    >pi:10   ##real numbers 实数
    [1] 3.141593 4.141593 5.141593 6.141593 7.141593 8.141593 9.141593
    

    ?‘:’查询操作符号:

    > seq(1,10)
     [1]  1  2  3  4  5  6  7  8  9 10
    
    > seq(0, 10, by=0.5)
     [1]  0.0  0.5  1.0  1.5  2.0  2.5  3.0  3.5  4.0  4.5  5.0  5.5  6.0  6.5  7.0  7.5  8.0  8.5
    [19]  9.0  9.5 10.0
    
    >my_seq<- seq(5,10,length=30)  ##在区间(5, 10)等距生成30个数
    > 1:length(my_seq)
     [1]  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30
    > seq(along.with = my_seq)
     [1]  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30
    
    > seq_along(my_seq)  **
     [1]  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30
    
    >rep(c(0,1,2),times=10)
     [1] 0 1 2 0 1 2 0 1 2 0 1 2 0 1 2 0 1 2 0 1 2 0 1 2 0 1 2 0 1 2
    >rep(c(0,1,2),each=10)
     [1] 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2
    

    Vector

    > paste(1:3,c("X", "Y", "Z"),sep="")
    [1] "1X" "2Y" "3Z"
    

    * Vector recycling!*

    > paste(LETTERS, 1:4, sep = "-")
     [1] "A-1" "B-2" "C-3" "D-4" "E-1" "F-2" "G-3" "H-4" "I-1" "J-2" "K-3" "L-4" "M-1" "N-2" "O-3"
    [16] "P-4" "Q-1" "R-2" "S-3" "T-4" "U-1" "V-2" "W-3" "X-4" "Y-1" "Z-2"
    

    数据类型

    对象与属性 Objects and Attributes

    Objects

    R has five basic or “atomic” classes of objects:

    • character
    • numeric (real numbers)
    • integer
    • complex
    • logical (True/False)

    The most basic object is a vector

    • A vector can only contain objects of the same class
    • BUT: The one exception is a list, which is represented as a vector but can contain objects of different classes (indeed, that’s usually why we use them)

    Empty vectors can be created with the vector() function.

    Numbers

    • Numbers in R a generally treated as numeric objects (i.e. double precision real numbers)
    • If you explicitly want an integer, you need to specify the L suffix
    • Ex: Entering *1* gives you a numeric object; entering *1L* explicitly gives you an integer **
    • There is also a special number *Inf* which represents infinity; e.g. 1 / 0; Inf can be used in ordinary calculations; e.g. 1 / Inf is 0
    • The value *NaN* represents an undefined value (“not a number”); e.g. 0 / 0; *NaN* can also be thought of as a missing value (more on that later)

    Attributes

    R objects can have attributes

    • names, dimnames
    • dimensions (e.g. matrices, arrays)
    • class
    • length
    • other user-defined attributes/metadata
      Attributes of an object can be accessed using the attributes() function

    向量与列表 Vectors and Lists

    Creating Vectors

    The c() function can be used to create vectors of objects.

    > x <- c(0.5, 0.6) ## numeric
    > x <- c(TRUE, FALSE) ## logical
    > x <- c(T, F) ## logical
    > x <- c("a", "b", "c") ## character
    > x <- 9:29 ## integer
    > x <- c(1+0i, 2+4i) ## complex
    

    Using the vector() function

    > x <- vector("numeric", length = 10)
    > x
     [1] 0 0 0 0 0 0 0 0 0 0
    

    Mixing Objects

    When different objects are mixed in a vector, coercion occurs so that every element in the vector is of the same class.

    > y <- c(1.7, "a") ## character
    > y <- c(TRUE, 2) ## numeric
    > y <- c("a", TRUE) ## character
    

    Explicit Coercion 强制明确

    Objects can be explicitly coerced from one class to another using the as.* functions, if available.

    > x <- 0:6
    > class(x)
    [1] "integer"
    > as.numeric(x)
    [1] 0 1 2 3 4 5 6
    > as.logical(x)
    [1] FALSE TRUE TRUE TRUE TRUE TRUE TRUE
    > as.character(x)
    [1] "0" "1" "2" "3" "4" "5" "6"
    

    Nonsensical coercion results in NAs

    > x <- c("a", "b", "c")
    > as.numeric(x)
    [1] NA NA NA
    Warning message:
    NAs introduced by coercion
    > as.logical(x)
    [1] NA NA NA
    > as.complex(x)
    [1] NA NA NA
    Warning message:
    NAs introduced by coercion 
    

    Lists

    Lists are a special type of vector that can contain elements of different classes. Lists are a very important data type in R and you should get to know them well.

    > x <- list(1, "a", TRUE, 1 + 4i)
    > x
    [[1]]
    [1] 1
    [[2]]
    [1] "a"
    [[3]]
    [1] TRUE
    [[4]]
    [1] 1+4i
    

    矩阵 Matrices

    Matrices

    Matrices are vectors with a dimension attribute. The dimension attribute is itself an integer vector of length 2 (nrow, ncol)

    > m <- matrix(nrow = 2, ncol = 3)
    > m
     [,1] [,2] [,3]
    [1,] NA NA NA
    [2,] NA NA NA
    > dim(m)
    [1] 2 3
    > attributes(m) **
    $dim
    [1] 2 3
    

    Matrices (cont’d)

    Matrices are constructed column-wise, so entries can be thought of starting in the “upper left” corner and running down the columns.

    > m <- matrix(1:6, nrow = 2, ncol = 3)
    > m
     [,1] [,2] [,3]
    [1,] 1 3 5
    [2,] 2 4 6
    

    Matrices can also be created directly from vectors by adding a dimension attribute.**

    > m <- 1:10
    > m
    [1] 1 2 3 4 5 6 7 8 9 10
    > dim(m) <- c(2, 5)  **
    > m
     [,1] [,2] [,3] [,4] [,5]
    [1,] 1 3 5 7 9
    [2,] 2 4 6 8 10
    

    cbind-ing and rbind-ing

    Matrices can be created by column-binding or row-binding with cbind() and rbind().

    > x <- 1:3
    > y <- 10:12
    > cbind(x, y)
     x y
    [1,] 1 10
    [2,] 2 11
    [3,] 3 12
    > rbind(x, y)
     [,1] [,2] [,3]
    x 1 2 3
    y 10 11 12
    

    因子 Factors

    Factors are used to represent categorical data. Factors can be unordered or ordered. One can think of a factor as an integer vector where each integer has a label.

    • Factors are treated specially by modelling functions like *lm()* and *glm()*
    • Using factors with labels is *better* than using integers because factors are self-describing; having a variable that has values “Male” and “Female” is better than a variable that has values 1 and 2.

       x <- factor(c("yes", "yes", "no", "yes", "no"))
       x
      [1] yes yes no yes no
      Levels: no yes
       table(x)
      x
      no yes
      2 3
       unclass(x)
      [1] 2 2 1 2 1
      attr(,"levels")
      [1] "no" "yes"
      

    The order of the levels can be set using the levels argument to factor(). This can be important in linear modelling because the first level is used as the baseline level.

    > x <- factor(c("yes", "yes", "no", "yes", "no"),
     levels = c("yes", "no")) **
    > x
    [1] yes yes no yes no
    Levels: yes no
    

    缺失值 Missing Values

    Missing values are denoted by NA or NaN for undefined mathematical operations.

    • is.na() is used to test objects if they are NA
    • is.nan() is used to test for NaN
    • NA values have a class also, so there are integer NA, character NA, etc
    • A NaN value is also NA but the converse is not true

      > x <- c(1, 2, NA, 10, 3)
      > is.na(x)
      [1] FALSE FALSE TRUE FALSE FALSE
      > is.nan(x)
      [1] FALSE FALSE FALSE FALSE FALSE
      > x <- c(1, 2, NaN, NA, 4)
      > is.na(x)
      [1] FALSE FALSE TRUE TRUE FALSE
      > is.nan(x)
      [1] FALSE FALSE TRUE FALSE FALSE
      

    数据框 Data Frames

    Data frames are used to store tabular data (表格数据)

    • They are represented as a special type of list where every element of the list has to have the same length
    • Each element of the list can be thought of as a column and the length of each element of the list is the number of rows
    • Unlike matrices, data frames can store different classes of objects in each column (just like lists); matrices must have every element be the same class
    • Data frames also have a special attribute called *row.names*
    • Data frames are usually created by calling *read.table()* or *read.csv()*
    • Can be converted to a matrix by calling *data.matrix()* *

      > x <- data.frame(foo = 1:4, bar = c(T, T, F, F))
      > x
       foo bar
      1 1 TRUE
      2 2 TRUE
      3 3 FALSE
      4 4 FALSE
      > nrow(x)
      [1] 4
      > ncol(x)
      [1] 2
      

    Names Attribute 名字属性

    Names

    R objects can also have names, which is very useful for writing readable code and self-describing objects.

    > x <- 1:3
    > names(x)
    NULL
    > names(x) <- c("foo", "bar", "norf")
    > x
    foo bar norf
     1 2 3
    > names(x)
    [1] "foo" "bar" "norf"
    

    Lists can also have names.

    > x <- list(a = 1, b = 2, c = 3)
    > x
    $a
    [1] 1
    $b
    [1] 2
    $c
    [1] 3
    

    And matrices.

    > m <- matrix(1:4, nrow = 2, ncol = 2)
    > dimnames(m) <- list(c("a", "b"), c("c", "d")) ***
    > m
     c d
    a 1 3
    b 2 4
    

    Summary

    Data Types

    • atomic classes: numeric, logical, character, integer, complex \
    • vectors, lists
    • factors
    • missing values
    • data frames
    • names

    Reading Writing Data

    Reading Data

    There are a few principal functions reading data into R.

    • *read.table()*, *read.csv()*, for reading tabular data
    • *readLines()*, for reading lines of a text file
    • *source()*, for reading in R code files (inverse of dump)**
    • *dget()*, for reading in R code files (inverse of dput)**
    • *load()*, for reading in saved workspaces
    • *unserialize()*, for reading single R objects in binary form

    Writing Data

    There are analogous functions for writing data to files.

    • write.table()
    • writeLines()
    • dump()
    • dput()
    • save()
    • serialize()

    Reading Data Files with read.table *

    The read.table function is one of the most commonly used functions for reading data. It has a few important arguments:

    • *file*, the name of a file, or a connection
    • *header*, logical indicating if the file has a header line
    • *sep*, a string indicating how the columns are separated
    • *colClasses*, a character vector indicating the class of each column in the dataset
    • *nrows*, the number of rows in the dataset
    • *comment.char()*, a character string indicating the comment character
    • *skip*, the number of lines to skip from the beginning
    • *stringsAsFactors*, should character variables be coded as factors?

    read.table
    For small to moderately sized datasets, you can usually call read.table without specifying any other arguments.

    data <- read.table("foo.txt")

    R will automatically

    • skip lines that begin with a #
    • figure out how many rows there are (and how much memory needs to be allocated
    • figure what type of variable is in each column of the table Telling R all these things directly makes R run faster and more efficiently.
    • *read.csv* is identical to *read.table* except that the default separator is a comma.

    Reading in Larger Datasets with read.table

    With much larger datasets, doing the following things will make your life easier and will prevent R from choking.

    • Read the help page for read.table, which contains many hints
    • Make a rough calculation of the memory required to store your dataset. If the dataset is larger than the amount of RAM on your computer, you can probably stop right here.
    • Set comment.char = "" if there are no commented lines in your file. **
    • Use the *colClasses* argument. Specifying this option instead of using the default can make ’read.table’ run MUCH faster, often twice as fast. In order to use this option, you have to know the class of each column in your data frame. If all of the columns are “numeric”, for example, then you can just set *colClasses = "numeric"*. A quick an dirty way to figure out the classes of each column is the following:
    initial <- read.table("datatable.txt", nrows = 100) ***
    classes <- sapply(initial, class)
    tabAll <- read.table("datatable.txt",
                          colClasses = classes)
    • Set *nrows*. This doesn’t make R run faster but it helps with memory usage. A mild overestimate is okay. You can use the Unix tool *wc* to calculate the number of lines in a file.

    Know Thy System

    In general, when using R with larger datasets, it’s useful to know a few things about your system.

    • How much memory is available?
    • What other applications are in use?
    • Are there other users logged into the same system?
    • What operating system?
    • Is the OS 32 or 64 bit?

    Calculating Memory Requirements

    I have a data frame with 1,500,000 rows and 120 columns, all of which are numeric data. Roughly, how much memory is required to store this data frame?
    1,500,000 × 120 × 8 bytes/numeric

    = 1440000000 bytes
    = 1440000000 / bytes/MB
    = 1,373.29 MB
    = 1.34 GB

    Textual Formats

    • *dumping* and *dputing* are useful because the resulting textual format is edit-able, and in the case of corruption, potentially recoverable.
    • *Unlike* writing out a table or csv file, *dump* and *dput* preserve the *metadata* (sacrificing some readability), so that another user doesn’t have to specify it all over again.
    • *Textual* formats can work much better with version control programs like subversion or git which can only track changes meaningfully in text files
    • Textual formats can be longer-lived; if there is corruption somewhere in the file, it can be easier to fix the problem
    • Textual formats adhere to the “Unix philosophy”
    • Downside: The format is not very space-efficient

    dput-ting R Objects ?

    Another way to pass data around is by deparsing the R object with dput and reading it back in using dget.

    > y <- data.frame(a = 1, b = "a")
    > dput(y)
    structure(list(a = 1,
                     b = structure(1L, .Label = "a",
                                            class = "factor")),
                .Names = c("a", "b"), row.names = c(NA, -1L),
                class = "data.frame")
    > dput(y, file = "y.R")
    > new.y <- dget("y.R")
    > new.y
         a    b
    1   1    a
    

    Dumping R Objects ?

    Multiple objects can be deparsed(逆分析) using the dump function(转储功能) and read back in using source.

    > x <- "foo"
    > y <- data.frame(a = 1, b = "a")
    > dump(c("x", "y"), file = "data.R")
    > rm(x, y)
    > source("data.R")
    > y
        a  b
    1  1  a
    > x
    [1] "foo"
    

    Interfaces to the Outside World

    Data are read in using connection interfaces. Connections can be made to files (most common) or to other more exotic things.

    • *file*, opens a connection to a file
    • *gzfile*, opens a connection to a file compressed with gzip
    • *bzfile*, opens a connection to a file compressed with bzip2
    • *url*, opens a connection to a webpage

    File Connections **

    > str(file)
    function (description = "", open = "", blocking = TRUE,
                encoding = getOption("encoding"))
    
     1. *description* is the name of the file
     2. *open* is a code indicating
        - “r” read only
        - “w” writing (and initializing a new file)
        - “a” appending
        - “rb”, “wb”, “ab” reading, writing, or appending in binary mode (Windows)
    

    Connections

    In general, connections are powerful tools that let you navigate files or other external objects. In practice, we often don’t need to deal with the connection interface directly.

    con <- file("foo.txt", "r") **
    data <- read.csv(con)
    close(con)

    is the same as

    data <- read.csv("foo.txt")

    Reading Lines of a Text File

    > con <- gzfile("words.gz")
    > x <- readLines(con, 10)
    > x
     [1] "1080"        "10-point"   "10th"         "11-point"
     [5] "12-point"  "16-point"   "18-point"  "1st"
     [9] "2"              "20-point"
    

    writeLines takes a character vector and writes each element one line at a time to a text file.
    readLines can be useful for reading in lines of webpages

    ## This might take time
    con <- url("http://www.jhsph.edu", "r")
    x <- readLines(con)
    > head(x)
    [1] "<!DOCTYPE HTML PUBLIC \"-//W3C//DTD HTML 4.0 Transitional//EN\">"
    [2] ""
    [3] "<html>"
    [4] "<head>"
    [5] "\t<meta http-equiv=\"Content-Type\" content=\"text/html;charset=utf-8

    Subsetting

    There are a number of operators that can be used to extract subsets of R objects.

    • [ always returns an object of the same class as the original; can be used to select more than one element (there is one exception)
    • [[ is used to extract elements of a list or a data frame; it can only be used to extract a single element and the class of the returned object will not necessarily be a list or data frame
    • $ is used to extract elements of a list or data frame by name; semantics are similar to that of [[.

      x <- c(“a”, “b”, “c”, “c”, “d”, “a”)
      x[1]
      [1] “a”
      x[2]
      [1] “b”
      x[1:4]
      [1] “a” “b” “c” “c”
      x[x > “a”]
      [1] “b” “c” “c” “d”
      u <- x > “a”
      u
      [1] FALSE TRUE TRUE TRUE TRUE FALSE
      x[u]
      [1] “b” “c” “c” “d”

    Subsetting Lists

    > x <- list(foo = 1:4, bar = 0.6)
    > x[1]
    $foo
    [1] 1 2 3 4
    > x[[1]]
    [1] 1 2 3 4
    > x$bar
    [1] 0.6
    > x[["bar"]]
    [1] 0.6
    > x["bar"]
    $bar
    [1] 0.6
    
    > x <- list(foo = 1:4, bar = 0.6, baz = "hello")
    > x[c(1, 3)]
    $foo
    [1] 1 2 3 4
    $baz
    [1] "hello"
    

    The [[ operator can be used with computed indices; $ can only be used with literal names.

    > x <- list(foo = 1:4, bar = 0.6, baz = "hello")
    > name <- "foo"
    > x[[name]]     ## computed index for ‘foo’ **
    [1] 1 2 3 4
    > x$name       ## element ‘name’ doesn’t exist!
    NULL
    > x$foo
    [1] 1 2 3 4       ## element ‘foo’ does exist
    

    Subsetting Nested Elements of a List
    The [[ can take an integer sequence.

    > x <- list(a = list(10, 12, 14), b = c(3.14, 2.81))
    > x[[c(1, 3)]]    **
    [1] 14
    > x[[1]][[3]]
    [1] 14
    > x[[c(2, 1)]]
    [1] 3.14
    

    Subsetting a Matrix

    Matrices can be subsetted in the usual way with (i,j) type indices.

    > x <- matrix(1:6, 2, 3)
    > x[1, 2]
    [1] 3
    > x[2, 1]
    [1] 2
    

    Indices can also be missing. **

    > x[1, ]
    [1] 1 3 5
    > x[, 2]
    [1] 3 4
    

    By default, when a single element of a matrix is retrieved, it is returned as a vector of length 1 rather than a 1 × 1 matrix. This behavior can be turned off by setting drop = FALSE.

    > x <- matrix(1:6, 2, 3)
    > x[1, 2]
    [1] 3
    > x[1, 2, drop = FALSE] **
        [,1] 
    [1,]   3
    

    Similarly, subsetting a single column or a single row will give you a vector, not a matrix (by default).

    > x <- matrix(1:6, 2, 3)
    > x[1, ]
    [1] 1 3 5
    > x[1, , drop = FALSE]
      [,1]    [,2]    [,3]
    [1,]   1       3       5
    

    Partial Matching

    Partial matching of names is allowed with [[ and $

    > x <- list(aardvark = 1:5)
    > x$a
    [1] 1 2 3 4 5
    > x[["a"]]
    NULL
    > x[["a", exact = FALSE]] ***
    [1] 1 2 3 4 5 
    

    Removing NA Values *

    A common task is to remove missing values (NAs).

    > x <- c(1, 2, NA, 4, NA, 5)
    > bad <- is.na(x)
    > x[!bad]
    [1] 1 2 4 5
    

    What if there are multiple things and you want to take the subset with no missing values?

    > x <- c(1, 2, NA, 4, NA, 5)
    > y <- c("a", "b", NA, "d", NA, "f")
    > good <- complete.cases(x, y) ***
    > good
    [1] TRUE TRUE FALSE TRUE FALSE TRUE
    > x[good]
    [1] 1 2 4 5
    > y[good]
    [1] "a" "b" "d" "f"
    
    > airquality[1:6, ]
          Ozone     Solar.R    Wind       Temp     Month   Day
    1       41       190       7.4         67       5       1
    2       36       118       8.0         72       5       2
    3       12       149       12.6       74       5       3
    4       18       313       11.5       62       5       4
    5       NA       NA       14.3       56       5       5
    6       28       NA 14.9 66 5 6
    > good <- complete.cases(airquality)
    > airquality[good, ] [1:6, ]   ***
             Ozone Solar.R   Wind      Temp       Month     Day
    1       41       190       7.4         67       5       1
    2       36       118       8.0         72       5       2
    3       12       149       12.6       74       5       3
    4       18       313       11.5       62       5       4
    7       23       299       8.6         65       5       7
    

    Vectorized Operations 向量化操作

    Many operations in R are vectorized making code more efficient, concise, and easier to read.

    > x <- 1:4; y <- 6:9
    > x + y
    [1] 7 9 11 13
    > x > 2
    [1] FALSE FALSE TRUE TRUE
    > x >= 2
    [1] FALSE TRUE TRUE TRUE
    > y == 8
    [1] FALSE FALSE TRUE FALSE
    > x * y
    [1] 6 14 24 36
    > x / y
    [1] 0.1666667 0.2857143 0.3750000 0.4444444
    

    Vectorized Matrix Operations

    > x <- matrix(1:4, 2, 2); y <- matrix(rep(10, 4), 2, 2) ?
    > x * y             ## element-wise multiplication
            [,1]    [,2]
    [1,]    10    30
    [2,]    20    40
    > x / y
         [,1]    [,2]
    [1,]    0.1    0.3
    [2,]    0.2    0.4
    > x %*% y       ## true matrix multiplication
              [,1]    [,2]
    [1,]      40    40
    [2,]      60    60
    

    Missing Value

    is.na(mydata) 与 mydata == NA 结果一样

    R uses ‘one-based indexing‘, which (you
    | guessed it!) means the first element of a vector is considered element 1.

    x[c(2, 10)] ##取x的第2个和第10个数
    x[c(-2, -10)] ##取除去第2个和第10个的所有数
    x[-c(2, 10)] ##同上

    展开全文
  • Image Retrieval with Bag of Visual Words You can use the Computer Vision System Toolbox™ functions to search by image, also known as a content-based image retrieval (CBIR) system. CBIR systems ...

    Image Retrieval with Bag of Visual Words

    You can use the Computer Vision System Toolbox™ functions to search by image, also known as a content-based image retrieval (CBIR) system. CBIR systems are used to retrieve images from a collection of images that are similar to a query image. The application of these types of systems can be found in many areas such as a web-based product search, surveillance, and visual place identification. First the system searches a collection of images to find the ones that are visually similar to a query image.

    The retrieval system uses a bag of visual words, a collection of image descriptors, to represent your data set of images. Images are indexed to create a mapping of visual words. The index maps each visual word to their occurrences in the image set. A comparison between the query image and the index provides the images most similar to the query image. By using the CBIR system workflow, you can evaluate the accuracy for a known set of image search results.

    Retrieval System Workflow

    1. Create image set that represents image features for retrieval. Use imageSet to store the image data. Use a large number of images that represent various viewpoints of the object. A large and diverse number of images helps train the bag of visual words and increases the accuracy of the image search.

    2. Type of feature. The indexImages function creates the bag of visual words using the speeded up robust features (SURF). For other types of features, you can use a custom extractor, and then use bagOfFeatures to create the bag of visual words. See the Create Search Index Using Custom Bag of Features example.

      You can use the original imgSet or a different collection of images for the training set. To use a different collection, create the bag of visual words before creating the image index, using the bagOfFeatures function. The advantage of using the same set of images is that the visual vocabulary is tailored to the search set. The disadvantage of this approach is that the retrieval system must relearn the visual vocabulary to use on a drastically different set of images. With an independent set, the visual vocabulary is better able to handle the additions of new images into the search index.

    3. Index the images. The indexImages function creates a search index that maps visual words to their occurrences in the image collection. When you create the bag of visual words using an independent or subset collection, include the bag as an input argument to indexImages. If you do not create an independent bag of visual words, then the function creates the bag based on the entire imgSet input collection. You can add and remove images directly to and from the image index using the addImages and removeImages methods.

    4. Search data set for similar images. Use the retrieveImages function to search the image set for images which are similar to the query image. Use the NumResults property to control the number of results. For example, to return the top 10 similar images, set the ROI property to use a smaller region of a query image. A smaller region is useful for isolating a particular object in an image that you want to search for.

    Evaluate Image Retrieval

    Use the evaluateImageRetrieval function to evaluate image retrieval by using a query image with a known set of results. If the results are not what you expect, you can modify or augment image features by the bag of visual words. Examine the type of the features retrieved. The type of feature used for retrieval depends on the type of images within the collection. For example, if you are searching an image collection made up of scenes, such as beaches, cities, or highways, use a global image feature. A global image feature, such as a color histogram, captures the key elements of the entire scene. To find specific objects within the image collections, use local image features extracted around object keypoints instead.

    Related Examples

    展开全文
  • A CAPTCHA or a “Completely Automated Public Turing test to tell Computers and Humans Apart,” comes in several shapes, sizes and types. These all work quite well against spam, but some are ...
  • 词带模型:Bag of Words Meets Bags of Popcorn(1)-Bag of Words Tfidf模型:Bag of Words Meets Bags of Popcorn(2)-tfidf 这一节采用词向量 1、读取数据 import pandas as pd train=pd.read_csv('./data/...
  • Scene recognition with bag of words

    千次阅读 2015-03-12 11:40:51
    FROM: ... An example of a typical bag of words classification pipeline. Figure by Chatfield et al. Project 3: Scene recognition with bag of words CS
  • 流利说 Level 4 全文

    万次阅读 多人点赞 2019-05-22 10:52:40
    Mountains are formed by forces deep within the Earth, and are made of different types of rocks. Rivers are streams of water that usually begin in mountains and flow into the sea. Many early cities ...
  • An example of a typical bag of words classification pipeline. Figure by Chatfield et al. Project 3: Scene recognition with bag of words CS 143: Introduction to Computer Vision Brief
  • TYPES OF TESTING

    千次阅读 2010-02-24 13:13:00
    TYPES OF TESTING 1. Black Box Testing. 21.1 FUNCTIONAL TESTING.. 21.2 STRESS TESTING.. 21.3 LOAD TESTING.. 31.4 AD-HOC TESTING.. 31.5 EXPLORATORY TESTING.. 31.6 USABILITY TESTING..
  • 最新版的Aspose.Words for .NET 破解版,支持在word中插入图表 Document doc = new Document(); DocumentBuilder builder = new DocumentBuilder(doc); // Add chart with default data. You can specify ...
  •   Every human being is ,in one way or another,unique.   Everyone’s personality is determined by a ...Let us examin ten personality types(one of which might by chance be your ver
  • he 3 Types of Buyers, and How to Optimize for Each One [Guest post by Jeremy Smith.] I absolutely love buyer psychology and neuroeconomics. Want to know why? ● Because it’s like a secret ...
  • 流利说 Level 3 全文

    万次阅读 多人点赞 2019-05-22 10:51:17
    Lesson 4 Types of Words Dialogue Lesson 5 Good News & Bad News 4/4 Listening Lesson 1 Leonardo da Vinci 1-2 Vocabulary Lesson 3 Sources of Pollution Lesson 4 Historical Figures...
  • Thinking with Types

    2019-07-04 11:05:24
    Thinking with Types started, as so many of my projects do, accidentally. I was unemployed, bored, and starting to get tired of answering the same questions over and over again in Haskell chat-rooms. ...
  • Proof of Stake FAQ

    万次阅读 2019-03-26 09:52:08
    Contents ... What are the benefits of proof of stake as opposed to proof of work? How does proof of stake fit into traditional Byzantine fault tolerance research? What is the "n...
  • 以太坊智能合约balanceof的正确用法

    千次阅读 2018-05-16 10:51:44
    balancof通常可以有两种用法: 查询余额 查询余额并空投币查询余额 一般会有如下代码 contract Test { address owner = msg.sender; mapping (address =&... function balanceOf(address _owner) public ret...
  • aspose-words安装部署 自动更新目录 参考网站: aspose-words github:...
  • Aspose.Words如何在文档中添加水印

    千次阅读 2018-08-08 09:50:53
    Aspose.Words是一款先进的文档处理控件,在不使用Microsoft Words的情况下,它可以使用户在各个应用程序中执行各种文档处理任务,其中包括文档的生成、修改、渲染、打印,文档格式转换和邮件合并等文档处理。...
  • Unit 4: Sentence Types

    千次阅读 2014-10-04 17:35:44
    SENTENCE TYPES   Simple Sentences   Compound Sentences   Complex Sentences   Compound-Complex Sentences When you start to put together all the clauses and phrase
  • Aspose.words之插入水印

    千次阅读 2017-11-17 20:26:16
    使用aspose.words为Word插入水印
  • EMNLP 2017 accepted papers

    千次阅读 2017-12-05 14:54:46
    emnlp 接受论文列表地址:... 其中216篇是长篇论文,107篇是短篇论文。详细如下: Accepted short papersGlobal Normalization of Convolutional Neural Networks for Joint Entity and
  • Table 9.2 Keywords and Reserved Words in MySQL 5.5 ACCESSIBLE (R) ACTION ADD (R) AFTER AGAINST AGGREGATE ALGORITHM ALL (R) ALTER (R) ANALYZE (R) AND (R) ANY AS (R)...
  • cs224n 2019assignment 1

    千次阅读 2020-07-06 04:36:31
    corpus_words = len(ans_test_corpus_words) Test correct number of words assert(num_corpus_words == ans_num_corpus_words), “Incorrect number of distinct words. Correct: {}. Yours: {}”.format(ans_num_...
  • Oracle Application Patch Types

    千次阅读 2013-05-09 00:49:46
    Types of Patches Individual Patch, One-off, Standalone These are terms used to describe an individual patch that is created to fix one particular bug. Currently, most Application products deliver ...
  • Removal of mapping types Indices created in Elasticsearch 7.0.0 or later no longer accept a default mapping. Indices created in 6.x will continue to function as before in Elasticsearch 6.x. Types are...
  • Python Types and Objects

    千次阅读 2013-12-05 18:28:19
    Python Types and Objects Shalabh Chaturvedi Copyright © 2005-2009 Shalabh Chaturvedi All Rights Reserved. About This Book Explains Python new-style objects: what
  • Lecture 4 Part-Of-Speech TaggingLearning ObjectivePart-of-Speech TaggingIntroduction to Part-Of-Speech (POS) TaggingPOS Tag SetsOn-Line Part-of-Speech (POS) Tagging DemosPOS Tagging ApproachRule-based...
  • Introduction   This article will explain 6 important concepts ...Stack , heap , value types , reference types , boxing and unboxing. This article starts first explaining what happens internally when
  • ArnetMiner: Extraction and Mining of Academic Social Networks ABSTRACT This paper addresses several key issues in the ArnetMiner system, which aims at extracting and mining academic social networks. ...

空空如也

空空如也

1 2 3 4 5 ... 20
收藏数 25,180
精华内容 10,072
关键字:

oftypeswords