Sep 5, 2015 - AbstractâThe downfall of many supervised learn- ... we present, a novel approach to supervised learning through the method of ... with...

0 downloads 12 Views 109KB Size

1

Introduction to Gravitational Clustering

arXiv:1509.01659v1 [cs.LG] 5 Sep 2015

Armen Aghajanyan

Abstract—The downfall of many supervised learning algorithms, such as neural networks, is the inherent need for a large amount of training data (Benediktsson et al., 1993). Although there is a lot of buzz about big data, there is still the problem of doing classification from a small data-set. Other methods such as support vector machines, although capable of dealing with few samples, are inherently binary classifiers (Cortes and Vapnik, 1995), and are in need of learning strategies such as One vs All in the case of multi-classification. In the presence of a large number of classes this can become problematic. In this paper we present, a novel approach to supervised learning through the method of clustering. Unlike traditional methods such as K-Means (MacQueen, 1967), Gravitational Clustering does not require the initial number of clusters, and automatically builds the clusters, individual samples can be arbitrarily weighted and it requires only few samples while staying resilient to over-fitting. Keywords—Machine Learning, Classification, Clustering.

→ dynamic position − x and a static class θ. Mathematically: m r − → x θ

∈ ∈ ∈ ∈

R R Rn Z

P

=

− {m, r, → x , θ}

Our universe will simply consist of a set of planets. The universe will also hold a couple of global constants. The initial radius of a planet that has just been created which we will denote with r′ . The so called percent step, which represents the amount a test mass moves before recalculating the new forces on the test mass. We will denote this with the Greek α. The amount of steps taken or iterations will be denoted with β. The distance between planets will → → be calculated with the function denoted D(− x,− y ). III.

I.

Introduction

The name of this algorithm is derived from the metaphor that the algorithm was built upon. Each cluster is symbolic of a planet,and each planet has a mass and a radius as well as the class that it represents. But unlike real life planets, our planets are static with respect to other planets. The process of training can be conceptually thought of as building a universe. The process of predicting is simply placing a mass in the universe and tracing what planet it will appear on. This 1) 2) 3)

algorithm exhibits three nice properties: Ability to learn from a few samples. Ability to weight the importance of training vectors. The nature of the algorithm makes it resilient to overfitting. The ability to weight the importance of training vectors as well as the ability to learn from a few samples allows us to model a system that supports the notion of prototypes, e.g. Eleanor Rosch (Lakoff, 1987)(P. 41). II.

Definition

Let us start by mathematically defining what each one of our symbolic structures will be. The most important structure is our cluster or our planet. We will define the planet as containing a dynamic mass m, dynamic radius r,

(1)

Training Model

One of the better aspects of the model is its ability to rate your feature vectors. To do so, let us define a hybrid feature vector h. → h = {− x , m, θ} (2) The m variable allows us to rate the value of the feature vector. For example if you have a probabilistic diagnosis, each feature vector will contain the class of the diagnosis as well as the probability of the diagnosis represented by the mass. The training is quite simple. Below is the pseudocode. − → nearplanets ← Find Planets in Radius of h x ; nearplanets ← nearplanets where Pθ = hθ ; if nearplanets is Empty then Universe Add Planet − → − {m = hm , r = r′ , → x = h x , θ = hθ } else p ← planet that generates most force ∈ nearplanets ; Universe update p ← − → → → p x + hmm h x } x = pmm − {m = pm + hm , r = m ppmr , − end Algorithm 1: Training Algorithm The new position is a weighted sum of the two position vectors with respect to their weight.

PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. X, NO. Y, FEBRUARY 2015

A. Asymptotic Analysis Our training simply traverses through all of the planets in the universe and computes the distance from the training sample. Saying N is the amount of planets and D is the dimensionality of our feature vectors. Assuming that the planet exists, we get O (D ∗ N ) = O (N )

(3)

Using a KD-Tree (Bentley, 1975) will allow us to train with the average asymptotic of O (D ∗ log N ) = O (log N )

(4)

On the flip side, assuming we have to add the planet: O (D ∗ N + Nnear ) = O (N + Nnear )

(5)

KD-Tree (Bentley, 1975) O (D ∗ log N + Nnear ) = O (log N + Nnear )

(6)

This is the asymptotic of adding a single train vector. Stating that Ns is the number of samples we end up with the final equation being. O (Ns (log N + Nn ))

Big O Online Training Variant Importance

• •

O (Ns (log N + Nn ))

K-Means O(nDk+1 log n)

To restate, α is the percent step taken with respect to the force. Now let us describe the simulation algorithm: − → pos ← l x ; for i in i [0, β] step 1 do → − P force ← p∈Universe pm ∗( pr2x −pos) ; ; α norm ← ||f orce|| f orce ; pos ← pos + norm end nearplanets ← Find Planets in Radius of pos; if nearplanets is not Empty then return mode[nearplanets θ] else return [planet closest to pos] θ end

A. Asymptotic Analysis of Simulation Testing Model Let us state that N is the number of planets and D is the dimensionality of our feature vector. Calculating the force takes up SVM O(n3 )

Decision Trees O(ns D log(ns ))

Yes

Yes

No

Partial

Yes

No

No

No

Nn is synonymous with Nnear Ns is synonymous with Nsamples IV.

− → → Where r is D(− p x , l x ). We define the total normalized force on our test mass with the custom equation. → − → − P Fnet = p∈Universe pm ∗( prx2 − l x ) (10) α Fnorm = ||Fnet || Fnet

(7)

B. Comparison of Training Times Gravitational Clustering

2

Simulation Testing Model

Metaphorically, predicting the class of a new point is equivalent to dropping a piece of mass into the universe and tracing the mass until it collides with a planet. In this metaphor, we assume that the planets are infinitely small and therefore there will be no interference. Our test → point will simply be defined as l = {− x } Let us first define getting the normalized directional force vector. Recall from physics that the gravitational force between two planets is m1 m2 (8) F =G 2 r In our case, we will assume that the mass of each test point is equal to every other, therefore we can disregard the mass. We can also remove the G constant. Our hybrid force equation per planet p is now: pm (9) F = 2 r

O (4D ∗ N ) = O (N )

(11)

The 4 comes from the vector arithmetic that needed to be done. One subtraction, one multiplication, one distance squared, one division. The N term came from the summation. The total simulation next becomes. O (7D ∗ N ∗ β + N )

(12)

The 3 more D terms come from: finding the magnitude, multiplying by force (simultaneously multiplying by α) and the update summation. The next N came from finding the planets with the radius containing pos. We can disregard the final if statement since they do not directly affect N. We get: O (7D ∗ N ∗ β + N ) = O (N (7D ∗ β + 1)) = O (N ) (13) V. Probabilistic Non-Simulating Model We propose an different method of computing the class of the test point, without the need of simulation and through purely statistical methods. We first make an assumption that a planet or cluster is normally distributed from the center and the standard deviation is some function of the radius of the planet σ(pr ). Therefore let us the define the probability density function. → −

PDFp =

− −D(→ p x , l x )2 1 e 2σ(pr )2 2π ∗ σ(pr )

(14)

PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. X, NO. Y, FEBRUARY 2015 Accuracy Per Data-set

Now to define our prediction equation: Universe Y

MAXθ [

p | pθ =θn

1 e 2π ∗ σ(pr )

→ − − −D(→ p x , l x )2 2σ(pr )2

]

(15)

To account for the fact that different classes have different amounts of planets, we will transform this function into: MAXθ [

log

QUniverse

p | pθ =θn

e

→ − − −D(→ p x , l x )2 ( p 2σ(p m r )2

)

]

|p | pθ = θn |

(16)

We removed the normalization constant, due to the fact that this is a relative measure. The bottom of the fraction is the number of planets per class which insures that there is no bias due to the different amounts of clusters with varying radius’s. The mass term is added to insure that greater planets have a greater impact on the rating. Through trial and error we found the best function for σ(pr ) was simply p2r . The asymptotic will simply be O (DN ) = O (N ) VI.

(17)

Testing Results

We tested the algorithm out on the Wisconsin breast cancer data-set (Wolberg and Mangasarian, 1990) (Lichman, 2013). Below are the results. Gravitational Clustering Simulated Model Probabilistic Model

r ′ = 50 α = 0.01 β = 100 89.65% 92.78%

r ′ = 5000 α = 0.001 β = 1000 90.59% 72.41%

It is interesting to note that the larger the clusters and smaller the amount of clusters the less accurate the probabilistic model will be. Unless of course the clusters perfectly model the data that they encapsulate. We continued our testing by comparing the outputs of some popular out of the box methods. All the other algorithms were implemented in the scikit-learn library (Pedregosa et al., 2011). The data-sets we used were the popular Iris data-set (Lichman, 2013), digits data-set(Pedregosa et al., 2011), Ollivetti data-set (Bevilacqua et al., 2006). Algorithm Data-sets

GC Prob

GC Sim

Iris Digits Olivetti

98.41% 86.95% 65.5%

96.82% 91.04% 77.5%

SVM (poly) 94.66% 98.99% 7.5%

3

SVM (rbf) 97.33% 25.61% 8.5%

Naive Bayes (Gaussian) 96% 83.85% 99.5%

To show that our algorithm can handle very few samples, we tested the following data-sets again, but this time we only used 1 sample per each class as the training data. Below are the results.

Algorithm Type Iris Digits Olivetti

GC Prob

GC Sim

93.33% 59.96% 63.5%

92.00% 58.18% 53.75%

VII. Conclusion In this paper we introduced a novel technique to clustering and supervised learning that can learn from a few samples, while maintaining a low asymptotic run-time and inherently allowing for arbitrary sample weighting. We compared it to current techniques for classification and showed both the strengths of the algorithm as well as the weaknesses. From the test results we can infer that our algorithm acts consistently in both low and high dimensional data, as well as staying consistent in a range of multi-class data-sets. All the code written, including the tests and the algorithm itself can be found on https://github.com/ArmenAg/GravitationalClustering/ Thank you for reading. References Benediktsson, J. A., SWAIN, P. H., and ERSOY, O. K. (1993). Conjugate-gradient neural networks in classification of multisource and very-high-dimensional remote sensing data. International Journal of Remote Sensing, 14(15):2883–2903. Bentley, J. L. (1975). Multidimensional binary search trees used for associative searching. Commun. ACM, 18(9):509–517. Bevilacqua, V., Mastronardi, G., Pedone, A., Romanazzi, G., and Daleno, D. (2006). Hidden markov models for recognition using artificial neural networks. 4113:126– 134. Cortes, C. and Vapnik, V. (1995). Support-vector networks. Machine Learning, 20(3):273–297. Lakoff, G. (1987). Women, Fire, and Dangerous Things. The University of Chicago Press. Lichman, M. (2013). UCI machine learning repository. MacQueen, J. (1967). Some methods for classification and analysis of multivariate observations. pages 281–297. Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., Blondel, M., Prettenhofer, P., Weiss, R., Dubourg, V., Vanderplas, J., Passos, A., Cournapeau, D., Brucher, M., Perrot, M., and Duchesnay, E. (2011). Scikit-learn: Machine learning in Python. Journal of Machine Learning Research, 12:2825–2830. Wolberg, W. and Mangasarian, O. (1990). Multisurface method of pattern separation for medical diagnosis applied to breast cytology,. pages 9193–9196.