精华内容
下载资源
问答
  • Scala语言实现Kmeans聚类算法

    千次阅读 2016-11-09 09:55:30
    利用scala实现的Kmeans小程序

     Kmeans算法是一种简单的聚类算法,现在仍然广泛使用,其优点就是收敛速度快,人为干涉少,但是缺点也很明显:需要提前了解K值,以及聚类结果不稳定,

    其原理可以参照:

    http://blog.csdn.net/qll125596718/article/details/8243404/

    这个博客关于原理解释的很清楚,并且给出了C/C++版和Java版,大家可以在深入了解一下,不过我仍然想强调一下我理解的一些细节:

           1.Kmeans首先随机找到K个聚类中心,是一个预分配的过程,然后通过计算新的聚类中心才是聚类过程。

            2.Kmeans算法得出结果以后,最后的聚类中心并不一定是给定的点集中的点,而是人为计算得到。

    这里我将会利用Scala语言实现Kmeans算法,程序运行所需的数据,以及中间结果,还有大家最关心的代码注释,我都会详细给出解释。

    我是采用eclipse开发的,结构图如下:


    大家也可以在下面这个链接直接下载,不需要积分

    http://download.csdn.net/detail/u014512572/9677346

           由于笔者本人现在仍然是菜鸟身份,所以很了解一个人摸索的痛苦,所以代码注释主要针对新手,各位大神不要见笑,不过在此之前,我有必要解释一下map()方法和reduce()方法:
    1.  map()方法是映射方法,当你需要逐个处理数据集的时候就可以使用,例如:
    points.map(_.toDouble)
    points.map(a => a+1)
    points.map(a => {
    函数块
    })
    上面的不管是通配符“_”还是a,都代表数据集里的每一个数据.
           
    2.  reduce()方法是规约方法,当你需要将数据集进行合并操作时可以使用,我一般喜欢使用reduceLeft(),这让我自己有一定的顺序感,例如下面这段在程序中出现的reduce()方法: centers.reduceLeft((a, b) => 
         if ((vectorDis(a,point)) < (vectorDis(b,point))) a else b
           这里的a,b就是数据集里的数据,既然需要进行规约,所以至少需要挑出两个数据,这里的(a,b)就代表数据集,然后通过规约条件,挑出你想要的数据

    下面请看代码:   

    
    
    /**
     * @author weixu_000
     */
    
    import java.util.Random
    import scala.io.Source
    import java.io._
    
    object Kmeans {
    
      val k = 5
      val dim = 41                  //这是我的数据集中每一组数据的维度
      val shold = 0.0000000001      //人为设定的阈值,最后用于判断偏移量 
      
      val centers =new Array[Vector[Double]](k)
      
      def main(args:Array[String]){
          
          //------------------------------------input data ------------------------
    
          val fileName = "data/testData.txt"
          val lines = Source.fromFile(fileName).getLines()
          val points =lines.map(line => {
                 val parts = line.split(" ").map(_.toDouble)     //这里需要了解map()函数的特性,为了能够一次性调度一组数据,我们必须采用Vector类型数据
                 var vector = Vector[Double]()                   //Vector类型是不可更改类型,但是可变长,可以利用这个特点将文本数据转为以Vector为元素的数组,即Array[Vector[Double]]类型
                 for( i <- 0 to dim-1)                           //“_”这是通配符,使用map(),reduce()以及一些其他方法时经常用到,它表示你当前取出的元素,可以表示任何类型,所以称为通配符 
                 vector ++= Vector(parts(i))
                 vector
          }).toArray
     
          findCenters(points)
          kmeans(points,centers)
          putout(points,centers)
          
        }
      
      //-------------------------find centers----------------------------------  
      def findCenters(points:Array[Vector[Double]])={
         val rand = new Random(System.currentTimeMillis())
         val pointsNum = points.length
         for(i <- 0 to k-1){
            centers(i) =  points(rand.nextInt(points.length)-1)
         }
    
         val writerCenters = new PrintWriter(new File("data/centers.txt"))
         for(i <- 0 to k-1){
         writerCenters.println(centers(i))
         }
         writerCenters.close()
       }
       
      //-----------------------------doing cluster---------------------------- 
      def kmeans(points:Array[Vector[Double]],centers:Array[Vector[Double]])={
         var bool = true
         var index = 0
         while(bool){                                                
          
           //这里我们根据聚类中心利用groupBy()进行分组,最后得到的cluster是Map(Vector[Double],Array[Vector[Double]])类型
           //cluster共五个元素,Map中key值就是聚类中心,Value就是依赖于这个中心的点集
           val cluster = points.groupBy { closestCenter(centers,_) } 
           
           //通过Map集合的get()方法取出每一个簇,然后采用匹配方法match()进行求取新的中心,这里再强调一遍,Vector类型是不可更改类型,即数据存入Vector以后就不能改变
           //所以需要你人为的定义Vector类型的加减乘除运算
           val newCenters = centers.map { oldCenter => 
             cluster.get(oldCenter) match{
               case Some(pointsInCluster) => 
                 vectorDivide(pointsInCluster.reduceLeft(vectorAdd(_,_)),pointsInCluster.length)
               case None => oldCenter
             }
            }
        
           var movement = 0d
           for(i <- 0 to k-1){
             movement += math.sqrt(vectorDis(centers(i),newCenters(i)))
             centers(i) = newCenters(i) 
           }
           if(movement <= shold){
             bool = false
           }
          index += 1
         }
       }
      
      //---------------------------putout----------------------------------------- 
       //我们最终需要输出的是聚类结果,我将每个点以“1,2,3,4,5”的形式输出,属于同一类的就是相同的数字
       //实在想不出更好的方法,只能再算一遍
       
      def putout(points:Array[Vector[Double]],centers:Array[Vector[Double]])={
         val pointsNum = points.length
         val pointLable = new Array[Int](pointsNum)
         for(i <- 0 to pointsNum-1){
            val temp = centers.reduceLeft((a,b) => 
            if ((vectorDis(a,points(i))) < (vectorDis(b,points(i))))  a
            else  b)
            pointLable(i) = centers.indexOf(temp)
         }
    
         val writerLable = new PrintWriter(new File("data/output.txt"))
         for(i <- 0 to pointsNum-1){
         writerLable.println(pointLable(i))
         }
          writerLable.close()
         
       }
        
      def vectorDis(v1:Vector[Double],v2:Vector[Double]):Double={
         var distance = 0d
            for(i <- 0 to dim-1){    
               distance += (v1(i)-v2(i))*(v1(i)-v2(i))
            }
            val distance = math.sqrt(t)                          
            distance
          }
       
      def vectorAdd(v1:Vector[Double],v2:Vector[Double])={
          val len=v1.length
          val av1=v1.toArray
          val av2=v2.toArray
          val av3=Array.fill(len)(0.0)
          var vector = Vector[Double]()
          for(i<-0 to len-1){
            av3(i)=av1(i)+av2(i)
            vector ++= Vector(av3(i))
          }
          vector
       }
       
      def vectorDivide(v1:Vector[Double],num:Int)={
          val av1=v1.toArray
          val len=v1.size
          val av2=Array.fill(len)(0.0)
          var vector = Vector[Double]()
          for(i<-0 to len-1){
            av2(i)=av1(i)/num
            vector ++= Vector(av2(i))
          }
          vector
       }
       
       /*
       def vectorAdd(v1:Vector[Double],v2:Vector[Double])={
         val  sumVector = Vector.fill(dim)(0.0)
            for(i <- 0 to dim-1){
              sumVector.updated(i, v1(i)+v2(i))
            }
         sumVector
       }
    
       def vectorDivide(v1:Vector[Double],num:Int)={
          for(i <- 0 to dim-1){
            v1.updated(i, v1(i)/num)
          }
          v1
       }
       * 
       */
    
      def closestCenter(centers:Array[Vector[Double]],point:Vector[Double])
       :Vector[Double]={
               centers.reduceLeft((a, b) => 
                if ((vectorDis(a,point)) < (vectorDis(b,point))) a else b
            )
            
       } 
       
      
    }
    
    
    
    
    
    
    
    
    
    
    
    

            

    tips:

    1.写这个程序时,大家可以看到我经常使用println()输出中间结果,这也是一种找bug的过程

    2.Vector类型的特性我当初不了解,就像最后我定义的那两个错的vectorAdd()和vectorDivide()方法

    3.为了程序的可读性,我采用的函数块的方法编写,但是这样不得不设置一些全局变量,导致所需内存比较大,大家如果需要考虑到资源使用量,可以自行修改








    展开全文
  • scala实现的高斯混合聚类,效果还不错,原理参考西瓜书p206-210 import breeze.linalg.{DenseMatrix, DenseVector, det, inv} import org.apache.spark.{SparkConf, SparkContext} import scala.collection....

     scala实现的高斯混合聚类,效果还不错,原理参考西瓜书p206-210

    import breeze.linalg.{DenseMatrix, DenseVector, det, inv}
    import org.apache.spark.{SparkConf, SparkContext}
    
    import scala.collection.mutable.ArrayBuffer
    
    object GaussCluster {
      var a1 = 1.0/3.0
      var a2 = 1.0/3.0
      var a3 = 1.0/3.0
    
      var cov1 = DenseMatrix((0.1, 0.0), (0.0, 0.1))
      var cov2 = DenseMatrix((0.1, 0.0), (0.0, 0.1))
      var cov3 = DenseMatrix((0.1, 0.0), (0.0, 0.1))
    
      var u1 = DenseVector(0.403, 0.237)
      var u2 = DenseVector(0.714, 0.346)
      var u3 = DenseVector(0.532, 0.472)
    
      var array = Array(0.0, 0.0, 0.0)
      var result = new collection.mutable.ArrayBuffer[Array[Double]]()
      def getPx(cov:DenseMatrix[Double],x:DenseVector[Double],u:DenseVector[Double]):Double = {
        val exp =  (mMv(inv(cov),(x-u))).t * u
        val px = (-0.5*Math.exp(exp)) / (2*Math.PI)*Math.sqrt(det(cov))
        px
      }
    
      def mMv(m:DenseMatrix[Double],v:DenseVector[Double]):DenseVector[Double] = {
        val cols = v.length
        val rows = m.rows
        val array = new collection.mutable.ArrayBuffer[Double]()
        val tv = v.t
        for(i <- 0 to cols - 1){
          var sum = 0.0
          for(j <- 0 to rows - 1){
            sum += tv(j)*m(j,i)
          }
          array.append(sum)
        }
    //    println(array.toArray)
        DenseVector(array.toArray)
      }
    
      def getPm(x:DenseVector[Double]):Array[Double] = {
        val p1 = a1 * getPx(cov1,x,u1)
        val p2 = a2 * getPx(cov2,x,u2)
        val p3 = a3 * getPx(cov3,x,u3)
    
        val pm1 = p1 / (p1 + p2 + p3)
        val pm2 = p2 / (p1 + p2 + p3)
        val pm3 = p3 / (p1 + p2 + p3)
    
        array = Array(pm1,pm2,pm3)
    //    var max = 0.0
    //    for(i <- 0 to array.length-1){
    //      if(array(i)>max){
    //        max = array(i)
    //      }
    //    }
    
    //    println(pm1 + ":" + pm2 + ":" + pm3)
    
        array
      }
    
      def updateCoef(x:Array[DenseVector[Double]]) = {
        val pmArray = new collection.mutable.ArrayBuffer[Array[Double]]()
        val rarray = new collection.mutable.ArrayBuffer[Array[Double]]()
        val m = x.length.toDouble
        for(i <- 0 to x.length-1){
          pmArray.append(getPm(x(i)))
          rarray.append(getPm(x(i)))
        }
    
        result = rarray
    
        var pmSum1 = 0.0
        var pmX1 = DenseVector(0.0,0.0)
        for(j <- 0 to pmArray.length-1){
          pmSum1 += pmArray(j)(0)
    //      println((x(j) * pmArray(j)(0)))
          pmX1 = pmX1 :+ (x(j) * pmArray(j)(0))
        }
        u1 = pmX1 :/ pmSum1
    
        var pmSum2 = 0.0
        var pmX2 = DenseVector(0.0,0.0)
        for(j <- 0 to pmArray.length-1){
          pmSum2 += pmArray(j)(0)
          pmX2 =  pmX2 + (x(j) * pmArray(j)(1))
        }
        u2 = pmX2 :/ pmSum2
    
        var pmSum3 = 0.0
        var pmX3 = DenseVector(0.0,0.0)
        for(j <- 0 to pmArray.length-1){
          pmSum3 += pmArray(j)(0)
          pmX3 = pmX3 + (x(j) * pmArray(j)(2))
        }
        u3 = pmX3 :/ pmSum3
    
        a1 = pmSum1 / m
        a2 = pmSum2 / m
        a3 = pmSum3 / m
    
        var Ncov1 = DenseMatrix((0.0,0.0),(0.0,0.0))
        var Ncov2 = DenseMatrix((0.0,0.0),(0.0,0.0))
        var Ncov3 = DenseMatrix((0.0,0.0),(0.0,0.0))
    
    
        for(k <- 0 to x.length-1 ){
          Ncov1 = Ncov1 :+ pmArray(k)(0)*((x(k)-u1) * (x(k)-u1).t)
          Ncov2 = Ncov2 :+ pmArray(k)(1)*((x(k)-u1) * (x(k)-u2).t)
          Ncov3 = Ncov3 :+ pmArray(k)(2)*((x(k)-u1) * (x(k)-u3).t)
    
        }
    
        cov1 = Ncov1 :/ pmSum1
        cov2 = Ncov2 :/ pmSum2
        cov3 = Ncov3 :/ pmSum3
    
    
      }
    
      def getNewCov(pmSum:Double,pmArray:ArrayBuffer[Double],u:DenseVector[Double],x:Array[DenseVector[Double]]) = {
      }
    
      def main(args: Array[String]): Unit = {
    //    val m = DenseMatrix((1.0,3.0),(2.0,4.0))
    //    val v = DenseVector(1.0,2.0)
    //    val temp = mMv(m,v)
    //    val result = temp.t * v
    //    println(result)
        val conf = new SparkConf().setMaster("local[4]").setAppName(s"${this.getClass.getSimpleName}")
        val sc = new SparkContext(conf)
        sc.setLogLevel("ERROR")
    
        val oriData = sc.textFile("C:\\Users\\dell\\Desktop\\data\\gaussCluster.txt")
        val data = oriData.map(_.split(" ")).map(s => s.map(_.toDouble)).map(s => {
          DenseVector(s(1),s(2))
        })
        var c = 0
        val x = data.collect()
    
        while(c<5){
          for (i <-0 to x.length-1){
            getPm(x(i))
          }
          updateCoef(x)
          c += 1
        }
    
        for (i <- 0 to result.length-1){
          val temp = result(i)
          var max = 0
          for (j <- 1 to temp.length-1){
            if(temp(j)>temp(max)){
              max = j
            }
          }
          println(max)
        }
      }
    
    }

     

    展开全文
  • scala spark 聚类

    2018-04-17 22:00:00
    scala> data1.select("gender", "age", "education").write.format("csv").save("hdfs://ns1/datafile/wangxiao/data123.csv") 转载于:https://www.cnblogs.com/zhangbojiangfeng/p/8870301.html

    import org.apache.spark.ml.clustering.KMeans
    import org.apache.spark.ml.evaluation.ClusteringEvaluator
    import org.apache.spark.sql.Row
    import org.apache.spark.sql.types._
    import org.apache.spark._
    import org.apache.spark.ml.feature.VectorAssembler

    // Loads data.
    val dataset = sc.parallelize(List(List(1.0,8.0),List(8.0,2.0),List(2.0,10.0),
    List(5.0,15.0),List(9.0,1.0),List(9.0,7.0),List(1.0,3.0)))
    //val rdd= sc.textFile("input/textdata.txt")


    case class data1(length:Double,wide:Double)
    val df = dataset.map(x=>data1(x(0),x(1))).toDF

    val assembler = (new VectorAssembler().
    setInputCols(Array("length", "wide")).
    setOutputCol("features"))

    val df2 = assembler.transform(df)

    // Trains a k-means model.
    val kmeans = new KMeans().setK(3).setSeed(1L)
    val model = kmeans.fit(df2)

    // Make predictions
    val predictions = model.transform(df2)

    val ret1=predictions.groupBy("prediction").agg(Map("length"->"avg","wide"->"avg"))

     


    // 保存数据框到文件

    scala> data1.select("gender", "age", "education").write.format("csv").save("hdfs://ns1/datafile/wangxiao/data123.csv")

    转载于:https://www.cnblogs.com/zhangbojiangfeng/p/8870301.html

    展开全文
  • 利用scala实现的k-means 包含数据集 0 1 22 9 181 5450 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 8 8 0.00 0.00 0.00 0.00 1.00 0.00 0.00 9 9 1.00 0.00 0.11 0.00 0.00 0.00 0.00 0.00 0 1 22 9 239 486 0 0 0 0 0 1 0 ...
  • 使用 Spark 框架在 Scala实现聚类算法。 算法仅处理二维( x和y )数据。 DBSCAN 程序参数: <input_file> <min> K均值 程序参数: <input_file> <number> 数据集 包含示例数据集文件 - data....
  • 系统聚类原理 层次聚类(Hierarchical Clustering)是聚类算法的一种,通过计算不同类别数据点间的相似度来创建一棵有层次的嵌套聚类树。在聚类树中,不同类别的原始数据点是树的最低层,树的顶层是一个聚类的根节点。...

    系统聚类原理

    层次聚类(Hierarchical Clustering)是聚类算法的一种,通过计算不同类别数据点间的相似度来创建一棵有层次的嵌套聚类树。在聚类树中,不同类别的原始数据点是树的最低层,树的顶层是一个聚类的根节点。创建聚类树有自下而上合并和自上而下分裂两种方法。
    层次聚类算法一般分为两类:
    Divisive 层次聚类:又称自顶向下(top-down)的层次聚类,最开始所有的对象均属于一个cluster,每次按一定的准则将某个cluster 划分为多个cluster,如此往复,直至每个对象均是一个cluster。
    Agglomerative 层次聚类:又称自底向上(bottom-up)的层次聚类,每一个对象最开始都是一个cluster,每次按一定的准则将最相近的两个cluster合并生成一个新的cluster,如此往复,直至最终所有的对象都属于一个cluster。
    图片来自https://www.biaodianfu.com/hierarchical-clustering.html在这里图片插入图片描述
    这个算法原理很简单,实现也不难

    
    ```scala
    class Hierarchical(var T: Double //聚类个数
                       , var data:List[Array[Double]] //数据集
    
                      ) {
       var finalResult = initFinalResult
     // data=standardization01(data)
      var initialList=data//初始列表
    
      var firstCluster=new ListBuffer[Int]()//首次出现聚类的列表
      var nextCluster=new ListBuffer[Array[Int]]()//首次出现的位置(聚类1和聚类2)
      var table=new ListBuffer[Array[Double]]()//表格
      var clusterHistory=ListBuffer[ListBuffer[ListBuffer[Array[Double]]]]()//迭代历史
      var stage=0;
      //初始化,N个初始模式样本自成一类
      private def initFinalResult:ListBuffer[ListBuffer[Array[Double]]] = {
        val startResult =new ListBuffer[ListBuffer[Array[Double]]]
        //首先将每一样本看成单独一类
        for (aData <- data) {
          val list=new ListBuffer[Array[Double]]
          list.append(aData)
          startResult.append(list)
        }
        startResult
      }
    
      def hierarchical: ListBuffer[ListBuffer[Array[Double]]] = {
        if (finalResult.size == 1) return finalResult
        //计算每类间的欧式距离,保存在二维数组中
        var distanceArray =Array.ofDim[Double](finalResult.size,finalResult.size)
        //最短距离 初始化为1,2类的距离
       // var min_dis = min_distance(finalResult(0), finalResult(1))
        //组间连接法
        var min_dis=baverage_distance(finalResult(0),finalResult(1))
        //即将合并的类的标号
        var index1 = 0
        var index2 = 1
        for (i <- 0 until finalResult.size) {
          for (j <- (i + 1) until finalResult.size) {
            distanceArray(i)(j) = baverage_distance(finalResult(i), finalResult(j))
            if (distanceArray(i)(j) < min_dis) {
              min_dis = distanceArray(i)(j)
              index1 = i
              index2 = j
            }
          }
        }
        distanceArray=null//这里注意不能去掉,不然迭代次数多了会报OOM异常
        //聚类个数判断
        if (finalResult.size == T) return finalResult
        else { //将最短距离对应的类合并。
          merge(finalResult(index1), finalResult(index2))
          //println("  " + min_dis.formatted("%.3f"))
          table(stage-1)(2)=min_dis
          finalResult.remove(index2)
          clusterHistory.append(finalResult)
          hierarchical
        }
        finalResult
      }
    
      //合并最短距离对应的类
      private def merge(list1: ListBuffer[Array[Double]], list2: ListBuffer[Array[Double]]): Unit = {
        list1++=list2
        stage=stage+1;
        //print(stage)
        var cluster1 = 0
        var cluster2=0
        var cluster11=0
        var cluster12=0
         breakable{
        for(i:Int <- 0 until initialList.size)
          {if(initialList(i).deep==list1(0).deep)
           { cluster1=i+1;
             break
           }}}
        for(index:Int <- 0 until initialList.size)
        {
          if(initialList(index).deep==list2(0).deep)
        {
          cluster2=index+1;
        }}
        breakable{
          for(i:Int <- (0 until firstCluster.size).reverse)
          {
            if(cluster1==firstCluster(i))
              {
                cluster11=i+1
                break
              }
          }}
        breakable{
          for(i:Int <- (0 until firstCluster.size).reverse)
          {
            if(cluster2==firstCluster(i))
            {
              cluster12=i+1
              break
            }
          }}
        firstCluster.append(cluster1)
        var nums=new Array[Int](2)
        nums(0)=cluster11
        nums(1)=cluster12
        nextCluster.append(nums)
       // print(" "+cluster1+"->"+cluster2+" "+cluster11+" "+cluster12)
        var tempArr=new Array[Double](5)
        tempArr(0)=cluster1.toDouble
        tempArr(1)=cluster2.toDouble
        tempArr(2)=(-1.0)
        tempArr(3)=cluster11.toDouble
        tempArr(4)=cluster12.toDouble
    table.append(tempArr)
      }
    
      //每个类间的最小距离
      private def min_distance(list1: ListBuffer[Array[Double]], list2: ListBuffer[Array[Double]]):Double = {
        var min_dis = euclideanDistance(list1(0), list2(0))
    
        for (i <- 0 until list1.size) {
          for (j <- 0 until list2.size) {
            val dis_temp = euclideanDistance(list1(i), list2(j))
            if (dis_temp < min_dis) {min_dis = dis_temp
    
            }
          }
        }
        min_dis
      }
      //每个类间的平均距离(组间连接)
      private def baverage_distance(list1: ListBuffer[Array[Double]], list2: ListBuffer[Array[Double]]):Double = {
        var dis = 0.0
    
        for (i <- 0 until list1.size) {
          for (j <- 0 until list2.size) {
            val dis_temp = squareEuclideanDistance(list1(i), list2(j))
           dis=dis+dis_temp
          }
        }
        dis/(list1.size*list2.size).toDouble
      }
      //欧式距离
      private def euclideanDistance(array1: Array[Double], array2: Array[Double]):Double = {
        /*math.sqrt(array1.zip(array2).
          map(p => p._1 - p._2).map(d => d*d).sum)*/
        var distance = 0.0
    
        for (i <- 0 until array1.length) {
          distance += Math.pow(array1(i) - array2(i), 2)
        }
        distance = Math.sqrt(distance)
        distance
      }
      //平方欧式距离
      private def squareEuclideanDistance(array1: Array[Double], array2: Array[Double]):Double = {
        var distance = 0.0
        for (i <- 0 until array1.length) {
          distance += Math.pow(array1(i) - array2(i), 2)
        }
        //distance = Math.sqrt(distance)
        distance
      }
    
    }
    object Hierarchical{
      def main(args: Array[String]): Unit = {
        var start=new Date().getTime
        var T=1;
        var data:ListBuffer[Array[Double]]=ListBuffer()
       val x1 = Array(2270.72, 377.81, 1162.96, 202.36, 930.33, 883.33, 709.22, 127.29)
        val x2 = Array(1368.93, 292.32, 699.21, 133.61, 202.87, 322.27, 301.06, 82.73)
        val x3 = Array(1192.93, 203.72, 696.12, 131.92, 326.73, 230.07, 219.32, 62.28)
        val x4 = Array(1206.69, 276.23, 286.73, 138.26, 328.72, 380.70, 210.32, 69.83)
        val x5 = Array(1283.61, 239.96, 369.60, 128.80, 206.72, 399.33, 320.62, 69.23)
        val x6 = Array(1329.00,298.82,601.71,138.91,226.27,387.97,283.37,107.78)
        val x7 = Array(1362.22,232.03,330.69,122.80,333.38,321.70,380.71,93.27)
        val x8 = Array(1267.68,308.29,871.31,130.00,393.02,237.37,331.03,83.21)
        val x9 = Array(3731.27,267.33,1806.08,303.96,879.37,833.30,697.11,179.06)
       val x10 = Array(2202.38,276.39,860.33,230.11,612.23,713.23,290.93,120.36)
        val x11 = Array(2779.10,232.79,1639.88,362.03,831.06,727.00,332.06,126.12)
        val x12 = Array(1232.18,180.02,630.31,163.33,280.63,292.82,199.22,38.92)
        val x13= Array(2162.30,263.39,777.31,222.86,332.68,390.13,197.83,113.01)
        val x14 = Array(1633.12,137.73,339.39,133.00,301.68,236.01,203.68,60.38)
        val x15 = Array(1331.77,230.29,802.73,220.91,232.33,217.27,280.29,79.00)
        val x16 = Array(1163.81,209.73,712.61,169.61,290.79,212.38,213.00,66.27)
        val x17= Array(1711.32,187.07,631.30,232.92,290.22,267.13,210.36,99.80)
        val x18= Array(1927.32,169.06,629.73,171.11,286.01,278.67,222.17,78.67)
        val x19 = Array(2388.91,177.67,962.33,189.01,283.66,272.87,239.00,136.82)
        val x20 = Array(1392.67,91.19,333.23,122.01,261.83,172.73,132.32,30.81)
        val x21 = Array(1337.33,89.89,391.02,102.07,261.37,288.29,123.82,86.67)
        val x22 = Array(1337.39,160.32,328.97,167.72,238.23,211.83,197.13,22.87)
        val x23 = Array(1627.38,172.39,269.73,163.99,236.08,173.26,209.22,33.29)
        val x24= Array(1119.62,112.26,227.20,92.36,139.61,122.10,96.38,33.73)
        val x25 = Array(1283.16,119.63,626.12,118.97,228.23,168.33,181.97,23.97)
        val x26 = Array(1133.37,228.68,322.07,120.06,127.21,62.26,33.82,70.09)
        val x27 = Array(1113.66,173.30,398.39,133.07,270.63,331.99,231.23,60.70)
        val x28= Array(1126.69,218.61,292.77,97.38,276.31,168.99,222.39,26.22)
        val x29= Array(1132.33,132.66,387.83,93.38,232.69,219.91,162.72,31.03)
        val x30 = Array(1220.02,200.26,368.79,110.33,316.73,128.86,270.06,61.32)
        val x31 = Array(1288.27,217.17,382.27,123.91,299.29,192.37,318.77,72.20)
        data.append(x1,x2,x3,x4,x5,x6,x7,x8,x9,x10,x11,x12,x13,x14,x15,x16,x17,x18,x19,x20,x21,x22,x23,x24,x25,x26,x27,x28,x29,x30,x31)
        val hier=new Hierarchical(T,data.toList)
        var list=hier.hierarchical
        var lists=hier.clusterHistory
        var end=new Date().getTime
        println("用时:"+(end-start))
    
      /*var printCluster=new util_printCluster
        printCluster.printCluster(list)
        for(i<-hier.getApproximationMatrix)
          {for(j<-i)
            print(j.formatted("%.3f")+"\t\t")
          println()
          }*/
    
    }}
    
    
    ```scala
    case class util_printCluster(){
      def printCluster(finalresult: ListBuffer[ListBuffer[Array[Double]]]): Unit = {
        import scala.collection.JavaConversions._
        for (aFinalresult <- finalresult) {
          var j = 0
          System.out.println("个数" + aFinalresult.size)
          while ( {
            j < aFinalresult.size
          }) {
            System.out.print("(")
            for (k <- 0 until aFinalresult.get(j).length) {
              System.out.print(aFinalresult.get(j)(k) + ",")
            }
            System.out.print(")")
            j += 1
          }
          System.out.println("\n")
        }
    }}
    
    展开全文
  • 聚类的评估的指标,大方向是分为内部指标和外部指标。 内部指标:包括轮廓系数、Calinski-Harabaz 指数 等,内部指标是在开发阶段用的,一般用来选择聚类的个数。 外部指标: 分为两种: 有标签的结果评价:包括...
  • 之前Java写的一个Keams算法,想通过写这个例子试试手,总结下来就是对Scala 还是不是很熟悉,还需要慢慢加强。对于Scala中List ,数组,Map等集合还需深入了解 Scala中foreach 用起来还是比较方便的,对于定义数据类型...
  • 基于Flink的K-Means聚类算法实现Scala版)   聚类分析是机器学习的入门级算法,属于无监督学习类算法。在传统IT技术手段下,算法只能在单个计算节点运行,由于时间成本的约束太大,因此只能采用对源数据抽样方式...
  • DBSCAN算法学习笔记及scala实现

    千次阅读 热门讨论 2017-11-28 20:41:28
    一、算法概述 DBSCAN(Density-Based Spatial Clustering of Applications with Noise,具有噪声的基于密度的聚类方法)是一种基于密度的空间聚类算法,相比其他的聚类方法,基于密度的聚类方法可以在有噪音的数据...
  • 调用MLlib库实现Kmeans聚类使用工具 - IDE:IntelliJ IDEA - scala解释器:scala-2.12.3 - Java jdk:jdk1.8.0_66 - spark lib:spark-assembly-1.4.1-hadoop2.6.0Steps 获取数据集 wget ...
  • 聚类的数目可以在程序中设定并传递给KMeans算法。然后计算集合内方差和( Within Set Sum of Squared Error,这是评价聚类好坏的标准,数值越小说明同一簇实例之间的距离越小。---译者注) import org.apache...
  • 实验镜像: ... ... ... Spark 机器学习库简介 ...Spark 机器学习库提供了常用机器学习算法的实现,包括聚类,分类,回归,协同过滤,维度缩减等。使用 Spark 机器学习库来做机器学习工作,可以说是非常的简单,通...
  • 参考 github 上spark-master中一个K值聚类算法的案例。作了一些改进。 第一步,生成数据。为了简化,使用python生成了四组均值为(2,2),(-2,2),(2,-2),(-2,-2)方差均为2的正态分布随机数。 def ...
  • import org.apache.spark.sql.functions._ import scala.collection.mutable /** * 根据文章标题及正文聚类,主题分析 * 1、hanlp分词 * 2、Word2Vec模型训练 * 3、Kmeans模型训练 * 4、Kmeans模型预测 */ object ...
  • 聚类(spectral clustering)及其实现详解

    万次阅读 多人点赞 2016-11-01 16:19:52
    Preface 开了很多题,手稿都是写好一直思考如何放到CSDN上来,一方面由于公司技术隐私,一方面由于面向...谱聚类从构造规则化的拉普拉斯矩阵,到对特征矩阵的聚类,个中原理虽然简洁明了,但却蕴含了强大的逻辑结构。
  • 本篇博文对UCI提供的 Machine-Learning-Databases 数据集进行数据分析,并通过K-Means模型实现聚类,最后格式化输出聚类中心点。 本文主要包括以下内容: 通过VectorAssembler来将多列数据合成一列features数据...
  • val k = args(2).toInt //聚类个数 var s = 0d //聚类效果评价标准 val shold = 0.1 //收敛阀值 var s1 = Double.MaxValue var times = 0 var readyForIteration = true val func1 = (x: (newVector, Int, ...
  • TF-IDF + K-Means 中文聚类例子 - scala

    万次阅读 2018-10-08 18:55:17
    <scala.tools.version>2.10</scala.tools.version> <scala.version>2.10.6</scala.version> <hbase.version>1.2.2 <!-- <groupId>org.apache.spark <artifactId>spark-mllib_2.11 <version>2.1.0 ...
  • import org.apache.spark.ml.clustering.KMeans import org.apache.spark.ml.evaluation.ClusteringEvaluator // Loads data. val dataset = spark.read.format("libsvm").load("...
  • Kmeans实例实现代码软件环境:scala2.10.4+spark1.6.3import org.apache.log4j.{Level, Logger} import org.apache.spark.mllib.clustering.{KMeans, KMeansModel} import org.apache.spark.mllib.linalg.Vectors ...

空空如也

空空如也

1 2 3 4 5 ... 20
收藏数 4,253
精华内容 1,701
关键字:

scala聚类实现