2019-05-11 16:42:02 qq_30784919 阅读数 66

 

NumPy

NumPy 是 Python 的一个扩充程序库,支持高级大量的维度数组与矩阵运算,此外也针对数组运算提供大量的数学函数库。内部解除了 Python 的 PIL (全局解释器锁),同时使用 C/C++ 做扩展,运算效率极好,是大量机器学习框架的基础库。

速查表:

 

Pandas

Pandas 是一个基于 NumPy 的工具,主要是为了解决数据分析任务,包括了一些标准的数据模型,提供了高效地操作大型数据集所需的工具。

速查表1:

速查表2:

SciPy

SciPy是基于NumPy开发的高级模块,它提供了许多数学算法和函数的实现,用于解决科学计算中的一些标准问题。例如数值积分和微分方程求解,扩展的矩阵计算,最优化,概率分布和统计函数,甚至包括信号处理等。

速查表:

 

Matplotlib

Matplotlib 是 Python 的一个绘图库。它包含了大量的工具,你可以使用这些工具创建各种图形,包括简单的散点图,正弦曲线,甚至是三维图形。Python 科学计算社区经常使用它完成数据可视化的工作。

速查表:

 

sklearn

sklearn是一个Python第三方提供的非常强力的机器学习库,它包含了从数据预处理到训练模型的各个方面。在实战使用scikit-learn中可以极大的节省我们编写代码的时间以及减少我们的代码量,使我们有更多的精力去分析数据分布,调整模型和修改超参。

速查表:

 

PySpark

Spark是基于内存计算的大数据并行计算框架.Spark基于内存计算,提高了在大数据环境下数据处理的实时性,同时保证了高容错性和高可伸缩性,允许用户将Spark部署在大量的廉价硬件之上,形成集群。PySpark是针对Spark的Python API。

速查表:

 

Keras

Keras 是一个用 Python 编写的高级神经网络 API,它能够以 TensorFlow, CNTK, 或者 Theano 作为后端运行。Keras 的开发重点是支持快速的实验。能够以最小的时延把你的想法转换为实验结果,是做好研究的关键。

速查表:

 

dplyr

dplyr包是 Hadley Wickham (ggplot2包的作者,被称作“一个改变R的人”)的杰作, 并自称 a grammar of data manipulation, 他将原本plyr 包中的ddply()等函数进一步分离强化,专注接受dataframe对象, 大幅提高了速度, 并且提供了更稳健的与其它数据库对象间的接口。

tidyr

tidyr包的作者是Hadley Wickham, 该包用于“tidy”你的数据,这个包常跟dplyr结合使用。

速查表1:

速查表2:

Neural Network

人工神经网络(Artificial Neural Network,即ANN ),是20世纪80 年代以来人工智能领域兴起的研究热点。 它从信息处理角度对人脑神经元网络进行抽象, 建立某种简单模型,按不同的连接方式组成不同的网络。

速查表:

关注公众号:

2019-11-07 14:06:47 frank110503 阅读数 27

目录

1.背景介绍

2.PySpark环境

3.问题描述

4.原因分析


1.背景介绍

机器学习平台底层基于大数据平台提供计算资源,在机器学习平台上主要使用 Spark MLlib 实现了共 7 类(源/目标、统计分析、数据预处理、特征工程、机器学习、工具、文本分析) 90+ 个组件(每个组件代表内置的逻辑片段),目的是减少用户重复开发工作、降低机器学习门槛。

 

2.PySpark环境

 

工具类组件中有【PySpark组件】,底层使用在虚拟机上通过 Anaconda 创建的虚拟环境作为每个组件实例具体执行环境。

 

3.问题描述

 

在使用 Keras(Tensorflow)中基于神经网络的降维方法 AutoEncoder 识别异常点时,出现如下错误:

 

4.原因分析

 

在虚拟机中使用 conda 命令安装 keras: "conda install -c conda-forge keras",如下图所示:

 

Keras 版本 Tensorflow 版本的对应关系(https://docs.floydhub.com/guides/environments/)如下图所示:

 

故可以使用命令 "conda install -c conda-forge keras=2.2.0" 安装指定版本的 keras。

 

 

 

2018-08-16 10:49:19 whale52hertz 阅读数 1432

Keras库为深度学习提供了一个相对简单的接口,使神经网络可以被大众使用。然而,我们面临的挑战之一是将Keras的探索模型转化为产品模型。Keras是用Python编写的,直到最近,这个语言之外的支持还很有限。虽然Flask,PySpark和Cloud ML等工具可以直接在Python中产品化模型,但我通常更喜欢使用Java来部署模型。

像ONNX这样的项目正朝着深度学习的标准化方向发展,但支持这些格式的运行时仍然有限。常用的方法是将Keras模型转换为TensorFlow图,然后在其他支持TensorFlow的运行时中使用这些图。我最近发现了Deeplearning4J(DL4J)项目,该项目本身支持Keras模型,使得在Java中进行深度学习很容易上手并运行。

我一直在探索深度学习的一个用例是使用Python训练Keras模型,然后使用Java产生模型。这对于需要直接在客户端进行深度学习的情况很有用,例如应用模型的Android设备,或者你希望利用使用Java编写的现有生产系统。

本文概述了在Python中训练Keras模型,并使用Java进行部署。我使用Jetty提供实时预测,使用Google的DataFlow构建批预测系统。运行这些示例所需的完整代码和数据可在GitHub上获得。

 

完整代码和详情请点击原文链接:使用Java部署训练好的Keras深度学习模型

2019-07-18 20:58:23 bluehatihati 阅读数 91

ctr deepfm

ctr广告点击率场景下使用deepfm模型进行收藏和购买预测。
github 大神的轮子
https://github.com/xxxmin/ctr_Keras/blob/master/deepfm_weight.py
使用keras实现,将遇到的理解不好的问题记录如下:

pd.DataFrame.values

https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.values.html
DataFrame.values
Return a Numpy representation of the DataFrame.
Only the values in the DataFrame will be returned, the axes labels will be removed.

Returns:
numpy.ndarray
The values of the DataFrame.

keras embedding层

https://juejin.im/entry/5acc23f26fb9a028d1416bb3 讲解的好
https://keras.io/zh/layers/embeddings/ 官方文档

它要求输入数据是整数编码的,所以每个字都用一个唯一的整数表示。这个数据准备步骤可以使用Keras提供的Tokenizer API来执行。

嵌入层被定义为网络的第一个隐藏层。它必须指定3个参数:

input_dim:这是文本数据中词汇的取值可能数。例如,如果您的数据是整数编码为0-9之间的值,那么词汇的大小就是10个单词;
output_dim:这是嵌入单词的向量空间的大小。它为每个单词定义了这个层的输出向量的大小。例如,它可能是32或100甚至更大,可以视为具体问题的超参数;
input_length:这是输入序列的长度,就像您为Keras模型的任何输入层所定义的一样,也就是一次输入带有的词汇个数。例如,如果您的所有输入文档都由1000个字组成,那么input_length就是1000。

举例子理解:
例如,下面我们定义一个词汇表为200的嵌入层(例如从0到199的整数编码的字,包括0到199,即总共有可能出现200个不一样的单词) input_dim
一个32维的向量空间(嵌入后的空间,将一个单词嵌入为一个32维的向量),其中将嵌入单词,以及输入文档 output_dim
每个句子有50个单词。input_length

e = Embedding(input_dim=200, output_dim=32, input_length=50)

画数值图和分布图

特征分析时,我需要了解部分特征的数值图(即y轴数值 x轴样本序号)和特征分布(y轴占比 x轴数值范围)
https://blog.csdn.net/jinruoyanxu/article/details/53390943

直接画数值图:

data_lxy['UserInfo_259'].plot()
plt.title("UserInfo_259")
plt.show()

直接dataframe后面加plot即可

画分布图

max_data = data_lxy['UserInfo_259'].max()
bins = np.linspace(0, max_data, max_data+1)

plt.hist(data_lxy['UserInfo_259'], bins, normed=True, color="#FF0000", alpha=.9, histtype="stepfilled")

plt.show()

我使用的直方图,bins理解就是下标的格子数,histtype="stepfilled"表示横格不那么长

画布设置

画布大小
plt.figure(figsize=(12,10))
分画布(类似matlab)
plt.subplot(2,1,1) :2x1 第一幅
plt.subplot(2,1,2):2x1 第二幅
画布清除
plt.clf()

plot组合

plt.figure(figsize=(12,10))
plt.subplot(2,1,1)
data_lxy['UserInfo_259'].plot()
plt.title("UserInfo_259")
#plt.show()
#plt.clf()
#画分布图
plt.subplot(2,1,2)
max_data = data_lxy['UserInfo_259'].max()
bins = np.linspace(0, max_data, max_data+1)
plt.hist(data_lxy['UserInfo_259'], bins, normed=True, color="#FF0000", alpha=.9, histtype="stepfilled")
plt.show()

在这里插入图片描述

pandas缺失值填补

在做特征分析时一直报错
Python Error1: ValueError: range parameter must be finite.

问题出在特征表中存在缺失值nan

pandas.DataFrame.fillna
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.fillna.html#pandas-dataframe-fillna

这里有一个误区,并不是使用 a.fillna(0)就可以大功告成的
这个函数并不会默认替换原DataFrame,需要加一个参数inplace=True,才会覆盖原dataframe

data_lxy.fillna(method='ffill',inplace=True)

method='ffill’这句代表向上取值填补缺失值
http://www.voidcn.com/article/p-rohjupzu-bto.html

连续特征离散化 onehot

https://blog.csdn.net/tongjinrui/article/details/79679727
在这里插入图片描述
pd.cut(data,4)代表将数据划分为4段,该行代码将生成4个category型数据,分别是4个区间
pd.get_dummies函数解释:https://www.jianshu.com/p/c324f4101785
官方文档:https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.get_dummies.html 我的理解就是onehot化

其他方法介绍:
https://blog.csdn.net/u014135752/article/details/80789251

2018-05-09 17:23:09 u010159842 阅读数 375

Elephas: Distributed Deep Learning with Keras & Spark Build Status

Elephas is an extension of Keras, which allows you to run distributed deep learning models at scale with Spark. Elephas currently supports a number of applications, including:

Schematically, elephas works as follows.

Elephas

Table of content:

Introduction

Elephas brings deep learning with Keras to Spark. Elephas intends to keep the simplicity and high usability of Keras, thereby allowing for fast prototyping of distributed models, which can be run on massive data sets. For an introductory example, see the following iPython notebook.

ἐλέφας is Greek for ivory and an accompanying project to κέρας, meaning horn. If this seems weird mentioning, like a bad dream, you should confirm it actually is at the Keras documentation. Elephas also means elephant, as in stuffed yellow elephant.

Elephas implements a class of data-parallel algorithms on top of Keras, using Spark's RDDs and data frames. Keras Models are initialized on the driver, then serialized and shipped to workers, alongside with data and broadcasted model parameters. Spark workers deserialize the model, train their chunk of data and send their gradients back to the driver. The "master" model on the driver is updated by an optimizer, which takes gradients either synchronously or asynchronously.

Getting started

Installation

Install elephas from PyPI with

pip install elephas

Depending on what OS you are using, you may need to install some prerequisite modules (LAPACK, BLAS, fortran compiler) first.

For example, on Ubuntu Linux:

sudo apt-get install liblapack-dev libblas-dev gfortran

A quick way to install Spark locally is to use homebrew on Mac

brew install spark

or linuxbrew on linux.

brew install apache-spark

The brew version of Spark may be outdated at times. To build from source, simply follow the instructions at the Spark download section or use the following commands.

wget http://apache.mirrors.tds.net/spark/spark-1.5.2/spark-1.5.2-bin-hadoop2.6.tgz -P ~
sudo tar zxvf ~/spark-* -C /usr/local
sudo mv /usr/local/spark-* /usr/local/spark

After that, make sure to put these path variables to your shell profile (e.g. ~/.zshrc):

export SPARK_HOME=/usr/local/spark
export PATH=$PATH:$SPARK_HOME/bin

Using Docker

Install and get Docker running by following the instructions here (https://www.docker.com/).

Building

The build takes quite a while to run the first time since many packages need to be downloaded and installed. In the same directory as the Dockerfile run the following commands

docker build . -t pyspark/elephas

Running

The following command starts a container with the Notebook server listening for HTTP connections on port 8899 (since local Jupyter notebooks use 8888) without authentication configured.

docker run -d -p 8899:8888 pyspark/elephas

Settings

  • Memory In the Dockerfile the following lines can be adjusted to configure memory settings.
ENV SPARK_OPTS --driver-java-options=-Xms1024M --driver-java-options=-Xmx4096M --driver-java-options=-Dlog4j.logLevel=info

Basic example

After installing both Elephas and Spark, training a model is done schematically as follows:

  • Create a local pyspark context
from pyspark import SparkContext, SparkConf
conf = SparkConf().setAppName('Elephas_App').setMaster('local[8]')
sc = SparkContext(conf=conf)
  • Define and compile a Keras model
from keras.models import Sequential
from keras.layers.core import Dense, Dropout, Activation
from keras.optimizers import SGD
model = Sequential()
model.add(Dense(128, input_dim=784))
model.add(Activation('relu'))
model.add(Dropout(0.2))
model.add(Dense(128))
model.add(Activation('relu'))
model.add(Dropout(0.2))
model.add(Dense(10))
model.add(Activation('softmax'))
model.compile(loss='categorical_crossentropy', optimizer=SGD())
  • Create an RDD from numpy arrays
from elephas.utils.rdd_utils import to_simple_rdd
rdd = to_simple_rdd(sc, X_train, Y_train)
  • A SparkModel is defined by passing Spark context and Keras model. Additionally, one has choose an optimizer used for updating the elephas model, an update frequency, a parallelization mode and the degree of parallelism, i.e. the number of workers.
from elephas.spark_model import SparkModel
from elephas import optimizers as elephas_optimizers

adagrad = elephas_optimizers.Adagrad()
spark_model = SparkModel(sc,model, optimizer=adagrad, frequency='epoch', mode='asynchronous', num_workers=2)
spark_model.train(rdd, nb_epoch=20, batch_size=32, verbose=0, validation_split=0.1)
  • Run your script using spark-submit
spark-submit --driver-memory 1G ./your_script.py

Increasing the driver memory even further may be necessary, as the set of parameters in a network may be very large and collecting them on the driver eats up a lot of resources. See the examples folder for a few working examples.

Spark MLlib example

Following up on the last example, to create an RDD of LabeledPoints for supervised training from pairs of numpy arrays, use

from elephas.utils.rdd_utils import to_labeled_point
lp_rdd = to_labeled_point(sc, X_train, Y_train, categorical=True)

Training a given LabeledPoint-RDD is very similar to what we've seen already

from elephas.spark_model import SparkMLlibModel
adadelta = elephas_optimizers.Adadelta()
spark_model = SparkMLlibModel(sc,model, optimizer=adadelta, frequency='batch', mode='hogwild', num_workers=2)
spark_model.train(lp_rdd, nb_epoch=20, batch_size=32, verbose=0, validation_split=0.1, categorical=True, nb_classes=nb_classes)

Spark ML example

To train a model with a SparkML estimator on a data frame, use the following syntax.

df = to_data_frame(sc, X_train, Y_train, categorical=True)
test_df = to_data_frame(sc, X_test, Y_test, categorical=True)

adadelta = elephas_optimizers.Adadelta()
estimator = ElephasEstimator(sc,model,
        nb_epoch=nb_epoch, batch_size=batch_size, optimizer=adadelta, frequency='batch', mode='asynchronous', num_workers=2,
        verbose=0, validation_split=0.1, categorical=True, nb_classes=nb_classes)

fitted_model = estimator.fit(df)

Fitting an estimator results in a SparkML transformer, which we can use for predictions and other evaluations by calling the transform method on it.

prediction = fitted_model.transform(test_df)
pnl = prediction.select("label", "prediction")
pnl.show(100)

prediction_and_label= pnl.map(lambda row: (row.label, row.prediction))
metrics = MulticlassMetrics(prediction_and_label)
print(metrics.precision())
print(metrics.recall())

Usage of data-parallel models

In the first example above we have seen that an elephas model is instantiated like this

spark_model = SparkModel(sc,model, optimizer=adagrad, frequency='epoch', mode='asynchronous', num_workers=2)

So, apart from the canonical Spark context and Keras model, Elephas models have four parameters to tune and we will describe each of them next.

Model updates (optimizers)

optimizer: The optimizers module in elephas is an adaption of the same module in keras, i.e. it provides the user with the following list of optimizers:

  • SGD
  • RMSprop
  • Adagrad
  • Adadelta
  • Adam

Once constructed, each of these can be passed to the optimizer parameter of the model. Updates in keras are computed with the help of theano, so most of the data structures in keras optimizers stem from theano. In elephas, gradients have already been computed by the respective workers, so it makes sense to entirely work with numpy arrays internally.

Note that in order to set up an elephas model, you have to specify two optimizers, one for elephas and one for the underlying keras model. Individual workers produce updates according to keras optimizers and the "master" model on the driver uses elephas optimizers to aggregate them. For starters, we recommend keras models with SGD and elephas models with Adagrad or Adadelta.

Update frequency

frequency: The user can decide how often updates are passed to the master model by controlling the frequency parameter. To update every batch, choose 'batch' and to update only after every epoch, choose 'epoch'.

Update mode

mode: Currently, there's three different modes available in elephas, each corresponding to a different heuristic or parallelization scheme adopted, which is controlled by the mode parameter. The default property is 'asynchronous'.

Asynchronous updates with read and write locks (mode='asynchronous')

This mode implements the algorithm described as downpour in [1], i.e. each worker can send updates whenever they are ready. The master model makes sure that no update gets lost, i.e. multiple updates get applied at the "same" time, by locking the master parameters while reading and writing parameters. This idea has been used in Google's DistBelief framework.

Asynchronous updates without locks (mode='hogwild')

Essentially the same procedure as above, but without requiring the locks. This heuristic assumes that we still fare well enough, even if we loose an update here or there. Updating parameters lock-free in a non-distributed setting for SGD goes by the name 'Hogwild!' [2], it's distributed extension is called 'Dogwild!' [3].

Synchronous updates (mode='synchronous')

In this mode each worker sends a new batch of parameter updates at the same time, which are then processed on the master. Accordingly, this algorithm is sometimes called batch synchronous parallel or just BSP.

Degree of parallelization (number of workers)

num_workers: Lastly, the degree to which we parallelize our training data is controlled by the parameter num_workers.

Distributed hyper-parameter optimization

Hyper-parameter optimization with elephas is based on hyperas, a convenience wrapper for hyperopt and keras. Make sure to have at least version 0.1.2 of hyperas installed. Each Spark worker executes a number of trials, the results get collected and the best model is returned. As the distributed mode in hyperopt (using MongoDB), is somewhat difficult to configure and error prone at the time of writing, we chose to implement parallelization ourselves. Right now, the only available optimization algorithm is random search.

The first part of this example is more or less directly taken from the hyperas documentation. We define data and model as functions, hyper-parameter ranges are defined through braces. See the hyperas documentation for more on how this works.

from __future__ import print_function
from hyperopt import Trials, STATUS_OK, tpe
from hyperas.distributions import choice, uniform

def data():
    '''
    Data providing function:

    Make sure to have every relevant import statement included here and return data as
    used in model function below. This function is separated from model() so that hyperopt
    won't reload data for each evaluation run.
    '''
    from keras.datasets import mnist
    from keras.utils import np_utils
    (X_train, y_train), (X_test, y_test) = mnist.load_data()
    X_train = X_train.reshape(60000, 784)
    X_test = X_test.reshape(10000, 784)
    X_train = X_train.astype('float32')
    X_test = X_test.astype('float32')
    X_train /= 255
    X_test /= 255
    nb_classes = 10
    Y_train = np_utils.to_categorical(y_train, nb_classes)
    Y_test = np_utils.to_categorical(y_test, nb_classes)
    return X_train, Y_train, X_test, Y_test


def model(X_train, Y_train, X_test, Y_test):
    '''
    Model providing function:

    Create Keras model with double curly brackets dropped-in as needed.
    Return value has to be a valid python dictionary with two customary keys:
        - loss: Specify a numeric evaluation metric to be minimized
        - status: Just use STATUS_OK and see hyperopt documentation if not feasible
    The last one is optional, though recommended, namely:
        - model: specify the model just created so that we can later use it again.
    '''
    from keras.models import Sequential
    from keras.layers.core import Dense, Dropout, Activation
    from keras.optimizers import RMSprop

    model = Sequential()
    model.add(Dense(512, input_shape=(784,)))
    model.add(Activation('relu'))
    model.add(Dropout({{uniform(0, 1)}}))
    model.add(Dense({{choice([256, 512, 1024])}}))
    model.add(Activation('relu'))
    model.add(Dropout({{uniform(0, 1)}}))
    model.add(Dense(10))
    model.add(Activation('softmax'))

    rms = RMSprop()
    model.compile(loss='categorical_crossentropy', optimizer=rms)

    model.fit(X_train, Y_train,
              batch_size={{choice([64, 128])}},
              nb_epoch=1,
              show_accuracy=True,
              verbose=2,
              validation_data=(X_test, Y_test))
    score, acc = model.evaluate(X_test, Y_test, show_accuracy=True, verbose=0)
    print('Test accuracy:', acc)
    return {'loss': -acc, 'status': STATUS_OK, 'model': model.to_yaml(), 'weights': pickle.dumps(model.get_weights())}

Once the basic setup is defined, running the minimization is done in just a few lines of code:

from hyperas import optim
from elephas.hyperparam import HyperParamModel
from pyspark import SparkContext, SparkConf

# Create Spark context
conf = SparkConf().setAppName('Elephas_Hyperparameter_Optimization').setMaster('local[8]')
sc = SparkContext(conf=conf)

# Define hyper-parameter model and run optimization
hyperparam_model = HyperParamModel(sc)
hyperparam_model.minimize(model=model, data=data, max_evals=5)

Distributed training of ensemble models

Building on the last section, it is possible to train ensemble models with elephas by means of running hyper-parameter optimization on large search spaces and defining a resulting voting classifier on the top-n performing models. With data and ```model```` defined as above, this is a simple as running

result = hyperparam_model.best_ensemble(nb_ensemble_models=10, model=model, data=data, max_evals=5)

In this example an ensemble of 10 models is built, based on optimization of at most 5 runs on each of the Spark workers.

Discussion

Premature parallelization may not be the root of all evil, but it may not always be the best idea to do so. Keep in mind that more workers mean less data per worker and parallelizing a model is not an excuse for actual learning. So, if you can perfectly well fit your data into memory and you're happy with training speed of the model consider just using keras.

One exception to this rule may be that you're already working within the Spark ecosystem and want to leverage what's there. The above SparkML example shows how to use evaluation modules from Spark and maybe you wish to further process the outcome of an elephas model down the road. In this case, we recommend to use elephas as a simple wrapper by setting num_workers=1.

Note that right now elephas restricts itself to data-parallel algorithms for two reasons. First, Spark simply makes it very easy to distribute data. Second, neither Spark nor Theano make it particularly easy to split up the actual model in parts, thus making model-parallelism practically impossible to realize.

Having said all that, we hope you learn to appreciate elephas as a pretty easy to setup and use playground for data-parallel deep-learning algorithms.

Future work & contributions

Constructive feedback and pull requests for elephas are very welcome. Here's a few things we're having in mind for future development

  • Benchmarks for training speed and accuracy.
  • Some real-world tests on EC2 instances with large data sets like imagenet.

Literature

[1] J. Dean, G.S. Corrado, R. Monga, K. Chen, M. Devin, QV. Le, MZ. Mao, M’A. Ranzato, A. Senior, P. Tucker, K. Yang, and AY. Ng. Large Scale Distributed Deep Networks.

[2] F. Niu, B. Recht, C. Re, S.J. Wright HOGWILD!: A Lock-Free Approach to Parallelizing Stochastic Gradient Descent

[3] C. Noel, S. Osindero. Dogwild! — Distributed Hogwild for CPU & GPU

没有更多推荐了,返回首页