精华内容
下载资源
问答
  • CuFFt
    千次阅读
    2017-06-11 22:44:38

    1. Introduction

    This document describes cuFFT, the NVIDIA® CUDA™ Fast Fourier Transform (FFT) product. It consists of two separate libraries: cuFFT and cuFFTW. The cuFFT library is designed to provide high performance on NVIDIA GPUs. The cuFFTW library is provided as a porting tool to enable users of FFTW to start using NVIDIA GPUs with a minimum amount of effort.

    The FFT is a divide-and-conquer algorithm for efficiently computing discrete Fourier transforms of complex or real-valued data sets. It is one of the most important and widely used numerical algorithms in computational physics and general signal processing. The cuFFT library provides a simple interface for computing FFTs on an NVIDIA GPU, which allows users to quickly leverage the floating-point power and parallelism of the GPU in a highly optimized and tested FFT library.

    The cuFFT product supports a wide range of FFT inputs and options efficiently on NVIDIA GPUs. This version of the cuFFT library supports the following features:

    • Algorithms highly optimized for input sizes that can be written in the form 2 a × 3 b × 5 c × 7 d . In general the smaller the prime factor, the better the performance, i.e., powers of two are fastest.
    • An O ( n log n ) algorithm for every input data size
    • Half-precision (16-bit floating point), single-precision (32-bit floating point) and double-precision (64-bit floating point). Transforms of lower precision have higher performance.
    • Complex and real-valued input and output. Real valued input or output require less computations and data than complex values and often have faster time to solution. Types supported are:
      • C2C - Complex input to complex output
      • R2C - Real input to complex output
      • C2R - Symmetric complex input to real output
    • 1D, 2D and 3D transforms
    • Execution of multiple 1D, 2D and 3D transforms simultaneously. These batched transforms have higher performance than single transforms.
    • In-place and out-of-place transforms
    • Arbitrary intra- and inter-dimension element strides (strided layout)
    • FFTW compatible data layout
    • Execution of transforms across multiple GPUs
    • Streamed execution, enabling asynchronous computation and data movement

    The cuFFTW library provides the FFTW3 API to facilitate porting of existing FFTW applications.

    2. Using the cuFFT API

    This chapter provides a general overview of the cuFFT library API. For more complete information on specific functions, see cuFFT API Reference. Users are encouraged to read this chapter before continuing with more detailed descriptions.

    The Discrete Fourier transform (DFT) maps a complex-valued vector x k (time domain) into its frequency domain representation given by:

    X k = ∑ n = 0 N − 1 x n e -2 π i k n N

    where X k is a complex-valued vector of the same size. This is known as a forward DFT. If the sign on the exponent of e is changed to be positive, the transform is an inverse transform. Depending on N , different algorithms are deployed for the best performance.

    The cuFFT API is modeled after FFTW, which is one of the most popular and efficient CPU-based FFT libraries. cuFFT provides a simple configuration mechanism called a plan that uses internal building blocks to optimize the transform for the given configuration and the particular GPU hardware selected. Then, when theexecution function is called, the actual transform takes place following the plan of execution. The advantage of this approach is that once the user creates a plan, the library retains whatever state is needed to execute the plan multiple times without recalculation of the configuration. This model works well for cuFFT because different kinds of FFTs require different thread configurations and GPU resources, and the plan interface provides a simple way of reusing configurations.

    Computing a number BATCH of one-dimensional DFTs of size NX using cuFFT will typically look like this:

    #define NX 256
    #define BATCH 10
    #define RANK 1
    ...
    {
        cufftHandle plan;
        cufftComplex *data;
        ...
        cudaMalloc((void**)&data, sizeof(cufftComplex)*NX*BATCH);
        cufftPlanMany(&plan, RANK, NX, &iembed, istride, idist, 
            &oembed, ostride, odist, CUFFT_C2C, BATCH);
        ...
        cufftExecC2C(plan, data, data, CUFFT_FORWARD);
        cudaDeviceSynchronize();
        ...
        cufftDestroy(plan);
        cudaFree(data);
    }
    
    

    2.1. Accessing cuFFT

    The cuFFT and cuFFTW libraries are available as shared libraries. They consist of compiled programs ready for users to incorporate into applications with the compiler and linker. cuFFT can be downloaded from http://developer.nvidia.com/cufft. By selecting Download CUDA Production Release users are all able to install the package containing the CUDA Toolkit, SDK code samples and development drivers. The CUDA Toolkit contains cuFFT and the samples include simplecuFFT.

    The Linux release for simplecuFFT assumes that the root install directory is /usr/local/cuda and that the locations of the products are contained there as follows. Modify the Makefile as appropriate for your system.

    Product Location and name Include file
    nvcc compiler /bin/nvcc  
    cuFFT library {lib, lib64}/libcufft.so inc/cufft.h
    cuFFT library with Xt functionality {lib, lib64}/libcufft.so inc/cufftXt.h
    cuFFTW library {lib, lib64}/libcufftw.so inc/cufftw.h

    The most common case is for developers to modify an existing CUDA routine (for example, filename.cu) to call cuFFT routines. In this case the include file cufft.h or cufftXt.h should be inserted into filename.cu file and the library included in the link line. A single compile and link line might appear as

    • /usr/local/cuda/bin/nvcc [options] filename.cu … -I/usr/local/cuda/inc -L/usr/local/cuda/lib -lcufft

    Of course there will typically be many compile lines and the compiler g++ may be used for linking so long as the library path is set correctly.

    Users of the FFTW interface (see FFTW Interface to cuFFT) should include cufftw.h and link with both cuFFT and cuFFTW libraries.

    For the best performance input data should reside in device memory. Therefore programs in the cuFFT library assume that the data is in GPU memory. For example, if one of the execution functions is called with data in host memory, the program will return CUFFT_EXEC_FAILED. Programs in the cuFFTW library assume that the input data is in host memory since this library is a porting tool for users of FFTW. If the data resides in GPU memory, the program will abort.

    2.2. Fourier Transform Setup

    The first step in using the cuFFT Library is to create a plan using one of the following:

    • cufftPlan1D() / cufftPlan2D() / cufftPlan3D() - Create a simple plan for a 1D/2D/3D transform respectively.
    • cufftPlanMany() - Creates a plan supporting batched input and strided data layouts.
    • cufftXtMakePlanMany() - Creates a plan supporting batched input and strided data layouts for any supported precision.

    Among the plan creation functions, cufftPlanMany() allows use of more complicated data layouts and batched executions. Execution of a transform of a particular size and type may take several stages of processing. When a plan for the transform is generated, cuFFT derives the internal steps that need to be taken. These steps may include multiple kernel launches, memory copies, and so on. In addition, all the intermediate buffer allocations (on CPU/GPU memory) take place during planning. These buffers are released when the plan is destroyed. In the worst case, the cuFFT Library allocates space for 8*batch*n[0]*..*n[rank-1] cufftComplex or cufftDoubleComplex elements (where batch denotes the number of transforms that will be executed in parallel, rank is the number of dimensions of the input data (see Multidimensional Transforms) and n[] is the array of transform dimensions) for single and double-precision transforms respectively. Depending on the configuration of the plan, less memory may be used. In some specific cases, the temporary space allocations can be as low as 1*batch*n[0]*..*n[rank-1] cufftComplex or cufftDoubleComplex elements. This temporary space is allocated separately for each individual plan when it is created (i.e., temporary space is not shared between the plans).

    The next step in using the library is to call an execution function such as cufftExecC2C() (see Parameter cufftType) which will perform the transform with the specifications defined at planning.

    One can create a cuFFT plan and perform multiple transforms on different data sets by providing different input and output pointers. Once the plan is no longer needed, the cufftDestroy() function should be called to release the resources allocated for the plan.

    Free memory requirement

    The first program call to any cuFFT function causes the initialization of the cuFFT kernels. This can fail if there is not enough free memory on the GPU. It is advisable to initialize cufft first (e.g. by creating a plan) and then allocating memory.

    2.3. Fourier Transform Types

    Apart from the general complex-to-complex (C2C) transform, cuFFT implements efficiently two other types: real-to-complex (R2C) and complex-to-real (C2R). In many practical applications the input vector is real-valued. It can be easily shown that in this case the output satisfies Hermitian symmetry ( X k = X N − k ∗ , where the star denotes complex conjugation). The converse is also true: for complex-Hermitian input the inverse transform will be purely real-valued. cuFFT takes advantage of this redundancy and works only on the first half of the Hermitian vector.

    Transform execution functions for single and double-precision are defined separately as:

    • cufftExecC2C() / cufftExecZ2Z() - complex-to-complex transforms for single/double precision.
    • cufftExecR2C() / cufftExecD2Z() - real-to-complex forward transform for single/double precision.
    • cufftExecC2R() / cufftExecZ2D() - complex-to-real inverse transform for single/double precision.

    Each of those functions demands different input data layout (see Data Layout for details).

    Functions cufftXtExec() and cufftXtExecDescriptor() can perform transforms on any of the supported types.

    2.3.1. Half precision cuFFT Transforms

    Half precision transforms have the following limitations:

    • Minimum GPU architecture is SM_53
    • Sizes are restricted to powers of two only
    • Strides are not supported
    • More than one GPU is not supported

    CUDA Toolkit provides cuda_fp16.h header with types and intrinsic functions for handling half precision arithmetic.

    2.4. Data Layout

    In the cuFFT Library, data layout depends strictly on the configuration and the transform type. In the case of general complex-to-complex transform both the input and output data shall be a cufftComplex/cufftDoubleComplex array in single- and double-precision modes respectively. In C2R mode an input array ( x 1 , x 2 ,… , x ⌊ N 2 ⌋ + 1 ) of only non-redundant complex elements is required. The output array ( X 1 , X 2 , … , X N ) consists of cufftReal/cufftDouble elements in this mode. Finally, R2C demands an input array ( X 1 , X 2 , … , X N ) of real values and returns an array ( x 1 , x 2 , … , x ⌊ N 2 ⌋ + 1 ) of non-redundant complex elements.

    In real-to-complex and complex-to-real transforms the size of input data and the size of output data differ. For out-of-place transforms a separate array of appropriate size is created. For in-place transforms the user should use padded data layout. This layout is FFTW compatibile.

    In the padded layout output signals begin at the same memory addresses as the input data. Therefore input data for real-to-complex and output data for complex-to-real must be padded.

    Expected sizes of input/output data for 1-d transforms are summarized in the table below:

    FFT type input data size output data size
    C2CcufftComplexcufftComplex
    C2R x 2 + 1 cufftComplexcufftReal
    R2C*cufftReal x 2 + 1 cufftComplex

    The real-to-complex transform is implicitly a forward transform. For an in-place real-to-complex transform where FFTW compatible output is desired, the input size must be padded to ⌊ N 2 ⌋ + 1 complex elements. For out-of-place transforms, input and output sizes match the logical transform size N and the non-redundant size ⌊N 2 ⌋ + 1 , respectively.

    The complex-to-real transform is implicitly inverse. For in-place complex-to-real FFTs where FFTW compatible output is selected (default padding mode), the input size is assumed to be ⌊ N 2 ⌋ + 1 cufftComplex elements. Note that in-place complex-to-real FFTs may overwrite arbitrary imaginary input point values when non-unit input and output strides are chosen. For out-of-place transforms, input and output sizes match the logical transform non-redundant size ⌊ N 2 ⌋ + 1 and size N , respectively.

    2.5. Multidimensional Transforms

    Multidimensional DFT map a d -dimensional array x n , where n = ( n 1 , n 2 , … , n d ) into its frequency domain array given by:

    X k = ∑ n = 0 N − 1 x n e -2 π i k n N

    where n N = ( n 1 N 1 , n 2 N 2 , … , n d N d ) , and the summation denotes the set of nested summations
    ∑ n 1 = 0 N 1 − 1 ∑ n 2 = 0 N 2 − 1 … ∑ n d = 0 N d − 1

    cuFFT supports one-dimensional, two-dimensional and three-dimensional transforms, which can all be called by the same cufftExec* functions (see Fourier Transform Types).

    Similar to the one-dimensional case, the frequency domain representation of real-valued input data satisfies Hermitian symmetry, defined as: x ( n 1 , n 2 , … , n d ) = x ( N 1 − n 1 , N 2 − n 2 , … , N d − n d ) ∗ .

    C2R and R2C algorithms take advantage of this fact by operating only on half of the elements of signal array, namely on: x n for n ∈ { 1 , … , N 1 } × … × { 1 , … , N d − 1 } × { 1 , … , ⌊ N d 2 ⌋ + 1 } .

    The general rules of data alignment described in Data Layout apply to higher-dimensional transforms. The following table summarizes input and output data sizes for multidimensional DFTs:

    Dims FFT type Input data size Output data size
    1D C2C N 1 cufftComplex N 1 cufftComplex
    1D C2R ⌊ N 1 2 ⌋ + 1 cufftComplex N 1 cufftReal
    1D R2C N 1 cufftReal ⌊ N 1 2 ⌋ + 1 cufftComplex
    2D C2C N 1 N 2 cufftComplex N 1 N 2 cufftComplex
    2D C2R N 1 ( ⌊ N 2 2 ⌋ + 1 ) cufftComplex N 1 N 2 cufftReal
    2D R2C N 1 N 2 cufftReal N 1 ( ⌊ N 2 2 ⌋ + 1 ) cufftComplex
    3D C2C N 1 N 2 N 3 cufftComplex N 1 N 2 N 3 cufftComplex
    3D C2R N 1 N 2 ( ⌊ N 3 2 ⌋ + 1 ) cufftComplex N 1 N 2 N 3 cufftReal
    3D R2C N 1 N 2 N 3 cufftReal N 1 N 2 ( ⌊ N 3 2 ⌋ + 1 ) cufftComplex

    For example, static declaration of a three-dimensional array for the output of an out-of-place real-to-complex transform will look like this:

    cufftComplex odata[N1][N2][N3/2+1];

    2.6. Advanced Data Layout

    The advanced data layout feature allows transforming only a subset of an input array, or outputting to only a portion of a larger data structure. It can be set by calling function:

    cufftResult cufftPlanMany(cufftHandle *plan, int rank, int *n, int *inembed,
        int istride, int idist, int *onembed, int ostride,
        int odist, cufftType type, int batch);
    

    Passing inembed or onembed set to NULL is a special case and is equivalent to passing n for each. This is same as the basic data layout and other advanced parameters such as istride are ignored.

    If the advanced parameters are to be used, then all of the advanced interface parameters must be specified correctly. Advanced parameters are defined in units of the relevant data type (cufftRealcufftDoubleRealcufftComplex, or cufftDoubleComplex).

    Advanced layout can be perceived as an additional layer of abstraction above the access to input/output data arrays. An element of coordinates [z][y][x] in signal number b in the batch will be associated with the following addresses in the memory:

    • 1D

      input[ b * idist + x * istride]

      output[ b * odist + x * ostride]

    • 2D

      input[b * idist + (x * inembed[1] + y) * istride]

      output[b * odist + (x * onembed[1] + y) * ostride]

    • 3D

      input[b * idist + ((x * inembed[1] + y) * inembed[2] + z) * istride]

      output[b * odist + ((x * onembed[1] + y) * onembed[2] + z) * ostride]

    The istride and ostride parameters denote the distance between two successive input and output elements in the least significant (that is, the innermost) dimension respectively. In a single 1D transform, if every input element is to be used in the transform, istride should be set to 1 ; if every other input element is to be used in the transform, then istride should be set to 2 . Similarly, in a single 1D transform, if it is desired to output final elements one after another compactly, ostride should be set to 1 ; if spacing is desired between the least significant dimension output data, ostride should be set to the distance between the elements.

    The inembed and onembed parameters define the number of elements in each dimension in the input array and the output array respectively. The inembed[rank-1] contains the number of elements in the least significant (innermost) dimension of the input data excluding the istride elements; the number of total elements in the least significant dimension of the input array is then istride*inembed[rank-1]. The inembed[0] or onembed[0] corresponds to the most significant (that is, the outermost) dimension and is effectively ignored since the idist or odist parameter provides this information instead. Note that the size of each dimension of the transform should be less than or equal to the inembed and onembed values for the corresponding dimension, that is n[i] ≤ inembed[i]n[i] ≤ onembed[i], where i ∈ { 0 , … , r a n k − 1 } .

    The idist and odist parameters indicate the distance between the first element of two consecutive batches in the input and output data.

    2.7. Streamed cuFFT Transforms

    Every cuFFT plan may be associated with a CUDA stream. Once so associated, all launches of the internal stages of that plan take place through the specified stream. Streaming of cuFFT execution allows for potential overlap between transforms and memory copies. (See the NVIDIA CUDA Programming Guide for more information on streams.) If no stream is associated with a plan, launches take place in stream(0), the default CUDA stream. Note that many plan executions require multiple kernel launches.

    cufftSetStream() returns an error in the multiple GPU case as multiple GPU plans perform operations in their own streams.

    2.8. Multiple GPU cuFFT Transforms

    cuFFT supports using up to eight GPUs connected to a CPU to perform Fourier Transforms whose calculations are distributed across the GPUs. An API has been defined to allow users to write new code or modify existing code to use this functionality.

    Some existing functions such as the creation of a plan using cufftCreate() also apply in the multiple GPU case. Multiple GPU routines contain Xt in their name.

    The memory on the GPUs is managed by helper functions cufftXtMalloc()/cufftXtFree() and cufftXtMemcpy() using the cudaLibXtDesc descriptor.

    Performance is a function of the bandwidth between the GPUs, the computational ability of the individual GPUs, and the type and number of FFT to be performed. The highest performance is obtained using NVLink interconnect (http://www.nvidia.com/object/nvlink.html). The second best option is using PCI Express 3.0 between the GPUs and ensuring that both GPUs are on the same switch. Note that multiple GPU execution is not guaranteed to solve a given size problem in a shorter time than single GPU execution.

    The multiple GPU extensions to cuFFT are built on the extensible cuFFT API. The general steps in defining and executing a transform with this API are:

    • cufftCreate() - create an empty plan, as in the single GPU case
    • cufftXtSetGPUs() - define which GPUs are to be used
    • Optional: cufftEstimate{1d,2d,3d,Many}() - estimate the sizes of the work areas required. These are the same functions used in the single GPU case although the definition of the argument workSize reflects the number of GPUs used.
    • cufftMakePlan{1d,2d,3d,Many}() - create the plan. These are the same functions used in the single GPU case although the definition of the argument workSize reflects the number of GPUs used.
    • Optional: cufftGetSize{1d,2d,3d,Many}() - refined estimate of the sizes of the work areas required. These are the same functions used in the single GPU case although the definition of the argument workSize reflects the number of GPUs used.
    • Optional: cufftGetSize() - check workspace size. This is the same function used in the single GPU case although the definition of the argument workSize reflects the number of GPUs used.
    • Optional: cufftXtSetWorkArea() - do your own workspace allocation.
    • cufftXtMalloc() - allocate descriptor and data on the GPUs
    • cufftXtMemcpy() - copy data to the GPUs
    • cufftXtExecDescriptorC2C()/cufftXtExecDescriptorZ2Z() - execute the plan
    • cufftXtMemcpy() - copy data from the GPUs
    • cufftXtFree() - free any memory allocated with cufftXtMalloc()
    • cufftDestroy() - free cuFFT plan resources

    2.8.1. Plan Specification and Work Areas

    In the single GPU case a plan is created by a call to cufftCreate() followed by a call to cufftMakePlan*(). For multiple GPUs, the GPUs to use for execution are identified by a call to cufftXtSetGPUs() and this must occur after the call to cufftCreate() and prior to the call to cufftMakePlan*().

    Note that when cufftMakePlan*() is called for a single GPU, the work area is on that GPU. In a multiple GPU plan, the returned work area has multiple entries; one value per GPU. That is workSize points to a size_t array, one entry per GPU. Also the strides and batches apply to the entire plan across all GPUs associated with the plan.

    Once a plan is locked by a call to cufftMakePlan*(), different descriptors may be specified in calls to cufftXtExecDescriptor*() to execute the plan on different data sets, but the new descriptors must use the same GPUs in the same order.

    As in the single GPU case, cufftEstimateSize{Many,1d,2d,3d}() and cufftGetSize{Many,1d,2d,3d}() give estimates of the work area sizes required for a multiple GPU plan and in this case workSize points to a size_t array, one entry per GPU.

    Similarly the actual work size returned by cufftGetSize() is a size_t array, one entry per GPU in the multiple GPU case.

    2.8.2. Helper Functions

    Multiple GPU cuFFT execution functions assume a certain data layout in terms of what input data has been copied to which GPUs prior to execution, and what output data resides in which GPUs post execution. cuFFT provides functions to assist users in manipulating data on multiple GPUs. These must be called after the call tocufftMakePlan*().

    On a single GPU users may call cudaMalloc() and cudaFree() to allocate and free GPU memory. To provide similar functionality in the multiple GPU case, cuFFT includes cufftXtMalloc() and cufftXtFree() functions. The function cufftXtMalloc() returns a descriptor which specifies the location of these memories.

    On a single GPU users may call cudaMemcpy() to transfer data between host and GPU memory. To provide similar functionality in the multiple GPU case, cuFFT includes cufftXtMemcpy() which allows users to copy between host and multiple GPU memories or even between the GPU memories.

    All single GPU cuFFT FFTs return output the data in natural order, that is the ordering of the result is the same as if a DFT had been performed on the data. Some Fast Fourier Transforms produce intermediate results where the data is left in a permutation of the natural output. When batch is one, data is left in the GPU memory in a permutation of the natural output.

    When cufftXtMemcpy() is used to copy data from GPU memory back to host memory, the results are in natural order regardless of whether the data on the GPUs is in natural order or permuted. Using CUFFT_COPY_DEVICE_TO_DEVICE allows users to copy data from the permuted data format produced after a single transform to the natural order on GPUs.

    2.8.3. Multiple GPU 2D and 3D Transforms on Permuted Input

    For single 2D or 3D transforms on multiple GPUs, when cufftXtMemcpy() distributes the data to the GPUs, the array is divided on the X axis. E.G. for two GPUs half of the X dimenson points, for all Y (and Z) values, are copied to each of the GPUs. When the transform is computed, the data are permuted such that they are divided on the Y axis. I.E. half of the Y dimension points, for all X (and Z) values are on each of the GPUs.

    When cuFFT creates a 2D or 3D plan for a single transform on multiple GPUs, it actually creates two plans. One plan expects input to be divided on the X axis. The other plan expects data to be divided on the Y axis. This is done because many algorithms compute a forward FFT, then perform some point-wise operation on the result, and then compute the inverse FFT. A memory copy to restore the data to the original order would be expensive. To avoid this, cufftXtMemcpy and cufftXtExecDescriptor() keep track of the data ordering so that the correct operation is used.

    The ability of cuFFT to process data in either order makes the following sequence possible.

    • cufftCreate() - create an empty plan, as in the single GPU case
    • cufftXtSetGPUs() - define which GPUs are to be used
    • cufftMakePlan{1d,2d,3d,Many}() - create the plan.
    • cufftXtMalloc() - allocate descriptor and data on the GPUs
    • cufftXtMemcpy() - copy data to the GPUs
    • cufftXtExecDescriptorC2C()/cufftXtExecDescriptorZ2Z() - compute the forward FFT
    • userFunction() - modify the data in the frequency domain
    • cufftXtExecDescriptorC2C()/cufftXtExecDescriptorZ2Z() - compute the inverse FFT
    • Note that it was not necessary to copy/permute the data between execute calls
    • cufftXtMemcpy() - copy data to the host
    • cufftXtFree() - free any memory allocated with cufftXtMalloc()
    • cufftDestroy() - free cuFFT plan resources

    2.8.4. Supported Functionality

    Since version 7.0 only a subset of single GPU functionality is supported for multiple GPU execution.

    Supported functionality:

    • Plans operating on two, four or eight GPUs are supported.
    • Up to eight GPUs are supported for 2D and 3D transforms or when number of batches is greater than 1.
    • All GPUs must have the same CUDA architecture level.
    • The GPUs must support the Unified Virtual Address Space.
    • On Windows, the GPU boards must be operating in Tesla Compute Cluster (TCC) mode.
    • Running cuFFT on multiple GPUs is not compatible with an application that uses the CUDA Driver API.
    • Strided input and output are not supported.
    • When the number of batches is 1:
      • Only C2C and Z2Z transform types are supported.
      • Only in-place transforms are supported.
      • For 1D transforms, the dimension must be a power of 2 greater then 64. For eight GPUs transform size must be greater than 128.
      • For 2D and 3D transforms, the dimensions must factor into primes less than or equal to 127. The X and Y dimensions of the transform must be greater than or equal to 32.

    General guidelines are:

    • Parameter whichGPUs of cufftXtSetGPUs() function determines ordering of the GPUs with respect to data decomposition (first data chunk is placed on GPU denoted by first element of whichGPUs)
    • The data for the entire transform must fit within the memory of the GPUs assigned to it.
    • For batch size m on n GPUs :
      • The first m % n GPUs execute m n + 1 transforms.
      • The remaining GPUs execute m n transforms.

    Batch size output differences:

    Single GPU cuFFT results are always returned in natural order. When multiple GPUs are used to perform more than one transform, the results are also returned in natural order. When multiple GPUs are used to perform a single transform the results are returned in a permutation of the normal results to reduce communication time.

    Number of GPUs Number of transforms Output Order on GPUs
    One One or multiple transforms Natural order
    Multiple One Permuted results
    Multiple Multiple Natural order

    To produce natural order results in GPU memory in the 1D single transform case, requires calling cufftXtMemcpy() with CUFFT_COPY_DEVICE_TO_DEVICE.

    2D and 3D transforms support execution of a transform given permuted order results as input. After execution in this case, the output will be in natural order. It is also possible to use cufftXtMemcpy() with CUFFT_COPY_DEVICE_TO_DEVICE to return 2D or 3D data to natural order.

    The only supported multiple GPU configurations are 2 or 4 GPUs, all with the same CUDA architecture level.

    See the cuFFT Code Examples section for single GPU and multiple GPU examples.

    2.9. cuFFT Callback Routines

    Callback routines are user-supplied kernel routines that cuFFT will call when loading or storing data. They allow the user to do data pre- or post- processing without additional kernel calls.

    2.9.1. Overview of the cufFFT Callback Routine Feature

    cuFFT provides a set of APIs that allow the cuFFT user to provide CUDA functions that re-direct or manipulate the data as it is loaded prior to processing the FFT, or stored once the FFT has been done. For the load callback, cuFFT passes the callback routine the address of the input data and the offset to the value to be loaded from device memory, and the callback routine returns the value it wishes cuFFT to use instead. For the store callback, cufFFT passes the callback routine the value it has computed, along with the address of the output data and the offset to the value to be written to device memory, and the callback routine modifies the value and stores the modified result.

    In order to provide a callback to cuFFT, a plan is created and configured normally using the extensible plan APIs. After the call to cufftCreate and cufftMakePlan, the user may associate a load callback routine, or a store callback routine, or both, with the plan, by callingcufftXtSetCallback. The caller also has the option to specify a device pointer to an opaque structure they wish to associate with the plan. This pointer will be passed to the callback routine by the cuFFT library. The caller may use this structure to remember plan dimensions and strides, or have a pointer to auxiliary data, etc.

    With some restrictions, the callback routine is allowed to request shared memory for its own use. If the requested amount of shared memory is available, cufft will pass a pointer to it when it calls the callback routine.

    CUFFT allows for 8 types of callback routine, one for each possible combination of: load or store, real or complex, single precision or double. It is the caller's responsibility to provide a routine that matches the function prototype for the type of routine specified. If there is already a callback of the specified type associated with the plan, the set callback function will replace it with the new one.

    The callback routine extensions to cuFFT are built on the extensible cuFFT API. The general steps in defining and executing a transform with callbacks are:

    • cufftCreate() - create an empty plan, as in the single GPU case
    • cufftMakePlan{1d,2d,3d,Many}() - create the plan. These are the same functions used in the single GPU case.
    • cufftXtSetCallback() - called for load and/or store callback for this plan
    • cufftExecC2C() etc. - execute the plan
    • cufftDestroy() - free cuFFT plan resources

    Callback functions are not supported on transforms with a dimension size that does not factor into primes smaller than 127. Callback functions on plans whose dimensions' prime factors are limited to 2, 3, 5, and 7 can safely call __syncthreads(). On other plans, results are not defined.

    NOTE:The callback API is available in the statically linked cuFFT library only, and only on 64 bit LINUX operating systems.

    2.9.2. Specifying Load and Store Callback Routines

    In order to associate a callback routine with a plan, it is necessary to obtain a device pointer to the callback routine.

    As an example, if the user wants to specify a load callback for an R2C transform, they would write the device code for the callback function, and define a global device variable that contains a pointer to the function:

     __device__  cufftReal myOwnCallback(void *dataIn, 
                                         size_t offset, 
                                         void *callerInfo,
                                         void *sharedPtr) {
         cufftReal ret;
         // use offset, dataIn, and optionally callerInfo to 
         // compute the return value
         return ret;
     }
     __device__ cufftCallbackLoadR myOwnCallbackPtr = myOwnCallback;
    

    From the host side, the user then has to get the address of the callback routine, which is stored in myOwnCallbackPtr. This is done with cudaMemcpyFromSymbol, as follows:

    cufftCallbackLoadR hostCopyOfCallbackPtr;
    
    cudaMemcpyFromSymbol(&hostCopyOfCallbackPtr, 
                         myOwnCallbackPtr, 
                         sizeof(hostCopyOfCallbackPtr));
    

    hostCopyOfCallbackPtr then contains the device address of the callback routine, that should be passed to cufftXtSetCallback. Note that, for multi-GPU transforms, hostCopyOfCallbackPtr will need to be an array of pointers, and the cudaMemcpyFromSymbol will have to be invoked for each GPU. Please note that __managed__ variables are not suitable to pass to cufftSetCallback due to restrictions on variable usage (See the NVIDIA CUDA Programming Guide for more information about __managed__ variables).

    2.9.3. Callback Routine Function Details

    Below are the function prototypes, and typedefs for pointers to the user supplied callback routines that cuFFT calls to load data prior to the transform.

    typedef  cufftComplex (*cufftCallbackLoadC)(void *dataIn, 
                                                size_t offset, 
                                                void *callerInfo,
                                                void *sharedPointer);
    
     typedef  cufftDoubleComplex (*cufftCallbackLoadZ)(void *dataIn, 
                                                       size_t offset, 
                                                       void *callerInfo,
                                                       void *sharedPointer);
    
     typedef  cufftReal (*cufftCallbackLoadR)(void *dataIn,
                                              size_t offset, 
                                              void *callerInfo,
                                              void *sharedPointer);
    
     typedef  cufftDoubleReal (*cufftCallbackLoadD)(void *dataIn,
                                                    size_t offset, 
                                                    void *callerInfo,
                                                    void *sharedPointer);
    

    Parameters for all of the load callbacks are defined as below:

    • offset: offset of the input element from the start of output data. This is not a byte offset, rather it is the number of elements from start of data.
    • dataIn: device pointer to the start of the input array that was passed in the cufftExecute call.
    • callerInfo: device pointer to the optional caller specified data passed in the cufftXtSetCallback call.
    • sharedPointer: pointer to shared memory, valid only if the user has called cufftXtSetCallbackSharedSize().

    Below are the function prototypes, and typedefs for pointers to the user supplied callback routines that cuFFT calls to store data after completion of the transform. Note that the store callback functions do not return a value. This is because a store callback function is responsible not only for transforming the data as desired, but also for writing the data to the desired location. This allows the store callback to rearrange the data, for example to shift the zero frequency result to the center of the ouput.

    typedef  void (*cufftCallbackStoreC)(void *dataOut, 
                                         size_t offset,
                                         cufftComplex element,
                                         void *callerInfo,
                                         void *sharedPointer);
    
    typedef  void (*cufftCallbackStoreZ)(void *dataOut, 
                                         size_t offset,
                                         cufftDoubleComplex element,
                                         void *callerInfo,
                                         void *sharedPointer);
    
    typedef  void (*cufftCallbackStoreR)(void *dataOut,
                                         size_t offset,
                                         cufftReal element,
                                         void *callerInfo,
                                         void *sharedPointer);
    
    typedef  void (*cufftCallbackStoreD)(void *dataOut,
                                         size_t offset,
                                         cufftDoubleReal element,
                                         void *callerInfo,
                                         void *sharedPointer);
    

    Parameters for all of the store callbacks are defined as below:

    • offset: offset of the output element from the start of output data. This is not a byte offset, rather it is the number of elements from start of data.
    • dataOut: device pointer to the start of the output array that was passed in the cufftExecute call.
    • element: the real or complex result computed by CUFFT for the element specified by the offset argument.
    • callerInfo: device pointer to the optional caller specified data passed in the cufftXtSetCallback call.
    • sharedPointer: pointer to shared memory, valid only if the user has called cufftXtSetCallbackSharedSize().

    2.9.4. Coding Considerations for the cuFFT Callback Routine Feature

    cuFFT supports callbacks on all types of transforms, dimension, batch, stride between elements or number of GPUs. Callbacks are supported for transforms of single and double precision.

    cuFFT supports a wide range of parameters, and based on those for a given plan, it attempts to optimize performance. The number of kernels launched, and for each of those, the number of blocks launched and the number of threads per block, will vary depending on how cuFFT decomposes the transform. For some configurations, cuFFT will load or store (and process) multiple inputs or outputs per thread. For some configurations, threads may load or store inputs or outputs in any order, and cuFFT does not guarantee that the inputs or outputs handled by a given thread will be contiguous. These characteristics may vary with transform size, transform type (e.g. C2C vs C2R), number of dimensions, and GPU architecture. These variations may also change from one library version to the next.

    cuFFT will call the load callback routine, for each point in the input, once and only once. Similarly it will call the store callback routine, for each point in the output, once and only once. If cuFFT is implementing a given FFT in multiple phases, it will only call the load callback routine from the first phase kernel(s), and it will only call the store callback routine from the last phase kernel(s).

    When cufft is using only a single kernel, both the load and store callback routines will be called from the same kernel. In this case, if the transform is being done in-place (i.e. input data and output data are in the same memory location) the store callback can not safely write outside the confines of the specified element, unless it is writing the data to a completley separate output buffer.

    When more than one kernel are used to implement a transform, the thread and block structure of the first kernel (the one that does the load) is often different from the thread and block structure of the last kernel (the one that does the store)

    One common use of callbacks is to reduce the amount of data read or written to memory, either by selective filtering or via type conversions. When more than one kernel are used to implement a transform, cuFFT alternates using the workspace and the output buffer to write intermediate results. This means that the output buffer must always be large enough to accommodate the entire transform.

    For multi-GPU transforms, the index passed to the callback routine is the element index from the start of data on that GPU, not from the start of the entire input or output data array.

    For transforms whose dimensions can be factored into powers of 2, 3, 5, or 7, cuFFT guarantees that it will call the load and store callback routines from points in the kernel that is safe to call __syncthreads function from within callback routine. Caller is responsible for guaranteeing that the callback routine is at a point where the callback code has converged, to avoid deadlock. For plans whose dimensions are factored into higher primes, results of a callback routine calling __syncthreads are not defined.

    2.10. Thread Safety

    cuFFT APIs are thread safe as long as different host threads execute FFTs using different plans and the output data are disjoint.

    2.11. Static Library and Callback Support

    Starting with release 6.5, the cuFFT Libraries are also delivered in a static form as libcufft_static.a and libcufftw_static.a on Linux and Mac. Static libraries are not supported on Windows. The static cufft and cufftw libraries depend on thread abstraction layer library libculibos.a.

    For example, on linux, to compile a small application using cuFFT against the dynamic library, the following command can be used:

        nvcc myCufftApp.c  -lcufft  -o myCufftApp
    

    For cufftw on Linux, to compile a small application against the dynamic library, the following command can be used:

        nvcc myCufftwApp.c  -lcufftw  -lcufft  -o myCufftwApp 
    

    Whereas to compile against the static cuFFT Library, extra steps need to be taken. Library needs to be device linked. It may happen during build and link of a simple program or as a separate step. Entire process is described in Using Separarate Compilation in CUDA.

    To compile against the static cuFFT library, the following command has to be used:

         
        nvcc myCufftApp.c  -lcufft_static   -lculibos -o myCufftApp\
            -gencode arch=compute_20,\"code=sm_20\"\
            -gencode arch=compute_30,\"code=sm_30\"\
            -gencode arch=compute_35,\"code=sm_35\"\
            -gencode arch=compute_50,\"code=sm_50\"\
            -gencode arch=compute_60,\"code=sm_60\"\
            -gencode arch=compute_60,\"code=compute_60\"
    

    Similarly to compile against the static cufftw library, the following command has to be used:

         
        nvcc myCufftwApp.c    libcufftw_static.a  libcufft_static.a   libculibos.a  -o myCufftwApp\
            -gencode arch=compute_20,\"code=sm_20\"\
            -gencode arch=compute_30,\"code=sm_30\"\
            -gencode arch=compute_35,\"code=sm_35\"\
            -gencode arch=compute_50,\"code=sm_50\"\
            -gencode arch=compute_60,\"code=sm_60\"\
            -gencode arch=compute_60,\"code=compute_60\"
    

    Please note that cuFFT library might not contain code for certain architectures as long as there is code for a lower architecture that is binary compatibile (ie. SM37, SM52, SM61). This is reflected in link commands above. To determine if a specific SM is included in the cuFFT library, one may use cuobjdump utility. For example, if you wish to know if SM_50 is included, the command to run is cuobjdump -arch sm_50 libcufft_static.a. Some kernels are built only on select architectures (ie. kernels with half precision arithmetics are present only for SM53 and above). This can cause warnings at link time that architectures are missing from these kernels. These warnings can safely be ignored.

    It is also possible to use the native Host C++ compiler and perform device link as a separate step. Please consult NVCC documentation for more details. Depending on the Host Operating system, some additional libraries like pthread or dl might be needed on the linking line.

    Note that in this case, the library cuda is not needed. The CUDA Runtime will try to open explicitly the cuda library if needed. In the case of a system which does not have the CUDA driver installed, this allows the application to gracefully manage this issue and potentially run if a CPU-only path is available.

    The cuFFT static library supports user supplied callback routines. The callback routines are CUDA device code, and must be separately compiled with NVCC and linked with the cuFFT library. Refer to the NVCC documentation regarding separate compilation for details. If you specify an SM when compiling your callback functions, you must specify one of the SM’s cuFFT includes.

    2.12. Accuracy and Performance

    A DFT can be implemented as a matrix vector multiplication that requires O ( N 2 ) operations. However, the cuFFT Library employs the Cooley-Tukey algorithm to reduce the number of required operations to optimize the performance of particular transform sizes. This algorithm expresses the DFT matrix as a product of sparse building block matrices. The cuFFT Library implements the following building blocks: radix-2, radix-3, radix-5, and radix-7. Hence the performance of any transform size that can be factored as 2 a × 3 b × 5 c × 7 d (where abc, and d are non-negative integers) is optimized in the cuFFT library. There are also radix-m building blocks for other primes, m, whose value is < 128. When the length cannot be decomposed as multiples of powers of primes from 2 to 127, Bluestein's algorithm is used. Since the Bluestein implementation requires more computations per output point than the Cooley-Tukey implementation, the accuracy of the Cooley-Tukey algorithm is better. The pure Cooley-Tukey implementation has excellent accuracy, with the relative error growing proportionally to log 2 ⁡ ( N ) , where N is the transform size in points.

    For sizes handled by the Cooley-Tukey code path, the most efficient implementation is obtained by applying the following constraints (listed in order from the most generic to the most specialized constraint, with each subsequent constraint providing the potential of an additional performance improvement).

    Half precision transforms might not be suitable for all kinds of problems due to limited range represented by half precision floating point arithmetics. Please note that the first element of FFT result is the sum of all input elements and it is likely to overflow for certain inputs.

    Applies to Recommendation Comment
    All Use single precision transforms. Single precision transforms require less bandwidth per computation than double precision transforms.
    All Restrict the size along all dimensions to be representable as 2 a × 3 b × 5 c × 7 d . The cuFFT library has highly optimized kernels for transforms whose dimensions have these prime factors. In general the best performance occurs when using powers of 2, followed by powers of 3, then 5, 7.
    All Restrict the size along each dimension to use fewer distinct prime factors. A transform of size 2 n or 3 n will usually be faster than one of size 2 i × 3 j even if the latter is slightly smaller, due to the composition of specialized paths.
    All Restrict the data to be contiguous in memory when performing a single transform. When performing multiple transforms make the individual datasets contiguous The cuFFT library has been optimized for this data layout.
    All Perform multiple (i.e., batched) transforms. Additional optimizations are performed in batched mode.
    real-to-complex transforms or complex-to-real transforms Ensure problem size of x dimension is a multiple of 4. This scheme uses more efficient kernels to implement conjugate symmetry property.
    real-to-complex transforms or complex-to-real transforms Use out-of-place mode. This scheme uses more efficient kernels than in-place mode.
    Multiple GPU transforms Use PCI Express 3.0 between GPUs and ensure the GPUs are on the same switch. The faster the interconnect between the GPUs, the faster the performance.

    3. cuFFT API Reference

    This chapter specifies the behavior of the cuFFT library functions by describing their input/output parameters, data types, and error codes. The cuFFT library is initialized upon the first invocation of an API function, and cuFFT shuts down automatically when all user-created FFT plans are destroyed.

    3.1. Return value cufftResult

    All cuFFT Library return values except for CUFFT_SUCCESS indicate that the current API call failed and the user should reconfigure to correct the problem. The possible return values are defined as follows:

    typedef enum cufftResult_t {
        CUFFT_SUCCESS        = 0,  //  The cuFFT operation was successful
        CUFFT_INVALID_PLAN   = 1,  //  cuFFT was passed an invalid plan handle
        CUFFT_ALLOC_FAILED   = 2,  //  cuFFT failed to allocate GPU or CPU memory
        CUFFT_INVALID_TYPE   = 3,  //  No longer used
        CUFFT_INVALID_VALUE  = 4,  //  User specified an invalid pointer or parameter
        CUFFT_INTERNAL_ERROR = 5,  //  Driver or internal cuFFT library error
        CUFFT_EXEC_FAILED    = 6,  //  Failed to execute an FFT on the GPU
        CUFFT_SETUP_FAILED   = 7,  //  The cuFFT library failed to initialize
        CUFFT_INVALID_SIZE   = 8,  //  User specified an invalid transform size
        CUFFT_UNALIGNED_DATA = 9,  //  No longer used
        CUFFT_INCOMPLETE_PARAMETER_LIST = 10, //  Missing parameters in call
        CUFFT_INVALID_DEVICE = 11, //  Execution of a plan was on different GPU than plan creation
        CUFFT_PARSE_ERROR    = 12, //  Internal plan database error 
        CUFFT_NO_WORKSPACE   = 13  //  No workspace has been provided prior to plan execution
        CUFFT_NOT_IMPLEMENTED = 14, // Function does not implement functionality for parameters given.
        CUFFT_LICENSE_ERROR  = 15, // Used in previous versions.
        CUFFT_NOT_SUPPORTED  = 16  // Operation is not supported for parameters given.
    } cufftResult;
    

    Users are encouraged to check return values from cuFFT functions for errors as shown in cuFFT Code Examples.

    3.2. cuFFT Basic Plans

    3.2.1. Function cufftPlan1d()

    cufftResult 
        cufftPlan1d(cufftHandle *plan, int nx, cufftType type, int batch);
    

    Creates a 1D FFT plan configuration for a specified signal size and data type. The batch input parameter tells cuFFT how many 1D transforms to configure.

    Input
    planPointer to a cufftHandle object
    nxThe transform size (e.g. 256 for a 256-point FFT)
    typeThe transform data type (e.g., CUFFT_C2C for single precision complex to complex)
    batchNumber of transforms of size nx. Please consider using cufftPlanMany for multiple transforms.
    Output
    planContains a cuFFT 1D plan handle value
    Return Values
    CUFFT_SUCCESScuFFT successfully created the FFT plan.
    CUFFT_ALLOC_FAILEDThe allocation of GPU resources for the plan failed.
    CUFFT_INVALID_VALUEOne or more invalid parameters were passed to the API.
    CUFFT_INTERNAL_ERRORAn internal driver error was detected.
    CUFFT_SETUP_FAILEDThe cuFFT library failed to initialize.
    CUFFT_INVALID_SIZEThe nx or batch parameter is not a supported size.

    3.2.2. Function cufftPlan2d()

    cufftResult 
        cufftPlan2d(cufftHandle *plan, int nx, int ny, cufftType type);
    

    Creates a 2D FFT plan configuration according to specified signal sizes and data type.

    Input
    planPointer to a cufftHandle object
    nxThe transform size in the x dimension This is slowest changing dimension of a transform (strided in memory).
    nyThe transform size in the y dimension. This is fastest changing dimension of a transform (contiguous in memory).
    typeThe transform data type (e.g., CUFFT_C2R for single precision complex to real)
    Output
    planContains a cuFFT 2D plan handle value
    Return Values
    CUFFT_SUCCESScuFFT successfully created the FFT plan.
    CUFFT_ALLOC_FAILEDThe allocation of GPU resources for the plan failed.
    CUFFT_INVALID_VALUEOne or more invalid parameters were passed to the API.
    CUFFT_INTERNAL_ERRORAn internal driver error was detected.
    CUFFT_SETUP_FAILEDThe cuFFT library failed to initialize.
    CUFFT_INVALID_SIZEEither or both of the nx or ny parameters is not a supported size.

    3.2.3. Function cufftPlan3d()

    cufftResult 
        cufftPlan3d(cufftHandle *plan, int nx, int ny, int nz, cufftType type);
    

    Creates a 3D FFT plan configuration according to specified signal sizes and data type. This function is the same as cufftPlan2d() except that it takes a third size parameter nz.

    Input
    planPointer to a cufftHandle object
    nxThe transform size in the x dimension. This is slowest changing dimension of a transform (strided in memory).
    nyThe transform size in the y dimension
    nzThe transform size in the z dimension. This is fastest changing dimension of a transform (contiguous in memory).
    typeThe transform data type (e.g., CUFFT_R2C for single precision real to complex)
    Output
    planContains a cuFFT 3D plan handle value
    Return Values
    CUFFT_SUCCESScuFFT successfully created the FFT plan.
    CUFFT_ALLOC_FAILEDThe allocation of GPU resources for the plan failed.
    CUFFT_INVALID_VALUEOne or more invalid parameters were passed to the API.
    CUFFT_INTERNAL_ERRORAn internal driver error was detected.
    CUFFT_SETUP_FAILEDThe cuFFT library failed to initialize.
    CUFFT_INVALID_SIZEOne or more of the nxny, or nz parameters is not a supported size.

    3.2.4. Function cufftPlanMany()

    cufftResult 
        cufftPlanMany(cufftHandle *plan, int rank, int *n, int *inembed,
            int istride, int idist, int *onembed, int ostride,
            int odist, cufftType type, int batch);
    

    Creates a FFT plan configuration of dimension rank, with sizes specified in the array n. The batch input parameter tells cuFFT how many transforms to configure. With this function, batched plans of 1, 2, or 3 dimensions may be created.

    The cufftPlanMany() API supports more complicated input and output data layouts via the advanced data layout parameters: inembedistrideidistonembedostride, and odist.

    If inembed and onembed are set to NULL, all other stride information is ignored, and default strides are used. The default assumes contiguous data arrays.

    All arrays are assumed to be in CPU memory.

    Input
    planPointer to a cufftHandle object
    rankDimensionality of the transform (1, 2, or 3).
    nArray of size rank, describing the size of each dimension, n[0] being the size of the outermost and n[rank-1] innermost (contiguous) dimension of a transform.
    inembedPointer of size rank that indicates the storage dimensions of the input data in memory. If set to NULL all other advanced data layout parameters are ignored.
    istrideIndicates the distance between two successive input elements in the least significant (i.e., innermost) dimension
    idistIndicates the distance between the first element of two consecutive signals in a batch of the input data
    onembedPointer of size rank that indicates the storage dimensions of the output data in memory. If set to NULL all other advanced data layout parameters are ignored.
    ostrideIndicates the distance between two successive output elements in the output array in the least significant (i.e., innermost) dimension
    odistIndicates the distance between the first element of two consecutive signals in a batch of the output data
    typeThe transform data type (e.g., CUFFT_R2C for single precision real to complex)
    batchBatch size for this transform
    Output
    planContains a cuFFT plan handle
    Return Values
    CUFFT_SUCCESScuFFT successfully created the FFT plan.
    CUFFT_ALLOC_FAILEDThe allocation of GPU resources for the plan failed.
    CUFFT_INVALID_VALUEOne or more invalid parameters were passed to the API.
    CUFFT_INTERNAL_ERRORAn internal driver error was detected.
    CUFFT_SETUP_FAILEDThe cuFFT library failed to initialize.
    CUFFT_INVALID_SIZEOne or more of the parameters is not a supported size.

    3.3. cuFFT Extensible Plans

    This API separates handle creation from plan generation. This makes it possible to change plan settings, which may alter the outcome of the plan generation phase, before the plan is actually generated.

    3.3.1. Function cufftCreate()

    cufftResult 
        cufftCreate(cufftHandle *plan);
    

    Creates only an opaque handle, and allocates small data structures on the host. The cufftMakePlan*() calls actually do the plan generation.

    Input
    planPointer to a cufftHandle object
    Output
    planContains a cuFFT plan handle value
    Return Values
    CUFFT_SUCCESScuFFT successfully created the FFT plan.
    CUFFT_ALLOC_FAILEDThe allocation of resources for the plan failed.
    CUFFT_INVALID_VALUEOne or more invalid parameters were passed to the API.
    CUFFT_INTERNAL_ERRORAn internal driver error was detected.
    CUFFT_SETUP_FAILEDThe cuFFT library failed to initialize.

    3.3.2. Function cufftMakePlan1d()

    cufftResult 
        cufftMakePlan1d(cufftHandle plan, int nx, cufftType type, int batch, 
            size_t *workSize);
    

    Following a call to cufftCreate() makes a 1D FFT plan configuration for a specified signal size and data type. The batch input parameter tells cuFFT how many 1D transforms to configure.

    If cufftXtSetGPUs() was called prior to this call with multiple GPUs, then workSize will contain multiple sizes. See sections on multiple GPUs for more details.

    Input
    plancufftHandle returned by cufftCreate
    nxThe transform size (e.g. 256 for a 256-point FFT). For multiple GPUs, this must be a power of 2.
    typeThe transform data type (e.g., CUFFT_C2C for single precision complex to complex). For multiple GPUs this must be a complex to complex transform.
    batchNumber of transforms of size nx. Please consider using cufftMakePlanMany for multiple transforms.
    *workSizePointer to the size(s), in bytes, of the work areas. For example for two GPUs worksize must be declared to have two elements.
    Output
    *workSizePointer to the size(s) of the work areas.
    Return Values
    CUFFT_SUCCESScuFFT successfully created the FFT plan.
    CUFFT_INVALID_PLANThe plan parameter is not a valid handle. Handle is not valid when multi-GPU restrictions are not met.
    CUFFT_ALLOC_FAILEDThe allocation of GPU resources for the plan failed.
    CUFFT_INVALID_VALUEOne or more invalid parameters were passed to the API.
    CUFFT_INTERNAL_ERRORAn internal driver error was detected.
    CUFFT_SETUP_FAILEDThe cuFFT library failed to initialize.
    CUFFT_INVALID_SIZEThe nx or batch parameter is not a supported size.

    3.3.3. Function cufftMakePlan2d()

    cufftResult 
        cufftMakePlan2d(cufftHandle plan, int nx, int ny, cufftType type, 
            size_t *workSize);
    

    Following a call to cufftCreate() makes a 2D FFT plan configuration according to specified signal sizes and data type.

    If cufftXtSetGPUs() was called prior to this call with multiple GPUs, then workSize will contain multiple sizes. See sections on multiple GPUs for more details.

    Input
    plancufftHandle returned by cufftCreate
    nxThe transform size in the x dimension. This is slowest changing dimension of a transform (strided in memory). For multiple GPUs, this must be factorable into primes less than or equal to 127.
    nyThe transform size in the y dimension. This is fastest changing dimension of a transform (contiguous in memory). For 2 GPUs, this must be factorable into primes less than or equal to 127.
    typeThe transform data type (e.g., CUFFT_C2R for single precision complex to real). For multiple GPUs this must be a complex to complex transform.
    *workSizePointer to the size(s), in bytes, of the work areas. For example for two GPUs worksize must be declared to have two elements.
    Output
    *workSizePointer to the size(s) of the work areas.
    Return Values
    CUFFT_SUCCESScuFFT successfully created the FFT plan.
    CUFFT_INVALID_PLANThe plan parameter is not a valid handle.
    CUFFT_ALLOC_FAILEDThe allocation of GPU resources for the plan failed.
    CUFFT_INVALID_VALUEOne or more invalid parameters were passed to the API.
    CUFFT_INTERNAL_ERRORAn internal driver error was detected.
    CUFFT_SETUP_FAILEDThe cuFFT library failed to initialize.
    CUFFT_INVALID_SIZEEither or both of the nx or ny parameters is not a supported size.

    3.3.4. Function cufftMakePlan3d()

    cufftResult 
        cufftMakePlan3d(cufftHandle plan, int nx, int ny, int nz, cufftType type,
            size_t *workSize);
    

    Following a call to cufftCreate() makes a 3D FFT plan configuration according to specified signal sizes and data type. This function is the same as cufftPlan2d() except that it takes a third size parameter nz.

    If cufftXtSetGPUs() was called prior to this call with multiple GPUs, then workSize will contain multiple sizes. See sections on multiple GPUs for more details.

    Input
    plancufftHandle returned by cufftCreate
    nxThe transform size in the x dimension. This is slowest changing dimension of a transform (strided in memory). For multiple GPUs, this must be factorable into primes less than or equal to 127.
    nyThe transform size in the y dimension. For multiple GPUs, this must be factorable into primes less than or equal to 127.
    nzThe transform size in the z dimension. This is fastest changing dimension of a transform (contiguous in memory). For multiple GPUs, this must be factorable into primes less than or equal to 127.
    typeThe transform data type (e.g., CUFFT_R2C for single precision real to complex). For multiple GPUs this must be a complex to complex transform.
    *workSizePointer to the size(s), in bytes, of the work areas. For example for two GPUs worksize must be declared to have two elements.
    Output
    *workSizePointer to the size(s) of the work area(s).
    Return Values
    CUFFT_SUCCESScuFFT successfully created the FFT plan.
    CUFFT_INVALID_PLANThe plan parameter is not a valid handle.
    CUFFT_ALLOC_FAILEDThe allocation of GPU resources for the plan failed.
    CUFFT_INVALID_VALUEOne or more invalid parameters were passed to the API.
    CUFFT_INTERNAL_ERRORAn internal driver error was detected.
    CUFFT_SETUP_FAILEDThe cuFFT library failed to initialize.
    CUFFT_INVALID_SIZEOne or more of the nxny, or nz parameters is not a supported size.

    3.3.5. Function cufftMakePlanMany()

    cufftResult 
        cufftMakePlanMany(cufftHandle plan, int rank, int *n, int *inembed,
            int istride, int idist, int *onembed, int ostride,
            int odist, cufftType type, int batch, size_t *workSize);
    

    Following a call to cufftCreate() makes a FFT plan configuration of dimension rank, with sizes specified in the array n. The batch input parameter tells cuFFT how many transforms to configure. With this function, batched plans of 1, 2, or 3 dimensions may be created.

    The cufftPlanMany() API supports more complicated input and output data layouts via the advanced data layout parameters: inembedistrideidistonembedostride, and odist.

    If inembed and onembed are set to NULL, all other stride information is ignored, and default strides are used. The default assumes contiguous data arrays.

    If cufftXtSetGPUs() was called prior to this call with multiple GPUs, then workSize will contain multiple sizes. See sections on multiple GPUs for more details.

    All arrays are assumed to be in CPU memory.

    Input
    plancufftHandle returned by cufftCreate
    rankDimensionality of the transform (1, 2, or 3)
    nArray of size rank, describing the size of each dimension, n[0] being the size of the outermost and n[rank-1] innermost (contiguous) dimension of a transform. For multiple GPUs and rank equal to 1, the sizes must be a power of 2. For multiple GPUs and rank equal to 2 or 3, the sizes must be factorable into primes less than or equal to 127.
    inembedPointer of size rank that indicates the storage dimensions of the input data in memory, inembed[0] being the storage dimension of the innermost dimension. If set to NULL all other advanced data layout parameters are ignored.
    istrideIndicates the distance between two successive input elements in the least significant (i.e., innermost) dimension
    idistIndicates the distance between the first element of two consecutive signals in a batch of the input data
    onembedPointer of size rank that indicates the storage dimensions of the output data in memory, inembed[0] being the storage dimension of the innermost dimension. If set to NULL all other advanced data layout parameters are ignored.
    ostrideIndicates the distance between two successive output elements in the output array in the least significant (i.e., innermost) dimension
    odistIndicates the distance between the first element of two consecutive signals in a batch of the output data
    typeThe transform data type (e.g., CUFFT_R2C for single precision real to complex). For 2 GPUs this must be a complex to complex transform.
    batchBatch size for this transform
    *workSizePointer to the size(s), in bytes, of the work areas. For example for two GPUs worksize must be declared to have two elements.
    Output
    *workSizePointer to the size(s) of the work areas.
    Return Values
    CUFFT_SUCCESScuFFT successfully created the FFT plan.
    CUFFT_INVALID_PLANThe plan parameter is not a valid handle. Handle is not valid when multi-GPU restrictions are not met.
    CUFFT_ALLOC_FAILEDThe allocation of GPU resources for the plan failed.
    CUFFT_INVALID_VALUEOne or more invalid parameters were passed to the API.
    CUFFT_INTERNAL_ERRORAn internal driver error was detected.
    CUFFT_SETUP_FAILEDThe cuFFT library failed to initialize.
    CUFFT_INVALID_SIZEOne or more of the parameters is not a supported size.

    Function cufftMakePlanMany64()

    cufftResult 
        cufftMakePlanMany64(cufftHandle plan, int rank, 
            long long int *n, 
            long long int *inembed, long long int istride, long long int idist, 
            long long int *onembed, long long int ostride, long long int odist, 
            cufftType type, 
            long long int batch, size_t *workSize);
    

    Following a call to cufftCreate() makes a FFT plan configuration of dimension rank, with sizes specified in the array n. The batch input parameter tells cuFFT how many transforms to configure. With this function, batched plans of 1, 2, or 3 dimensions may be created.

    This API is identical to cufftMakePlanMany except that the arguments specifying sizes and strides are 64 bit integers. This API makes very large transforms possible. cuFFT includes kernels that use 32 bit indexes, and kernels that use 64 bit indexes. cuFFT planning selects 32 bit kernels whenever possible to avoid any overhead due to 64 bit arithmetic.

    All sizes and types of transform are supported by this interface, with two exceptions. For transforms whose total size exceeds 4G elements, the dimensions specified in the array n must be factorable into primes that are less than or equal to 127. For real to complex and complex to real transforms whose total size exceeds 2G elements, the fastest changing dimension must be even.

    The cufftPlanMany64() API supports more complicated input and output data layouts via the advanced data layout parameters: inembedistrideidistonembedostride, and odist.

    If inembed and onembed are set to NULL, all other stride information is ignored, and default strides are used. The default assumes contiguous data arrays.

    If cufftXtSetGPUs() was called prior to this call with multiple GPUs, then workSize will contain multiple sizes. See sections on multiple GPUs for more details.

    All arrays are assumed to be in CPU memory.

    Input
    plancufftHandle returned by cufftCreate
    rankDimensionality of the transform (1, 2, or 3)
    nArray of size rank, describing the size of each dimension. For multiple GPUs and rank equal to 1, the sizes must be a power of 2. For multiple GPUs and rank equal to 2 or 3, the sizes must be factorable into primes less than or equal to 127.
    inembedPointer of size rank that indicates the storage dimensions of the input data in memory. If set to NULL all other advanced data layout parameters are ignored.
    istrideIndicates the distance between two successive input elements in the least significant (i.e., innermost) dimension
    idistIndicates the distance between the first element of two consecutive signals in a batch of the input data
    onembedPointer of size rank that indicates the storage dimensions of the output data in memory. If set to NULL all other advanced data layout parameters are ignored.
    ostrideIndicates the distance between two successive output elements in the output array in the least significant (i.e., innermost) dimension
    odistIndicates the distance between the first element of two consecutive signals in a batch of the output data
    typeThe transform data type (e.g., CUFFT_R2C for single precision real to complex). For 2 GPUs this must be a complex to complex transform.
    batchBatch size for this transform
    *workSizePointer to the size(s), in bytes, of the work areas. For example for two GPUs worksize must be declared to have two elements.
    Output
    *workSizePointer to the size(s) of the work areas.
    Return Values
    CUFFT_SUCCESScuFFT successfully created the FFT plan.
    CUFFT_INVALID_PLANThe plan parameter is not a valid handle. Handle is not valid when multi-GPU restrictions are not met.
    CUFFT_ALLOC_FAILEDThe allocation of GPU resources for the plan failed.
    CUFFT_INVALID_VALUEOne or more invalid parameters were passed to the API.
    CUFFT_INTERNAL_ERRORAn internal driver error was detected.
    CUFFT_SETUP_FAILEDThe cuFFT library failed to initialize.
    CUFFT_INVALID_SIZEOne or more of the parameters is not a supported size.

    3.3.7. Function cufftXtMakePlanMany()

    cufftResult
        cufftXtMakePlanMany(cufftHandle plan, int rank, long long int *n, long long int *inembed,
            long long int istride, long long int idist, cudaDataType inputtype,
            long long int *onembed, long long int ostride, long long int odist,
            cudaDataType outputtype, long long int batch, size_t *workSize,
            cudaDataType executiontype);
    

    Following a call to cufftCreate() makes an FFT plan configuration of dimension rank, with sizes specified in the array n. The batch input parameter tells cuFFT how many transforms to configure. With this function, batched plans of 1, 2, or 3 dimensions may be created.

    Type specifiers inputtypeoutputtype and executiontype dictate type and precision of transform to be performed. Not all combinations of parameters are supported. Currently all three parameters need to match precision. Parameters inputtype and outputtype need to match transform type complex-to-complex, real-to-complex or complex-to-real. Parameter executiontype needs to match precision and be of a complex type. Example: for a 16 bit real-to-complex transform parameters inputtypeoutputtype and executiontype would have values of CUDA_R_16FCUDA_C_16F and CUDA_C_16F respectively.

    The cufftXtMakePlanMany() API supports more complicated input and output data layouts via the advanced data layout parameters: inembedistrideidistonembedostride, and odist.

    If inembed and onembed are set to NULL, all other stride information is ignored, and default strides are used. The default assumes contiguous data arrays.

    If cufftXtSetGPUs() was called prior to this call with multiple GPUs, then workSize will contain multiple sizes. See sections on multiple GPUs for more details.

    All arrays are assumed to be in CPU memory.

    Input
    plancufftHandle returned by cufftCreate
    rankDimensionality of the transform (1, 2, or 3)
    nArray of size rank, describing the size of each dimension, n[0] being the size of the innermost deminsion. For multiple GPUs and rank equal to 1, the sizes must be a power of 2. For multiple GPUs and rank equal to 2 or 3, the sizes must be factorable into primes less than or equal to 127.
    inembedPointer of size rank that indicates the storage dimensions of the input data in memory, inembed[0] being the storage dimension of the innermost dimension. If set to NULL all other advanced data layout parameters are ignored.
    istrideIndicates the distance between two successive input elements in the least significant (i.e., innermost) dimension
    idistIndicates the distance between the first element of two consecutive signals in a batch of the input data
    inputtypeType of input data.
    onembedPointer of size rank that indicates the storage dimensions of the output data in memory, inembed[0] being the storage dimension of the innermost dimension. If set to NULL all other advanced data layout parameters are ignored.
    ostrideIndicates the distance between two successive output elements in the output array in the least significant (i.e., innermost) dimension
    odistIndicates the distance between the first element of two consecutive signals in a batch of the output data
    outputtypeType of output data.
    batchBatch size for this transform
    *workSizePointer to the size(s), in bytes, of the work areas. For example for two GPUs worksize must be declared to have two elements.
    executiontypeType of data to be used for computations.
    Output
    *workSizePointer to the size(s) of the work areas.
    Return Values
    CUFFT_SUCCESScuFFT successfully created the FFT plan.
    CUFFT_INVALID_PLANThe plan parameter is not a valid handle. Handle is not valid when multi-GPU restrictions are not met.
    CUFFT_ALLOC_FAILEDThe allocation of GPU resources for the plan failed.
    CUFFT_INVALID_VALUEOne or more invalid parameters were passed to the API.
    CUFFT_INTERNAL_ERRORAn internal driver error was detected.
    CUFFT_SETUP_FAILEDThe cuFFT library failed to initialize.
    CUFFT_INVALID_SIZEOne or more of the parameters is not a supported size.

    3.4. cuFFT Estimated Size of Work Area

    During plan execution, cuFFT requires a work area for temporary storage of intermediate results. The cufftEstimate*() calls return an estimate for the size of the work area required, given the specified parameters, and assuming default plan settings. Some problem sizes require much more storage than others. In particular powers of 2 are very efficient in terms of temporary storage. Large prime numbers, however, use different algorithms and may need up to the eight times that of a similarly sized power of 2. These routines return estimated workSize values which may still be smaller than the actual values needed especially for values of n that are not multiples of powers of 2, 3, 5 and 7. More refined values are given by the cufftGetSize*() routines, but these values may still be conservative.

    3.4.1. Function cufftEstimate1d()

    cufftResult 
        cufftEstimate1d(int nx, cufftType type, int batch, size_t *workSize);
    

    During plan execution, cuFFT requires a work area for temporary storage of intermediate results. This call returns an estimate for the size of the work area required, given the specified parameters, and assuming default plan settings.

    Input
    nxThe transform size (e.g. 256 for a 256-point FFT)
    typeThe transform data type (e.g., CUFFT_C2C for single precision complex to complex)
    batchNumber of transforms of size nx. Please consider using cufftEstimateMany for multiple transforms.
    *workSizePointer to the size, in bytes, of the work space.
    Output
    *workSizePointer to the size of the work space
    Return Values
    CUFFT_SUCCESScuFFT successfully returned the size of the work space.
    CUFFT_ALLOC_FAILEDThe allocation of GPU resources for the plan failed.
    CUFFT_INVALID_VALUEOne or more invalid parameters were passed to the API.
    CUFFT_INTERNAL_ERRORAn internal driver error was detected.
    CUFFT_SETUP_FAILEDThe cuFFT library failed to initialize.
    CUFFT_INVALID_SIZEThe nx parameter is not a supported size.

    3.4.2. Function cufftEstimate2d()

    cufftResult 
        cufftEstimate2d(int nx, int ny, cufftType type, size_t *workSize);
    

    During plan execution, cuFFT requires a work area for temporary storage of intermediate results. This call returns an estimate for the size of the work area required, given the specified parameters, and assuming default plan settings.

    Input
    nxThe transform size in the x dimension (number of rows)
    nyThe transform size in the y dimension (number of columns)
    typeThe transform data type (e.g., CUFFT_C2R for single precision complex to real)
    *workSizePointer to the size, in bytes, of the work space.
    Output
    *workSizePointer to the size, of the work space
    Return Values
    CUFFT_SUCCESScuFFT successfully returned the size of the work space.
    CUFFT_ALLOC_FAILEDThe allocation of GPU resources for the plan failed.
    CUFFT_INVALID_VALUEOne or more invalid parameters were passed to the API.
    CUFFT_INTERNAL_ERRORAn internal driver error was detected.
    CUFFT_SETUP_FAILEDThe cuFFT library failed to initialize.
    CUFFT_INVALID_SIZEEither or both of the nx or ny parameters is not a supported size.

    3.4.3. Function cufftEstimate3d()

    cufftResult 
        cufftEstimate3d(int nx, int ny, int nz, cufftType type, size_t *workSize);
    

    During plan execution, cuFFT requires a work area for temporary storage of intermediate results. This call returns an estimate for the size of the work area required, given the specified parameters, and assuming default plan settings.

    Input
    nxThe transform size in the x dimension
    nyThe transform size in the y dimension
    nzThe transform size in the z dimension
    typeThe transform data type (e.g., CUFFT_R2C for single precision real to complex)
    *workSizePointer to the size, in bytes, of the work space.
    Output
    *workSizePointer to the size of the work space
    Return Values
    CUFFT_SUCCESScuFFT successfully returned the size of the work space.
    CUFFT_ALLOC_FAILEDThe allocation of GPU resources for the plan failed.
    CUFFT_INVALID_VALUEOne or more invalid parameters were passed to the API.
    CUFFT_INTERNAL_ERRORAn internal driver error was detected.
    CUFFT_SETUP_FAILEDThe cuFFT library failed to initialize.
    CUFFT_INVALID_SIZEOne or more of the nxny, or nz parameters is not a supported size.

    3.4.4. Function cufftEstimateMany()

    cufftResult 
        cufftEstimateMany(int rank, int *n, int *inembed,
            int istride, int idist, int *onembed, int ostride,
            int odist, cufftType type, int batch, size_t *workSize);
    

    During plan execution, cuFFT requires a work area for temporary storage of intermediate results. This call returns an estimate for the size of the work area required, given the specified parameters, and assuming default plan settings.

    The cufftEstimateMany() API supports more complicated input and output data layouts via the advanced data layout parameters: inembedistrideidistonembedostride, and odist.

    All arrays are assumed to be in CPU memory.

    Input
    rankDimensionality of the transform (1, 2, or 3)
    nArray of size rank, describing the size of each dimension
    inembedPointer of size rank that indicates the storage dimensions of the input data in memory. If set to NULL all other advanced data layout parameters are ignored.
    istrideIndicates the distance between two successive input elements in the least significant (i.e., innermost) dimension
    idistIndicates the distance between the first element of two consecutive signals in a batch of the input data
    onembedPointer of size rank that indicates the storage dimensions of the output data in memory. If set to NULL all other advanced data layout parameters are ignored.
    ostrideIndicates the distance between two successive output elements in the output array in the least significant (i.e., innermost) dimension
    odistIndicates the distance between the first element of two consecutive signals in a batch of the output data
    typeThe transform data type (e.g., CUFFT_R2C for single precision real to complex)
    batchBatch size for this transform
    *workSizePointer to the size, in bytes, of the work space.
    Output
    *workSizePointer to the size of the work space
    Return Values
    CUFFT_SUCCESScuFFT successfully returned the size of the work space.
    CUFFT_ALLOC_FAILEDThe allocation of GPU resources for the plan failed.
    CUFFT_INVALID_VALUEOne or more invalid parameters were passed to the API.
    CUFFT_INTERNAL_ERRORAn internal driver error was detected.
    CUFFT_SETUP_FAILEDThe cuFFT library failed to initialize.
    CUFFT_INVALID_SIZEOne or more of the parameters is not a supported size.

    3.5. cuFFT Refined Estimated Size of Work Area

    The cufftGetSize*() routines give a more accurate estimate of the work area size required for a plan than the cufftEstimate*() routines as they take into account any plan settings that may have been made. As discussed in the section cuFFT Estimated Size of Work Area, the workSize value(s) returned may be conservative especially for values of n that are not multiples of powers of 2, 3, 5 and 7.

    3.5.1. Function cufftGetSize1d()

    cufftResult 
        cufftGetSize1d(cufftHandle plan, int nx, cufftType type, int batch, 
            size_t *workSize);
    

    This call gives a more accurate estimate of the work area size required for a plan than cufftEstimate1d(), given the specified parameters, and taking into account any plan settings that may have been made.

    Input
    plancufftHandle returned by cufftCreate
    nxThe transform size (e.g. 256 for a 256-point FFT)
    typeThe transform data type (e.g., CUFFT_C2C for single precision complex to complex)
    batchNumber of transforms of size nx. Please consider using cufftGetSizeMany for multiple transforms.
    *workSizePointer to the size(s), in bytes, of the work areas. For example for two GPUs worksize must be declared to have two elements.
    Output
    *workSizePointer to the size of the work space
    Return Values
    CUFFT_SUCCESScuFFT successfully returned the size of the work space.
    CUFFT_INVALID_PLANThe plan parameter is not a valid handle.
    CUFFT_ALLOC_FAILEDThe allocation of GPU resources for the plan failed.
    CUFFT_INVALID_VALUEOne or more invalid parameters were passed to the API.
    CUFFT_INTERNAL_ERRORAn internal driver error was detected.
    CUFFT_SETUP_FAILEDThe cuFFT library failed to initialize.
    CUFFT_INVALID_SIZEThe nx parameter is not a supported size.

    3.5.2. Function cufftGetSize2d()

    cufftResult 
        cufftGetSize2d(cufftHandle plan, int nx, int ny, cufftType type, 
            size_t *workSize);
    

    This call gives a more accurate estimate of the work area size required for a plan than cufftEstimate2d(), given the specified parameters, and taking into account any plan settings that may have been made.

    Input
    plancufftHandle returned by cufftCreate
    nxThe transform size in the x dimension (number of rows)
    nyThe transform size in the y dimension (number of columns)
    typeThe transform data type (e.g., CUFFT_C2R for single precision complex to real)
    *workSizePointer to the size(s), in bytes, of the work areas. For example for two GPUs worksize must be declared to have two elements.
    Output
    *workSizePointer to the size of the work space
    Return Values
    CUFFT_SUCCESScuFFT successfully returned the size of the work space.
    CUFFT_INVALID_PLANThe plan parameter is not a valid handle.
    CUFFT_ALLOC_FAILEDThe allocation of GPU resources for the plan failed.
    CUFFT_INVALID_VALUEOne or more invalid parameters were passed to the API.
    CUFFT_INTERNAL_ERRORAn internal driver error was detected.
    CUFFT_SETUP_FAILEDThe cuFFT library failed to initialize.
    CUFFT_INVALID_SIZEEither or both of the nx or ny parameters is not a supported size.

    3.5.3. Function cufftGetSize3d()

    cufftResult 
        cufftGetSize3d(cufftHandle plan, int nx, int ny, int nz, cufftType type,
            size_t *workSize);
    

    This call gives a more accurate estimate of the work area size required for a plan than cufftEstimate3d(), given the specified parameters, and taking into account any plan settings that may have been made.

    Input
    plancufftHandle returned by cufftCreate
    nxThe transform size in the x dimension
    nyThe transform size in the y dimension
    nzThe transform size in the z dimension
    typeThe transform data type (e.g., CUFFT_R2C for single precision real to complex)
    *workSizePointer to the size(s), in bytes, of the work areas. For example for two GPUs worksize must be declared to have two elements.
    Output
    *workSizePointer to the size of the work space.
    Return Values
    CUFFT_SUCCESScuFFT successfully returned the size of the work space.
    CUFFT_INVALID_PLANThe plan parameter is not a valid handle.
    CUFFT_ALLOC_FAILEDThe allocation of GPU resources for the plan failed.
    CUFFT_INVALID_VALUEOne or more invalid parameters were passed to the API.
    CUFFT_INTERNAL_ERRORAn internal driver error was detected.
    CUFFT_SETUP_FAILEDThe cuFFT library failed to initialize.
    CUFFT_INVALID_SIZEOne or more of the nxny, or nz parameters is not a supported size.

    3.5.4. Function cufftGetSizeMany()

    cufftResult 
        cufftGetSizeMany(cufftHandle plan, int rank, int *n, int *inembed,
            int istride, int idist, int *onembed, int ostride,
            int odist, cufftType type, int batch, size_t *workSize);
    

    This call gives a more accurate estimate of the work area size required for a plan than cufftEstimateSizeMany(), given the specified parameters, and taking into account any plan settings that may have been made.

    Input
    plancufftHandle returned by cufftCreate
    rankDimensionality of the transform (1, 2, or 3)
    nArray of size rank, describing the size of each dimension
    inembedPointer of size rank that indicates the storage dimensions of the input data in memory. If set to NULL all other advanced data layout parameters are ignored.
    istrideIndicates the distance between two successive input elements in the least significant (i.e., innermost) dimension
    idistIndicates the distance between the first element of two consecutive signals in a batch of the input data
    onembedPointer of size rank that indicates the storage dimensions of the output data in memory. If set to NULL all other advanced data layout parameters are ignored.
    ostrideIndicates the distance between two successive output elements in the output array in the least significant (i.e., innermost) dimension
    odistIndicates the distance between the first element of two consecutive signals in a batch of the output data
    typeThe transform data type (e.g., CUFFT_R2C for single precision real to complex)
    batchBatch size for this transform
    *workSizePointer to the size(s), in bytes, of the work areas. For example for two GPUs worksize must be declared to have two elements.
    Output
    *workSizePointer to the size of the work area
    Return Values
    CUFFT_SUCCESScuFFT successfully returned the size of the work space.
    CUFFT_INVALID_PLANThe plan parameter is not a valid handle.
    CUFFT_ALLOC_FAILEDThe allocation of GPU resources for the plan failed.
    CUFFT_INVALID_VALUEOne or more invalid parameters were passed to the API.
    CUFFT_INTERNAL_ERRORAn internal driver error was detected.
    CUFFT_SETUP_FAILEDThe cuFFT library failed to initialize.
    CUFFT_INVALID_SIZEOne or more of the parameters is not a supported size.

    Function cufftGetSizeMany64()

    cufftResult 
        cufftGetSizeMany64(cufftHandle plan, int rank, 
            long long int *n, 
            long long int *inembed, long long int istride, long long int idist, 
            long long int *onembed, long long int ostride, long long int odist, 
            cufftType type, 
            long long int batch, size_t *workSize);
    

    This call gives a more accurate estimate of the work area size required for a plan than cufftEstimateSizeMany(), given the specified parameters, and taking into account any plan settings that may have been made.

    This API is identical to cufftMakePlanMany except that the arguments specifying sizes and strides are 64 bit integers. This API makes very large transforms possible. cuFFT includes kernels that use 32 bit indexes, and kernels that use 64 bit indexes. cuFFT planning selects 32 bit kernels whenever possible to avoid any overhead due to 64 bit arithmetic.

    All sizes and types of transform are supported by this interface, with two exceptions. For transforms whose total size exceeds 4G elements, the dimensions specified in the array n must be factorable into primes that are less than or equal to 127. For real to complex and complex to real transforms whose total size exceeds 2G elements, the fastest changing dimension must be even.

    Input
    plancufftHandle returned by cufftCreate
    rankDimensionality of the transform (1, 2, or 3)
    nArray of size rank, describing the size of each dimension
    inembedPointer of size rank that indicates the storage dimensions of the input data in memory. If set to NULL all other advanced data layout parameters are ignored.
    istrideIndicates the distance between two successive input elements in the least significant (i.e., innermost) dimension
    idistIndicates the distance between the first element of two consecutive signals in a batch of the input data
    onembedPointer of size rank that indicates the storage dimensions of the output data in memory. If set to NULL all other advanced data layout parameters are ignored.
    ostrideIndicates the distance between two successive output elements in the output array in the least significant (i.e., innermost) dimension
    odistIndicates the distance between the first element of two consecutive signals in a batch of the output data
    typeThe transform data type (e.g., CUFFT_R2C for single precision real to complex)
    batchBatch size for this transform
    *workSizePointer to the size(s), in bytes, of the work areas. For example for two GPUs worksize must be declared to have two elements.
    Output
    *workSizePointer to the size of the work area
    Return Values
    CUFFT_SUCCESScuFFT successfully returned the size of the work space.
    CUFFT_INVALID_PLANThe plan parameter is not a valid handle.
    CUFFT_ALLOC_FAILEDThe allocation of GPU resources for the plan failed.
    CUFFT_INVALID_VALUEOne or more invalid parameters were passed to the API.
    CUFFT_INTERNAL_ERRORAn internal driver error was detected.
    CUFFT_SETUP_FAILEDThe cuFFT library failed to initialize.
    CUFFT_INVALID_SIZEOne or more of the parameters is not a supported size.

    3.5.6. Function cufftXtGetSizeMany()

    cufftResult
        cufftXtGetSizeMany(cufftHandle plan, int rank, long long int *n, long long int *inembed,
            long long int istride, long long int idist, cudaDataType inputtype,
            long long int *onembed, long long int ostride, long long int odist,
            cudaDataType outputtype, long long int batch, size_t *workSize,
            cudaDataType executiontype);
    

    This call gives a more accurate estimate of the work area size required for a plan than cufftEstimateSizeMany(), given the specified parameters that match signature of cufftXtMakePlanMany function, and taking into account any plan settings that may have been made.

    For more information about valid combinations of inputtypeoutputtype and executiontype parameters please refer to documentation of cufftXtMakePlanMany function.

    Input
    plancufftHandle returned by cufftCreate
    rankDimensionality of the transform (1, 2, or 3)
    nArray of size rank, describing the size of each dimension
    inembedPointer of size rank that indicates the storage dimensions of the input data in memory. If set to NULL all other advanced data layout parameters are ignored.
    istrideIndicates the distance between two successive input elements in the least significant (i.e., innermost) dimension
    idistIndicates the distance between the first element of two consecutive signals in a batch of the input data
    cudaDataType inputtypeType of input data.
    onembedPointer of size rank that indicates the storage dimensions of the output data in memory. If set to NULL all other advanced data layout parameters are ignored.
    ostrideIndicates the distance between two successive output elements in the output array in the least significant (i.e., innermost) dimension
    odistIndicates the distance between the first element of two consecutive signals in a batch of the output data
    cudaDataType outputtypeType of output data.
    batchBatch size for this transform
    *workSizePointer to the size(s), in bytes, of the work areas. For example for two GPUs worksize must be declared to have two elements.
    cudaDataType executiontypeType of data to be used for computations.
    Output
    *workSizePointer to the size of the work area
    Return Values
    CUFFT_SUCCESScuFFT successfully returned the size of the work space.
    CUFFT_INVALID_PLANThe plan parameter is not a valid handle.
    CUFFT_ALLOC_FAILEDThe allocation of GPU resources for the plan failed.
    CUFFT_INVALID_VALUEOne or more invalid parameters were passed to the API.
    CUFFT_INTERNAL_ERRORAn internal driver error was detected.
    CUFFT_SETUP_FAILEDThe cuFFT library failed to initialize.
    CUFFT_INVALID_SIZEOne or more of the parameters is not a supported size.

    3.6. Function cufftGetSize()

    cufftResult 
        cufftGetSize(cufftHandle plan, size_t *workSize);
    

    Once plan generation has been done, either with the original API or the extensible API, this call returns the actual size of the work area required to support the plan. Callers who choose to manage work area allocation within their application must use this call after plan generation, and after any cufftSet*() calls subsequent to plan generation, if those calls might alter the required work space size.

    Input
    plancufftHandle returned by cufftCreate
    *workSizePointer to the size(s), in bytes, of the work areas. For example for two GPUs worksize must be declared to have two elements.
    Output
    *workSizePointer to the size of the work space
    Return Values
    CUFFT_SUCCESScuFFT successfully returned the size of the work space.
    CUFFT_INVALID_PLANThe plan parameter is not a valid handle.
    CUFFT_INTERNAL_ERRORAn internal driver error was detected.

    cuFFT Caller Allocated Work Area Support

    3.7.1. Function cufftSetAutoAllocation()

    cufftResult 
        cufftSetAutoAllocation(cufftHandle plan, int autoAllocate);
    

    cufftSetAutoAllocation() indicates that the caller intends to allocate and manage work areas for plans that have been generated. cuFFT default behavior is to allocate the work area at plan generation time. If cufftSetAutoAllocation() has been called with autoAllocate set to 0 ("false") prior to one of thecufftMakePlan*() calls, cuFFT does not allocate the work area. This is the preferred sequence for callers wishing to manage work area allocation.

    Input
    plancufftHandle returned by cufftCreate.
    autoAllocateIndicates whether to allocate work area.
    Return Values
    CUFFT_SUCCESScuFFT successfully allows user to manage work area.
    CUFFT_INVALID_PLANThe plan parameter is not a valid handle.
    CUFFT_INTERNAL_ERRORAn internal driver error was detected.

    3.7.2. Function cufftSetWorkArea()

    cufftResult 
        cufftSetWorkArea(cufftHandle plan, void *workArea);
    

    cufftSetWorkArea() overrides the work area pointer associated with a plan. If the work area was auto-allocated, cuFFT frees the auto-allocated space. The cufftExecute*() calls assume that the work area pointer is valid and that it points to a contiguous region in device memory that does not overlap with any other work area. If this is not the case, results are indeterminate.

    Input
    plancufftHandle returned by cufftCreate
    workAreaPointer to workArea. For multiple GPUs, multiple work area pointers must be given.
    Return Values
    CUFFT_SUCCESScuFFT successfully allows user to override workArea pointer.
    CUFFT_INVALID_PLANThe plan parameter is not a valid handle.
    CUFFT_INTERNAL_ERRORAn internal driver error was detected.
    CUFFT_SETUP_FAILEDThe cuFFT library failed to initialize.

    3.8. Function cufftDestroy()

    cufftResult 
        cufftDestroy(cufftHandle plan);
    

    Frees all GPU resources associated with a cuFFT plan and destroys the internal plan data structure. This function should be called once a plan is no longer needed, to avoid wasting GPU memory.

    Input
    planThe cufftHandle object of the plan to be destroyed.
    Return Values
    CUFFT_SUCCESScuFFT successfully destroyed the FFT plan.
    CUFFT_INVALID_PLANThe plan parameter is not a valid handle.

    3.9. cuFFT Execution

    3.9.1. Functions cufftExecC2C() and cufftExecZ2Z()

    cufftResult 
        cufftExecC2C(cufftHandle plan, cufftComplex *idata, 
            cufftComplex *odata, int direction);
    cufftResult 
        cufftExecZ2Z(cufftHandle plan, cufftDoubleComplex *idata, 
            cufftDoubleComplex *odata, int direction);
    

    cufftExecC2C() (cufftExecZ2Z()) executes a single-precision (double-precision) complex-to-complex transform plan in the transform direction as specified by direction parameter. cuFFT uses the GPU memory pointed to by the idata parameter as input data. This function stores the Fourier coefficients in the odataarray. If idata and odata are the same, this method does an in-place transform.

    Input
    plancufftHandle returned by cufftCreate
    idataPointer to the complex input data (in GPU memory) to transform
    odataPointer to the complex output data (in GPU memory)
    directionThe transform direction: CUFFT_FORWARD or CUFFT_INVERSE
    Output
    odataContains the complex Fourier coefficients
    Return Values
    CUFFT_SUCCESScuFFT successfully executed the FFT plan.
    CUFFT_INVALID_PLANThe plan parameter is not a valid handle.
    CUFFT_INVALID_VALUEAt least one of the parameters idataodata, and direction is not valid.
    CUFFT_INTERNAL_ERRORAn internal driver error was detected.
    CUFFT_EXEC_FAILEDcuFFT failed to execute the transform on the GPU.
    CUFFT_SETUP_FAILEDThe cuFFT library failed to initialize.

    3.9.2. Functions cufftExecR2C() and cufftExecD2Z()

    cufftResult 
        cufftExecR2C(cufftHandle plan, cufftReal *idata, cufftComplex *odata);
    cufftResult 
        cufftExecD2Z(cufftHandle plan, cufftDoubleReal *idata, cufftDoubleComplex *odata);
    

    cufftExecR2C() (cufftExecD2Z()) executes a single-precision (double-precision) real-to-complex, implicitly forward, cuFFT transform plan. cuFFT uses as input data the GPU memory pointed to by the idata parameter. This function stores the nonredundant Fourier coefficients in the odata array. Pointers to idataand odata are both required to be aligned to cufftComplex data type in single-precision transforms and cufftDoubleComplex data type in double-precision transforms. If idata and odata are the same, this method does an in-place transform. Note the data layout differences between in-place and out-of-place transforms as described in Parameter cufftType.

    Input
    plancufftHandle returned by cufftCreate
    idataPointer to the real input data (in GPU memory) to transform
    odataPointer to the complex output data (in GPU memory)
    Output
    odataContains the complex Fourier coefficients
    Return Values
    CUFFT_SUCCESScuFFT successfully executed the FFT plan.
    CUFFT_INVALID_PLANThe plan parameter is not a valid handle.
    CUFFT_INVALID_VALUEAt least one of the parameters idata and odata is not valid.
    CUFFT_INTERNAL_ERRORAn internal driver error was detected.
    CUFFT_EXEC_FAILEDcuFFT failed to execute the transform on the GPU.
    CUFFT_SETUP_FAILEDThe cuFFT library failed to initialize.

    3.9.3. Functions cufftExecC2R() and cufftExecZ2D()

    cufftResult 
        cufftExecC2R(cufftHandle plan, cufftComplex *idata, cufftReal *odata);
    cufftResult 
        cufftExecZ2D(cufftHandle plan, cufftComplex *idata, cufftReal *odata);
    

    cufftExecC2R() (cufftExecZ2D()) executes a single-precision (double-precision) complex-to-real, implicitly inverse, cuFFT transform plan. cuFFT uses as input data the GPU memory pointed to by the idata parameter. The input array holds only the nonredundant complex Fourier coefficients. This function stores the real output values in the odata array. and pointers are both required to be aligned to cufftComplex data type in single-precision transforms and cufftDoubleComplex type in double-precision transforms. If idata and odata are the same, this method does an in-place transform.

    Input
    plancufftHandle returned by cufftCreate
    idataPointer to the complex input data (in GPU memory) to transform
    odataPointer to the real output data (in GPU memory)
    Output
    odataContains the real output data
    Return Values
    CUFFT_SUCCESScuFFT successfully executed the FFT plan.
    CUFFT_INVALID_PLANThe plan parameter is not a valid handle.
    CUFFT_INVALID_VALUEAt least one of the parameters idata and odata is not valid.
    CUFFT_INTERNAL_ERRORAn internal driver error was detected.
    CUFFT_EXEC_FAILEDcuFFT failed to execute the transform on the GPU.
    CUFFT_SETUP_FAILEDThe cuFFT library failed to initialize.

    3.9.4. Function cufftXtExec()

    cufftResult 
        cufftXtExec(cufftHandle plan, void *input, 
            void *output, int direction);
    

    Function cufftXtExec executes any cuFFT transform regardless of precision and type. In case of complex-to-real and real-to-complex transforms direction parameter is ignored. cuFFT uses the GPU memory pointed to by the input parameter as input data. This function stores the Fourier coefficients in the outputarray. If input and output are the same, this method does an in-place transform.

    Input
    plancufftHandle returned by cufftCreate
    inputPointer to the input data (in GPU memory) to transform
    outputPointer to the output data (in GPU memory)
    directionThe transform direction: CUFFT_FORWARD or CUFFT_INVERSE. Ignored for complex-to-real and real-to-complex transforms.
    Output
    outputContains the complex Fourier coefficients
    Return Values
    CUFFT_SUCCESScuFFT successfully executed the FFT plan.
    CUFFT_INVALID_PLANThe plan parameter is not a valid handle.
    CUFFT_INVALID_VALUEAt least one of the parameters idataodata, and direction is not valid.
    CUFFT_INTERNAL_ERRORAn internal driver error was detected.
    CUFFT_EXEC_FAILEDcuFFT failed to execute the transform on the GPU.
    CUFFT_SETUP_FAILEDThe cuFFT library failed to initialize.

    3.9.5. Function cufftXtExecDescriptor()

    cufftResult 
        cufftXtExecDescriptor(cufftHandle plan, cudaLibXtDesc *input, 
            cudaLibXtDesc *output, int direction);
    

    Function cufftXtExecDescriptor() executes any cuFFT transform regardless of precision and type. In case of complex-to-real and real-to-complex transforms direction parameter is ignored. cuFFT uses the GPU memory pointed to by cudaLibXtDesc *input descriptor as input data and cudaLibXtDesc *output as output data.

    Input
    plancufftHandle returned by cufftCreate
    inputPointer to the complex input data (in GPU memory) to transform
    outputPointer to the complex output data (in GPU memory)
    directionThe transform direction: CUFFT_FORWARD or CUFFT_INVERSE. Ignored for complex-to-real and real-to-complex transforms.
    Output
    idataContains the complex Fourier coefficients
    Return Values
    CUFFT_SUCCESScuFFT successfully executed the FFT plan.
    CUFFT_INVALID_PLANThe plan parameter is not a valid handle.
    CUFFT_INVALID_VALUEAt least one of the parameters idata and direction is not valid.
    CUFFT_INTERNAL_ERRORAn internal driver error was detected.
    CUFFT_EXEC_FAILEDcuFFT failed to execute the transform on the GPU.
    CUFFT_SETUP_FAILEDThe cuFFT library failed to initialize.
    CUFFT_INVALID_DEVICEAn invalid GPU index was specified in a descriptor.

    3.10. cuFFT and Multiple GPUs

    3.10.1. Function cufftXtSetGPUs()

    cufftResult 
        cufftXtSetGPUs(cufftHandle plan, int nGPUs, int *whichGPUs);
    

    cufftXtSetGPUs() indentifies which GPUs are to be used with the plan. As in the single GPU case cufftCreate() creates a plan and cufftMakePlan*() does the plan generation. This call will return an error if a non-default stream has been associated with the plan.

    Note that the call to cufftXtSetGPUs() must occur after the call to cufftCreate() and prior to the call to cufftMakePlan*(). Parameter whichGPUs of cufftXtSetGPUs() function determines ordering of the GPUs with respect to data decomposition (first data chunk is placed on GPU denoted by first element ofwhichGPUs).

    Input
    plancufftHandle returned by cufftCreate
    nGPUsNumber of GPUs to use
    whichGPUsThe GPUs to use
    Return Values
    CUFFT_SUCCESScuFFT successfully set the GPUs to use.
    CUFFT_INVALID_PLANThe plan parameter is not a valid handle, or a non-default stream has been associated with the plan.
    CUFFT_ALLOC_FAILEDThe allocation of GPU resources for the plan failed.
    CUFFT_INTERNAL_ERRORAn internal driver error was detected.
    CUFFT_SETUP_FAILEDThe cuFFT library failed to initialize.
    CUFFT_INVALID_VALUEThe requested number of GPUs was less than 2 or more than 8.
    CUFFT_INVALID_DEVICEAn invalid GPU index was specified.
    CUFFT_INVALID_SIZETransform size that plan was created for does not meet minimum size criteria.

    3.10.2. Function cufftXtSetWorkArea()

    cufftResult 
        cufftXtSetWorkArea(cufftHandle plan, void **workArea);
    

    cufftXtSetWorkArea() overrides the work areas associated with a plan. If the work area was auto-allocated, cuFFT frees the auto-allocated space. The cufftXtExec*() calls assume that the work area is valid and that it points to a contiguous region in each device memory that does not overlap with any other work area. If this is not the case, results are indeterminate.

    Input
    plancufftHandle returned by cufftCreate
    workAreaPointer to the pointers to workArea
    Return Values
    CUFFT_SUCCESScuFFT successfully allows user to override workArea pointer.
    CUFFT_INVALID_PLANThe plan parameter is not a valid handle.
    CUFFT_INTERNAL_ERRORAn internal driver error was detected.
    CUFFT_SETUP_FAILEDThe cuFFT library failed to initialize.
    CUFFT_INVALID_DEVICEA GPU associated with the plan could not be selected.

    3.10.3. cuFFT Multiple GPU Execution

    3.10.3.1. Functions cufftXtExecDescriptorC2C() and cufftXtExecDescriptorZ2Z()

    cufftResult 
        cufftXtExecDescriptorC2C(cufftHandle plan, cudaLibXtDesc *input, 
            cudaLibXtDesc *output, int direction);
    cufftResult 
        cufftXtExecDescriptorZ2Z(cufftHandle plan, cudaLibXtDesc *input, 
            cudaLibXtDesc *output, int direction);
    

    cufftXtExecDescriptorC2C() (cufftXtExecDescriptorZ2Z()) executes a single-precision (double-precision) complex-to-complex transform plan in the transform direction as specified by direction parameter. cuFFT uses the GPU memory pointed to by cudaLibXtDesc *input as input data. Since only in-place multiple GPU functionality is support, this function also stores the result in the cudaLibXtDesc *input arrays.

    Input
    plancufftHandle returned by cufftCreate
    inputPointer to the complex input data (in GPU memory) to transform
    outputPointer to the complex output data (in GPU memory)
    directionThe transform direction: CUFFT_FORWARD or CUFFT_INVERSE
    Output
    inputContains the complex Fourier coefficients
    Return Values
    CUFFT_SUCCESScuFFT successfully executed the FFT plan.
    CUFFT_INVALID_PLANThe plan parameter is not a valid handle.
    CUFFT_INVALID_VALUEAt least one of the parameters input and direction is not valid.
    CUFFT_INTERNAL_ERRORAn internal driver error was detected.
    CUFFT_EXEC_FAILEDcuFFT failed to execute the transform on the GPU.
    CUFFT_SETUP_FAILEDThe cuFFT library failed to initialize.
    CUFFT_INVALID_DEVICEAn invalid GPU index was specified in a descriptor.

    3.10.4. Memory Allocation and Data Movement Functions

    Multiple GPU cuFFT execution functions assume a certain data layout in terms of what input data has been copied to which GPUs prior to execution, and what output data resides in which GPUs post execution. The following functions assist in allocation, setup and retrieval of the data. They must be called after the call tocufftMakePlan*().

    3.10.4.1. Function cufftXtMalloc()

    cufftResult 
        cufftXtMalloc(cufftHandle plan, cudaLibXtDesc **descriptor, 
            cufftXtSubFormat format);
    

    cufftXtMalloc() allocates a descriptor, and all memory for data in GPUs associated with the plan, and returns a pointer to the descriptor. Note the descriptor contains an array of device pointers so that the application may preprocess or postprocess the data on the GPUs. The enumerated parameter cufftXtSubFormat_tindicates if the buffer will be used for input or output.

    Input
    plancufftHandle returned by cufftCreate
    descriptorPointer to a pointer to a cudaLibXtDesc object
    formatcufftXtSubFormat value
    Output
    descriptorPointer to a pointer to a cudaLibXtDesc object
    Return Values
    CUFFT_SUCCESScuFFT successfully allows user to allocate descriptor and GPU memory.
    CUFFT_INVALID_PLANThe plan parameter is not a valid handle or it is not a multiple GPU plan.
    CUFFT_ALLOC_FAILEDThe allocation of GPU resources for the plan failed.
    CUFFT_INTERNAL_ERRORAn internal driver error was detected.
    CUFFT_SETUP_FAILEDThe cuFFT library failed to initialize.
    CUFFT_INVALID_DEVICEAn invalid GPU index was specified in the descriptor.

    3.10.4.1.1. Parameter cufftXtSubFormat

    cufftXtSubFormat_t is an enumerated type that indicates if the buffer will be used for input or output and the ordering of the data.

    typedef enum cufftXtSubFormat_t {
        CUFFT_XT_FORMAT_INPUT,              //by default input is in linear order across GPUs
        CUFFT_XT_FORMAT_OUTPUT,             //by default output is in scrambled order depending on transform
        CUFFT_XT_FORMAT_INPLACE,            //by default inplace is input order, which is linear across GPUs
        CUFFT_XT_FORMAT_INPLACE_SHUFFLED,   //shuffled output order after execution of the transform
        CUFFT_FORMAT_UNDEFINED
    } cufftXtSubFormat;
    

    3.10.4.2. Function cufftXtFree()

    cufftResult 
        cufftXtFree(cudaLibXtDesc *descriptor);
    

    cufftXtFree() frees the descriptor and all memory associated with it. The descriptor and memory must have been returned by a previous call to cufftXtMalloc().

    Input
    descriptorPointer to a cudaLibXtDesc object
    Return Values
    CUFFT_SUCCESScuFFT successfully allows user to free descriptor and associated GPU memory.
    CUFFT_INTERNAL_ERRORAn internal driver error was detected.

    3.10.4.3. Function cufftXtMemcpy()

    cufftResult 
        cufftXtMemcpy(cufftHandle plan, void *dstPointer, void *srcPointer, 
            cufftXtCopyType type);
    

    cufftXtMemcpy() copies data between buffers on the host and GPUs or between GPUs. The enumerated parameter cufftXtCopyType_t indicates the type and direction of transfer.

    Input
    plancufftHandle returned by cufftCreate
    dstPointerPointer to the destination address(es)
    srcPointerPointer to the source address(es)
    typecufftXtCopyTypevalue
    Return Values
    CUFFT_SUCCESScuFFT successfully allows user to copy memory between host and GPUs or between GPUs.
    CUFFT_INVALID_PLANThe plan parameter is not a valid handle.
    CUFFT_INVALID_VALUEOne or more invalid parameters were passed to the API.
    CUFFT_INTERNAL_ERRORAn internal driver error was detected.
    CUFFT_SETUP_FAILEDThe cuFFT library failed to initialize.
    CUFFT_INVALID_DEVICEAn invalid GPU index was specified in a descriptor.

    3.10.4.3.1. Parameter cufftXtCopyType

    cufftXtCopyType_t is an enumerated type for multiple GPU functions that specifies the type of copy for cufftXtMemcpy().

    CUFFT_COPY_HOST_TO_DEVICE copies data from a contiguous host buffer to multiple device buffers, in the layout cuFFT requires for input data. dstPointer must point to a cudaLibXtDesc structure, and srcPointer must point to a host memory buffer.

    CUFFT_COPY_DEVICE_TO_HOST copies data from multiple device buffers, in the layout cuFFT produces for output data, to a contiguous host buffer. dstPointer must point to a host memory buffer, and srcPointer must point to a cudaLibXtDesc structure.

    CUFFT_COPY_DEVICE_TO_DEVICE copies data from multiple device buffers, in the layout cuFFT produces for output data, to multiple device buffers, in the layout cuFFT requires for input data. dstPointer and srcPointer must point to different cudaLibXtDesc structures (and therefore memory locations). That is, the copy cannot be in-place.

    typedef enum cufftXtCopyType_t {
        CUFFT_COPY_HOST_TO_DEVICE,
        CUFFT_COPY_DEVICE_TO_HOST,
        CUFFT_COPY_DEVICE_TO_DEVICE
    } cufftXtCopyType;
    
    

    3.10.5. General Multiple GPU Descriptor Types

    3.10.5.1. cudaXtDesc

    A descriptor type used in multiple GPU routines that contains information about the GPUs and their memory locations.

        struct cudaXtDesc_t{
        int version;                             //descriptor version
        int nGPUs;                               //number of GPUs
        int GPUs[MAX_CUDA_DESCRIPTOR_GPUS];      //array of device IDs
        void *data[MAX_CUDA_DESCRIPTOR_GPUS];    //array of pointers to data, one per GPU
        size_t size[MAX_CUDA_DESCRIPTOR_GPUS];   //array of data sizes, one per GPU
        void *cudaXtState;                       //opaque CUDA utility structure
    };
    typedef struct cudaXtDesc_t cudaXtDesc;
    

    3.10.5.2. cudaLibXtDesc

    A descriptor type used in multiple GPU routines that contains information about the library used.

    struct cudaLibXtDesc_t{
        int version;                //descriptor version
        cudaXtDesc *descriptor;     //multi-GPU memory descriptor
        libFormat library;          //which library recognizes the format
        int subFormat;              //library specific enumerator of sub formats
        void *libDescriptor;        //library specific descriptor e.g. FFT transform plan object
    };
    typedef struct cudaLibXtDesc_t cudaLibXtDesc;
    

    3.11. cuFFT Callbacks

    3.11.1. Function cufftXtSetCallback()

    cufftResult 
        cufftXtSetCallback(cufftHandle plan, void **callbackRoutine, cufftXtCallbackType type, void **callerInfo)
    

    cufftXtSetCallback() specifies a load or store callback to be used with the plan. This call is valid only after a call to cufftMakePlan*(), which does the plan generation. If there was already a callback of this type associated with the plan, this new callback routine replaces it. If the new callback requires shared memory, you must call cufftXtSetCallbackSharedSize with the amount of shared memory it needs. cuFFT will not retain the amount of shared memory associated with the previous callback.

    Input
    plancufftHandle returned by cufftCreate
    callbackRoutineArray of callback routine pointers, one per GPU
    typetype of callback routine
    callerInfooptional array of device pointers to caller specific information, one per GPU
    Return Values
    CUFFT_SUCCESScuFFT successfully associated the callback function with the plan.
    CUFFT_INVALID_PLANThe plan parameter is not a valid handle, or a non-default stream has been associated with the plan.
    CUFFT_INTERNAL_ERRORAn internal driver error was detected.
    CUFFT_SETUP_FAILEDThe cuFFT library failed to initialize.

    3.11.2. Function cufftXtClearCallback()

    cufftResult 
        cufftXtClearCallback(cufftHandle plan, cufftXtCallbackType type)
    

    cufftXtClearCallback() instructs cuFFT to stop invoking the specified callback type when executing the plan. Only the specified callback is cleared. If no callback of this type had been specified, the return code is CUFFT_SUCCESS.

    Input
    plancufftHandle returned by cufftCreate
    typetype of callback routine
    Return Values
    CUFFT_SUCCESScuFFT successfully disassociated the callback function with the plan.
    CUFFT_INVALID_PLANThe plan parameter is not a valid handle, or a non-default stream has been associated with the plan.
    CUFFT_INTERNAL_ERRORAn internal driver error was detected.

    3.11.3. Function cufftXtSetCallbackSharedSize()

    cufftResult 
        cufftXtSetCallbackSharedSize(cufftHandle plan, cufftXtCallbackType type, size_t sharedSize)
    

    cufftXtSetCallbackSharedSize() instructs cuFFT to dynamically allocate shared memory at launch time, for use by the callback. The maximum allowable amount of shared memory is 16K bytes. cuFFT passes a pointer to this shared memory to the callback routine at execution time. This shared memory is only valid for the life of the load or store callback operation. During execution, cuFFT may overwrite shared memory for its own purposes.

    Input
    plancufftHandle returned by cufftCreate
    typetype of callback routine
    sharedSizeamount of shared memory requested
    Return Values
    CUFFT_SUCCESScuFFT will invoke the callback routine with a pointer to the requested amount of shared memory.
    CUFFT_INVALID_PLANThe plan parameter is not a valid handle, or a non-default stream has been associated with the plan.
    CUFFT_INTERNAL_ERRORAn internal driver error was detected.
    CUFFT_ALLOC_FAILEDcuFFT will not be able to allocate the requested amount of shared memory.

    3.12. Function cufftSetStream()

    cufftResult 
        cufftSetStream(cufftHandle plan, cudaStream_t stream);
    

    Associates a CUDA stream with a cuFFT plan. All kernel launches made during plan execution are now done through the associated stream, enabling overlap with activity in other streams (e.g. data copying). The association remains until the plan is destroyed or the stream is changed with another call to cufftSetStream(). This call will return an error for multiple GPU plans.

    Input
    planThe cufftHandle object to associate with the stream
    streamA valid CUDA stream created with cudaStreamCreate()0 for the default stream
    Status Returned
    CUFFT_SUCCESSThe stream was associated with the plan.
    CUFFT_INVALID_PLANThe plan parameter is not a valid handle, or it is a multiple GPU plan.

    3.13. Function cufftGetVersion()

    cufftResult 
        cufftGetVersion(int *version);
    

    Returns the version number of cuFFT.

    Input
    versionPointer to the version number
    Output
    versionContains the version number
    Return Values
    CUFFT_SUCCESScuFFT successfully returned the version number.

    3.14. Function cufftGetProperty()

    cufftResult 
        cufftGetProperty(libraryPropertyType type, int *value);
    
    

    Return in *value the number for the property described by type of the dynamically linked CUFFT library.

    Input
    typeCUDA library property
    Output
    valueContains the integer value for the requested property
    Return Values
    CUFFT_SUCCESSthe property value was successfully returned.
    CUFFT_INVALID_TYPEthe property type is not recognized
    CUFFT_INVALID_VALUEvalue is NULL

    3.15. cuFFT Types

    3.15.1. Parameter cufftType

    The cuFFT library supports complex- and real-data transforms. The cufftType data type is an enumeration of the types of transform data supported by cuFFT.

    typedef enum cufftType_t {
        CUFFT_R2C = 0x2a,  // Real to complex (interleaved) 
        CUFFT_C2R = 0x2c,  // Complex (interleaved) to real 
        CUFFT_C2C = 0x29,  // Complex to complex (interleaved) 
        CUFFT_D2Z = 0x6a,  // Double to double-complex (interleaved) 
        CUFFT_Z2D = 0x6c,  // Double-complex (interleaved) to double 
        CUFFT_Z2Z = 0x69   // Double-complex to double-complex (interleaved)
    } cufftType;
    

    3.15.2. Parameters for Transform Direction

    The cuFFT library defines forward and inverse Fast Fourier Transforms according to the sign of the complex exponential term.

        #define cuFFTFORWARD -1
        #define cuFFTINVERSE 1
    

    cuFFT performs un-normalized FFTs; that is, performing a forward FFT on an input data set followed by an inverse FFT on the resulting set yields data that is equal to the input, scaled by the number of elements. Scaling either transform by the reciprocal of the size of the data set is left for the user to perform as seen fit.

    3.15.3. Type definitions for callbacks

    The cuFFT library supports callback funtions for all combinations of single or double precision, real or complex data, load or store. These are enumerated in the parameter cufftXtCallbackType.

    typedef enum cufftXtCallbackType_t {
        CUFFT_CB_LD_COMPLEX = 0x0,
        CUFFT_CB_LD_COMPLEX_DOUBLE = 0x1,
        CUFFT_CB_LD_REAL = 0x2,
        CUFFT_CB_LD_REAL_DOUBLE = 0x3,
        CUFFT_CB_ST_COMPLEX = 0x4,
        CUFFT_CB_ST_COMPLEX_DOUBLE = 0x5,
        CUFFT_CB_ST_REAL = 0x6,
        CUFFT_CB_ST_REAL_DOUBLE = 0x7,
        CUFFT_CB_UNDEFINED = 0x8
    } cufftXtCallbackType;
    

    The corresponding function prototypes and pointer type definitions are as follows:

    typedef cufftComplex (*cufftCallbackLoadC)(void *dataIn, size_t offset, void *callerInfo, void *sharedPointer);
    
    typedef cufftDoubleComplex (*cufftCallbackLoadZ)(void *dataIn, size_t offset, void *callerInfo, void *sharedPointer);
    
    typedef cufftReal (*cufftCallbackLoadR)(void *dataIn, size_t offset, void *callerInfo, void *sharedPointer);
    
    typedef cufftDoubleReal(*cufftCallbackLoadD)(void *dataIn, size_t offset, void *callerInfo, void *sharedPointer);
    
    
    typedef void (*cufftCallbackStoreC)(void *dataOut, size_t offset, cufftComplex element, void *callerInfo, void *sharedPointer);
    
    typedef void (*cufftCallbackStoreZ)(void *dataOut, size_t offset, cufftDoubleComplex element, void *callerInfo, void *sharedPointer);
    
    typedef void (*cufftCallbackStoreR)(void *dataOut, size_t offset, cufftReal element, void *callerInfo, void *sharedPointer);
    
    typedef void (*cufftCallbackStoreD)(void *dataOut, size_t offset, cufftDoubleReal element, void *callerInfo, void *sharedPointer);
    
    

    3.15.4. Other cuFFT Types

    3.15.4.1. cufftHandle

    A handle type used to store and access cuFFT plans. The user receives a handle after creating a cuFFT plan and uses this handle to execute the plan.

    typedef unsigned int cufftHandle;

    3.15.4.2. cufftReal

    A single-precision, floating-point real data type.

    typedef float cufftReal;

    3.15.4.3. cufftDoubleReal

    A double-precision, floating-point real data type.

    typedef double cufftDoubleReal;

    3.15.4.4. cufftComplex

    A single-precision, floating-point complex data type that consists of interleaved real and imaginary components.

    typedef cuComplex cufftComplex;

    3.15.4.5. cufftDoubleComplex

    A double-precision, floating-point complex data type that consists of interleaved real and imaginary components.

    typedef cuDoubleComplex cufftDoubleComplex;

    3.16. Common types

    3.16.1. cudaDataType

    The cudaDataType data type is an enumeration of the types supported by CUDA libraries.

    typedef enum cudaDataType_t
    {
            CUDA_R_16F= 2, // 16 bit real 
            CUDA_C_16F= 6, // 16 bit complex
            CUDA_R_32F= 0, // 32 bit real
            CUDA_C_32F= 4, // 32 bit complex
            CUDA_R_64F= 1, // 64 bit real
            CUDA_C_64F= 5, // 64 bit complex
            CUDA_R_8I= 3,  // 8 bit real as a signed integer 
            CUDA_C_8I= 7,  // 8 bit complex as a pair of signed integers
            CUDA_R_8U= 8,  // 8 bit real as a signed integer 
            CUDA_C_8U= 9   // 8 bit complex as a pair of signed integers
    } cudaDataType;
    

    3.16.2. libraryPropertyType

    The libraryPropertyType data type is an enumeration of library property types. (ie. CUDA version X.Y.Z would yield MAJOR_VERSION=XMINOR_VERSION=YPATCH_LEVEL=Z)

    typedef enum libraryPropertyType_t
    {
            MAJOR_VERSION,
            MINOR_VERSION,
            PATCH_LEVEL
    } libraryPropertyType;
    

    4. cuFFT Code Examples

    This chapter provides multiple simple examples of complex and real 1D, 2D, and 3D transforms that use cuFFT to perform forward and inverse FFTs.

    1D Complex-to-Complex Transforms

    In this example a one-dimensional complex-to-complex transform is applied to the input data. Afterwards an inverse transform is performed on the computed frequency domain representation.

    #define NX 256
    #define BATCH 1
    
    cufftHandle plan;
    cufftComplex *data;
    cudaMalloc((void**)&data, sizeof(cufftComplex)*NX*BATCH);
    if (cudaGetLastError() != cudaSuccess){
    	fprintf(stderr, "Cuda error: Failed to allocate\n");
    	return;	
    }
    
    if (cufftPlan1d(&plan, NX, CUFFT_C2C, BATCH) != CUFFT_SUCCESS){
    	fprintf(stderr, "CUFFT error: Plan creation failed");
    	return;	
    }	
    
    ...
    
    /* Note:
     *  Identical pointers to input and output arrays implies in-place transformation
     */
    
    if (cufftExecC2C(plan, data, data, CUFFT_FORWARD) != CUFFT_SUCCESS){
    	fprintf(stderr, "CUFFT error: ExecC2C Forward failed");
    	return;	
    }
    
    if (cufftExecC2C(plan, data, data, CUFFT_INVERSE) != CUFFT_SUCCESS){
    	fprintf(stderr, "CUFFT error: ExecC2C Inverse failed");
    	return;	
    }
    
    /* 
     *  Results may not be immediately available so block device until all 
     *  tasks have completed
     */
    
    if (cudaDeviceSynchronize() != cudaSuccess){
    	fprintf(stderr, "Cuda error: Failed to synchronize\n");
    	return;	
    }	
    
    /* 
     *  Divide by number of elements in data set to get back original data
     */
    
    ...
    	
    cufftDestroy(plan);
    cudaFree(data);
    
    

    1D Real-to-Complex Transforms

    In this example a one-dimensional real-to-complex transform is applied to the input data.

    #define NX 256
    #define BATCH 1
    
    cufftHandle plan;
    cufftComplex *data;
    cudaMalloc((void**)&data, sizeof(cufftComplex)*(NX/2+1)*BATCH);
    if (cudaGetLastError() != cudaSuccess){
    	fprintf(stderr, "Cuda error: Failed to allocate\n");
    	return;	
    }
    
    if (cufftPlan1d(&plan, NX, CUFFT_R2C, BATCH) != CUFFT_SUCCESS){
    	fprintf(stderr, "CUFFT error: Plan creation failed");
    	return;	
    }	
    
    ...
    
    /* Use the CUFFT plan to transform the signal in place. */
    if (cufftExecR2C(plan, (cufftReal*)data, data) != CUFFT_SUCCESS){
    	fprintf(stderr, "CUFFT error: ExecC2C Forward failed");
    	return;	
    }
    
    if (cudaDeviceSynchronize() != cudaSuccess){
    	fprintf(stderr, "Cuda error: Failed to synchronize\n");
    	return;	
    }
    
    ...
    
    cufftDestroy(plan);
    cudaFree(data);
    
    

    2D Complex-to-Real Transforms

    In this example a two-dimensional complex-to-real transform is applied to the input data arranged according to the requirements of the default FFTW padding mode.

    #define NX 256
    #define NY 128
    #define NRANK 2
    #define BATCH 1
    
    cufftHandle plan;
    cufftComplex *data;
    int n[NRANK] = {NX, NY};
    
    cudaMalloc((void**)&data, sizeof(cufftComplex)*NX*(NY/2+1));
    if (cudaGetLastError() != cudaSuccess){
    	fprintf(stderr, "Cuda error: Failed to allocate\n");
    	return;	
    }
    
    /* Create a 2D FFT plan. */
    if (cufftPlanMany(&plan, NRANK, n,
    				  NULL, 1, 0,
    				  NULL, 1, 0,
    				  CUFFT_C2R,BATCH) != CUFFT_SUCCESS){
    	fprintf(stderr, "CUFFT Error: Unable to create plan\n");
    	return;	
    }
    
    ...
    
    if (cufftExecC2R(plan, data, data) != CUFFT_SUCCESS){
    	fprintf(stderr, "CUFFT Error: Unable to execute plan\n");
    	return;		
    }
    
    if (cudaDeviceSynchronize() != cudaSuccess){
      	fprintf(stderr, "Cuda error: Failed to synchronize\n");
       	return;
    }	
    
    ...
    
    cufftDestroy(plan);
    cudaFree(data);
    
    

    3D Complex-to-Complex Transforms

    In this example a three-dimensional complex-to-complex transform is applied to the input data.

    #define NX 64
    #define NY 128
    #define NZ 128
    #define BATCH 10
    #define NRANK 3
    
    cufftHandle plan;
    cufftComplex *data;
    int n[NRANK] = {NX, NY, NZ};
    
    cudaMalloc((void**)&data, sizeof(cufftComplex)*NX*NY*NZ*BATCH);
    if (cudaGetLastError() != cudaSuccess){
    	fprintf(stderr, "Cuda error: Failed to allocate\n");
    	return;	
    }
    
    /* Create a 3D FFT plan. */
    if (cufftPlanMany(&plan, NRANK, n, 
    				  NULL, 1, NX*NY*NZ, // *inembed, istride, idist 
    				  NULL, 1, NX*NY*NZ, // *onembed, ostride, odist
    				  CUFFT_C2C, BATCH) != CUFFT_SUCCESS){
    	fprintf(stderr, "CUFFT error: Plan creation failed");
    	return;	
    }	
    
    /* Use the CUFFT plan to transform the signal in place. */
    if (cufftExecC2C(plan, data, data, CUFFT_FORWARD) != CUFFT_SUCCESS){
    	fprintf(stderr, "CUFFT error: ExecC2C Forward failed");
    	return;	
    }
    
    if (cudaDeviceSynchronize() != cudaSuccess){
    	fprintf(stderr, "Cuda error: Failed to synchronize\n");
    	return;	
    }	
    
    ...
    	
    cufftDestroy(plan);
    cudaFree(data);
    
    

    2D Advanced Data Layout Use

    In this example a two-dimensional complex-to-complex transform is applied to the input data arranged according to the requirements the advanced layout.

    #define NX 128
    #define NY 256
    #define BATCH 10
    #define NRANK 2
    
    /* Advanced interface parameters, arbitrary strides */
    #define ISTRIDE 2   // distance between successive input elements in innermost dimension
    #define OSTRIDE 1   // distance between successive output elements in innermost dimension
    #define IX (NX+2)
    #define IY (NY+1)
    #define OX (NX+3)
    #define OY (NY+4)
    #define IDIST (IX*IY*ISTRIDE+3) // distance between first element of two consecutive signals in a batch of input data
    #define ODIST (OX*OY*OSTRIDE+5) // distance between first element of two consecutive signals in a batch of output data
    
    cufftHandle plan;
    cufftComplex *idata, *odata;
    int isize = IDIST * BATCH;
    int osize = ODIST * BATCH;
    int n[NRANK] = {NX, NY};
    int inembed[NRANK] = {IX, IY}; // pointer that indicates storage dimensions of input data
    int onembed[NRANK] = {OX, OY}; // pointer that indicates storage dimensions of output data
    
    cudaMalloc((void **)&idata, sizeof(cufftComplex)*isize);
    cudaMalloc((void **)&odata, sizeof(cufftComplex)*osize);
    if (cudaGetLastError() != cudaSuccess){
    	fprintf(stderr, "Cuda error: Failed to allocate\n");
    	return;	
    }
    
    /* Create a batched 2D plan */
    if (cufftPlanMany(&plan, NRANK, n,
    				  inembed,ISTRIDE,IDIST,
    				  onembed,OSTRIDE,ODIST,
    				  CUFFT_C2C,BATCH) != CUFFT_SUCCESS){
    	fprintf(stderr, "CUFFT Error: Unable to create plan\n");
    	return;	
    }
    
    ...
    
    /* Execute the transform out-of-place */
    if (cufftExecC2C(plan, idata, odata, CUFFT_FORWARD) != CUFFT_SUCCESS){
    	fprintf(stderr, "CUFFT Error: Failed to execute plan\n");
    	return;		
    }
    
    if (cudaDeviceSynchronize() != cudaSuccess){
      	fprintf(stderr, "Cuda error: Failed to synchronize\n");
       	return;
    }
    
    ...
    
    cufftDestroy(plan);
    cudaFree(idata);
    cudaFree(odata);
    
    

    3D Complex-to-Complex Transforms using Two GPUs

    In this example a three-dimensional complex-to-complex transform is applied to the input data using two GPUs.

    // Demonstrate how to use CUFFT to perform 3-d FFTs using 2 GPUs
    //
    // cufftCreate() - Create an empty plan
        cufftHandle plan_input; cufftResult result;
        result = cufftCreate(&plan_input);
        if (result != CUFFT_SUCCESS) { printf ("*Create failed\n"); return; }
    //
    // cufftXtSetGPUs() - Define which GPUs to use
        int nGPUs = 2, whichGPUs[2];
        whichGPUs[0] = 0; whichGPUs[1] = 1;
        result = cufftXtSetGPUs (plan_input, nGPUs, whichGPUs);
        if (result != CUFFT_SUCCESS) { printf ("*XtSetGPUs failed\n"); return; }
    //
    // Initialize FFT input data
        size_t worksize[2];
        cufftComplex *host_data_input, *host_data_output;
        int nx = 64, ny = 128, nz = 32;
        int size_of_data = sizeof(cufftComplex) * nx * ny * nz;
        host_data_input = malloc(size_of_data);
        if (host_data_input == NULL) { printf ("malloc failed\n"); return; }
        host_data_output = malloc(size_of_data);
        if (host_data_output == NULL) { printf ("malloc failed\n"); return; }
        initialize_3d_data (nx, ny, nz, host_data_input, host_data_output);
    //
    // cufftMakePlan3d() - Create the plan
        result = cufftMakePlan3d (plan_input, nz, ny, nx, CUFFT_C2C, worksize);
        if (result != CUFFT_SUCCESS) { printf ("*MakePlan* failed\n"); return; }
    //
    // cufftXtMalloc() - Malloc data on multiple GPUs
        cudaLibXtDesc *device_data_input;
        result = cufftXtMalloc (plan_input, &device_data_input,
            CUFFT_XT_FORMAT_INPLACE);
        if (result != CUFFT_SUCCESS) { printf ("*XtMalloc failed\n"); return; }
    //
    // cufftXtMemcpy() - Copy data from host to multiple GPUs
        result = cufftXtMemcpy (plan_input, device_data_input,
            host_data_input, CUFFT_COPY_HOST_TO_DEVICE);
        if (result != CUFFT_SUCCESS) { printf ("*XtMemcpy failed\n"); return; }
    //
    // cufftXtExecDescriptorC2C() - Execute FFT on multiple GPUs
        result = cufftXtExecDescriptorC2C (plan_input, device_data_input,
            device_data_input, CUFFT_FORWARD);
        if (result != CUFFT_SUCCESS) { printf ("*XtExec* failed\n"); return; }
    //
    // cufftXtMemcpy() - Copy data from multiple GPUs to host
        result = cufftXtMemcpy (plan_input, host_data_output,
            device_data_input, CUFFT_COPY_DEVICE_TO_HOST);
        if (result != CUFFT_SUCCESS) { printf ("*XtMemcpy failed\n"); return; }
    //
    // Print output and check results
        int output_return = output_3d_results (nx, ny, nz,
            host_data_input, host_data_output);
        if (output_return != 0) { return; }
    //
    // cufftXtFree() - Free GPU memory
        result = cufftXtFree(device_data_input);
        if (result != CUFFT_SUCCESS) { printf ("*XtFree failed\n"); return; }
    //
    // cufftDestroy() - Destroy FFT plan
        result = cufftDestroy(plan_input);
        if (result != CUFFT_SUCCESS) { printf ("*Destroy failed: code\n"); return; }
        free(host_data_input); free(host_data_output);
    
    

    1D Complex-to-Complex Transforms using Two GPUs with Natural Order

    In this example a one-dimensional complex-to-complex transform is applied to the input data using two GPUs. The output data is in natural order in GPU memory.

    // Demonstrate how to use CUFFT to perform 1-d FFTs using 2 GPUs 
    // Output on the GPUs is in natural output
    // Function return codes should be checked for errors in actual code
    //
    // cufftCreate() - Create an empty plan
        cufftHandle plan_input; cufftResult result;
        result = cufftCreate(&plan_input);
    //
    // cufftXtSetGPUs() - Define which GPUs to use
        int nGPUs = 2, whichGPUs[2];
        whichGPUs[0] = 0; whichGPUs[1] = 1;
        result = cufftXtSetGPUs (plan_input, nGPUs, whichGPUs);
    //
    // Initialize FFT input data
        size_t worksize[2];
        cufftComplex *host_data_input, *host_data_output;
        int nx = 1024, batch = 1, rank = 1, n[1];
        int inembed[1], istride, idist, onembed[1], ostride, odist;
        n[0] = nx;
        int size_of_data = sizeof(cufftComplex) * nx * batch;
        host_data_input = malloc(size_of_data);
        host_data_output = malloc(size_of_data);
        initialize_1d_data (nx, batch, rank, n, inembed, &istride, &idist,
            onembed, &ostride, &odist, host_data_input, host_data_output);
    //
    // cufftMakePlanMany() - Create the plan
        result = cufftMakePlanMany (plan_input, rank, n, inembed, istride, idist,
            onembed, ostride, odist, CUFFT_C2C, batch, worksize);
    //
    // cufftXtMalloc() - Malloc data on multiple GPUs
        cudaLibXtDesc *device_data_input, *device_data_output;
        result = cufftXtMalloc (plan_input, &device_data_input,
            CUFFT_XT_FORMAT_INPLACE);
        result = cufftXtMalloc (plan_input, &device_data_output,
            CUFFT_XT_FORMAT_INPLACE);
    //
    // cufftXtMemcpy() - Copy data from host to multiple GPUs
        result = cufftXtMemcpy (plan_input, device_data_input,
            host_data_input, CUFFT_COPY_HOST_TO_DEVICE);
    //
    // cufftXtExecDescriptorC2C() - Execute FFT on multiple GPUs
        result = cufftXtExecDescriptorC2C (plan_input, device_data_input,
            device_data_input, CUFFT_FORWARD);
    //
    // cufftXtMemcpy() - Copy the data to natural order on GPUs
        result = cufftXtMemcpy (plan_input, device_data_output,
            device_data_input, CUFFT_COPY_DEVICE_TO_DEVICE);
    //
    // cufftXtMemcpy() - Copy natural order data from multiple GPUs to host
        result = cufftXtMemcpy (plan_input, host_data_output,
            device_data_output, CUFFT_COPY_DEVICE_TO_HOST);
    //
    // Print output and check results
        int output_return = output_1d_results (nx, batch,
            host_data_input, host_data_output);
    //
    // cufftXtFree() - Free GPU memory
        result = cufftXtFree(device_data_input);
        result = cufftXtFree(device_data_output);
    //
    // cufftDestroy() - Destroy FFT plan
        result = cufftDestroy(plan_input);
        free(host_data_input); free(host_data_output);
    
    

    1D Complex-to-Complex Convolution using Two GPUs

    In this example a one-dimensional convolution is calculated using complex-to-complex transforms.

    
    
    //
    // Demonstrate how to use CUFFT to perform a convolution using 1-d FFTs and
    // 2 GPUs. The forward FFTs use both GPUs, while the inverse FFT uses one.
    // Function return codes should be checked for errors in actual code.
    //
    // cufftCreate() - Create an empty plan
        cufftResult result; cudaError_t cuda_status;
        cufftHandle plan_forward_2_gpus, plan_inverse_1_gpu;
        result = cufftCreate(&plan_forward_2_gpus);
        result = cufftCreate(&plan_inverse_1_gpu);
    //
    // cufftXtSetGPUs() - Define which GPUs to use
        int nGPUs = 2, whichGPUs[2];
        whichGPUs[0] = 0; whichGPUs[1] = 1;
        result = cufftXtSetGPUs (plan_forward_2_gpus, nGPUs, whichGPUs);
    //
    // Initialize FFT input data
        size_t worksize[2];
        cufftComplex *host_data_input, *host_data_output;
        int nx = 1048576, batch = 2, rank = 1, n[1];
        int inembed[1], istride, idist, onembed[1], ostride, odist;
        n[0] = nx;
        int size_of_one_set = sizeof(cufftComplex) * nx;
        int size_of_data = size_of_one_set * batch;
        host_data_input = (cufftComplex*)malloc(size_of_data);
        host_data_output = (cufftComplex*)malloc(size_of_one_set);
        initialize_1d_data (nx, batch, rank, n, inembed, &istride, &idist,
            onembed, &ostride, &odist, host_data_input, host_data_output);
    //
    // cufftMakePlanMany(), cufftPlan1d - Create the plans
        result = cufftMakePlanMany (plan_forward_2_gpus, rank, n, inembed,
            istride, idist, onembed, ostride, odist, CUFFT_C2C, batch, worksize);
        result = cufftPlan1d (&plan_inverse_1_gpu, nx, CUFFT_C2C, 1);
    //
    // cufftXtMalloc(), cudaMallocHost - Allocate data for GPUs
        cudaLibXtDesc *device_data_input; cufftComplex *GPU0_data_from_GPU1;
        result = cufftXtMalloc (plan_forward_2_gpus, &device_data_input,
            CUFFT_XT_FORMAT_INPLACE);
        int device0 = device_data_input->descriptor->GPUs[0];
        cudaSetDevice(device0) ;
        cuda_status = cudaMallocHost ((void**)&GPU0_data_from_GPU1,size_of_one_set);
    //
    // cufftXtMemcpy() - Copy data from host to multiple GPUs
        result = cufftXtMemcpy (plan_forward_2_gpus, device_data_input,
            host_data_input, CUFFT_COPY_HOST_TO_DEVICE);
    //
    // cufftXtExecDescriptorC2C() - Execute forward FFTs on multiple GPUs
        result = cufftXtExecDescriptorC2C (plan_forward_2_gpus, device_data_input,
            device_data_input, CUFFT_FORWARD);
    //
    // cudaMemcpy result from GPU1 to GPU0
        cufftComplex *device_data_on_GPU1;
        device_data_on_GPU1 = (cufftComplex*)
            (device_data_input->descriptor->data[1]);
        cuda_status = cudaMemcpy (GPU0_data_from_GPU1, device_data_on_GPU1,
            size_of_one_set, cudaMemcpyDeviceToDevice);
    //
    // Continued on next page
    //
    
    

    
    
    //
    // Demonstrate how to use CUFFT to perform a convolution using 1-d FFTs and
    // 2 GPUs. The forward FFTs use both GPUs, while the inverse FFT uses one.
    // Function return codes should be checked for errors in actual code.
    //
    // Part 2
    //
    // Multiply results and scale output
        cufftComplex *device_data_on_GPU0;
        device_data_on_GPU0 = (cufftComplex*)
            (device_data_input->descriptor->data[0]);
        cudaSetDevice(device0) ;
        ComplexPointwiseMulAndScale<<<32, 256>>>((cufftComplex*)device_data_on_GPU0,
            (cufftComplex*) GPU0_data_from_GPU1, nx);
    //   
    // cufftExecC2C() - Execute inverse FFT on one GPU
        result = cufftExecC2C (plan_inverse_1_gpu, GPU0_data_from_GPU1,
            GPU0_data_from_GPU1, CUFFT_INVERSE);
    //
    // cudaMemcpy() - Copy results from GPU0 to host
        cuda_status = cudaMemcpy(host_data_output, GPU0_data_from_GPU1,
            size_of_one_set, cudaMemcpyDeviceToHost);
    //
    // Print output and check results
        int output_return = output_1d_results (nx, batch,
            host_data_input, host_data_output);
    //
    // cufftDestroy() - Destroy FFT plans
        result = cufftDestroy(plan_forward_2_gpus);
        result = cufftDestroy(plan_inverse_1_gpu);
    //
    // cufftXtFree(), cudaFreeHost(), free() - Free GPU and host memory
        result = cufftXtFree(device_data_input);
        cuda_status = cudaFreeHost (GPU0_data_from_GPU1);
        free(host_data_input); free(host_data_output);
    
    

    // 
    // Utility routine to perform complex pointwise multiplication with scaling
    __global__ void ComplexPointwiseMulAndScale
        (cufftComplex *a, cufftComplex *b, int size)
    {
        const int numThreads = blockDim.x * gridDim.x;
        const int threadID = blockIdx.x * blockDim.x + threadIdx.x;
        float scale = 1.0f / (float)size;
        cufftComplex c;
        for (int i = threadID; i < size; i += numThreads)
        {
            c = cuCmulf(a[i], b[i]);
            b[i] = make_cuFloatComplex(scale*cuCrealf(c), scale*cuCimagf(c));
        }
        return;
    }
    
    

    5. Multiple GPU Data Organization

    This chapter explains how data are distributed between the GPUs, before and after a multiple GPU transform. For simplicity, it is assumed in this chapter that the caller has specified GPU 0 and GPU 1 to perform the transform.

    5.1. Multiple GPU Data Organization for Batched Transforms

    For batches of transforms, each individual transform is executed on a single GPU. If possible the batches are evenly distributed among the GPUs. For a batch of size m performed on n GPUs, where m is not divisible by n, the first m % n GPUs will perform m n + 1 transforms. The remaining GPUs will perform m n transforms. For example, in a batch of 15 transforms performed on 4 GPUs, the first three GPUs would perform 4 transforms, and the last GPU would perform 3 transforms. This approach removes the need for data exchange between the GPUs, and results in nearly perfect scaling for cases where the batch size is divisible by the number of GPUs.

    Multiple GPU Data Organization for Single 2D and 3D Transforms

    Single transforms performed on multiple GPUs require the data to be divided between the GPUs. Then execution takes place in phases. For example with 2 GPUs, for 2D and 3D transforms with even sized dimensions, each GPU does half of the transform in (rank - 1) dimensions. Then data are exchanged between the GPUs so that the final dimension can be processed.

    Since 2D and 3D transforms support sizes other than powers of 2, it is possible that the data can not be evenly distributed among the GPUs. In general for the case of n GPUs, a dimension of size m that is not a multiple of n would be distributed such that the first m % n GPUs would get one extra row for 2D transforms, one extra plane for 3D transforms.

    Take for example, a 2D transform on 4 GPUs, using an array declared in C as data[x][y], where x is 65 and y is 99. The surface is distributed prior to the transform such that GPU 0 receives a surface with dimensions [17][99], and GPUs 1...3 receive surfaces with dimensions [16][99]. After the transform, each GPU again has a portion of the surface, but divided in the y dimension. GPUs 0...2 have surfaces with dimensions [65][25]. GPU 3 has a surface with dimensions [65][24]

    For a 3D transform on 4 GPUs consider an array declared in C as data[x][y][z], where x is 103, y is 122, and z is 64. The volume is distributed prior to the transform such that each GPUs 0...2 receive volumes with dimensions [26][122][64], and GPU 3 receives a volume with dimensions [26][101][64]. After the transform, each GPU again has a portion of the surface, but divided in the y dimension. GPUs 0 and 1 have a volumes with dimensions [103][31][64], and GPUs 2 and 3 have volumes with dimensions [103][30][64].

    Multiple-GPU Data Organization for Single 1D Transforms

    By default for 1D transforms, the initial distribution of data to the GPUs is similar to the 2D and 3D cases. For a transform of dimension x on two GPUs, GPU 0 receives data ranging from 0...(x/2-1). GPU 1 receives data ranging from (x/2)...(x-1). Similarly, with 4 GPUs, the data are evenly distributed among all 4 GPUs.

    Before computation can begin, data are redistributed among the GPUs. It is possible to perform this redistribution in the copy from host memory, in cases where the application does not need to pre-process the data prior to the transform. To do this, the application can create the data descriptor with cufftXtMalloc using the sub-format CUFFT_XT_FORMAT_1D_INPUT_SHUFFLED. This can significantly reduce the time it takes to execute the transform.

    cuFFT performs multiple GPU 1D transforms by decomposing the transform size into factors Factor1 and Factor2, and treating the data as a grid of size Factor1 x Factor2. The four steps done to calculate the 1D FFT are: Factor1 transforms of size Factor2, data exchange between the GPUs, a pointwise twiddle multiplication, and Factor2 transforms of size Factor1.

    To gain efficiency by overlapping computation with data exchange, cuFFT breaks the whole transform into independent segments or strings, which can be processed while others are in flight. A side effect of this algorithm is that the output of the transform is not in linear order. The output in GPU memory is in strings, each of which is composed of Factor2 substrings of equal size. Each substring contains contiguous results starting Factor1 elements subsequent to start of the previous substring. Each string starts substring size elements after the start of the previous string. The strings appear in order, the first half on GPU 0, and the second half on GPU 1. See the example below:

    transform size = 1024
    number of strings = 8
    Factor1 = 64
    Factor2 = 16
    substrings per string for output layout is Factor2 (16)
    string size = 1024/8 = 128
    substring size = 128/16 = 8
    stride between substrings = 1024/16 = Factor1 (64)
    
    On GPU 0:
    string 0 has substrings with indices 0...7   64...71   128...135 ... 960...967
    string 1 has substrings with indices 8...15  72...79   136...143 ... 968...975
    ...
    On GPU 1:
    string 4 has substrings with indices 32...39  96...103 160...167 ... 992...999
    ...
    string 7 has substrings with indices 56...63 120...127 184...191 ... 1016...1023
        

    The cufftXtQueryPlan API allows the caller to retrieve a structure containing the number of strings, the decomposition factors, and (in the case of power of 2 size) some useful mask and shift elements. The example below shows how cufftXtQueryPlan is invoked. It also shows how to translate from an index in the host input array to the corresponding index on the device, and vice versa.

    /* 
     * These routines demonstrate the use of cufftXtQueryPlan to get the 1D 
     * factorization and convert between permuted and linear indexes.
     */
    /*
     * Set up a 1D plan that will execute on GPU 0 and GPU1, and query
     * the decomposition factors
     */
    int main(int argc, char **argv){
        cufftHandle plan;
        cufftResult stat;
        int whichGPUs[2] = { 0, 1 };
        cufftXt1dFactors factors;
        stat = cufftCreate( &plan );
        if (stat != CUFFT_SUCCESS) {
            printf("Create error %d\n",stat);
            return 1;
        }
        stat = cufftXtSetGPUs( plan, 2, whichGPUs );
        if (stat != CUFFT_SUCCESS) {
            printf("SetGPU error %d\n",stat);
            return 1;
        }
        stat = cufftMakePlan1d( plan, size, CUFFT_C2C, 1, workSizes );
        if (stat != CUFFT_SUCCESS) {
            printf("MakePlan error %d\n",stat);
            return 1;
        }
        stat = cufftXtQueryPlan( plan, (void *) &factors, CUFFT_QUERY_1D_FACTORS );
        if (stat != CUFFT_SUCCESS) {
            printf("QueryPlan error %d\n",stat);
            return 1;
        }
        printf("Factor 1 %zd, Factor2 %zd\n",factors.factor1,factors.factor2);
        cufftDestroy(plan);
        return 0;
    }
    
        

    /* 
     * Given an index into a permuted array, and the GPU index return the 
     * corresponding linear index from the beginning of the input buffer.
     * 
     * Parameters:
     *      factors     input:  pointer to cufftXt1dFactors as returned by 
     *                          cufftXtQueryPlan
     *      permutedIx  input:  index of the desired element in the device output
     *                          array
     *      linearIx    output: index of the corresponding input element in the  
     *                          host array
     *      GPUix       input:  index of the GPU containing the desired element
     */
    cufftResult permuted2Linear( cufftXt1dFactors * factors,
                                 size_t permutedIx,
                                 size_t *linearIx,
                                 int GPUIx ) {
        size_t indexInSubstring;
        size_t whichString;
        size_t whichSubstring;
        // the low order bits of the permuted index match those of the linear index
        indexInSubstring = permutedIx & factors->substringMask;
        // the next higher bits are the substring index
        whichSubstring = (permutedIx >> factors->substringShift) &
                          factors->factor2Mask;
        // the next higher bits are the string index on this GPU
        whichString = (permutedIx >> factors->stringShift) & factors->stringMask;
        // now adjust the index for the second GPU
        if (GPUIx) {
            whichString += factors->stringCount/2;
        }
        // linear index low order bits are the same
        // next higher linear index bits are the string index
        *linearIx = indexInSubstring + ( whichString << factors->substringShift );
        // next higher bits of linear address are the substring index
        *linearIx += whichSubstring << factors->factor1Shift;
        return CUFFT_SUCCESS;
    }
    
        

    /* 
     * Given a linear index into a 1D array, return the GPU containing the permuted 
     * result, and index from the start of the data buffer for that element.
     * 
     * Parameters:
     *      factors     input:  pointer to cufftXt1dFactors as returned by 
     *                          cufftXtQueryPlan
     *      linearIx    input:  index of the desired element in the host input 
     *                          array
     *      permutedIx  output: index of the corresponding result in the device 
     *                          output array
     *      GPUix       output: index of the GPU containing the result
     */
    cufftResult linear2Permuted( cufftXt1dFactors * factors,
                                 size_t linearIx,
                                 size_t *permutedIx,
                                 int *GPUIx ) {
        size_t indexInSubstring;
        size_t whichString;
        size_t whichSubstring;
        size_t whichStringMask;
        int whichStringShift;
        if (linearIx >= factors->size) {
            return CUFFT_INVALID_VALUE;
        }
        // get a useful additional mask and shift count
        whichStringMask = factors->stringCount -1;
        whichStringShift = (factors->factor1Shift + factors->factor2Shift) -
                            factors->stringShift ;
        // the low order bits identify the index within the substring
        indexInSubstring = linearIx & factors->substringMask;
        // first determine which string has our linear index.
        // the low order bits indentify the index within the substring.
        // the next higher order bits identify which string.
        whichString = (linearIx >> factors->substringShift) & whichStringMask;
        // the first stringCount/2 strings are in the first GPU, 
        // the rest are in the second.
        *GPUIx = whichString/(factors->stringCount/2);
        // next determine which substring within the string has our index
        // the substring index is in the next higher order bits of the index
        whichSubstring = (linearIx >>(factors->substringShift + whichStringShift)) &
                          factors->factor2Mask;
        // now we can re-assemble the index 
        *permutedIx = indexInSubstring;
        *permutedIx += whichSubstring << factors->substringShift;
        if ( !*GPUIx ) {
            *permutedIx += whichString << factors->stringShift;
        } else {
            *permutedIx += (whichString - (factors->stringCount/2) ) <<
                            factors->stringShift;
        }
        return CUFFT_SUCCESS;
    }
    
        

    6. FFTW Conversion Guide

    cuFFT differs from FFTW in that FFTW has many plans and a single execute function while cuFFT has fewer plans, but multiple execute functions. The cuFFT execute functions determine the precision (single or double) and whether the input is complex or real valued. The following table shows the relationship between the two interfaces.

    FFTW function cuFFT function
    fftw_plan_dft_1d(), fftw_plan_dft_r2c_1d(), fftw_plan_dft_c2r_1d() cufftPlan1d()
    fftw_plan_dft_2d(), fftw_plan_dft_r2c_2d(), fftw_plan_dft_c2r_2d() cufftPlan2d()
    fftw_plan_dft_3d(), fftw_plan_dft_r2c_3d(), fftw_plan_dft_c2r_3d() cufftPlan3d()
    fftw_plan_dft(), fftw_plan_dft_r2c(), fftw_plan_dft_c2r() cufftPlanMany()
    fftw_plan_many_dft(), fftw_plan_many_dft_r2c(), fftw_plan_many_dft_c2r() cufftPlanMany()
    fftw_execute() cufftExecC2C(), cufftExecZ2Z(), cufftExecR2C(), cufftExecD2Z(), cufftExecC2R(), cufftExecZ2D()
    fftw_destroy_plan() cufftDestroy()

    7. FFTW Interface to cuFFT

    NVIDIA provides FFTW3 interfaces to the cuFFT library. This allows applications using FFTW to use NVIDIA GPUs with minimal modifications to program source code. To use the interface first do the following two steps

    • It is recommended that you replace the include file fftw3.h with cufftw.h
    • Instead of linking with the double/single precision libraries such as fftw3/fftw3f libraries, link with both the cuFFT and cuFFTW libraries
    • Ensure the search path includes the directory containing cuda_runtime_api.h

    After an application is working using the FFTW3 interface, users may want to modify their code to move data to and from the GPU and use the routines documented in the FFTW Conversion Guide for the best performance.

    The following tables show which components and functions of FFTW3 are supported in cuFFT.

    Section in FFTW manual Supported Unsupported
    Complex numbers fftw_complex, fftwf_complex types  
    Precision double fftw3, single fftwf3 long double fftw3l, quad precision fftw3q are not supported since CUDA functions operate on double and single precision floating-point quantities
    Memory Allocation   fftw_malloc(), fftw_free(), fftw_alloc_real(), fftw_alloc_complex(), fftwf_alloc_real(), fftwf_alloc_complex()
    Multi-threaded FFTW   fftw3_threads, fftw3_omp are not supported
    Distributed-memory FFTW with MPI   fftw3_mpi,fftw3f_mpi are not supported

    Note that for each of the double precision functions below there is a corresponding single precision version with the letters fftw replaced by fftwf.

    Section in FFTW manual Supported Unsupported
    Using Plans fftw_execute(), fftw_destroy_plan(), fftw_cleanup(), fftw_print_plan() fftw_cost(), fftw_flops() exist but are not functional
    Basic Interface    
    Complex DFTs fftw_plan_dft_1d(), fftw_plan_dft_2d(), fftw_plan_dft_3d(), fftw_plan_dft()  
    Planner Flags   Planner flags are ignored and the same plan is returned regardless
    Real-data DFTs fftw_plan_dft_r2c_1d(), fftw_plan_dft_r2c_2d(), fftw_plan_dft_r2c_3d(), fftw_plan_dft_r2c(), fftw_plan_dft_c2r_1d(), fftw_plan_dft_c2r_2d(), fftw_plan_dft_c2r_3d(), fftw_plan_dft_c2r()  
    Read-data DFT Array Format   Not supported
    Read-to-Real Transform   Not supported
    Read-to-Real Transform Kinds   Not supported
    Advanced Interface    
    Advanced Complex DFTs fftw_plan_many_dft() with multiple 1D, 2D, 3D transforms fftw_plan_many_dft() with 4D or higher transforms or a 2D or higher batch of embedded transforms
    Advanced Real-data DFTs fftw_plan_many_dft_r2c(), fftw_plan_many_dft_c2r() with multiple 1D, 2D, 3D transforms fftw_plan_many_dft_r2c(), fftw_plan_many_dft_c2r() with 4D or higher transforms or a 2D or higher batch of embedded transforms
    Advanced Real-to-Real Transforms   Not supported
    Guru Interface    
    Interleaved and split arrays Interleaved format Split format
    Guru vector and transform sizes fftw_iodim struct  
    Guru Complex DFTs fftw_plan_guru_dft(), fftw_plan_guru_dft_r2c(), fftw_plan_guru_dft_c2r() with multiple 1D, 2D, 3D transforms fftw_plan_guru_dft(), fftw_plan_guru_dft_r2c(), fftw_plan_guru_dft_c2r() with 4D or higher transforms or a 2D or higher batch of transforms
    Guru Real-data DFTs   Not supported
    Guru Real-to-real Transforms   Not supported
    64-bit Guru Interface   Not supported
    New-array Execute Functions fftw_execute_dft(), fftw_execute_dft_r2c(), fftw_execute_dft_c2r() with interleaved format Split format and real-to-real functions
    Wisdom   fftw_export_wisdom_to_file(), fftw_import_wisdom_from_file() exist but are not functional. Other wisdom functions do not have entry points in the library.

    Deprecated Functionality

    Function cufftSetCompatibilityMode is deprecated.

    Notices

    Notice

    ALL NVIDIA DESIGN SPECIFICATIONS, REFERENCE BOARDS, FILES, DRAWINGS, DIAGNOSTICS, LISTS, AND OTHER DOCUMENTS (TOGETHER AND SEPARATELY, "MATERIALS") ARE BEING PROVIDED "AS IS." NVIDIA MAKES NO WARRANTIES, EXPRESSED, IMPLIED, STATUTORY, OR OTHERWISE WITH RESPECT TO THE MATERIALS, AND EXPRESSLY DISCLAIMS ALL IMPLIED WARRANTIES OF NONINFRINGEMENT, MERCHANTABILITY, AND FITNESS FOR A PARTICULAR PURPOSE.

    Information furnished is believed to be accurate and reliable. However, NVIDIA Corporation assumes no responsibility for the consequences of use of such information or for any infringement of patents or other rights of third parties that may result from its use. No license is granted by implication of otherwise under any patent rights of NVIDIA Corporation. Specifications mentioned in this publication are subject to change without notice. This publication supersedes and replaces all other information previously supplied. NVIDIA Corporation products are not authorized as critical components in life support devices or systems without express written approval of NVIDIA Corporation.

    Trademarks

    NVIDIA and the NVIDIA logo are trademarks or registered trademarks of NVIDIA Corporation in the U.S. and other countries. Other company and product names may be trademarks of the respective companies with which they are associated.



    Read more at: http://docs.nvidia.com/cuda/cufft/index.html#ixzz4jhmTdXbZ 
    Follow us: @GPUComputing on Twitter | NVIDIA on Facebook
    更多相关内容
  • 在本例中,CUFFT被用来计算一维信号在给定滤波器下的滤波实现:首先进行时间域到频率域的变换,即将信号与滤波器都变换到频率域,然后二者相乘,最后逆变换回频率域。cuFFT plans被创建出来,且分别使用简单和高级的...
  • FFTW VS cuFFT 用于比较和之间处理速度的基准脚本。用法 make./bench_fftw./bench_cufft描述bench_fftw : 使用 FFTW 运行基准测试bench_cufft : 使用 cuFFT 运行基准测试两个二进制文件具有相同的接口。 ./bench_XXX...
  • 将我在学习利用cufft库实现多次一维FFT的实现过程中的总结,步骤清晰,注释详细,可供参考。
  • NVIDIA CUDA cufft 库 —— cuFFT Library User's Guide
  • NVIDIA CUDA cufft 库 —— CUDA Toolkit 5.0 CUFFT Library
  • cuFFT_cuda_fft_

    2021-09-30 06:01:09
    Nvidia CUDA FFT test
  • This document describes CUFFT, the NVIDIA® CUDA™ Fast Fourier Transform (FFT) library. The FFT is a divide-and-conquer algorithm for efficiently computing discrete Fourier transforms of complex or ...
  • FFTW cuFFT的使用记录

    2021-10-31 20:52:15
    好久没写点什么东西了,今天饶有兴趣,总结一下FFTW、cuFFT的调用方法。 一些知识点的回顾 弄懂了FT、DTFT、DFT三者之间的关系 傅里叶变换(Fourier Transform,FT),表示能够将一定条件的某个函数表示为三角函数...

    好久没写点什么东西了,今天饶有兴趣,总结一下FFTWcuFFT的调用方法。

    一些知识点的回顾

    弄懂了FT、DTFT、DFT三者之间的关系

    傅里叶变换(Fourier Transform,FT),表示能够将一定条件的某个函数表示为三角函数或者它们的积分的线性组合。从连续域到连续域。

    离散时间傅里叶变换(Discrete-time Fourier Transform,DTFT),表示将以离散时间nT作为变量的函数变换到连续频域,频谱会周期延拓。

    离散傅里叶变换(Discrete Fourier Transform,DFT),将时域信号的采样变换为在DTFT频域的采样。形式上,两端都是有限长,实际上,两组序列都应该被认为是离散周期信号的主值序列。即使对有限长的离散信号做DFT,也应该将其看做经过周期延拓成为周期信号再做变换。

    说到底,还是为了让计算机可以处理傅里叶变换,因此需要在时域和频域都是离散信号。

    弄懂了如何从离散傅里叶级数到离散傅里叶变换,其中包含公式推导

    将要处理的离散非周期信号看做是一个离散周期信号的主值序列,然后对其做DFS,取结果的主值序列(DFS的结果也是周期的)。

    循环卷积的计算步骤:

    1. 有限长序列构造周期序列
    2. 计算周期卷积
    3. 周期卷积取主值

    循环卷积需要两个序列的长度一样,也就是周期一样。

    使用循环卷积来计算线性卷积:两个序列长度分别为N,M,那么取L>max(N,M)作为两个序列的周期,在两个序列后面补零,然后计算周期卷积,取主值序列就是循环卷积。

    从DFT到FFT:

    1. 分奇偶;
    2. DFT是DFS的主值序列,本身实质上是隐含周期,因此可以快算计算出另一半结果;
    3. 于是就将算法复杂度转为了对数复杂度;

    之前网上找的一个FFT实现,现在看,更加透彻了

    1. 先对输入信号的位置做一个反转,此处用到了位运算和动态规划;
    2. 一排一排的计算,文中的h代表当前处理长度的一半;

    FFTW实现

    FFTW是一个开源的fft库,首先需要去官网下载压缩包windows版本,然后使用VS Command Prompt工具生成lib,首先cd到压缩包的目录下,然后执行命令lib /machine:x64 /def:libfftw3l-3.def,注意,此处有一个大坑:linux下的可以直接cd到指定目录,而在win下,竟然需要先到D盘,然后cd
    在这里插入图片描述

    基本流程

    1. 创建句柄plan,可以将其看做一个对象,里面需要包含输入in,输出out的地址;
    2. 初始化数据
    3. 执行
    4. 销毁
    #include <fftw3.h>
    ...
    {
    fftw_complex *in, *out;
    fftw_plan p;
    ...
    in = (fftw_complex*) fftw_malloc(sizeof(fftw_complex) * N);
    out = (fftw_complex*) fftw_malloc(sizeof(fftw_complex) * N);
    p = fftw_plan_dft_1d(N, in, out, FFTW_FORWARD, FFTW_ESTIMATE);
    ...
    fftw_execute(p); /* repeat as needed */
    ...
    fftw_destroy_plan(p);
    fftw_free(in); fftw_free(out);
    }
    

    简单测试

    做一个8点的FFT,使用实数形式和复数形式:

    #include<iostream>
    
    #include"fftw3.h"
    #pragma comment(lib, "libfftw3-3.lib") // double版本
    // #pragma comment(lib, "libfftw3f-3.lib")// float版本
    // #pragma comment(lib, "libfftw3l-3.lib")// long double版本
    
    
    const double PI = acos(-1); //很巧妙的一种定义pi的方式
    
    
    int main() {
    	int len = 8;
    
    	// 如果要使用float版本,需先引用float版本的lib库,然后在fftw后面加上f后缀即可.
    	double *in = NULL;
    	fftw_complex *out = NULL;// fftwf_complex --> 即为float版本
    	fftw_complex *in2 = NULL;
    	fftw_complex *out2 = NULL;// fftwf_complex --> 即为float版本
    	fftw_plan p, p2;
    
    	//分配内存空间
    	in = (double *)fftw_malloc(sizeof(double) * len);
    	out = (fftw_complex *)fftw_malloc(sizeof(fftw_complex) * len);
    	in2 = (fftw_complex *)fftw_malloc(sizeof(fftw_complex) * len);
    	out2 = (fftw_complex *)fftw_malloc(sizeof(fftw_complex) * len);
    
    
    	// 创建句柄
    	p = fftw_plan_dft_r2c_1d(len, in, out, FFTW_ESTIMATE);
    	p2 = fftw_plan_dft_1d(len, in2, out2, FFTW_FORWARD, FFTW_ESTIMATE);
    
    	// 输入
    	printf("in:\n");
    	for (int i = 0; i < len; i++)
    	{
    		//in[i] = sin(2 * PI * dx*i) + sin(4 * PI * dx*i);
    		in[i] = i + 1;
    		printf("%.2f ", in[i]);
    	}
    	printf("\n\n");
    
    	printf("in2:\n");
    	for (int i = 0; i < len; i++)
    	{
    		in2[i][0] = i + 1;
    		in2[i][1] = 0;
    		printf("%.2f %.2f\n", in2[i][0], in2[i][1]);
    	}
    	printf("\n\n");
    
    	//执行
    	fftw_execute(p);
    	fftw_execute(p2);
    
    	// 输出
    	printf("out1:\n");
    	for (int i = 0; i < len; i++)
    	{
    		printf("%.5f ,%.5f  \n", out[i][0], out[i][1]);
    	}
    	printf("\n");
    
    	printf("out2:\n");
    	for (int i = 0; i < len; i++)
    	{
    		printf("%.5f ,%.5f  \n", out2[i][0], out2[i][1]);
    	}
    	printf("\n");
    
    	// 释放资源
    	fftw_destroy_plan(p);
    	fftw_free(in);
    	fftw_free(out);
    
    	fftw_destroy_plan(p2);
    	fftw_free(in2);
    	fftw_free(out2);
    
    	//system("pause");
    	return 0;
    
    }
    

    在这里插入图片描述

    对于一个数据文件,需要对其做FFT,应该是一段一段的做,得到的其实是短时傅里叶变换,单纯的傅里叶变换频谱不能反映出时间信息,短时傅里叶变换可以体现出这一点;

    cuFFT实现

    cuFFT的思想和FFTW基本一致,唯一有点区别的是,cuFFT需要将数据从cpu拷贝到gpu,然后计算,最后从gpu拷贝回cpu;

    1. 创建句柄,分配内存
    2. 初始化,内存拷贝cpu->gpu
    3. 执行
    4. 内存拷贝gpu->cpu
    5. 销毁
    #include "cuda_runtime.h"
    #include "device_launch_parameters.h"
    #include "cufft.h"
    
    
    #include<iostream>
    
    
    #include"fftw3.h"
    #pragma comment(lib, "libfftw3-3.lib") // double版本
    // #pragma comment(lib, "libfftw3f-3.lib")// float版本
    // #pragma comment(lib, "libfftw3l-3.lib")// long double版本
    
    
    
    #define CHECK(call)\
    {\
    	if ((call) != cudaSuccess)\
    			{\
    		printf("Error: %s:%d, ", __FILE__, __LINE__);\
    		printf("code:%d, reason: %s\n", (call), cudaGetErrorString(cudaGetLastError()));\
    		exit(1);\
    			}\
    }
    
    
    const double PI = acos(-1); //很巧妙的一种定义pi的方式
    
    void test_FFTW();
    
    int main() {
    
    	const int NX = 8;
    	const int BATCH = 1;
    
    	cufftHandle plan;
    	cufftComplex *data;
    	cufftComplex *data_cpu;
    
    	data_cpu = (cufftComplex *)malloc(sizeof(cufftComplex) * NX * BATCH);
    	if (data_cpu == NULL) return -1;
    
    	CHECK(cudaMalloc((void**)&data, sizeof(cufftComplex) * NX * BATCH));
    	
    	CHECK(cufftPlan1d(&plan, NX, CUFFT_C2C, BATCH)); 
    
    	//输入数据
    	for (int i = 0; i < NX; ++i) {
    		data_cpu[i].x = i + 1;
    		data_cpu[i].y = 0;
    	}
    
    	//数据传输cpu->gpu
    	CHECK(cudaMemcpy(data, data_cpu, sizeof(cufftComplex) * NX * BATCH, cudaMemcpyHostToDevice));
    	CHECK(cudaDeviceSynchronize());
    	
    	CHECK(cufftExecC2C(plan, data, data, CUFFT_FORWARD)); 
    	//CHECK(cufftExecC2C(plan, data, data, CUFFT_INVERSE) != CUFFT_SUCCESS);
    
    	//数据传输gpu->cpu
    	CHECK(cudaMemcpy(data_cpu, data, sizeof(cufftComplex) * NX * BATCH, cudaMemcpyDeviceToHost));
    	CHECK(cudaDeviceSynchronize());
    	
    
    	cufftDestroy(plan);
    	cudaFree(data);
    
    	printf("CUFFT_FORWARD:\n");
    	for (int i = 0; i < NX; ++i) {
    		printf("%f , %f\n", data_cpu[i].x, data_cpu[i].y);
    	}
    
    	system("pause");
    	return 0;
    
    }
    

    在这里插入图片描述

    展开全文
  • cufft64_90.dll

    2018-12-26 13:45:57
    NVIDIA cuda必备的dll文件,可用于人工智能开发,也可用于配合phoenix Go
  • 参考资料 [1] 快速傅里叶变换 - 维基百科,自由的百科全书 [2] cuFFT :: CUDA Toolkit Documentation [3] FFTW Home Page [4] c++ - CUDA FFT - power of two - Stack Overflow [5] CUDA学习笔记3:CUFFT(CUDA提供...

    前言

    傅里叶变换(Fourier Transform)是数字信号处理领域中一个非常重要的数学变换方法,用来实现信号从时域到频域的变换过程。

    离散傅里叶变换(Discrete Fourier Transform,DFT)是连续傅里叶变换在离散系统中的表示形式,由于其计算量大,在很长一段时间内其应用受到了极大限制。

    快速傅里叶变换(Fast Fourier Transform,FFT),是快速计算序列的离散傅里叶变换(DFT)或其逆变换的方法。是由Cooley和Tukey于20世纪60年代(1965年)合作发表之后开始为人所知。它能够将计算DFT的复杂度从只用DFT定义计算需要的 O ( n 2 ) O(n^2) O(n2)降低到 O ( n log ⁡ n ) O(n\log n) O(nlogn),其中 n n n 为数据大小。大大提高了DFT的运算速度,从而使DFT在实际应用中得到了快速发展。

    在这里插入图片描述
    对于FFT的CPU实现,一般可以通过调用FFTW库实现。
    在这里插入图片描述
    而对于FFT的GPU实现,一般通过调用cuFFT库实现。
    在这里插入图片描述

    最优尺寸选择

    在cuFFT的文档中明确说道:
    在这里插入图片描述
    注意第一条:

    Algorithms highly optimized for input sizes that can be written in the
    form 2 a × 3 b × 5 c × 7 d 2^a×3^b×5^c×7^d 2a×3b×5c×7d. In general the smaller the prime factor, the better the performance, i.e., powers of two are fastest.

    也就是说要使用cuFFT达到比较高效的计算性能,待计算数据的尺寸应该为质数2、3、5、7的幂次方乘积。当尺寸仅为质数2的幂次方时,计算速度最快。(注:实质上该特性是由Cooley-Tukey算法的本质——分治思想决定的。)

    在这里插入图片描述
    因此在实际使用时,应该严格遵循该性能原则。

    另一方面,进一步探究如果数据尺寸不仅仅为质数2、3、5、7的幂次方乘积时,cuFFT的表现如何。

    在文档中,可以看到这样一段话:
    在这里插入图片描述

    • 当数据尺寸仅为质数2、3、5、7的幂次方乘积时,cuFFT将调用优化后的 Cooley-Tukey算法计算,此时计算效率最高;
    • 当数据尺寸不仅为质数2、3、5、7的幂次方乘积时,还包括其他小于128的质数时,cuFFT将调用普通Cooley-Tukey算法计算;
    • 当数据尺寸不仅为质数2到127的幂次方乘积时,cuFFT将调用Bluestein算法计算。需要注意的是,Bluestein算法实现需要比Cooley-Tukey算法实现更多的计算量,且Bluestein算法的精度没有Cooley-Tukey算法高。

    在stackoverflow上,有一段关于cuFFT计算效率的描述说的也比较具体:
    在这里插入图片描述
    因此,在使用cuFFT求解傅里叶变换时,需要尽可能按照文档中描述的质数组合,来调整数据的尺寸,以实现算法的最优。

    如果涉及到多种尺寸的数据需要同时进行计算dft,注意尽量避免cuFFT调用Bluestein算法计算。对于cuFFT库而言,每当调用Bluestein算法时,都会存在若干隐式的显存开辟和释放的操作,这会严重降低计算速度。

    OpenCV中的dft

    在OpenCV中,实现了基于CPU的dft运算(cv::dft)和基于CUDA的dft运算( cv::cuda::dft)。也用到了上述最优数据尺寸的思想。

    namespace cv
    {
    
    static const int optimalDFTSizeTab[] = {
    1, 2, 3, 4, 5, 6, 8, 9, 10, 12, 15, 16, 18, 20, 24, 25, 27, 30, 32, 36, 40, 45, 48,
    50, 54, 60, 64, 72, 75, 80, 81, 90, 96, 100, 108, 120, 125, 128, 135, 144, 150, 160,
    162, 180, 192, 200, 216, 225, 240, 243, 250, 256, 270, 288, 300, 320, 324, 360, 375,
    384, 400, 405, 432, 450, 480, 486, 500, 512, 540, 576, 600, 625, 640, 648, 675, 720,
    729, 750, 768, 800, 810, 864, 900, 960, 972, 1000, 1024, 1080, 1125, 1152, 1200,
    1215, 1250, 1280, 1296, 1350, 1440, 1458, 1500, 1536, 1600, 1620, 1728, 1800, 1875,
    1920, 1944, 2000, 2025, 2048, 2160, 2187, 2250, 2304, 2400, 2430, 2500, 2560, 2592,
    2700, 2880, 2916, 3000, 3072, 3125, 3200, 3240, 3375, 3456, 3600, 3645, 3750, 3840,
    3888, 4000, 4050, 4096, 4320, 4374, 4500, 4608, 4800, 4860, 5000, 5120, 5184, 5400,
    5625, 5760, 5832, 6000, 6075, 6144, 6250, 6400, 6480, 6561, 6750, 6912, 7200, 7290,
    7500, 7680, 7776, 8000, 8100, 8192, 8640, 8748, 9000, 9216, 9375, 9600, 9720, 10000,
    10125, 10240, 10368, 10800, 10935, 11250, 11520, 11664, 12000, 12150, 12288, 12500,
    12800, 12960, 13122, 13500, 13824, 14400, 14580, 15000, 15360, 15552, 15625, 16000,
    16200, 16384, 16875, 17280, 17496, 18000, 18225, 18432, 18750, 19200, 19440, 19683,
    20000, 20250, 20480, 20736, 21600, 21870, 22500, 23040, 23328, 24000, 24300, 24576,
    25000, 25600, 25920, 26244, 27000, 27648, 28125, 28800, 29160, 30000, 30375, 30720,
    31104, 31250, 32000, 32400, 32768, 32805, 33750, 34560, 34992, 36000, 36450, 36864,
    37500, 38400, 38880, 39366, 40000, 40500, 40960, 41472, 43200, 43740, 45000, 46080,
    46656, 46875, 48000, 48600, 49152, 50000, 50625, 51200, 51840, 52488, 54000, 54675,
    55296, 56250, 57600, 58320, 59049, 60000, 60750, 61440, 62208, 62500, 64000, 64800,
    65536, 65610, 67500, 69120, 69984, 72000, 72900, 73728, 75000, 76800, 77760, 78125,
    78732, 80000, 81000, 81920, 82944, 84375, 86400, 87480, 90000, 91125, 92160, 93312,
    93750, 96000, 97200, 98304, 98415, 100000, 101250, 102400, 103680, 104976, 108000,
    ···
    };
    }
    
    int cv::getOptimalDFTSize( int size0 )
    {
        int a = 0, b = sizeof(optimalDFTSizeTab)/sizeof(optimalDFTSizeTab[0]) - 1;
        if( (unsigned)size0 >= (unsigned)optimalDFTSizeTab[b] )
            return -1;
    
        while( a < b )
        {
            int c = (a + b) >> 1;
            if( size0 <= optimalDFTSizeTab[c] )
                b = c;
            else
                a = c+1;
        }
    
        return optimalDFTSizeTab[b];
    }
    

    版权说明

    本文为原创文章,独家发布在blog.csdn.net/TracelessLe。未经个人允许不得转载。如需帮助请email至tracelessle@163.com
    在这里插入图片描述

    参考资料

    [1] 快速傅里叶变换 - 维基百科,自由的百科全书
    [2] cuFFT :: CUDA Toolkit Documentation
    [3] FFTW Home Page
    [4] c++ - CUDA FFT - power of two - Stack Overflow
    [5] CUDA学习笔记3:CUFFT(CUDA提供了封装好的CUFFT库)的使用例子 - 爱国呐 - 博客园
    [6] 快速傅里叶变换 | 晓茵万事通
    [7] AutoFFT: A Template-Based FFT Codes Auto-Generation Framework for ARM and X86 CPUs
    [8] cuFFT | NVIDIA Developer
    [9] FFTW3学习笔记2:FFTW(快速傅里叶变换)中文参考 - 爱国呐 - 博客园
    [10] OpenCV: Discrete Fourier Transform
    [11] opencv例程解读——dft(离散傅里叶变换)_autocyz-CSDN博客_cv::dft
    [12] Understanding profiling (nvprof) output of cuFFT - CUDA / CUDA Programming and Performance - NVIDIA Developer Forums
    [13] cuPoisson
    [14] 3cHeLoN/cupoisson: CUDA implementation of the 2D fast Poisson solver
    [15] cupoisson/poisson.cu at master · 3cHeLoN/cupoisson
    [16] Poisson Image Editing
    [17] Project 2G: Gradient Domain Editing
    [18] opencv/cuda_info.cpp at 68d15fc62edad980f1ffa15ee478438335f39cc3 · opencv/opencv
    [19] opencv/dxt.cpp at master · opencv/opencv
    [20] OpenCV: Arithm Operations on Matrices
    [21] PhDResearchData/cufft.py at 2cf47a41df1e8d9bd9ebbe0da016234bf26cc1d1 · jaisw7/PhDResearchData

    展开全文
  • CUFFT 浅析

    千次阅读 2019-05-18 14:07:54
     cufftPlanMany(&plan_Nfft_Many, rank, n, inembed, istride, idist, onembed, ostride, odist, CUFFT_C2C, batch);  /* 核心部份 */  cudaMemcpy(data_dev, data_Host, Nfft * NXWITH0 * sizeof(cufftComplex...

    1. 流程
    使用cufftHandle创建句柄
    使用cufftPlan1d(),cufftPlan3d(),cufftPlan3d(),cufftPlanMany()对句柄进行配置,主要是配置句柄对应的信号长度,信号类型,在内存中的存储形式等信息。 
    cufftPlan1d():针对单个 1 维信号
    cufftPlan2d():针对单个 2 维信号
    cufftPlan3d():针对单个 3 维信号
    cufftPlanMany():针对多个信号同时进行 fft
    使用cufftExec()函数执行 fft
    使用cufftDestroy()函数释放 GPU 资源
    2. 单个 1 维信号的 fft
    假设要执行 fft 的信号data_dev的长度为N,并且已经传输到 GPU 显存中,data_dev数据的类型为cufftComplex,可以用一下方式产生主机段的data_dev,如下所示:

        cufftComplex *data_Host = (cufftComplex*)malloc(NX*BATCH*sizeof(cufftComplex)); // 主机端数据头指针
    
        // 初始数据
        for (int i = 0; i < NX; i++)
        {
            data_Host[i].x = float((rand() * rand()) % NX) / NX;
            data_Host[i].y = float((rand() * rand()) % NX) / NX;
        }


    然后用cudaMemcpy()将主机端的data_host拷贝到设备端的data_dev,即可用下述方法执行 fft :

        cufftHandle plan; // 创建cuFFT句柄
        cufftPlan1d(&plan, N, CUFFT_C2C, BATCH);
        cufftExecC2C(plan, data_dev, data_dev, CUFFT_FORWARD); // 执行 cuFFT,正变换


    cufftPlan1d():

    第一个参数就是要配置的 cuFFT 句柄;
    第二个参数为要进行 fft 的信号的长度;
    第三个CUFFT_C2C为要执行 fft 的信号输入类型及输出类型都为复数;CUFFT_C2R表示输入复数,输出实数;CUFFT_R2C表示输入实数,输出复数;CUFFT_R2R表示输入实数,输出实数;
    第四个参数BATCH表示要执行 fft 的信号的个数,新版的已经使用cufftPlanMany()来同时完成多个信号的 fft。
    cufftExecC2C()

    第一个参数就是配置好的 cuFFT 句柄;
    第二个参数为输入信号的首地址;
    第三个参数为输出信号的首地址;
    第四个参数CUFFT_FORWARD表示执行的是 fft 正变换;CUFFT_INVERSE表示执行 fft 逆变换。
    需要注意的是,执行完逆 fft 之后,要对信号中的每个值乘以 1/N

    3. 多个 1 维信号的 fft
    要进行多个信号的 fft,就不得不使用 cufftPlanMany 函数,该函数的参数比较多,需要特别介绍,

    cufftPlanMany(cufftHandle *plan, int rank, int *n, 
                  int *inembed, int istride, int idist, 
                  int *onembed, int ostride, int odist, 
                  cufftType type, int batch);


    为了叙述的更准确,此处先引入一个图,表示输入数据在内存中的布局,如下图所示,数据在内存中按行优先存储,但是现有的信号为一列表示一个信号,后四列灰白色的表示无关数据,要对前 12 个彩色的列信号分别进行 fft。

    plan:表示 cufft 句柄
    rank:表示进行 fft 的每个信号的维度数,一维信号为 1,二维信号为2,三维信号为 3 ,针对上图,rank = 1
    n:表示进行 fft 的每个信号的行数,列数,页数,必须用数组形式表示,例如假设要进行 fft 的每个信号的行、列、页为(m, n, k),则 int n[rank] = {m, n, k};针对上图,int n[1] = {5}
    inembed:表示输入数据的[页数,列数,行数],这是三维信号的情况;二维信号则为[列数,行数];一维信号为[行数];inembed[0] 这个参数会被忽略,也就是此处 inembed 可以为{0},{1},{2}等等。
    istride:表示每个输入信号相邻两个元素的距离,在此处 istride = 16(每个信号相邻两个元素间的距离为16)
    idist:表示两个连续输入信号的起始元素之间的间隔,在此处为 idist = 1(第一个信号的第一个元素与第二个信号的第一个元素的间隔为1);如果把上图数据的每一行看成一个信号,那么应该为 idist = 16;
    onembed:表示输出数据的[页数,列数,行数],这是三维信号的情况;二维信号则为[列数,行数];一维信号为[行数];onembed[0] 这个参数会被忽略,也就是此处 onembed 可以为{0},{1},{2}等等。
    ostride:表示每个输出信号相邻两个元素的距离,在此处 ostride = 16(每个信号相邻两个元素间的距离为16)
    odist:表示两个连续信号的起始元素之间的间隔,在此处为 odist = 1(第一个信号的第一个元素与第二个信号的第一个元素的间隔为1);如果把上图数据的每一行看成一个信号,那么应该为 odist = 16;
    如下所示:是第 b 个信号的 [z][y][x] (表示第 z 列,第 y 行,第 x 页的元素)的索引(由于 c 和 c++ 中数组的声明方式的问题,array[X][Y][Z]表示数组有 X 页,Y 行,Z 列) :

    ‣ 1D

    input[ b * idist + x * istride ] 
    output[ b * odist + x * ostride ]

    ‣ 2D

    input[ b * idist + (x * inembed[1] + y) * istride ] 
    output[ b * odist + (x * onembed[1] + y) * ostride ]

    ‣ 3D

    input[b * idist + (x * inembed[1] * inembed[2] + y * inembed[2] + z) * istride] 
    output[b * odist + (x * onembed[1] * onembed[2] + y * onembed[2] + z) * ostride]
    
        /* 申请 cufft 句柄*/
        cufftHandle plan_Nfft_Many; // 创建cuFFT句柄
        const int rank = 1; // 一维 fft
        int n[rank] = { Nfft }; // 进行 fft 的信号的长度为 Nfft
        int inembed[1] = { 0 }; // 输入数据的[页数,列数,行数](3维);[列数,行数](2维)
        int onembed[1] = { 0 }; // 输出数据的[页数,列数,行数];[列数,行数](2维)
        int istride = NXWITH0; // 每个输入信号相邻两个元素的距离
        int idist = 1; // 每两个输入信号第一个元素的距离
        int ostride = NXWITH0; // 每个输出信号相邻两个元素的距离
        int odist = 1; // 每两个输出信号第一个元素的距离
        int batch = NX; // 进行 fft 的信号个数
        cufftPlanMany(&plan_Nfft_Many, rank, n, inembed, istride, idist, onembed, ostride, odist, CUFFT_C2C, batch);
    
        /* 核心部份 */
    
        cudaMemcpy(data_dev, data_Host, Nfft * NXWITH0 * sizeof(cufftComplex), cudaMemcpyHostToDevice);
        cufftExecC2C(plan_Nfft_Many, data_dev, data_dev, CUFFT_FORWARD); // 执行 cuFFT,正变换
        cufftExecC2C(plan_Nfft_Many, data_dev, data_dev, CUFFT_INVERSE); // 执行 cuFFT,逆变换
        CufftComplexScale<<<dimGrid2D_NXWITH0_Nfft, dimBlock2D>>>(data_dev, data_dev, 1.0f / Nfft); // 乘以系数
        cudaMemcpy(resultIFFT, data_dev, Nfft * NXWITH0 * sizeof(cufftComplex), cudaMemcpyDeviceToHost);

     

    参考
    CUDA官方文档《CUFFT Library》

    https://blog.csdn.net/endlch/article/details/46724811

    展开全文
  • CUDA提供了封装好的CUFFT库,它提供了与CPU上的FFTW库相似的接口,能够让使用者轻易地挖掘GPU的强大浮点处理能力,又不用自己去实现专门的FFT内核函数。使用者通过调用CUFFT库的API函数,即可完成FFT变换。  常见...
  •    使 用 cuFFT API    cuFFT 功 能 示 范   
  • cuda fortran cufft

    2013-08-20 15:12:03
    这是自己在VS2012下,使用PVF弄的一个fortran调用CUFFT的例子。
  • NVIDIA CUDA cufft库调用及实例的官方指导,一维二维三维都有详细实例说明。不明白的可以联系我:yuehankelisiduofu@126.com
  • 利用scikits.cuda对基于CuFFT的Theano进行卷积运算
  • cufftExecC2C(p, (cufftComplex*)t_result_temp_din, (cufftComplex*)t_result_temp_out, CUFFT_FORWARD); //将值辅到host cudaMemcpy(result_temp_din, t_result_temp_out, ROWS * sizeof(cufftComplex)* COLS,...
  • cufftExecC2C(plan, data_dev, data_dev, CUFFT_FORWARD); // 执行 cuFFT,正变换 cufftPlan1d() : 第一个参数就是要配置的 cuFFT 句柄; 第二个参数为要进行 fft 的信号的长度; 第三个 CUFFT_C2C 为...
  • CUFFT.jl:CUDA FFT库的包装器
  • CUDA的cufft库可以实现(复数C-复数C),(实数R-复数C)和(复数C-实数R)的单精度,双精度福利变换。其变换前后的输入,输出数据的长度如图所示。在C2R和R2C模式中,根据埃尔米特对称性(Hermitian symmetry),...
  • CUDA快速傅里叶变换 cuFFT

    千次阅读 2018-11-08 00:44:02
    CUDA快速傅里叶变换 cuFFT
  • GPU Computing with CUDA Lecture 8 - CUDA Libraries - CUFFT, PyCUDA,讲述如何利用CUDA中的cufft模块。
  • //设置fft变换的句柄,这里需要注意,传入的参数先为fftH,fftW,即先传高度值,再传宽度值,CUFFT_R2C表示从real实数变换到complex复数 checkCudaErrors(cufftPlan2d(&fftPlanInv, fftH, fftW, CUFFT_C2R));...
  • cu-QRTM是基于CUDA的代码包,它基于一组稳定而高效的策略(例如流CUFFT,检查点辅助时间反转重构(CATRC)和自适应稳定方案)来实现$ Q $ -RTM。 提供此软件包的目的是为了加速基于CPU的常规$ Q $ -RTM,并模仿地球...
  • 求助cufft64_80.dll文件

    2019-11-27 13:50:12
    运行.exe时报错提示缺失“_cufft64_80.dll”文件,搜了很多地方都没找到,各位大佬有这个文件分享一下吗?感激!!
  • 最近学习CUDA编程,做一个基于cufft的GPUfft运算加速的实验,使用了cufft库中的一些函数,编译无措,连接报错,错误如下:1>sample.obj : error LNK2019: unresolved external symbol _cufftDestroy@4 referenced...
  • 使用cuFFT实现大整数乘法

    千次阅读 2017-08-23 17:53:59
    序言在某些场合,我们可能需要使用远超内置整型范围的整数进行运算,比如公钥加密等。如果使用最原始的竖式计算,那么时间复杂度是T(n^2),其中n是相乘的两个整数的位数。使用Karatsuba算法优化,时间复杂度可以降至...
  • CUDA为开发人员提供了多种库,每一类库针对某一特定领域的应用,CUFFT库则是CUDA中专门用于进行傅里叶变换的函数库,这一系列的文章是博主近一段时间对cuFFT库的学习总结,主要内容是文档的译文,其间夹杂一些博主...
  • CUDA为开发人员提供了多种库,cuFFT库则是CUDA中专门用于进行傅里叶变换的函数库。因为在网上找资料,当时想学习一下多个 1 维信号的 fft,这里我推荐这位博主的文章,但是我没有成功,我后来自己实现了。 1. 下载 ...

空空如也

空空如也

1 2 3 4 5 ... 20
收藏数 1,933
精华内容 773
关键字:

CuFFt