• Sunando Sengupta

    I am computer vision scientist at Vicon Motion Systems, with interests in computer vision, robotics and machine learning. This is my research blog. Please leave any comments, suggestions or any queries.
  • Blog Stats

    • 26,706 hits

Install Theano on Windows 8.1 with Visual Studio 2013, CUDA 7.5

Theano is the Deep learning python library running on GPGPU. This install instructions has worked for my machine, for details but please refer to http://deeplearning.net/software/theano/install_windows.html and http://deeplearning.net/software/theano/install.html#install

Machine : Windows 8.1(64 bit), Visual studio 2013 (currently CUDA 7.5 does not support VS2015), CUDA 7.5, Geforce GT650m

The windows installation which worked for me was as follows

  1. Install python 64bit
  2. Install precompiled whl from http://www.lfd.uci.edu/~gohlke/pythonlibs
    1. Numpy+MKL library
    2. scipy
    3. nose
    4. blas (cvxopt)
    5. pycuda
  3. Get GCC from http://tdm-gcc.tdragon.net/
  4. Download theano https://codeload.github.com/Theano/Theano/zip/master
    1. Also you can use pip install theano
  5. cd into the theano directory
  6. run python setup.py develop (other options are install, but I wanted a development environment)
  7. go to the python shell and import theano should work
  8. Check for GPU usage: Use the script from http://deeplearning.net/software/theano/tutorial/using_gpu.html#testing-theano-with-gpu
    1. if it is using cpu then create a .theanorc.txt and put in the C:\Users\UserName folder with following options


      device = gpu
      floatX = float32

      fastmath = True
      compiler_bindir=C:\Program Files (x86)\Microsoft Visual Studio 12.0\VC\bin\cl.exe

    2. Run the test script and then it should be using the GPU
  9. Common errors faced while using the GPU (I faced them in order)
    1. nvcc cannot locate the location of cl.exe Solution: Set the cl.exe  location in the environment path
    2. nvcc fatal : Microsoft Visual Studio configuration file ‘vcvars64.bat’ could not be found : Copied hte folder C:\Program Files (x86)\Microsoft Visual Studio 12.0\VC\bin\x86_amd64 into C:\Program Files (x86)\Microsoft Visual Studio 12.0\VC\bin\amd64 (note the change in path) and create the file ‘vcvars64.bat’ with the following command ‘CALL setenv /x64’
    3. nvcc fatal : Some include missing or some lib missing : NVCC required some windows sdk includes and libraries. It might be needed to install MicrosoftSDk . Added  these lines in the file “C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v7.5\bin\nvcc.profile”INCLUDES += “-I$(TOP)/include” $(_SPACE_) “-IC:/Program Files (x86)/Microsoft Visual Studio 12.0/VC/include” $(_SPACE_) “-IC:\Program Files (x86)\Microsoft SDKs\Windows\v7.1A\Include” $(_SPACE_)

      LIBRARIES =+ $(_SPACE_) “/LIBPATH:$(TOP)/lib/$(_WIN_PLATFORM_)” $(_SPACE_) “/LIBPATH:C:/Program Files (x86)/Microsoft Visual Studio 12.0/VC/lib/amd64” $(_SPACE_) “/LIBPATH:C:\Program Files (x86)\Microsoft SDKs\Windows\v7.1A\Lib\x64” $(_SPACE_)

  10. Finally my GPU was being used by theano. This might be a bit dirty way to get the theano use my gpu, so please let me know if there are some better options.

Optimize Cuda for Beginners

We all think, take a CPU code, put in GPU and then wuala, you have 10x times fast application, however in reality the process is a bit more involved. My very first code was 3 times slower than the normal single core cpu version, with some simple steps it was 10x faster, which I would like to share here in this post. Now let me assume we all are beginners here and we want to make our existing lousy cpu code fast. Before we begin, it is always better to go through the CUDA guidelines, CUDA programming practice, etc.. It is really a good source and worth visiting the Nvidia site, and PLEASE have a look in the memory outlay of your generation graphics card.

The Golden Rule

(taken from http://docs.nvidia.com/cuda/cuda-c-best-practices-guide/#axzz31nBBo0da)

1. Analyse your existing code. Run the Visual Studio profiler (comes wtih VS2012 most versions). Find out the cpu part which takes time. Attack that part.

2. Given you have analyzed that hard/slow bit, think how to parallelise it. If it is image processing related, it can be paralleled at pixel level, image level, group of pixels. Write an appropriate kernel for that operation

3. Now is the tricky part.  Optimise. There are few easy steps, which will easily give you a speedup.

Step 1.    Check the launch configuration. can be done easily by looping over the kernel with different launch params.

cudaEvent_t start, stop;
float time;
for (int p=1; p < MAX_GRID; p++) // loop over kernel launch configs
  for (int q=1; q < MAX_THREAD; q++)
    cudaEventRecord(start, 0);
    Kernel<<<p, q>>>();
    cudaEventRecord(stop, 0);
     cudaEventElapsedTime(&time, start, stop);
     printf ("kernel time: %f ms\n"time);

Take the config with the lowest time, generally it is the one with threads as multiple of 32. Hopefully by this step, you will have a reasonable occupancy for the gpu. The occupancy can be viewed from Nvidia Visual profiler through Kernel Latency analysis.

Step2. Compiler Flags

In the VS editor, project properties->Cuda C++->device

change ‘max used register’ to 65536  (or whatever max your device allows, you can cond it from Nsight device page)

In the VS editor, project properties->Cuda C++->host

use fast maths : -use_fast_math

change the optimization to /O2

Step 3.   Open the awesome nvvp profiler and run your exe within it.

Do the full analysis, and then the kernel analysis.

  1. See the number of registers your chunkiest kernel is taking. Now go to Kernel latency analysis. there a chart shows how varying registers increases the number of simultaneous  kernel launches.  We need to be in the top.


Most probably, the cuda optimizer will decide about the registers, but if you have some constants values that is read across the threads, declare a shared variable. It is fast, and reduces the register count.

After tewaking the number of registers, you can again check the best launch configuration as in step1.


Step4. Use asynchronous streams to launch multiple parallel kernel

Now let us try to improve further stuff . Cuda allows streams so that multiple kernels can execute simultaneously cor devices with cuda compute capability > 2.0. This is quite  powerful and helps significantly. The syntax is easy, (short example of 2 streams)

cudaStream_t stream1, stream2;

// wait till the stream operation is finished
 while( cudaSuccess != cudaStreamQuery ( stream1 ) );
 while( cudaSuccess != cudaStreamQuery ( stream2 ) );

Now if the kernel can be divided, you can launch multiple kernels in parallel through multiple streams. Point to note, the memory management in streams can be tricky. Reading  global memory is fine, but atomics will not work across streams, leading to concurrency error.

A good source for streams http://on-demand.gputechconf.com/gtc-express/2011/presentations/StreamsAndConcurrencyWebinar.pdf

Hopefully you code will re reasonably fast now, and you can take a break.

Cuda api error while working though a remote desktop on windows.

Very recently, had to rum my code on a GPU on a remote machine. So had logged in through windows desktop and while trying to execute my code, was encountered with an error, ‘No Cuda Device available.’ After doing a bit of googling founf out that, windoew remote desktop unloads any third party display drivers, which resulted in my nvidia drivers being unloaded and hence no cuda device.

Possible workaround:

use different remote apps like vnc.

similar problem with openGL, microsoft social forum has some discussion. Please have a look



KinectFusion with PCL on windows

This article is to aid the kinectfusion going using the point-cloud library KinFu project. My system – Windows 7, 64 bit., VS 2010, CUDA 4.2

The KinFu project is in their trunk. So we need to download the sources and compile them. The following steps should be helpful to properly get the kinectfusion working 🙂

Edit – Also got it working in Windows 8. Please see below for windows 8 problems.

Step1) Installing hte dependencies

The following dependencies need to be installed. All the installs can be found in http://pointclouds.org/downloads/windows.html

a.) Boost 1.50.0

b)  Eigen

c) Flann

d) VTK

e) QT

f) QHull

g) OpenNi both the

h) NVDIA device drivers and CUDA toolkit. Have tested on  GeForce580, and GEForce590.

2) Download the latest trunk. Please follow the steps of how to set the svn and download the latest code. The steps can be found at


3) Run cmake and generate the VS project files. Please make sure that the Build_apps, Build_tools and Build_visualizations, Build_gpu are ticked. Issues that you might face for generating the project files

  • QT_QMAKE_EXECUTABLE not found. Put the QMAKE path manually like C:/Qt/4.8.0/bin/qmake.exe
  • Better to build all the cuda related stuff.
  • I had a problem while ticking BUILD_all_in_one_installer. It tries to download the executables for openNI and fails. As we have installed the dependencies separately, we dont need this option. So UNTICK BUILD_all_in_one_installer

4) Run the PCL.sln and build it. It will take quite a lot of time. The main project of interest is PCL_Kinfu_app. Alternatively you can only  build this project. But it is safer to build the pcl once.

5) Run pcl_kinfu_app_debug.exe / pcl_kinfu_app_release.exe. Should work.


a) The exe threw an exception. Make sure the dlls are there in the proper path. The main dlls needed are cudart64_42_9.dll, pcl_common_release.dll, pcl_io.dll, pcl_gpu_kinfu.dll, pcl_kdtree.dll, pcl_range_image.dll, pcl_visualization.dll, pcl_gpu_utils.dll, pcl_gpu_octree.dll, pcl_gpu_containers.dll.


Can't open depth source

Sommetime windows might install the kinect drivers and this might prevent the OpenNI drivers to access it. So change hte device drivers from Device Managers, if there is an entry of PrimeSense with kinect camera and kinect device inside that then the problem is somewhere else. If the kinect is present inside Microsoft Kinect then update the drivers for Kinect Camera and Kinect Device, Select the proper files from SensorKinect(e.g. C:\Program Files\PrimeSense\SensorKinect\Driver). Try again, it should be working.

Error in Windows 8:

It is very common that windows 8 does not allow the kinect to communicate with the device drivers  of OpenNI. This is is particularly irritating as the OpenNI drivers are unsigned and windows does not allow to have unsigned drivers and secondly windows 8 has inbuilt kinect drivers preventing any new drivers to talk to kinect. The solution is to disable the driver signing in windows 8 and then install the SensorKinect.

I hope it will work them. For any problems in windows 8 please let me know.

Cuda 5.0 with Visual Studio 2010

This article is to help create a  Cuda 5.0 project with Visual studio 2010. Generally the cuda source files .cu, and forthem to be compiled in visual studio you would need the nvcc (nvidia cuda compiler).

My machine – 64 bit windows7

The basic steps are

1. Install the 64-bit CUDA 5.0.32 SDK (the latest release of CUDA 5.0). can be found at https://developer.nvidia.com/cuda-downloads

2. Set up a visual studio console projectImage

2.  Create a .cu file and add it to your project.

3. Check the build customizations project->build customizationImage

Select cuda5.0 build customization. If the cuda toolkit is properly installed then it should have the cuda build customization option. However if it is not there, please copy the .targets, .props from the location

“C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v4.2\extras\visual_studio_integration\MSBuildExtensions”  to “”C:\Program Files (x86)\MSBuild\Microsoft.Cpp\v4.0\BuildCustomizations”

4. Go the property page for the file “cudaFile.cu” and change the item type to Cuda C/C++


5. Add cudart.lib, and cuda.lib in the linker input.


You should be able to compile now your empty project. In case there is any error check the project property->CUDA C/C++->Device and ensure the code generation is compute_10,sm_10


Happy Cuda coding…

Automatic Labelling Environment

The segmentation code provided by http://www.robots.ox.ac.uk/~lubor/ALE.zip is the code which every computer vision guy working in segmentation would love to have. It is very simple to use and runs effortlessly in windows (apparently compiles and runs in linux too but have not tried). All you need is

1) Use visual studio to open the solution file,

2) create a dataset folder. A sample dataset folder can be downloaded from http://www.robots.ox.ac.uk/~lubor/Msrc.zip. Unzip it and keep it in the same directory as source.

3) Run F5. It should do the segmentation/training and evaluation, so better get a cup of tea.



Photosynth/Bundler in windows


While using this software from http://phototour.cs.washington.edu/bundler/ which is used in Microsoft photosynth,  for Structure from Motion, I ran into  some compiling issues for VS2008 in win 7. This is a very nice piece of software which has all the code available to download.

1) Download from http://phototour.cs.washington.edu/bundler/

2)  Hit build, should compile easily.

3) I found few compilation errors which were windows related which were easy to fix. (like rename the files sysdep1.h0 to sysdep1.h, signal1.h0 to signal1.h)

3.a) You might need the zlib1.dll to be placed in the executable directory. Download it from http://www.dll-files.com/dllindex/dll-files.shtml?zlib1

4) Use David loves sift to get the key points. Note – If you use the script ToSift.sh (through), try to edit the file to as it not to make the gzip version of the code. As .gz format varies for different cygwin.  This removes the requirement to call for functions like gzread/gzopen/gzclose, which might have a different format based on the zlib1.dll

5) Use KeyMatch to generate matches and then Bundler for SFM. (go through the readme file)

I used meshlab to visualize the points.

Please tell me of any issue as I was able to rum it easily.


Using Sparse Bundle Adjustment (SBA) in windows

These are the steps to compile SBA in windows using Visual Studio
Visual Studio 2008

1. Download the SBA from http://www.ics.forth.gr/~lourakis/sba/
2. Download the precompiled Lapack libraries (you will need BLAS.LIB, CLAPACK.lib, LIBF2C.lib). Download from http://www.netlib.org/clapack/LIB_WINDOWS/prebuilt_libraries_windows.html
2a) download them into a specified folder say xxx\lib. make sure you use either all of hte debug version or all of release version. Any mismatch might lead to link error in VS.
3. Extract the SBA package and you need to create a VS project. You can do it in two ways
2a) Use CMake – download CMAKE from CMake GUI http://www.cmake.org/
Use CMAKE to configure and generate the VS projects. In the configure stage, make sure you specify the path for libraries blas and lapack. F2C should be on.
2b) The second way is that you create a new project.It should be a console project but without any precompiled headers. As the sba files are essentially C files.
3) Check the libraries dependency
In the linker settings. Make sure you have the link directory set to xxx\lib. Add the lib files in the additional link libraries in the order libf2c.lib, blas.lib, lapack.lib
4) If you have created the project on you own, add the source files and header files.
5) Press F7, then F5….


** In case you get some link error like conflict with msvcrt, put this setting

LIBCMT.lib(tidtable.obj) : error LNK2005: __encode_pointer already defined in MSVCRTD.lib(MSVCR90D.dll)
1>LIBCMT.lib(tidtable.obj) : error LNK2005: __decode_pointer already defined in MSVCRTD.lib(MSVCR90D.dll)
1>LIBCMT.lib(invarg.obj) : error LNK2005: __invoke_watson already defined in MSVCRTD.lib(MSVCR90D.dll)
1>LIBCMT.lib(setlocal.obj) : error LNK2005: __configthreadlocale already defined in MSVCRTD.lib(MSVCR90D.dll)
1>LIBCMT.lib(crt0dat.obj) : error LNK2005: __amsg_exit already defined in MSVCRTD.lib(MSVCR90D.dll)
1>LIBCMT.lib(crt0dat.obj) : error LNK2005: __initterm_e already defined in MSVCRTD.lib(MSVCR90D.dll)
1>LIBCMT.lib(crt0dat.obj) : error LNK2005: _exit already defined in MSVCRTD.lib(MSVCR90D.dll)
1>LIBCMT.lib(crt0dat.obj) : error LNK2005: __exit already defined in MSVCRTD.lib(MSVCR90D.dll)
1>LIBCMT.lib(crt0dat.obj) : error LNK2005: __cexit already defined in MSVCRTD.lib(MSVCR90D.dll)
1>LIBCMT.lib(mlock.obj) : error LNK2005: __unlock already defined in MSVCRTD.lib(MSVCR90D.dll)
1>LIBCMT.lib(mlock.obj) : error LNK2005: __lock already defined in MSVCRTD.lib(MSVCR90D.dll)
1>LIBCMT.lib(winxfltr.obj) : error LNK2005: __XcptFilter already defined in MSVCRTD.lib(MSVCR90D.dll)
1>LIBCMT.lib(crt0init.obj) : error LNK2005: ___xi_a already defined in MSVCRTD.lib(cinitexe.obj)
1>LIBCMT.lib(crt0init.obj) : error LNK2005: ___xi_z already defined in MSVCRTD.lib(cinitexe.obj)
1>LIBCMT.lib(crt0init.obj) : error LNK2005: ___xc_a already defined in MSVCRTD.lib(cinitexe.obj)
1>LIBCMT.lib(crt0init.obj) : error LNK2005: ___xc_z already defined in MSVCRTD.lib(cinitexe.obj)

in the Linker->input->Ignore specific library -LIBCMT.lib



GPU in computer vision

As vision moves ahead to meet industrial expectations (Microsoft Kinect made a major leap to bridge this gap) real-time implementation for vision algorithm has become a must. Some of these implementations/works are listed here

HOG features on GPU (http://www.cs.unc.edu/~jmf/CVPR2009_CVGPU.pdf)

Deep learning for image classification (http://portal.acm.org/citation.cfm?id=1553486)

SVM on GPU (http://conflate.net/icml/paper/2008/491)

BMVC 2010