Saturday, April 10, 2010

Matrix and linear algebra in F#, Part V: Sparse matrix implementation in PowerPack, and PInvoke a large scale SVD library as an application.

In the previous posts of this series. We mainly talked about dense matrices in F#:

Part I: a tutorial on using matrix type in F#, mainly focused on dense matrix.

Part II: doing dense matrix linear algebra using math providers.

Part III: eigenface application

Part IV: performance issues

In this post, we move to sparse matrices. In many applications in data mining, the data table is stored as a sparse matrix. For example, in text mining, a document can be represented by a row-vector(or column-vector), the value at index i means that how many times word i occurs in this document. While the vocabulary could be quite large (tens of thousands), thus the document vector is highly sparse. If we store the documents in a matrix, then the matrix is also highly sparse. In DNA sequence analysis applications, the same situation occurs, that is we need to deal with sparse matrices!

We first dig into the implementation of F# sparse matrix in PowerPack. This is important, as we need to know how the sparse matrices are represented in F#, what algorithms are supported, etc. Based on the F# implementation, we can build our own routines, e.g. change the row based sparse representation to column based. At last, we PInvoke a SVD library in C as a example to show how to link existing stable linear algebra routines for sparse matrices.

The sparse matrix implementation in PowerPack

Sparse matrices have several different representation, which also have different strength. A good introduction is this Wikipedia page.

F# uses Yale Sparse Matrix Format, or another name, Compressed Sparse Row. (ref to the above Wiki article.)

Let’s check this out:

> let B = Matrix.initSparse 4 3 [0,0,2.3; 2,0,3.8; 1,1,1.3; 0,2,4.2; 1,2,2.2; 2,2,0.5]
val B : matrix = matrix [[2.3; 0.0; 4.2]
[0.0; 1.3; 2.2]
[3.8; 0.0; 0.5]
[0.0; 0.0; 0.0]]
> B.InternalSparseValues
val it : float [] = [|2.3; 4.2; 1.3; 2.2; 3.8; 0.5|]
> B.InternalSparseColumnValues
val it : int [] = [|0; 2; 1; 2; 0; 2|]
> B.InternalSparseRowOffsets
val it : int [] = [|0; 2; 4; 6; 6|]

Non-zero entries are sequentially in B.InternalSparseValues in a row-by-row fashion. B.InternalSparseColumnValues stores their column indices. B.InternalSparseRowOffsets stores the row offsets, for row i, it starts at B.InternalSparseRowOffsets[i] and ends at B.InternalSparseRowOffsets[i+1]-1 in the value array(B.InternalSparseValues).

This representation allows to take out one row of a sparse matrix very efficiently. In data mining, data instances/samples are usually stored as a row vector in a matrix, so it supports efficient handling on the instance level.


However, this representation does not support efficient column operations. In numerical computing, taking columns is more often (following the old Fortran tradition), thus a lot of libraries use Compressed Sparse Column representation for sparse matrices. It is also called Harwell-Boeing sparse matrix format (HB format for short) in the numerical computing community. An example is here.

PowerPack’s aim for providing the Internal* member functions of matrix type is to let users to interpolate with other libraries. However, it does not provide a function to convert the row representation to the column representation. Thus, this could be our first exercise:
let toColumnSparse (A:matrix) =
if A.IsDense then failwith "only for sparse matrix!"
let nrow = A.NumRows
let ncol = A.NumCols
let cidx = A.InternalSparseColumnValues
let roffset = A.InternalSparseRowOffsets
let vals = A.InternalSparseValues
let nval = vals.Length

// 1. scan to get column offset
let colCnt = Array.create ncol 0
for i=0 to nrow-1 do
for
idx=roffset.[i] to roffset.[i+1]-1 do
colCnt.[cidx.[idx]] <- colCnt.[cidx.[idx]] + 1
let colOffset = Array.zeroCreate (ncol+1)
for i=1 to ncol do
colOffset.[i] <- colOffset.[i-1] + colCnt.[i-1]
Array.Clear(colCnt, 0, ncol) // clear cnt array

// 2. score the value
let rowIdx = Array.create nval 0
let vals2 = Array.create nval 0.0
for i=0 to nrow-1 do
for
idx=roffset.[i] to roffset.[i+1]-1 do
let
j = cidx.[idx]
vals2.[colOffset.[j] + colCnt.[j]] <- vals.[idx]
rowIdx.[colOffset.[j] + colCnt.[j]] <- i
colCnt.[j] <- colCnt.[j] + 1

vals2, rowIdx, colOffset
This implementation is quite efficient. It does two scans and returns three arrays for the HB format. This implementation currently only supports sparse matrices, it should also support dense matrices as we often need to transform a dense matrix into a sparse one. We can add the support in this function, or we can first transfer a dense matrix into a sparse one and use the above function. Both are OK. However, PowerPack currently does not support dense to sparse operation (it supports sparse to dense, which is used more often). Let this dense-to-sparse operation be our second exercise:

let
toSparse (A:matrix) =
    if A.IsSparse then failwith "A should be a desne matrix"
let l =
seq {
let B = A.InternalDenseValues // use the internal array
for i=0 to (Array2D.length1 B)-1 do
for
j=0 to (Array2D.length2 B)-1 do
if
B.[i,j] > 1e-30 then
yield
(i,j,B.[i,j])
}
Matrix.initSparse A.NumRows A.NumCols l


To get the HB format of a dense matrix A, we simply use:

let vals, rowIdx, colOffset = A |> toSparse |> toColumnSparse

At last of this section, let’s have a look at the annotated implementation for Matrix.initSparse in matrix.fs to have a better understanding of the F# sparse matrix type:


/// Create a matrix from a sparse sequence
let initSparseMatrixGU maxi maxj ops s =
(* 1. the matrix is an array of dictionarys
tab[i] is the dictionary for row i *)
let tab = Array.create maxi null
let
count = ref 0
for (i,j,v) in s do
if
i < 0 || i >= maxi || j <0 || j >= maxj then failwith "initial value out of range";
count := !count + 1;
let tab2 =
match tab.[i] with
| null ->
let
tab2 = new Dictionary<_,_>(3)
tab.[i] <- tab2;
tab2
| tab2 -> tab2
tab2.[j] <- v
// 2. calcuate the offset for each row
// need to be optimized!
let offsA =
let rowsAcc = Array.zeroCreate (maxi + 1)
let mutable acc = 0
// 3. this loop could be optimized using
// sorted map for tab
for i = 0 to maxi-1 do
rowsAcc.[i] <- acc;
acc <- match tab.[i] with
| null -> acc
| tab2 -> acc+tab2.Count
rowsAcc.[maxi] <- acc;
rowsAcc
// 4. get the column indices and values
let colsA,valsA =
let colsAcc = new ResizeArray<_>(!count)
let valsAcc = new ResizeArray<_>(!count)
for i = 0 to maxi-1 do
match
tab.[i] with
| null -> ()
| tab2 -> tab2 |> Seq.toArray |> Array.sortBy (fun kvp -> kvp.Key) |> Array.iter (fun kvp -> colsAcc.Add(kvp.Key); valsAcc.Add(kvp.Value));
colsAcc.ToArray(), valsAcc.ToArray()
// 5. call the SparseMatrix constructor
SparseMatrix(opsData=ops, sparseValues=valsA, sparseRowOffsets=offsA, ncols=maxj, columnValues=colsA)
As noted in the code, one possible optimization is to use SortedMap for table, rather than an array. But this SortedMap has more overhead, the current implementation is already good. The other possible way is to sort the (i,j,val) sequence, which avoids the overhead in using a Dictionary structure.

Only a few matrix operations are implemented for sparse matrices, e.g. +, – and * are supported. However, map, columns and rows are not supported. This does not quite matter as when we need sparse matrices, we will be usually dealing with large datasets. For large datasets, calling a specialized library or writing the code ourselves is a better solution, as we will see the SVD example blow:


PInvoke SVDLIBC


In a previous post, We already know how to write a simple matrix multiplication in C, and call it from F# using P/Invoke. Here we move to a more useful one, a large scale SVD library, SVDLIBC. For a small dense SVD, using lapack’s svd is just fine. However, for a 10000-by-10000 sparse matrix, we need a more powerful one. (ARPACK project is dedicated to this kind of decompositions. SVDLIBC is a C translation of a small part ARPACK.)


The SVDLIBC is a very good svd solver. It also provides a command line tool to do SVD for sparse or dense matrices. However, it uses some non-standard headers for I/O. To make it compile, we need to delete some code for IO. The main svd solver (in las2.c) is:

SVDRec svdLAS2A(SMat A, longdimensions)

A wrapper with a clear interface is need for this solver:


#define CHECK(ptr)  if (!(ptr)) return 0;

__declspec(dllexport)
int svds(int nrow, int ncol, int nval, double *val, int *row, int *offset,
int dim,
double *Uval,
double *Sval,
double *Vval)
{
SMat A;
SVDRec res;
DMat U, V; double *S;
FILE *fp;


A = svdNewSMat(nrow, ncol, nval);
CHECK(A);
memcpy(A->value, val, sizeof(double) * nval);
memcpy(A->rowind, row, sizeof(int) * nval);
memcpy(A->pointr, offset, sizeof(int) * (ncol+1));


res = svdLAS2A(A, dim);
CHECK(res);
CHECK(res->d == dim); // the dimension passed in must be correct!

S = res->S;
memcpy(Sval, S, sizeof(double) * dim);

U = svdTransposeD(res->Ut); CHECK(U);
V = svdTransposeD(res->Vt); CHECK(V);
memcpy(Uval, &U->value[0][0], sizeof(double) * (U->rows * U->cols));
memcpy(Vval, &V->value[0][0], sizeof(double) * (V->rows * U->cols));
svdFreeDMat(U);
svdFreeDMat(V);
svdFreeSMat(A);
svdFreeSVDRec(res);
return 1; // successful
}
and in F#:

module Native =
[<System.Runtime.InteropServices.DllImport(@"svdlibc.dll",EntryPoint="svds")>]
extern int svds(int nrow, int ncol, int nval, double *vals, int *row, int *offset, int dim, double *Uval, double *Sval, double *Vval);


module LinearAlgebra =
// Sparse SVD
// A: F# sparse matrix
// di: the dimension
let svds (A:matrix) (di:int)=
/// A = U * S * Vt
if A.IsDense then failwith "only for sparse matrix!"
else
let
nrow = A.NumRows
let ncol = A.NumCols
let nval = A.InternalSparseValues.Length
// let dim = min nrow ncol
let dim = max 1 (min (min nrow ncol) di) // choose a valid value
let U = Matrix.zero nrow dim
let S = Vector.zero dim
let V = Matrix.zero ncol dim

let colVals, rowIdx, colOffset = MatrixUtility.toColumnSparse A
let valsP = NativeUtilities.pinA colVals
let rowP = NativeUtilities.pinA rowIdx
let offsetP = NativeUtilities.pinA colOffset
let Up, Vp = NativeUtilities.pinMM U V
let Sp = NativeUtilities.pinV S

let ret = Native.svds(nrow, ncol, nval, valsP.Ptr, rowP.Ptr, offsetP.Ptr, dim, Up.Ptr, Sp.Ptr, Vp.Ptr)
valsP.Free()
rowP.Free()
offsetP.Free()
Up.Free()
Vp.Free()
Sp.Free()
if ret = 0 then
failwith "error in pinvoke svds"
U, S, V
// default Sparse SVD
// A: F# sparse matrix
let svds0 A =
svds A (min A.NumCols A.NumRows)

test it:

let test2() =
let r = new System.Random()
let nrow = 1000
let ncol = 1000
let nval = 100000

let l =
seq {
for i=1 to nval do
yield
(r.Next()%nrow, r.Next()%ncol, r.NextDouble())
}
let A = Matrix.initSparse nrow ncol l
tic()
let U, S, V = svds A 100
toc("svds")
()


We will in later posts to show the data mining applications of SVD.


NativeUtilities module

I used utility functions in NativeUtilities to convert the F# array/matrix/vector and their native array representations. Here’s its implementation from F# math-provider source code:
open System
open System.Runtime.InteropServices
open Microsoft.FSharp.NativeInterop
open Microsoft.FSharp.Math


// from math-provider source code
module NativeUtilities = begin
let
nativeArray_as_CMatrix_colvec (arr: 'T NativeArray) = new CMatrix<_>(arr.Ptr,arr.Length,1)
let nativeArray_as_FortranMatrix_colvec (arr: 'T NativeArray) = new FortranMatrix<_>(arr.Ptr,arr.Length,1)
let pinM m = PinnedArray2.of_matrix(m)
let pinV v = PinnedArray.of_vector(v)
let pinA arr = PinnedArray.of_array(arr)

let pinA2 arr = PinnedArray2.of_array2D(arr)

let pinMV m1 v2 = pinM m1,pinV v2
let pinVV v1 v2 = pinV v1,pinV v2
let pinAA v1 v2 = pinA v1,pinA v2
let pinMVV m1 v2 m3 = pinM m1,pinV v2,pinV m3
let pinMM m1 m2 = pinM m1,pinM m2
let pinMMM m1 m2 m3 = pinM m1,pinM m2,pinM m3
let freeM (pA: PinnedArray2<'T>) = pA.Free()
let freeV (pA: PinnedArray<'T>) = pA.Free()
let freeA (pA: PinnedArray<'T>) = pA.Free()

let freeA2 a = freeM a

let freeMV (pA: PinnedArray2<'T>,pB : PinnedArray<'T>) = pA.Free(); pB.Free()
let freeVV (pA: PinnedArray<'T>,pB : PinnedArray<'T>) = pA.Free(); pB.Free()
let freeAA (pA: PinnedArray<'T>,pB : PinnedArray<'T>) = pA.Free(); pB.Free()
let freeMM (pA: PinnedArray2<'T>,(pB: PinnedArray2<'T>)) = pA.Free();pB.Free()
let freeMMM (pA: PinnedArray2<'T>,(pB: PinnedArray2<'T>),(pC: PinnedArray2<'T>)) = pA.Free();pB.Free();pC.Free()
let freeMVV (pA: PinnedArray2<'T>,(pB: PinnedArray<'T>),(pC: PinnedArray<'T>)) = pA.Free();pB.Free();pC.Free()

let matrixDims (m:Matrix<_>) = m.NumRows, m.NumCols
let matrixDim1 (m:Matrix<_>) = m.NumRows
let matrixDim2 (m:Matrix<_>) = m.NumCols
let vectorDim (v:Vector<_>) = v.Length

let assertDimensions functionName (aName,bName) (a,b) =
if a=b then () else
failwith (sprintf "Require %s = %s, but %s = %d and %s = %d in %s" aName bName aName a bName b functionName)
end


A Note on Compressed Sparse Row and Compressed Sparse Column(HB format)


We can see both representations have deficiency, is there a magic structure or an engraining trick taking the advantage of both? Let’s check the standard software Matlab.


In Matlab, we usually write A(:,j) to take the j-th column of sparse matrix A or A(i,:) to take the i-th row. Does Matlab have a magic to let both operations run efficiently. Then answer is NO. The following script is used to test this:

function [ ret ] = testOctaveSvds ()

nrow = 10000;
ncol = 10000;
nval = 100000;
rows = floor(rand(1, nval)*nrow)+1;
cols = floor(rand(1, nval)*ncol)+1;
vals = rand(1, nval);
m = sparse(rows, cols, vals, nrow, ncol);
tic;
s = 0;
for i=1:10000,
c = floor(rand(1) * ncol) + 1;
s = s + sum(m(:,c));
end
fprintf('cols: ')
toc;

tic;
s = 0;
for i=1:10000,
r = floor(rand(1) * nrow) + 1;
s = s + sum(m(r,:));
end
fprintf('rows: ')
toc;

endfunction


In Matlab 2009a, the output is:

cols: Elapsed time is 0.204693 seconds.

rows: Elapsed time is 16.058717 seconds.


Octave also has similar result. From the result, we can deduce that Matlab/Octave use HB format for sparse matrix and it does not do heavy optimization for the operation for taking the rows. This is reasonable, as mentioned before, that when using sparse matrices, the user/programmer has the obligation to optimize the program, rather than the matrix library.

Monday, April 5, 2010

PInvoke native C/C++ libraries in F#

In this blog post, we explore on PInvoke in F#. Before going into PInvoke, an matrix multiplication example is also given to show how to build a DLL for PInvoke, which is important as in most of the cases we don't have pre-built DLLs.

PInvoke stands for Platform Invoke, meaning calling naive libraries. The two major reasons to use PInvoke is 1) native performance, native code is usually faster than .Net managed code, especially for numerical computing routines, and 2) existing stable(well tested) libraries can be used on the fly on .Net.

Of course, it also has some disadvantages, e.g. 1) extra instructions are used for each PInvoke call, this is not important as long as P/Invoke is not used too frequently, 2) calling native code means not safe anymore, e.g. no exception, no memory access checking.

There are several tutorials online for C#: here, here and here. However, these are more focused on existing windows dlls, their motivation for PInvoke is that some functions are not available on .Net, thus calling an existing native DLL, e.g. Win32 API. However, we are talking about performance in this series, PInvoke means that we rewrite the performance critical part in C/C++, or compile an existing library (maybe we also need to make a C wrapper of it in the meantime) and call it using PInvoke. So we should start with the native code part as we don’t have the native dll yet.

Matrix multiplication

Because C and C++ are slightly different, and C++ usually compiles C code well. So we’ll deal with C++ mode in VC++ compiler, i.e. all files have .cpp extensions.

First we write mattrixproduct.cpp:

__declspec(dllexport) int minus(int a, int b)
{
return a - b;
}

extern "C" __declspec(dllexport) int add(int a, int b)
{
return a + b;
}

extern "C" __declspec(dllexport)
void matmul(double *a, int an, int am,
double *b, int bn, int bm,
double *c)
{
int i, j, k;

for (i=0; i<an; i++)
for (j=0; j<bm; j++) {
double s = 0;
for (k=0; k<am; k++)
s += a[i*am + k] * b[k*bm + j];
c[i*bm+j] = s;
}
}
__declspec(dllexport) is the header we must add to each function to tell the compiler to generate this function in the DLL. extern "C" is to tell the compiler that this is a C code. We use command line:
cl.exe /LD mattrixproduct.cpp

to get mattrixproduct.dll. We can also create a new Visual C++ project and set the target to .dll.

Here’s the information for this dll: (use command line: link /DUMP /EXPORTS mattrixproduct.dll)

ordinal hint RVA name
1 0 00001000 ?minus@@YAHHH@Z = ?minus@@YAHHH@Z (int __cdecl minus(int,int))
2 1 00001010 add = _add
3 2 00001020 matmul = _matmul

We can see that minus function (without extern “C”) has a non-readable name, while the other two have correct names.

Now, we have the DLL, let’s move on to the PInvoke in F#. Here is the code:

open System
open System.Runtime.InteropServices
open Microsoft.FSharp.NativeInterop
open Microsoft.FSharp.Math

module Native =
[<System.Runtime.InteropServices.DllImport(@"mattrixproduct.dll",
EntryPoint="add")>]
extern int add(int a, int b);

// notice the wired name for minus
[<System.Runtime.InteropServices.DllImport(@"mattrixproduct.dll",
EntryPoint="?minus@@YAHHH@Z")>]
extern int minus(int a, int b);

[<System.Runtime.InteropServices.DllImport(@"mattrixproduct.dll",
EntryPoint="matmul")>]
extern void matmul(double *a, int an, int am, double *b, int bn,
int bm, double *c);


let a = Native.add(10,20)
let b = Native.minus(10,20)
let A = matrix [[1.0;2.0;]; [3.0;4.0;]]
let B = matrix [[3.0;2.0;1.0]; [1.0;1.0;1.0]]
let C = Matrix.zero 2 3
let Ap = PinnedArray2.of_matrix(A)
let Bp = PinnedArray2.of_matrix(B)
let Cp = PinnedArray2.of_matrix(C)
Native.matmul(Ap.Ptr, 2, 2, Bp.Ptr, 2, 3, Cp.Ptr)
Ap.Free()
Bp.Free()
Cp.Free()

printfn "a = %A\nb = %A\nC = %A" a b C
There are several things to mention:

1. For basic type, like int, double, as in the add and minus, they don’t need any special treatment.

2. The hard part is the array. This introduces concept called data marshaling. Because .Net arrays and native arrays are different, so they cannot be directly mapped. This is why we open namespaces for Interop. Another thing is that data marshaling does not support 2 dimensional arrays directly! This is why I used 1-D array in the C implementation. High performance code usually only use 1-D arrays, so this inconvenience does not quite matter to us. To marshal the array, we get a one dimensional pointer for a matrix using PinnedArray2.of_matrix, which is defined in F# PowerPack (in NativeArrayExtensions.fs), and then pass this pointer in the function call.

The performance

Let's now test the performance of the wrapped native matrix multiplication. First write some test on both small matrices and big matrices:

let mm A B =
let Ap = PinnedArray2.of_matrix(A)
let Bp = PinnedArray2.of_matrix(B)
let C = Matrix.zero (A.NumRows) (B.NumCols)
let Cp = PinnedArray2.of_matrix(C)
Native.matmul(Ap.Ptr, A.NumRows, A.NumCols, Bp.Ptr, B.NumRows, B.NumCols, Cp.Ptr)
Ap.Free()
Bp.Free()
Cp.Free()
C

let test() =
// use small matrices
let A = matrix [[1.0;2.0;]; [3.0;4.0;]]
let B = matrix [[3.0;2.0;1.0]; [1.0;1.0;1.0]]
tic()
mm A B |> ignore
toc("my native matrix *")
A * B |> ignore
toc("F# *")

// use big matrices
let r = new System.Random()
let A1 = Matrix.init (92*112) 280 (fun i j -> r.NextDouble())
let A' = A1.Transpose
tic()
let C1 = mm A' A1
toc("my native matrix *")
let C2 = A' * A1
toc("F# *")

if C1.Equals(C2) then true
else false
The timing functions tic() and toc() are defined in Part IV of the matrix and linear algebra series. Run it in F# interactive:

> test();;
my native matrix * 1.758400 ms
F# * 0.005100 ms
my native matrix * 7207.073800 ms
F# * 16524.670200 ms
val it : bool = true

First the return value is true, indicating that my C implementation is correct. In the small matrix case, the overhead of PInvoke is obvious. In the big matrix case, the performance boost using native code is also obvious.

Note 1: C++ class

So far so good. Because we didn’t touch C++ yet! Classes, templates are more complex.

To work with these. One method is to write C wrappers for C++ classes. This method is commonly used in practice.

We can also directly add __declspec(dllexport) to the class definition. However, creating objects via constructors is not supported anymore. One pattern is to write two static methods: one for creating an object and the other for freeing an object. Member functions are used by providing this pointer explicitly. If you know how to enable object style programming in C, this patter would be familiar to you. Anyway, in this style, the accessing ability is equivalent to C’s. As this pattern requires modifying existing class definitions(although only a little), it is less commonly used, thus I don’t give detailed example here.

Note 2: Pure C

The above example is given in C++ as in most of the cases we are dealing with C++ files, at least C files could be compiled in C++.

For Pure C files (with .c extensions), things are actually easier. Just removing extern "C" would be OK.

Using existing C/C++ code without PInvoke

We can use C++/CLI to compile a C/C++ library without modifying anything into a unsafe managed dll. I will write a separate blog for this later.

Remark

In this tutorial, we know the basics of P/Invoke in F#. We don’t touch two things:

1) Marshaling complex parameters, e.g. different format of strings, non-array pointers, C++ classes, etc.

2) MEMORY! MEMORY! MEMORY! The hard and dangerous part of P/Invoke is memory management. I don’t have enough experience of this part yet. Currently I just follow the style in Math-Provider.

Saturday, April 3, 2010

Matrix and linear algebra in F#, Part IV: profile your program, find the bottleneck and speed it up: using matrix multiplication as an example

We will use the face recognition program we developed in Part III to illustrate how to profile programs.

The performance problem of the face recognition program is that it takes about 25 seconds to process 280 faces (each face has 10304=112*92 pixels). I want to find which part takes most of the time. Is it due to the eigen decomposition?

Profile

Profiling means knowing how much time and memory every part of your program take. Good profiling tools could give a detail report on the running behavior of a program. In .Net, there are several good tools for this purpose. Some of them are free.

However, these tools are too big to use in a explorative/interactive programming style. Following Matlab’s tic and toc, I have written them for F#:

let stopWatch = new System.Diagnostics.Stopwatch()

let tic() =
stopWatch.Reset()
stopWatch.Start()

let toc(msg:string) =
stopWatch.Stop()
printfn "%s %f ms" msg stopWatch.Elapsed.TotalMilliseconds
stopWatch.Reset()
stopWatch.Start()
I also use the same style for my C programs:

#include <time.h>
time_t stopWatch;
void tic()
{
stopWatch = clock();
}
void toc(const char *msg)
{
double s = (clock() - stopWatch) * 1000.0 / CLOCKS_PER_SEC;
printf("%s: %.2lf ms\n", msg, s);
stopWatch = clock();
}
This implementation enables us to use a sequence of tocs to profile a code block line by line without the need to write tic every time. However this simple profiler can not be nested as there is only a global timer.

After putting several tocs into my eigCov:


tic()
let colm = colMean faces
toc("mean face")
let A = Matrix.init nrow ncol (fun i j -> faces.[i,j] - colm.[i])
toc("minus mean face")
let L = A.Transpose * A
toc("get L")
let val1, vec1 = eig L
toc("lapack eig")
let v = val1 * (1.0 / float (ncol - 1))
let u = A * vec1.Transpose
toc("get u")
// normalize eigen vectors
let mutable cnt = 0
for i=0 to ncol-1 do
if
abs v.[i] < 1e-8 then
u.[0..nrow-1,i..i] <- Matrix.create nrow 1 0.0
else
cnt <- cnt + 1
let norm = Vector.norm (u.[0..nrow-1,i..i].ToVector())
u.[0..nrow-1,i..i] <- u.[0..nrow-1,i..i] * ( 1.0 / float norm )
toc("normolize u")
It gets the following output:

mean face 639.891700 ms
minus mean face 195.484600 ms
get L 12885.751500 ms
lapack eig 247.015700 ms
get u 7169.749100 ms
normolize u 623.919700 ms

The bottle neck is in "Get L” and “get u”, both of which are matrix multiplication operations. The main computation part “lapack eig” actually costs very little time (as the 280-by-280 matrix is not big).

The matrix multiplication in “get L” is 280x10304 multiplies 10304x280, which is quite large!


Matrix Multiplication

Matrix multiplication is a great example to show that why we need well tuned libraries to perform numerical computations.

Table: the time cost for self-made(not optimized) implementations





















implementations operator * for matrix type 3 F# loops(use matrix) 3 F# loops(use float Array2D) 3 C loops (native)
time(seconds) 13.07 38.81 13.52 8.10
I used the following 3 loops:

for (i=0; i<N; i++)
for (j=0; j<N; j++) {
double s = 0;
for (k=0; k<M; k++)
s += a[i][k] * b[k][j];
c[i][j] = s;
}
The two F# programs are similar to the above one in C. We can find that if we use a lot of F# matrix element access operator [,], the performance is very poor. The * operator of matrix type does not have any optimization in it, so it costs the same amount of time as 3-loops (Array2D) does. Native code is faster than the managed ones as it does not do boundary checking for index.

Matlab, R and Numpy/Python all wrap optimized matrix implementations. Let us next see how they perform:

Table: the time cost for optimized implementations

















Software Matlab 2009a R 2.10.1 Numpy 1.3.0
time 0.25 1.98 0.35
Matlab is the fastest, but it uses 200% CPU. While R and Numpy are single threaded. R takes 2 seconds. Matlab uses Intel MKL, Numpy uses Lapack(maybe optimized version), I don’t know what algebra routines R uses.

We can see the difference between optimized versions and non-optimized ones is very large. A similar report from a .Net math software vender is here. The conclusion is that we should use optimized matrix multiplication procedure when matrices are large.


Add optimized matrix multiplication support to math-provider

The math-provider has made an wrapper for dgemm_, which is a BLAS procedure for computing:

C  :=
alpha*op( A )*op( B ) + beta*C
where op (A) = A or A’.
This is general case of matrix multiplication. The math-provider also has an wrapper for dgemv_, matrix-vector multiplication.

However, they are not directly callable from Lapack service provider.

add the following code to linear_algebra_service.fs:
let
MM a b =
  Service().dgemm_(a,b)

and add the following to linear_algebra.fs:
let MM A B =
if HaveService() then
LinearAlgebraService.MM A B
else
A * B
and add the following to linear_algebra.fsi:
/// BLAS matrix multiplication
val MM: A:matrix -> B:matrix -> matrix
Now we define a local name for it:

let mm = LinearAlgebra.MM 
Using BLAS matrix multiplication routine costs about 4 seconds, which is faster than native C’s performance(8 seconds), but worse than Matlab’s or Numpy’s. The reason is that the BLAS(Netlib BLAS 3.1.1) is not optimized for my platform. If I use ATLAS for BLAS and use an Intel Fortran Compiler, the performance will be close to Numpy’s or Matlab’s.

By using BLAS’s matrix multiplication routine for the face recognition, the total running reduces from 25 seconds to 9 seconds.


Attachment



wait.

Matrix and linear algebra in F#, Part III: Eigen decomposition and face recognition

The theory

Face recognition is one of the most import research problems in computer vision. It is also an important application for everyday use. A lot of security system has a face recognition component, as well as other parts, e.g., finger print recognition.

While this application has a lot of algorithms, the most famous one must be Eigen face proposed in

M. Turk and A. Pentland (1991). "Eigenfaces for recognition". Journal of Cognitive Neuroscience 3 (1): 71–86.

The theory of Eigen face is very simple if one has learned the concept of Principal Component Analysis (PCA). Given a set of faces, each face is a n-by-m matrix, if we reshape the matrix into a vector, then each face can be viewed as a point in the (n*m) dimensional space. The idea of Eigen face is to find some directions, which are orthogonal, in this space such that the points(faces) have a minimal representation, or have maximum variance along these directions. As you can see, this is an optimization problem. The magic part is that this optimization is solved by Singular Value Decomposition (SVD).

The following is a summary of the procedures to get the eigen faces:

1. for a set of faces, F = {f1,…,fn}, calculate its mean face m. F is a d-by-n matrix, where d is the number of pixels of an face and n is the number of faces in the dataset. fi is a column vector.

2. subtract the mean face from F, get A = {f1-m, … fn-m}. We want to get the eigenvalues and eigenvectors of Cov(A) = A * A’. The eigenvectors associated with big eigenvalues are eigenfaces. (In the program part, we will visualize these eigenfaces).

However, Cov(A) is usually too big. (If there are 10K pixels in each face, then the size of Cov(A) would be 10K-by-10K). A trick is to calculate the eigenvalues and eigenvectors of A’ * A (which is of small size if the number of faces n is small.) and calculate Cov(A)’s based on them. This trick is well documented in the program section.

Once we get the Eigen faces, we can do face recognition. As said before, each eigen face is actually an direction, we can map our new face and the face in the database into the space defined by the set of Eigen faces. For example, if we have 10 Eigen faces, and we know that they are orthogonal to each other, thus a 10-D real space is defined by the 10 Eigen faces. Once real faces are mapped into this 10-D space, we can measure the distance between the new face and the existing faces, and find the closet one as the label face for the new face.

Here are some reference for interested readers:

1. Eigenface wiki page. It has the mathematics and links to programs and papers.

2. A Matab Tutorial. and this one. These two are very good tutorials using Matlab.

3. Principal Component Analysis (PCA), Singular Value Decomposition and Eigendecompsition. If you want to know more theory of it, you can move on to have a good understanding of PCA.

The ORL face dataset

There are more than 10 face datasets available online, which range from different complexity. I use an quite old one the ORL face dataset. This one is well pre-processed, e.g. under the same light condition, of the same size, etc.

Using this dataset greatly eases our work in processing the images and enables us to focus on the Eigen face algorithm in F#.

This dataset contains 400 faces from 40 subjects, with each having 10 faces. The images are of .PGM format. For the I/O of the images, I wrote a separate blog post on reading, writing and visualizing this image format. That post also features on how to write a simple GUI in F#.

The data structures for the dataset:

type pgm = matrix
type person = matrix array
type dataset = person array
Each PGM image is an F# matrix, each person is an array of matrices, and the dataset is an array of persons.

Then we define functions to read the dataset for a given folder:

let readPerson folder =
seq {
for i=1 to 10 do
let
name = folder + string(i) + ".pgm"
yield readPgm name
} |> Seq.toArray

let readDataset folder =
seq {
for i=1 to 40 do
let
name = folder + "s" + string(i) + "\\"
yield readPerson name
} |> Seq.toArray
We also define functions to split the dataset into training data and testing data. Training data are used to get the eigenfaces, and then faces in the testing data are recognized using the eigenfaces. For each person, the first 7 faces are used for training, and the 3 left ones are used for testing.

let splitDataSet (ds:dataset) =
let train = seq {
for p in ds do
yield
(
seq {
for i=0 to 9 do
if
i<7 then yield p.[i]
} |> Seq.toArray )
}

let test = seq {
for p in ds do
yield
(
seq {
for i=0 to 9 do
if
i>=7 then yield p.[i]
} |> Seq.toArray )
}
train |> Seq.toArray, test |> Seq.toArray

let strip (data: 'a array array) =
let dat = seq {
for x in data do
for
y in x do
yield
y
}
let label = seq {
let i = ref 0
for x in data do
for
y in x do
yield
!i
i := !i + 1
}
dat |> Seq.toArray, label |> Seq.toArray



Function strip is used to transform a data set (matrix array array) into a flat array (matrix array) and remember their labels.


The program

First we write some help functions:

let mat2vec (A:matrix) =
let n = A.NumRows
let m = A.NumCols
Vector.init (n*m) (fun i->A.[i%n, i/n])

let vec2mat (vec:vector) nrow ncol =
if vec.Length <> nrow * ncol then failwith "vec length != nrow * ncol "
Matrix.init nrow ncol (fun i j -> vec.[j*nrow+i])

let colSum (m:matrix) =
let nrow = m.NumRows
Matrix.foldByRow (fun a b -> a+b) (Vector.create nrow 0.0) m

let colMean (m:matrix) =
let ncol = m.NumCols
colSum m * (1.0 / float ncol)
mat2vec is concate the columns of a matrix into a long column vector, while vec2mat does the inverse. Their behaviors are similar to the reshape function in Matlab. colSum and colMean are similar to sum and mean in Matlab.

To use the eigen decomposition for a symmetric matrix, we call lapack via math-provider. Read Part II of this F# matrix series to know to how use them.

open Microsoft.FSharp.Math
open Microsoft.FSharp.Math.Experimental
let isSucc = Experimental.LinearAlgebra.Lapack.Start()
printfn "service = %A" (if isSucc then "Netlib lapack" else "Managed code")
let eig = LinearAlgebra.EigenSpectrumWhenSymmetric
eig is a shorthand for EigenSpectrumWhenSymmetric. Notice that eig gets the eigenvectors in a matrix, each row of which is an eigenvector. This may be a bug in the math-provider wrapper, but since it only deals with square and symmetric matrices, this incontinence does not hurt too much.

The main function:

let eigCov (faces:matrix) numVecs =
    // each column is a face

let nrow = faces.NumRows
let ncol = faces.NumCols
let colm = colMean faces
let A = Matrix.init nrow ncol (fun i j -> faces.[i,j] - colm.[i])
// A - each column is a difference face
// Cov(A) = A * A', a big matrix
// trick: A'*A*v(i) = lambda(i)*v(i)
// A*A'*A*u(i) = lambda(i)*A*u(i)
// (A*u(i)) is an eignvector of Cov(A)
// L(A) = A' * A
let L = A.Transpose * A
let val1, vec1 = eig L // each row of vec1 is an eigenvector
// get the eigen values and eigen vectors for Cov(A)
let v = val1 * (1.0 / float (ncol - 1))
let u = A * vec1.Transpose

// normalize eigen vectors
let mutable cnt = 0
for i=0 to ncol-1 do
if
abs v.[i] < 1e-8 then
u.[0..nrow-1,i..i] <- Matrix.create nrow 1 0.0
else
cnt <- cnt + 1
let norm = Vector.norm (u.[0..nrow-1,i..i].ToVector())
u.[0..nrow-1,i..i] <- u.[0..nrow-1,i..i] * ( 1.0 / float norm )

// sort
// let u' = u.PermuteColumns (fun j -> j ) //u.NumCols - j - 1)
// the PermuteColumns function is buggy
let u' = Matrix.init u.NumRows u.NumCols (fun i j -> u.[i, u.NumCols - j - 1])

let v' = v.Permute (fun j -> v.NumRows - j - 1)

let select = min cnt numVecs
//printfn "%A" select
if cnt < numVecs then
printfn "warning: numvec is %d; only %d exist." numVecs cnt

let v1 = v'.[0..(select-1)]
let u1 = u'.Columns(0, select)
let transformed = (u1.Transpose) * A
v1, u1, colm, transformed
and:

let eigenface (p:matrix array) numVecs =
let B = p |> Array.map mat2vec
let nrow = B.[0].Length // nrow is big == # of elements in a face
let ncol = B.Length // # of faces for a person
// put a 2d face into a vector as a column in matrix A
let A = Matrix.init nrow ncol (fun i j -> B.[j].[i])

let v, u, mface, transformed = eigCov A numVecs
(*
(new PGMViewer(vec2mat mface 112 92, "mean face")).Show()
for i=0 to 2 do
let eface = scale (u.Column(i))
let f = vec2mat eface 112 92
(new PGMViewer(f, "eigenface " + i.ToString())).Show()
*)

v, u, mface, transformed
This function transform an image matrix into a vector and call the eigCov to get the eigenfaces. The commented lines in this function is used to output the mean face and the first 3 eigenfaces:

eigenfaces



Use eigenfaces to do recognition for a new face:

let recognition (u:matrix) (mface:vector) (transformed:matrix) (face:vector) =
    let d = face - mface
let t = u.Transpose * d

let distance a b = Vector.norm (a - b)
{0..transformed.NumCols-1}
|> Seq.map (fun j -> distance t (transformed.Column(j)), j)
|> Seq.min
|> snd
To put the recognition into a learning and testing framework:

let learning data =
let train, test = splitDataSet data
let dat, lab = strip train
let dat0, lab0 = strip test
let v, d, mface, tran = eigenface dat 10

let correctCnt =
dat0
|> Seq.mapi (fun i f ->
let
idx = recognition d mface tran (mat2vec f)
let label = lab.[idx]
if label = lab0.[i] then 1 else 0
)
|> Seq.sum
float correctCnt / (float dat0.Length)
Using 10 most significant eigen faces from the 280(7 * 40) faces, we recognize the remaining 120(3 * 40) faces, and get 113 faces recognized to the correct person, thus a 94.17% accuracy, which is reasonable good considering that we have done any normalization on the faces, and the classifier is only 1 nearest neighbor.

Remark

In this post, we have seen that how F# could be used as an prototype tool like Matlab. In the matrix computation part, we need to write some utility functions ourselves, while in Matlab they are built in. However, the strength of F# comes from the language part, we could control the training and testing more freely than in Matlab. Moreover, once we have developed this face recognition component, it is just straightforward to use C# or VB.Net to call it.

Although we have a high accuracy in the final test, the result is only for a well prepared dataset. In real situations, things would be very hard. Maybe over 90% time is spent on tuning best parameters and preprocessing the face images.



Attachment

wait.

A PGM Image Viewer in F#, with PGM writer and reader

This post is to serve the face recognition project.

In that project, we need a little tool to visualize/display Eigen faces. It would be very helpful during debugging/testing to display the Eigen faces and mean faces visually after they are calculated. We would also like to display an image to see if our IO part is correct or not.

The ORL face data set I use is of PGM format, so I need to load these images and display them.

First let’s get familiar with the PGM image format. PGM stands for Netpbm grayscale image. Its format is every easy:

P5 or P2
# 0 or more lines for comments
nrows ncolumns
max-value
image content

The format of image content part could be either text(P2) or binary(P5). More detail is here and here.


Before reading the file, let’s define the data structure for a PGM image. Because it is grey scale, thus it is actually an F# matrix:


type
pgm = matrix


We would use two readers: a BinaryReader for binary image content and a StreamReader for text image content. (I think we could also use StreamReader for both, But I started using BinaryReader for binary image content and added the text image content later, so two readers are used.)

exception
PgmFormatError of string
let readPgm path =
   let rec skipCommentLine br =
let l = getline br
if l.StartsWith("#") then
skipCommentLine br
else
l
let br = new BinaryReader(File.Open(path, FileMode.Open))
let pgmType = getline br
let sizeStr = skipCommentLine br
let s = sizeStr.Split ' '
let ncol = int (s.[0])
let nrow = int (s.[1])
printfn "nrow = %d ncol = %d" nrow ncol
let maxStr = getline br

if (pgmType.StartsWith("P5")) then
// binary format
let mutable b = Array.create (nrow * ncol) 0uy
let nread = br.Read(b, 0, (nrow*ncol))
let bb = b
Matrix.init nrow ncol (fun i j -> float (bb.[i*ncol+j]))
elif (pgmType.StartsWith("P2")) then
// text format
br.Close() // close the file and re-open
use sr = new StreamReader(path)
sr.ReadLine() |> ignore
let mutable line = sr.ReadLine()
while line.StartsWith("#") do
line <- sr.ReadLine()
sr.ReadLine() |> ignore

let str =
seq {
while not sr.EndOfStream do
yield
sr.ReadLine()
}
|> Seq.fold (fun a b -> a + b) ""

let pic = Matrix.create nrow ncol 0.0
let vals = str.Split([|' ';'\t';'\r';'\n'|], StringSplitOptions.RemoveEmptyEntries)
printfn "%A" vals
Matrix.init nrow ncol (fun i j -> float (vals.[i*ncol+j]))
else
// unknown format
raise (PgmFormatError(path))

The getline function for BinaryReader:

let
getline (br:BinaryReader) =
   let chSeq = seq {
let ch = ref (br.ReadChar())
while !ch <> '\n' do
yield
!ch
ch := br.ReadChar()
}

let chArr = chSeq |> Seq.toArray
let line = new String(chArr)

line
Next we write the writer for a pgm image, I only implemented the writer for the text content case as this is mainly used for outputting some intermediate results, thus need to be human-readable.

let writePgm (path:string) (p:pgm) =
use fs = new FileStream(path, FileMode.Create)
   use sw = new StreamWriter(fs, Encoding.ASCII)
sw.Write("P2\n{0} {1}\n255\n", p.NumCols, p.NumRows)

for i=0 to (p.NumRows - 1) do
for
j=0 to (p.NumCols - 1) do
sw.Write(" " + string (p.[i,j]))
sw.WriteLine()

Finally, the viewer. F# actually is quite good at doing small GUI programs. First inherits from a form:

type
PGMViewer(pic:matrix, name:string) as this =
   inherit Form(Width = 300, Height = 300)

then create a main menu, and menu items in it, a PictureBox to display bitmap images and add a SaveFileDialog:

let
mainmenu = new MainMenu()
let mFile = new MenuItem()
let miFileSave = new MenuItem()
let miExit = new MenuItem()
let pb = new PictureBox()
let filesave = new SaveFileDialog()
Set necessary properties of these objects (put these code in a do block)

do
// set the property of picture box
let nrow = pic.NumRows
let ncol = pic.NumCols
let image = new Bitmap(ncol, nrow)
printfn "%A %A" nrow ncol
for i=0 to nrow-1 do
for
j=0 to ncol-1 do
image.SetPixel(j, i, Color.FromArgb(int pic.[i,j], int pic.[i,j], int pic.[i,j]))
pb.Image <- image
pb.Height <- nrow
pb.Width <- ncol

// add picture box
this.Controls.Add(pb)

// set memu property
miFileSave.Index <- 0
miFileSave.Text <- "&Save"
miExit.Index <- 1
miExit.Text <- "&Exit"
mFile.Index <- 0
mFile.Text <- "&File"
mFile.MenuItems.Add(miFileSave) |> ignore
mFile.MenuItems.Add(miExit) |> ignore
mainmenu.MenuItems.Add(mFile) |> ignore
// add menu
this.Menu <- mainmenu
And add event handler for click event of the save menu item:
// savefile dialog
filesave.DefaultExt <- "pgm"
filesave.Filter <- "PGM Files|*.pgm"
let filesaveClick e =
if pb.Image = null then
()
else
if
filesave.ShowDialog() = DialogResult.OK then
let
bm = pb.Image :?> Bitmap
let pic = Matrix.init (bm.Height) (bm.Width) (
fun i j ->
let
c = bm.GetPixel(j,i)
float c.R * 0.3 + float c.G * 0.59 + float c.B * 0.11
)

writePgm filesave.FileName pic
miFileSave.Click.Add(filesaveClick)
Put them together:

open System.Windows.Forms
open System.Drawing

type PGMViewer(pic:matrix, name:string) as this =
inherit Form(Width = 300, Height = 300)

let mainmenu = new MainMenu()
let mFile = new MenuItem()
let miFileSave = new MenuItem()
let miExit = new MenuItem()
let pb = new PictureBox()
let filesave = new SaveFileDialog()

do
// set the property of picture box
let nrow = pic.NumRows
let ncol = pic.NumCols
let image = new Bitmap(ncol, nrow)
printfn "%A %A" nrow ncol
for i=0 to nrow-1 do
for
j=0 to ncol-1 do
image.SetPixel(j, i, Color.FromArgb(int pic.[i,j], int pic.[i,j], int pic.[i,j]))
pb.Image <- image
pb.Height <- nrow
pb.Width <- ncol

// add picture box
this.Controls.Add(pb)

// set memu property
miFileSave.Index <- 0
miFileSave.Text <- "&Save"
miExit.Index <- 1
miExit.Text <- "&Exit"
mFile.Index <- 0
mFile.Text <- "&File"
mFile.MenuItems.Add(miFileSave) |> ignore
mFile.MenuItems.Add(miExit) |> ignore
mainmenu.MenuItems.Add(mFile) |> ignore
// add menu
this.Menu <- mainmenu

// savefile dialog
filesave.DefaultExt <- "pgm"
filesave.Filter <- "PGM Files|*.pgm"
let filesaveClick e =
if pb.Image = null then
()
else
if
filesave.ShowDialog() = DialogResult.OK then
let
bm = pb.Image :?> Bitmap
let pic = Matrix.init (bm.Height) (bm.Width) ( fun i j ->
let
c = bm.GetPixel(j,i)
float c.R * 0.3 + float c.G * 0.59 + float c.B * 0.11
)

writePgm filesave.FileName pic
miFileSave.Click.Add(filesaveClick)


// set title of the dialog
this.Text <- "PGMViewer" + " - " + name

And let’s test the viewer now:

(new PGMViewer(readPgm @"C:\temp\temp4.pgm")).Show()

pgmviewer

(Copyright: this face is from the ORL face database.)

Attachment

wait.