3

I have to perform a [complex] basis transformation on a large number of [real] diagonal matrices: $$ \langle b_i | A | b_j \rangle = \sum_k \langle b_i | \bar{b}_k\rangle \langle\bar{b}_k | A | \bar{b}_k \rangle \langle\bar{b}_k | b_j \rangle $$

What is the most efficient way to perform this operation, ideally relying on BLAS/LAPACK or similar libraries?

It may be important that the $\bar{B}$ basis is much larger than the $B$ basis (that's why we're doing it, after all). Also, only the upper or lower triangular of the matrix $\langle b_i | A | b_j \rangle$ is actually needed, as its will end up in an eigenvalue solver. $\langle\bar{b}_k | b_j \rangle$ is stored with $ k $ as the fast index and $ j $ as the slow one. On distributed systems, the $k$ index is distributed very close to evenly.

I'll put my current implementation as an answer below. I don't think it is bad, but this accounts for a large enough fraction of run-time that any improvement would be awesome.

David Z
  • 3,383
  • 2
  • 27
  • 34
Max Hutchinson
  • 3,051
  • 16
  • 29
  • Since you're using BLAS/LAPACK you may find this useful. A surprisingly large fraction of time can be spent in mundane ways you wouldn't have guessed, that can be easily fixed. – Mike Dunlavey Sep 01 '12 at 14:46

2 Answers2

5

Unfortunately the desired routine is not part of BLAS but is very similar to zherk, which performs a Hermitian rank-k update. In particular, zherk supports operations which look like

$$A := \alpha B B^H + \beta A,$$

where only one triangle of $A$ is updated (the traditional variable names are different in the zherk prototype, where usually $C$ is the matrix to be updated). What you want is the generalization

$$A := \alpha B D B^H + \beta A,$$

where $D$ is a real diagonal matrix. I needed something similar for computing Hermitian matrix functions and wrote my own version which computes

$$A := \alpha B (D B)^H + \beta A,$$

where the user can specify that, even though the left and right matrices for the outer product are different, the result will still be Hermitian. You can see my distributed-memory implementation here. I call the generalization a Triangular Rank K update (Trrk). It is also useful for $LDL$ factorizations.

Jack Poulson
  • 7,599
  • 32
  • 40
  • I've found that this solution works quite well. If $D$ is positive, I scale $B$ by $\sqrt(D)$ and then call zherk. If $D$ has negative elements or is complex, I scale $B^H$ by $D$ and then just call ZGEMM, blocking along the diagonal if $B$ is large enough. I still wish this was just in BLAS... – Max Hutchinson Jan 31 '13 at 06:04
0

I do the right-side of the transformation $b_j$ by $b_j$ and then the left-side through a matrix-vector call. After accumulating in a condensed vector, and summing on distributed systems, I re-expand into the upper triangular.

  ! trans(k,j) is $<\bar{b}_k | b_j>$
  ij = 1
  do j = 1, N1
    ! right-side of transformation
    jtmp(1:N2) = A(1:N2) * conjg(trans(1:N2, j))
    ! left-size of transformation
    call ZGEMV('T', N2, j, one, &
                trans(:,:), N2, &
                jtmp, 1, zero, &
                res(ij), 1)
    ij = ij + j
  enddo

  ! global sum if needed
  [global sum(res)]

  ! expand to upper triangular
  ij=0
  do j = 1, N2; do i = 1,j
    ij = ij + 1
    A_new(i,j) = conjg(res(ij))
  enddo; enddo

I am unhappy with:

  • BLAS not knowing that the right and left side of the transformations are the same (though the cache might, on machines with cache).

  • The element-wise multiply not being a library call. I've noticed that BLAS's ZDOTC noticeably outperforms Fortran's intrinsic dot_product (mkl, Intel compilers), so I don't have a ton of confidence in that line being optimally vectorized.

Max Hutchinson
  • 3,051
  • 16
  • 29