

Libraries and Learning Services

## University of Auckland Research Repository, ResearchSpace

#### **Suggested Reference**

Wang, H., & Sinnen, O. (2015). *FPGA based acceleration of FDAS module for pulsar search*. Poster session presented at the meeting of 2015 International Conference on Field Programmable Technology (FPT). Queenstown, New Zealand.

#### Copyright

Items in ResearchSpace are protected by copyright, with all rights reserved, unless otherwise indicated. Previously published items are made available in accordance with the copyright policy of the publisher.

For more information, see General copyright.



# FPGA based Acceleration of **FDAS Module for Pulsar Search**



Haomiao Wang, Oliver Sinnen Department of Electrical and Computer Engineering, University of Auckland

## Introduction

The Square Kilometre Array (SKA<sup>1</sup>), currently in the pre-construction phase, will be the world largest telescope array for radio astronomy. The Fourier domain

## **Relaxing Requirements**

| Tab 2. Influence of Different Relaxation Methods |                       |                   |  |
|--------------------------------------------------|-----------------------|-------------------|--|
| Factors                                          | Relaxation<br>Methods | Reduced<br>Number |  |
| Precision                                        | 16+16-bit fixed-point | 54.69%            |  |

# **Results (Cont'd)**

- Performance of general 64-tap FIR filter kernel: 110GFLOPS, 80x speedup
- The performance of fixed-point kernel is higher than that of SPF kernel

acceleration search (FDAS) module is the sub-module of the Non-imaging Processing Pulsar Search Sub-element (NIP PSS) of SKA1-MID Central Signal Processor (CSP) element. The purpose of it is to minimize the effect of potential cyclic Doppler shift on pulsar signals. Its main function is to execute preprocessed input, by using a correlation technique and then identifying pulsar candidates<sup>2</sup>.





Yes

Completely

Unrolled

load coefficien

Load Coefficients?

Core computation of

**FIR filter** 

Store one complex output

**Optimization Techniques** 

Fig 2. Single Work-item Convolution Kernel Structure

Loop Pipelining Completely

Unrolled



 Performance of Conjugate root kernel has 1.5x speedup



Fig 5. Performance Comparison of SPF Kernel and Fixed-point Kernels



Fig 1. Signal Flow Diagram of FDAS Module

# **First Performance Estimate**

## Tab 1. FDAS Module Parameters

| Parameter | Description                                       | Value                   |
|-----------|---------------------------------------------------|-------------------------|
| B         | Number of beams                                   | 1000~2000               |
| N         | Number of complex<br>samples in<br>one data group | 2 <sup>22</sup>         |
| М         | Number of templates                               | 84                      |
| K         | Number of average template length                 | 222                     |
| W         | Overall workload for one beam                     | 6.26 x 10 <sup>11</sup> |

## Regularity in Coefficients

- Symmetric
- Conjugate Roots  $\bullet$
- Common Sub-expression Elimination (CSE)  $\bullet$

### 1 1 0 0 0 0 0

Fig 3. Example of General 3-bit Binary based Horizontal CSE<sup>3</sup>

# Results

•

- Altera SDK for OpenCL<sup>4</sup> version 15.0.0.145
- Terasic DE5 Board featuring an Altera • Stratix V GX FPGA (5SGXEA7N2F45C2)
  - 3.7GHz Intel Core i7-4820K CPU, 32GB



Fig 5. Performance Comparison of General Kernel and Conjugate Roots Kernel

# Discussion

- **OpenCL** based Altera FPGA  $\bullet$ development is applied
- The number of DSP blocks is a great barrier for SPF multiplications
- Loop pipelining and complete unroll of single work-item kernel are two main factors to achieve high performance
- Future focus is on implementation of



7.11*TFLOPS* 

Time of executing one sample group of all beams

88*m*s

Number of needed FPGAs to execute one beam

 $\square$ 

t<sub>limit</sub>

N<sub>FPGA</sub>

Maximum Performance Method:  $N_{A7 FPGA} = 65$  Parallelisation of Multiplication Method:  $N_{A7 FPGA} = 59$ 

RAM and SSD

## Ubuntu 14.04 LTS 64-bit



Fig 4. Terasic DE5 Board<sup>5</sup> with Altera Stratix V GXA7 FPGA

large-tap FIR filter and optimization techniques.

## Reference

#### 1. http://www.skatelescope.org/

- 2. S. M. Ransom, S. S. Eikenberry, and J. Middleditch, "Fourier techniques for very long astrophysical time-series analysis," The Astronomical Journal, vol. 124, no. 3, p. 1788, 2002.
- 3. I. Hatai, I. Chakrabarti, and S. Banerjee, "An efficient constant multiplier architecture based on vertical-horizontal binary common sub-expression elimination algorithm for reconfigurable fir filter synthesis," Circuits and Systems I: Regular Papers, IEEE Transactions on, vol. 62, no. 4, pp. 1071-1080, 2015.
- 4. https://www.altera.com/products/design-software/embeddedsoftware-developers/opencl/overview.tablet.html
- 5. http://www.terasic.com.tw/cgibin/page/archive.pl?Language=English&No=526