**Euclidean Distance Definition**

The Euclidean Distance between two points, \((x_1,y_1)\) and \((x_2,y_2)\), is \(\sqrt{(x_1-x_2)^2 + (y_1-y_2)^2}\). This classic equation extends to 3-dimensions as well, \(\sqrt{(x_1-x_2)^2 + (y_1-y_2)^2+(z_1-z_2)^2}\). There are many applications for the Euclidean distance equation. Anything from collision detection in video games to space travel, to even machine learning algorithms (see blog post K-Means Clustering Post).

The Euclidean distance is calculated often in these applications. For collision detection the distance is calculated between the character to all the objects the character can interact with in the game. Once the character moves the distance is calculated again. This simple equation when calculated repeatedly can tax a processor that is also tasked with other algorithms for the game.

For the machine learning algorithm K-Means Clustering the distance of every point to it’s centroid and to every other point in it’s clustering is very laborious. At the heart of the algorithm the maximization of cluster distance and a minimization of inter-cluster distance. Here we will dedicate hardware on an FPGA to calculate the Euclidean distance.

**VHDL Entity Definition**

First we need to decide the inputs and outputs of the VHDL entity. The block will be a synchronous design so we will have a clock and a data-valid line. Next we will consider the coordinates of the two points. In general we are deciding between two architectures, The first architecture is where all the data are valid on the same clock event alternatively the data is valid on different clock events. This is a common architecture debate since the decision is really balancing hardware usage versus system latency.

The balance between hardware usage and system latency is what makes developing for the FPGA difficult and powerful. For this distance calculation we will have both sets of coordinates valid on a single clock event. We are choosing to reduce system latency and increase the hardware usage by processing all coordinates in parallel.

So to recap we have a clock, a data-valid flag, and two sets of coordinates where each set has three eight-bit values. As for outputs we have an output data-valid flag and the distance result.

```
-- dist3d.vhd
library ieee;
use ieee.std_logic_1164.all;
use ieee.numeric_std.all;
entity dist3d is
generic(g_bitwidth : integer)
port(i_clk : in std_logic;
i_dv : in std_logic;
i_x1 : in std_logic_vector(g_bitwidth-1 downto 0);
i_y1 : in std_logic_vector(g_bitwidth-1 downto 0);
i_z1 : in std_logic_vector(g_bitwidth-1 downto 0);
i_x2 : in std_logic_vector(g_bitwidth-1 downto 0);
i_y2 : in std_logic_vector(g_bitwidth-1 downto 0);
i_z2 : in std_logic_vector(g_bitwidth-1 downto 0);
o_dv : out std_logic;
o_dist : out std_logic_vector(g_bitwidth-1 downto 0)
);
end entity dist3d;
```

We also anticipate that this implementation will require the highest throughput possible so we will use a pipelined architecture. In a pipelined architecture registers are instantiated for every operation in the Euclidean distance equation. The naming convention used here is a single ‘f’ prefix for a register and if a registered result is operated on and the result is registered another ‘f’ is prefixed to the name. The purpose and benefit of this naming convention is that as the design gets more complex it is easy to tell how many clock cycles are needed for the processing to be completed, this also shows the latency of the entity.

```
architecture rtl of dist3d is
-- Output = sqrt{ (x_1-x_2)^2 + (y_1-y_2)^2 + (z_1-z_2)^2 }
-- Pipline Stage 1 - Difference
signal f_x_diff : signed(g_bitwidth-1 downto 0) := (others => '0');
signal f_y_diff : signed(g_bitwidth-1 downto 0) := (others => '0');
signal f_z_diff : signed(g_bitwidth-1 downto 0) := (others => '0');
-- Pipline Stage 2 - Squaring
signal ff_x_sq : signed(2*g_bitwidth-1 downto 0) := (others => '0');
signal ff_y_sq : signed(2*g_bitwidth-1 downto 0) := (others => '0');
signal ff_z_sq : signed(2*g_bitwidth-1 downto 0) := (others => '0');
-- Pipline Stage 3 - Addition
signal fff_xy : signed(2*g_bitwidth-1 downto 0) := (others => '0');
signal fff_z : signed(2*g_bitwidth-1 downto 0) := (others => '0');
-- Pipline Stage 4 - Addition
signal f4_dist : signed(2*g_bitwidth-1 downto 0) := (others => '0');
begin
p_calc : process(i_clk)
begin
if rising_edge(i_clk) then
f_x_diff <= signed(i_x1) - signed(i_x2);
f_y_diff <= signed(i_y1) - signed(i_y2);
f_z_diff <= signed(i_z1) - signed(i_z2);
ff_x_sq <= f_x_diff * f_x_diff;
ff_y_sq <= f_y_diff * f_y_diff;
ff_z_sq <= f_z_diff * f_z_diff;
fff_xy <= ff_x_sq + ff_y_sq;
fff_z <= ff_z_sq;
f4_dist <= fff_xy + fff_z;
end if;
end process;
end rtl;
```

After writing VHDL the next step is to simulate the code in a functional simulator. By functionally simulating the code we will ensure the code behaves the way we want it to. Below is a picture of the Euclidean Distance VHDL entity. We can see the inputs and outputs of the code under the ‘TB’ divider and the internal signal values, the only thing this implementation is lacking is the final square-root calculation. Since taking the square-root is resource intensive and has high latency we will not include the calculation in the implementation. Not every usage of this VHDL entity will be able to handle the added resources and latency but he the square-root were truly needed we could use a vendor (Xilinx or Altera) specific CORDIC to do this calculation.

Pingback: Getting Started with FPGA Programming for Free - Jason Pennington, Phd