One hot encoding for state machines

Designers have long found that there are many tradeoffs for area and performance. One-hot-state-encoding is an approach to state machine design that attempts to gain performance by simplifying decode logic. The simplest example is comparing a 4 bit counter to it's one-hot equivalent, which is a ring of 16 flops setup as a continual shift register. The counter may be smaller, but performance is limited by 4 bits of incrementer logic. The shift register performance is better, but it requires 16 flops instead of 4.

But one-hot encoding of a state machine requires more than just defining the states as 00001, 00010, 00100, 01000, and 10000. The RTL code style affects whether or not the logic will really be as minimal as possible.

This example looks like a one-hot encoded state machine, but it turns out to be a poor implementation.


case (state)
4'b0001: begin
if (in1) state <= 4'b0010 ;
out1a <= 1 ;
out1b <= in1 ;
end
4'b0010: begin
if (in2) state <= 4'b0100 ;
else state <= 4'b0001 ;
out1a <= 0 ;
out1b <= 0 ;
out23 <= 0 ;
end
4'b0100: begin
state <= 4'b1000 ;
end
4'b1000: begin
if (in4) state <= 4'b0001 ;
out23 <= 0 ;
endcase

The first goal of the one-hot encoding is that the decode of the state to other logic should be minimal. "out1a" should just be a one-cycle delay of state[0], but as written above, it ends up much more complex:

surprising? there's no "out1a <= 0 ;" in all of the other states and default - hey, there's no default, for that matter. synthesis doesn't look at the state sequence and so no synthesis tool can "see" to make it a delayed copy of state[0]

The decode of current state to next state should also be minimal, but instead of state[3] being a one-cycle delay of state[2], it is instead a 4 bit function:

state[3] <= (state == 4'b0100).


One improvement is to add a synthesis directive to the case statement so that synthesis knows that all other values of state can be ignored:

case (state) // synthesis full_case

This may help with the logic for state[3], but it won't necessarily always help. (The actual syntax may vary depending on the synthesis tool; the usual syntax is "synopsys" instead of "synthesis", but that will depend on the vendor; most vendors will recognize the Synopsys syntax).


Instead, we can make the decode simpler by changing the RTL style:

case (1'b1) // synthesis full_case parallel_case

state[0]: begin ... end
state[1]: begin ... end
state[2]: begin ... end
state[3]: begin ... end

endcase

Here we use both full_case and parallel_case. Both are often misused without obvious error. full_case is redundant if there is a default clause. Also, there may be times when a designer deliberately omits some case values, but still adds the full_case pragma out of habit - simulation ignores it, so won't show the error. But in that case, the synthesized results would be incorrect, with an error that would only be revealed by enough simulation with gate models.


parallel_case only applies to a case statement where the individual cases are variables (like our second example) instead of constants (like our first example).

(Using parallel_case on constant cases is a common error). With variables as cases, more than one may be true, but the verilog language definition is that only the first true case will be executed. This implies an if (), else if (), else if ()... structure - a priority encoder. parallel_case is a directive to the synthesis tool that the designer is sure that the cases are mutually exclusive, so a simple OR structure is good enough, instead of the priority encoder.


For example:

case (1'b1) // synthesis parallel_case

state[0]: out <= in1 ;
state[1]: out <=in2 ;

endcase

With the parallel_case pragma, that is equivalent to:

out <= (state[0] & in1) | (state[1] & in2) ;

Without the parallel case pragma, extra logic is required to handle the condition when both state[0] and [1] are asserted:

out <= (state[0] & in1) | (state[1] & ! state[0] & in2) ;

(not a big difference, but this is simple example).

Sometimes one-hot encoding does not have to be taken to the extreme of one flop per state - several consecutive states might be collapsed into one state with a "side" counter (up or down). This is particularly usefull if some other logic would use the OR of those state bits to enable some function.