[SCFToCalyx] Lower SCF Parallel Op To Calyx #7409

jiahanxie353 · 2024-07-30T21:16:00Z

This patch lowers scf.parallel op to Calyx by invoking multiple calyx.components in parallel. I'm not pushing test files for now because some required features are not upstream yet.

But I can give an example, say we have:

module {
  func.func @main() {
    %c2 = arith.constant 2 : index
    %c1 = arith.constant 1 : index
    %c3 = arith.constant 3 : index
    %c0 = arith.constant 0 : index
    %alloc = memref.alloc() : memref<6xi32>
    %alloc_1 = memref.alloc() : memref<6xi32>
    scf.parallel (%arg2, %arg3) = (%c0, %c0) to (%c3, %c2) step (%c2, %c1) {
      %4 = arith.shli %arg3, %c2 : index
      %5 = arith.addi %4, %arg2 : index
      %6 = memref.load %alloc_1[%5] : memref<6xi32>
      %7 = arith.shli %arg2, %c1 : index
      %8 = arith.addi %7, %arg3 : index
      memref.store %6, %alloc[%8] : memref<6xi32>
      scf.reduce
    }
    return
  }
}

The output is:

module attributes {calyx.entrypoint = "main"} {
  calyx.component @main(%clk: i1 {clk}, %reset: i1 {reset}, %go: i1 {go}) -> (%done: i1 {done}) {
    %c2_i32 = hw.constant 2 : i32
    %c1_i32 = hw.constant 1 : i32
    %c0_i32 = hw.constant 0 : i32
    %mem_1.addr0, %mem_1.write_data, %mem_1.write_en, %mem_1.clk, %mem_1.reset, %mem_1.read_data, %mem_1.done = calyx.memory @mem_1 <[6] x 32> [3] {external = true} : i3, i32, i1, i1, i1, i32, i1
    %mem_0.addr0, %mem_0.write_data, %mem_0.write_en, %mem_0.clk, %mem_0.reset, %mem_0.read_data, %mem_0.done = calyx.memory @mem_0 <[6] x 32> [3] {external = true} : i3, i32, i1, i1, i1, i32, i1
    %par_func_0_3_instance.in0, %par_func_0_3_instance.in1, %par_func_0_3_instance.in2, %par_func_0_3_instance.in4, %par_func_0_3_instance.clk, %par_func_0_3_instance.reset, %par_func_0_3_instance.go, %par_func_0_3_instance.done = calyx.instance @par_func_0_3_instance of @par_func_0_3 : i32, i32, i32, i32, i1, i1, i1, i1
    %par_func_0_2_instance.in0, %par_func_0_2_instance.in1, %par_func_0_2_instance.in2, %par_func_0_2_instance.in4, %par_func_0_2_instance.clk, %par_func_0_2_instance.reset, %par_func_0_2_instance.go, %par_func_0_2_instance.done = calyx.instance @par_func_0_2_instance of @par_func_0_2 : i32, i32, i32, i32, i1, i1, i1, i1
    %par_func_0_1_instance.in0, %par_func_0_1_instance.in1, %par_func_0_1_instance.in2, %par_func_0_1_instance.in4, %par_func_0_1_instance.clk, %par_func_0_1_instance.reset, %par_func_0_1_instance.go, %par_func_0_1_instance.done = calyx.instance @par_func_0_1_instance of @par_func_0_1 : i32, i32, i32, i32, i1, i1, i1, i1
    %par_func_0_0_instance.in0, %par_func_0_0_instance.in1, %par_func_0_0_instance.in2, %par_func_0_0_instance.in4, %par_func_0_0_instance.clk, %par_func_0_0_instance.reset, %par_func_0_0_instance.go, %par_func_0_0_instance.done = calyx.instance @par_func_0_0_instance of @par_func_0_0 : i32, i32, i32, i32, i1, i1, i1, i1
    calyx.wires {
    }
    calyx.control {
      calyx.seq {
        calyx.par {
          calyx.seq {
            calyx.invoke @par_func_0_0_instance[arg_mem_0 = mem_1, arg_mem_1 = mem_0](%par_func_0_0_instance.in0 = %c0_i32, %par_func_0_0_instance.in1 = %c2_i32, %par_func_0_0_instance.in2 = %c0_i32, %par_func_0_0_instance.in4 = %c1_i32) -> (i32, i32, i32, i32)
          }
          calyx.seq {
            calyx.invoke @par_func_0_1_instance[arg_mem_0 = mem_1, arg_mem_1 = mem_0](%par_func_0_1_instance.in0 = %c1_i32, %par_func_0_1_instance.in1 = %c2_i32, %par_func_0_1_instance.in2 = %c0_i32, %par_func_0_1_instance.in4 = %c1_i32) -> (i32, i32, i32, i32)
          }
          calyx.seq {
            calyx.invoke @par_func_0_2_instance[arg_mem_0 = mem_1, arg_mem_1 = mem_0](%par_func_0_2_instance.in0 = %c0_i32, %par_func_0_2_instance.in1 = %c2_i32, %par_func_0_2_instance.in2 = %c2_i32, %par_func_0_2_instance.in4 = %c1_i32) -> (i32, i32, i32, i32)
          }
          calyx.seq {
            calyx.invoke @par_func_0_3_instance[arg_mem_0 = mem_1, arg_mem_1 = mem_0](%par_func_0_3_instance.in0 = %c1_i32, %par_func_0_3_instance.in1 = %c2_i32, %par_func_0_3_instance.in2 = %c2_i32, %par_func_0_3_instance.in4 = %c1_i32) -> (i32, i32, i32, i32)
          }
        }
      }
    }
  } {toplevel}
  calyx.component @par_func_0_0(%in0: i32, %in1: i32, %in2: i32, %in4: i32, %clk: i1 {clk}, %reset: i1 {reset}, %go: i1 {go}) -> (%done: i1 {done}) {
    %true = hw.constant true
    %std_slice_1.in, %std_slice_1.out = calyx.std_slice @std_slice_1 : i32, i3
    %std_slice_0.in, %std_slice_0.out = calyx.std_slice @std_slice_0 : i32, i3
    %std_add_1.left, %std_add_1.right, %std_add_1.out = calyx.std_add @std_add_1 : i32, i32, i32
    %std_lsh_1.left, %std_lsh_1.right, %std_lsh_1.out = calyx.std_lsh @std_lsh_1 : i32, i32, i32
    %load_0_reg.in, %load_0_reg.write_en, %load_0_reg.clk, %load_0_reg.reset, %load_0_reg.out, %load_0_reg.done = calyx.register @load_0_reg : i32, i1, i1, i1, i32, i1
    %std_add_0.left, %std_add_0.right, %std_add_0.out = calyx.std_add @std_add_0 : i32, i32, i32
    %std_lsh_0.left, %std_lsh_0.right, %std_lsh_0.out = calyx.std_lsh @std_lsh_0 : i32, i32, i32
    %arg_mem_1.addr0, %arg_mem_1.write_data, %arg_mem_1.write_en, %arg_mem_1.clk, %arg_mem_1.reset, %arg_mem_1.read_data, %arg_mem_1.done = calyx.memory @arg_mem_1 <[6] x 32> [3] : i3, i32, i1, i1, i1, i32, i1
    %arg_mem_0.addr0, %arg_mem_0.write_data, %arg_mem_0.write_en, %arg_mem_0.clk, %arg_mem_0.reset, %arg_mem_0.read_data, %arg_mem_0.done = calyx.memory @arg_mem_0 <[6] x 32> [3] : i3, i32, i1, i1, i1, i32, i1
    calyx.wires {
      calyx.group @bb0_2 {
        calyx.assign %std_slice_1.in = %std_add_0.out : i32
        calyx.assign %arg_mem_0.addr0 = %std_slice_1.out : i3
        calyx.assign %load_0_reg.in = %arg_mem_0.read_data : i32
        calyx.assign %load_0_reg.write_en = %true : i1
        calyx.assign %std_add_0.left = %std_lsh_0.out : i32
        calyx.assign %std_lsh_0.left = %in0 : i32
        calyx.assign %std_lsh_0.right = %in1 : i32
        calyx.assign %std_add_0.right = %in2 : i32
        calyx.group_done %load_0_reg.done : i1
      }
      calyx.group @bb0_5 {
        calyx.assign %std_slice_0.in = %std_add_1.out : i32
        calyx.assign %arg_mem_1.addr0 = %std_slice_0.out : i3
        calyx.assign %arg_mem_1.write_data = %load_0_reg.out : i32
        calyx.assign %arg_mem_1.write_en = %true : i1
        calyx.assign %std_add_1.left = %std_lsh_1.out : i32
        calyx.assign %std_lsh_1.left = %in2 : i32
        calyx.assign %std_lsh_1.right = %in4 : i32
        calyx.assign %std_add_1.right = %in0 : i32
        calyx.group_done %arg_mem_1.done : i1
      }
    }
    calyx.control {
      calyx.seq {
        calyx.seq {
          calyx.enable @bb0_2
          calyx.enable @bb0_5
        }
      }
    }
  }
  calyx.component @par_func_0_1(%in0: i32, %in1: i32, %in2: i32, %in4: i32, %clk: i1 {clk}, %reset: i1 {reset}, %go: i1 {go}) -> (%done: i1 {done}) {
    // same idea
  }
  calyx.component @par_func_0_2(%in0: i32, %in1: i32, %in2: i32, %in4: i32, %clk: i1 {clk}, %reset: i1 {reset}, %go: i1 {go}) -> (%done: i1 {done}) {
   // same idea
  }
  calyx.component @par_func_0_3(%in0: i32, %in1: i32, %in2: i32, %in4: i32, %clk: i1 {clk}, %reset: i1 {reset}, %go: i1 {go}) -> (%done: i1 {done}) {
  // same idea
}

The design choice is worth discussing together:
Re. why I created multiple FuncOps for the body of scf.parallel: There To invoke something in Calyx in parallel using par:

calyx.group
calyx.component
The former is not expressive enough for holding the body information of scf.parallel; the latter can, and since we know FuncOp will eventually be lowered to calyx.component, we can just invoke them in parallel in the control section by merely changing the operands passing to components.

Question:
Can we invoke the same calyx.component together? If so, I can only create one @par_func instance instead of multiple ones.

rachitnigam · 2024-07-30T21:47:55Z

@jiahanxie353 this patch is marked as a draft. Is that intentional? It would be better to review it once it is in a state where the design is mostly fixed and on the path to being merged.

cgyurgyik · 2024-07-30T21:50:03Z

I have a few C++/coding related comments, but I'll postpone these since you're in draft mode.

So currently, you've created N components that have the same M cells, and are subsequently invoked in parallel in some main component, i.e.,

component main() -> () {
  cells { 
    A0 = A0(); A1 = A1(); A2 = A2(); ...; An = An(); 
  }
  control {
    par { invoke A0; invoke A1; invoke A2; ...; invokeAn; }
  }
}

component A0() -> () { ... }
...
component An() -> () { ... }

The alternative is what you've questioned,

Can we invoke the same calyx.component together? If so, I can only create one @par_func instance instead of multiple ones.

i.e., create a component that takes as input M cells. This should be possible, but would require wiring the correct inputs and outputs.

component main() -> () {
  cells {
    A0 = A(); A1 = A(); ... ; An = A();
  }
  control {
    par { invoke A0(...); invoke A1(...); ...; invoke An(...); }
  }
}

component A() -> () { ... }

jiahanxie353 · 2024-07-31T01:05:19Z

this patch is marked as a draft. Is that intentional? It would be better to review it once it is in a state where the design is mostly fixed and on the path to being merged.

The reason why I mark it draft for two reasons:

The test cases can only be integrated when other PRs got merged first since we are missing some features here;
The high-level design is not fixed since it'd be great to get some feedback from reviewers, and once we have decided a certain design choice (for instance, how many components to create, etc.), we can move it out of draft. With that said, the implementation for the current design is solid.

This should be possible, but would require wiring the correct inputs and outputs.

Makes sense!

Discussion needed:
After translating to Calyx, I found the issue of Calyx memory write_en driven by multiple sources, the root of which is memref.store %6, %alloc[%8] : memref<6xi32>.
This is because I'm passing the same memory by reference to different components. In the scf.parallel's semantics, we are certain that we won't be storing to the same memory address by different memref.stores. However, when lowering down to Calyx, we lost this semantics and only have

calyx.group @bb0_5 {
  ....
  calyx.assign %arg_mem_1.write_en = %true : i1
  ....
}

across different component. And since it has multiple drivings, we can't lower it to Verilog.

Any thoughts?

rachitnigam · 2024-07-31T14:35:46Z

Okay, the reasons for marking it a draft makes sense @jiahanxie353!

In the scf.parallel's semantics, we are certain that we won't be storing to the same memory address by different memref.stores.

It is not sufficient to show that we are going to write to different memory locations. The problem is that a hardware memory can only support a single read or write in a clock cycle (because it is a single-ported memory). For cases like this, you need to "bank" the memory into separate parts so that it access entirely different physical memories. First, I'd recommend reading the Dahlia paper to get a sense of what is going on. Next, we'd need to think more carefully about how this should be supported. @andrewb1999's work is adding this capability in a more general fashion but we might need to bank memories somehow.

Also tagging @ethanuppal since he's been thinking about similar problems in his language.

jiahanxie353 · 2024-08-14T22:22:42Z

Is there a general algorithm to determine an "optimal" bank size for each memory?

Consider a somewhat made-up example with appropriate complexity:

module {
  func.func @main(%buffer : memref<16xi32>) {
    %idx_one = arith.constant 1 : index
    %idx_two = arith.constant 2 : index
    %idx_three = arith.constant 3 : index
    %idx_four = arith.constant 4 : index
    %idx_five = arith.constant 5 : index

    %one = arith.constant 1 : i32
    %six = arith.constant 6 : i32
    %val = arith.addi %one, %six : i32

    %step1 = arith.constant 1 : index
    %step2 = arith.constant 2 : index
    %lb1 = arith.constant 0 : index
    %lb2 = arith.constant 1 : index
    %ub1 = arith.constant 5 : index
    %ub2 = arith.constant 4 : index
    %mem1 = memref.alloc() : memref<25xi32>

    scf.parallel (%iv1, %iv2) = (%lb1, %lb2) to (%ub1, %ub2) step (%step1, %step2) {
      %mul1 = arith.muli %iv1, %idx_five : index
      %load_idx1 = arith.addi %mul1, %idx_four : index
      %load_val1 = memref.load %mem1[%load_idx1] : memref<25xi32>

      %mul2 = arith.muli %iv1, %idx_two : index
      %load_idx2 = arith.addi %mul2, %idx_three : index
      %load_val2 = memref.load %mem1[%load_idx2] : memref<25xi32>
      %sum1 = arith.addi %load_val1, %load_val2 : i32

      %mul3 = arith.muli %iv2, %idx_one : index
      %load_idx3 = arith.addi %mul3, %idx_one : index
      %load_val3 = memref.load %mem1[%load_idx3] : memref<25xi32>

      %mul4 = arith.muli %iv1, %idx_four : index
      %load_idx4 = arith.addi %mul4, %idx_two : index
      %load_val4 = memref.load %buffer[%load_idx4] : memref<16xi32>
      %sum2 = arith.addi %load_val3, %load_val4 : i32

      %store_idx = arith.muli %iv2, %idx_three : index
      %store_val = arith.addi %sum1, %sum2 : i32
      memref.store %store_val, %buffer[%store_idx] : memref<16xi32>
    }

    return
  }
}

And I have generated the memory access patterns:

Memory access	(`iv1`=0, `iv2`=1)	(0, 3)	(1, 1)	(1, 3)	(2, 1)	(2, 3)	(3, 1)	(3, 3)	(4, 1)	(4, 3)
load mem1[5iv1+4]	4	4	9	9	14	14	19	19	24	24
load mem1[2iv1+3]	3	3	5	5	7	7	9	9	11	11
load mem1[3iv2+1]	4	10	4	10	4	10	4	10	4	10
load buffer[4iv1+2]	2	2	6	6	10	10	14	14	18	18
store buffer[3iv2]	3	9	3	9	3	9	3	9	3	9

And I realize, for this example, and according to the documentation:

Semantically we require that the iteration space can be iterated in any order, and the loop body can be executed in parallel. If there are data races, the behavior is undefined.

Since (iv1=0, iv2=1) and (iv1=0, iv2=3) all access mem1's 4th, 3rd element, there will be potential data race and we cannot do any parallelism here. First, this example is completely artificial and may I'm over-complicating it. But if it were to happen, should we reject this kind of access patterns in the first place?

And if we take a step back to consider a somewhat simplified access pattern with only one induction variable:

Memory access	`iv1`=0	1	2	3	4
load mem1[5iv1+4]	4	9	14	19	24
load mem1[2iv1+3]	3	5	7	9	11

What is the algorithm for determining the banking factor?
If we were to use some brute force algorithm, we search from bankingFactor = the size of access patterns (5 in the example above) to 1.
Let's arbitrarily take `bankingFactor = 4. we find out that,

Group 1:

Memory access	`iv1`=0
load mem1[5iv1+4]	4
load mem1[2iv1+3]	3

"consume"s two banks: 1. the first bank (4 % 4 = 0); 2. the fourth bank (3 % 4 = 3)

Group 2:

Memory access	`iv1`=1
load mem1[5iv1+4]	9
load mem1[2iv1+3]	5

"consume"s 1 bank, i.e., the second bank (9 % 4 = 1, 5 % 4 = 1);
And Group 1 & 2 can be executed in parallel, since they are not sharing any bank:

par {
  component1 {
    iv1 = 0;
    load mem1[5iv1+4];
    load mem1[2iv1+3];
  }
  component2 {
    iv1 = 1;
    load mem1[5iv1+4];
    load mem1[2iv1+3];
  }
}

Group 3:

Memory access	`iv1`= 2
load mem1[5iv1+4]	14
load mem1[2iv1+3]	7

"consume"s two banks: 1. third (14 % 4 = 2); 2. fourth (7 % 4 = 3).
So Group 3 cannot be executed in parallel with Group 1; but can be executed with Group 2.

Group 4:

Memory access	`iv1`=3
load mem1[5iv1+4]	19
load mem1[2iv1+3]	9

"consume"s two banks: 1. fourth (19 % 4 = 3); 2. second (9 % 4 = 1)
So Group 4 conflicts with Group 1 (they all use the fourth bank) & Group 2 (they all use the second bank) & Group 3 (all use the fourth bank).

Group 4:

Memory access	`iv1`=4
load mem1[5iv1+4]	24
load mem1[2iv1+3]	1

"consume"s two banks: 1. first (24 % 4 = 0); 2. second (1 % 4 = 1).
So Group 5 conflicts with Group 4, resulting Group 4 forced to be executed in sequence isolated from other groups; more analysis.

To put different groups together and determine which ones can be executed in parallel, this will almost sound like graph coloring problem. Is there any general way to do it?

In addition, we might move on to other bankingFactor and check if we can achieve a higher parallelism.
Worst case, if the work described above is too tedious; or if none is infeasible, we fall back the banking factor to 1, meaning there will be no parallelism at all: there will only be one component in the par group; and everything inside scf.parallel will be unrolled and executed in sequence inside the component.

Any suggestion to tackle this?

rachitnigam · 2024-08-15T14:46:52Z

You can look at the Tracebank paper for details on how to automatically determine good banking factors. I think optimality exists but for large programs, you might have to use heuristics.

ethanuppal · 2024-08-17T04:21:22Z

I second TraceBank.

lower scf parallel to calyx by invoking multiple components in parallel

0498000

jiahanxie353 added the Calyx The Calyx dialect label Jul 30, 2024

jiahanxie353 requested review from rachitnigam and cgyurgyik July 30, 2024 21:16

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SCFToCalyx] Lower SCF Parallel Op To Calyx #7409

[SCFToCalyx] Lower SCF Parallel Op To Calyx #7409

jiahanxie353 commented Jul 30, 2024

rachitnigam commented Jul 30, 2024

cgyurgyik commented Jul 30, 2024

jiahanxie353 commented Jul 31, 2024

rachitnigam commented Jul 31, 2024

jiahanxie353 commented Aug 14, 2024

rachitnigam commented Aug 15, 2024

ethanuppal commented Aug 17, 2024

[SCFToCalyx] Lower SCF Parallel Op To Calyx #7409

Are you sure you want to change the base?

[SCFToCalyx] Lower SCF Parallel Op To Calyx #7409

Conversation

jiahanxie353 commented Jul 30, 2024

rachitnigam commented Jul 30, 2024

cgyurgyik commented Jul 30, 2024

jiahanxie353 commented Jul 31, 2024

rachitnigam commented Jul 31, 2024

jiahanxie353 commented Aug 14, 2024

rachitnigam commented Aug 15, 2024

ethanuppal commented Aug 17, 2024