Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SCFToCalyx] Lower SCF Parallel Op To Calyx #7409

Draft
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

jiahanxie353
Copy link
Contributor

This patch lowers scf.parallel op to Calyx by invoking multiple calyx.components in parallel. I'm not pushing test files for now because some required features are not upstream yet.

But I can give an example, say we have:

module {
  func.func @main() {
    %c2 = arith.constant 2 : index
    %c1 = arith.constant 1 : index
    %c3 = arith.constant 3 : index
    %c0 = arith.constant 0 : index
    %alloc = memref.alloc() : memref<6xi32>
    %alloc_1 = memref.alloc() : memref<6xi32>
    scf.parallel (%arg2, %arg3) = (%c0, %c0) to (%c3, %c2) step (%c2, %c1) {
      %4 = arith.shli %arg3, %c2 : index
      %5 = arith.addi %4, %arg2 : index
      %6 = memref.load %alloc_1[%5] : memref<6xi32>
      %7 = arith.shli %arg2, %c1 : index
      %8 = arith.addi %7, %arg3 : index
      memref.store %6, %alloc[%8] : memref<6xi32>
      scf.reduce
    }
    return
  }
}

The output is:

module attributes {calyx.entrypoint = "main"} {
  calyx.component @main(%clk: i1 {clk}, %reset: i1 {reset}, %go: i1 {go}) -> (%done: i1 {done}) {
    %c2_i32 = hw.constant 2 : i32
    %c1_i32 = hw.constant 1 : i32
    %c0_i32 = hw.constant 0 : i32
    %mem_1.addr0, %mem_1.write_data, %mem_1.write_en, %mem_1.clk, %mem_1.reset, %mem_1.read_data, %mem_1.done = calyx.memory @mem_1 <[6] x 32> [3] {external = true} : i3, i32, i1, i1, i1, i32, i1
    %mem_0.addr0, %mem_0.write_data, %mem_0.write_en, %mem_0.clk, %mem_0.reset, %mem_0.read_data, %mem_0.done = calyx.memory @mem_0 <[6] x 32> [3] {external = true} : i3, i32, i1, i1, i1, i32, i1
    %par_func_0_3_instance.in0, %par_func_0_3_instance.in1, %par_func_0_3_instance.in2, %par_func_0_3_instance.in4, %par_func_0_3_instance.clk, %par_func_0_3_instance.reset, %par_func_0_3_instance.go, %par_func_0_3_instance.done = calyx.instance @par_func_0_3_instance of @par_func_0_3 : i32, i32, i32, i32, i1, i1, i1, i1
    %par_func_0_2_instance.in0, %par_func_0_2_instance.in1, %par_func_0_2_instance.in2, %par_func_0_2_instance.in4, %par_func_0_2_instance.clk, %par_func_0_2_instance.reset, %par_func_0_2_instance.go, %par_func_0_2_instance.done = calyx.instance @par_func_0_2_instance of @par_func_0_2 : i32, i32, i32, i32, i1, i1, i1, i1
    %par_func_0_1_instance.in0, %par_func_0_1_instance.in1, %par_func_0_1_instance.in2, %par_func_0_1_instance.in4, %par_func_0_1_instance.clk, %par_func_0_1_instance.reset, %par_func_0_1_instance.go, %par_func_0_1_instance.done = calyx.instance @par_func_0_1_instance of @par_func_0_1 : i32, i32, i32, i32, i1, i1, i1, i1
    %par_func_0_0_instance.in0, %par_func_0_0_instance.in1, %par_func_0_0_instance.in2, %par_func_0_0_instance.in4, %par_func_0_0_instance.clk, %par_func_0_0_instance.reset, %par_func_0_0_instance.go, %par_func_0_0_instance.done = calyx.instance @par_func_0_0_instance of @par_func_0_0 : i32, i32, i32, i32, i1, i1, i1, i1
    calyx.wires {
    }
    calyx.control {
      calyx.seq {
        calyx.par {
          calyx.seq {
            calyx.invoke @par_func_0_0_instance[arg_mem_0 = mem_1, arg_mem_1 = mem_0](%par_func_0_0_instance.in0 = %c0_i32, %par_func_0_0_instance.in1 = %c2_i32, %par_func_0_0_instance.in2 = %c0_i32, %par_func_0_0_instance.in4 = %c1_i32) -> (i32, i32, i32, i32)
          }
          calyx.seq {
            calyx.invoke @par_func_0_1_instance[arg_mem_0 = mem_1, arg_mem_1 = mem_0](%par_func_0_1_instance.in0 = %c1_i32, %par_func_0_1_instance.in1 = %c2_i32, %par_func_0_1_instance.in2 = %c0_i32, %par_func_0_1_instance.in4 = %c1_i32) -> (i32, i32, i32, i32)
          }
          calyx.seq {
            calyx.invoke @par_func_0_2_instance[arg_mem_0 = mem_1, arg_mem_1 = mem_0](%par_func_0_2_instance.in0 = %c0_i32, %par_func_0_2_instance.in1 = %c2_i32, %par_func_0_2_instance.in2 = %c2_i32, %par_func_0_2_instance.in4 = %c1_i32) -> (i32, i32, i32, i32)
          }
          calyx.seq {
            calyx.invoke @par_func_0_3_instance[arg_mem_0 = mem_1, arg_mem_1 = mem_0](%par_func_0_3_instance.in0 = %c1_i32, %par_func_0_3_instance.in1 = %c2_i32, %par_func_0_3_instance.in2 = %c2_i32, %par_func_0_3_instance.in4 = %c1_i32) -> (i32, i32, i32, i32)
          }
        }
      }
    }
  } {toplevel}
  calyx.component @par_func_0_0(%in0: i32, %in1: i32, %in2: i32, %in4: i32, %clk: i1 {clk}, %reset: i1 {reset}, %go: i1 {go}) -> (%done: i1 {done}) {
    %true = hw.constant true
    %std_slice_1.in, %std_slice_1.out = calyx.std_slice @std_slice_1 : i32, i3
    %std_slice_0.in, %std_slice_0.out = calyx.std_slice @std_slice_0 : i32, i3
    %std_add_1.left, %std_add_1.right, %std_add_1.out = calyx.std_add @std_add_1 : i32, i32, i32
    %std_lsh_1.left, %std_lsh_1.right, %std_lsh_1.out = calyx.std_lsh @std_lsh_1 : i32, i32, i32
    %load_0_reg.in, %load_0_reg.write_en, %load_0_reg.clk, %load_0_reg.reset, %load_0_reg.out, %load_0_reg.done = calyx.register @load_0_reg : i32, i1, i1, i1, i32, i1
    %std_add_0.left, %std_add_0.right, %std_add_0.out = calyx.std_add @std_add_0 : i32, i32, i32
    %std_lsh_0.left, %std_lsh_0.right, %std_lsh_0.out = calyx.std_lsh @std_lsh_0 : i32, i32, i32
    %arg_mem_1.addr0, %arg_mem_1.write_data, %arg_mem_1.write_en, %arg_mem_1.clk, %arg_mem_1.reset, %arg_mem_1.read_data, %arg_mem_1.done = calyx.memory @arg_mem_1 <[6] x 32> [3] : i3, i32, i1, i1, i1, i32, i1
    %arg_mem_0.addr0, %arg_mem_0.write_data, %arg_mem_0.write_en, %arg_mem_0.clk, %arg_mem_0.reset, %arg_mem_0.read_data, %arg_mem_0.done = calyx.memory @arg_mem_0 <[6] x 32> [3] : i3, i32, i1, i1, i1, i32, i1
    calyx.wires {
      calyx.group @bb0_2 {
        calyx.assign %std_slice_1.in = %std_add_0.out : i32
        calyx.assign %arg_mem_0.addr0 = %std_slice_1.out : i3
        calyx.assign %load_0_reg.in = %arg_mem_0.read_data : i32
        calyx.assign %load_0_reg.write_en = %true : i1
        calyx.assign %std_add_0.left = %std_lsh_0.out : i32
        calyx.assign %std_lsh_0.left = %in0 : i32
        calyx.assign %std_lsh_0.right = %in1 : i32
        calyx.assign %std_add_0.right = %in2 : i32
        calyx.group_done %load_0_reg.done : i1
      }
      calyx.group @bb0_5 {
        calyx.assign %std_slice_0.in = %std_add_1.out : i32
        calyx.assign %arg_mem_1.addr0 = %std_slice_0.out : i3
        calyx.assign %arg_mem_1.write_data = %load_0_reg.out : i32
        calyx.assign %arg_mem_1.write_en = %true : i1
        calyx.assign %std_add_1.left = %std_lsh_1.out : i32
        calyx.assign %std_lsh_1.left = %in2 : i32
        calyx.assign %std_lsh_1.right = %in4 : i32
        calyx.assign %std_add_1.right = %in0 : i32
        calyx.group_done %arg_mem_1.done : i1
      }
    }
    calyx.control {
      calyx.seq {
        calyx.seq {
          calyx.enable @bb0_2
          calyx.enable @bb0_5
        }
      }
    }
  }
  calyx.component @par_func_0_1(%in0: i32, %in1: i32, %in2: i32, %in4: i32, %clk: i1 {clk}, %reset: i1 {reset}, %go: i1 {go}) -> (%done: i1 {done}) {
    // same idea
  }
  calyx.component @par_func_0_2(%in0: i32, %in1: i32, %in2: i32, %in4: i32, %clk: i1 {clk}, %reset: i1 {reset}, %go: i1 {go}) -> (%done: i1 {done}) {
   // same idea
  }
  calyx.component @par_func_0_3(%in0: i32, %in1: i32, %in2: i32, %in4: i32, %clk: i1 {clk}, %reset: i1 {reset}, %go: i1 {go}) -> (%done: i1 {done}) {
  // same idea
}

The design choice is worth discussing together:
Re. why I created multiple FuncOps for the body of scf.parallel: There To invoke something in Calyx in parallel using par:

  • calyx.group
  • calyx.component
    The former is not expressive enough for holding the body information of scf.parallel; the latter can, and since we know FuncOp will eventually be lowered to calyx.component, we can just invoke them in parallel in the control section by merely changing the operands passing to components.

Question:
Can we invoke the same calyx.component together? If so, I can only create one @par_func instance instead of multiple ones.

@jiahanxie353 jiahanxie353 added the Calyx The Calyx dialect label Jul 30, 2024
@rachitnigam
Copy link
Contributor

@jiahanxie353 this patch is marked as a draft. Is that intentional? It would be better to review it once it is in a state where the design is mostly fixed and on the path to being merged.

@cgyurgyik
Copy link
Member

I have a few C++/coding related comments, but I'll postpone these since you're in draft mode.

So currently, you've created N components that have the same M cells, and are subsequently invoked in parallel in some main component, i.e.,

component main() -> () {
  cells { 
    A0 = A0(); A1 = A1(); A2 = A2(); ...; An = An(); 
  }
  control {
    par { invoke A0; invoke A1; invoke A2; ...; invokeAn; }
  }
}

component A0() -> () { ... }
...
component An() -> () { ... }

The alternative is what you've questioned,

Can we invoke the same calyx.component together? If so, I can only create one @par_func instance instead of multiple ones.

i.e., create a component that takes as input M cells. This should be possible, but would require wiring the correct inputs and outputs.

component main() -> () {
  cells {
    A0 = A(); A1 = A(); ... ; An = A();
  }
  control {
    par { invoke A0(...); invoke A1(...); ...; invoke An(...); }
  }
}

component A() -> () { ... }

@jiahanxie353
Copy link
Contributor Author

this patch is marked as a draft. Is that intentional? It would be better to review it once it is in a state where the design is mostly fixed and on the path to being merged.

The reason why I mark it draft for two reasons:

  1. The test cases can only be integrated when other PRs got merged first since we are missing some features here;
  2. The high-level design is not fixed since it'd be great to get some feedback from reviewers, and once we have decided a certain design choice (for instance, how many components to create, etc.), we can move it out of draft. With that said, the implementation for the current design is solid.

This should be possible, but would require wiring the correct inputs and outputs.

Makes sense!


Discussion needed:
After translating to Calyx, I found the issue of Calyx memory write_en driven by multiple sources, the root of which is memref.store %6, %alloc[%8] : memref<6xi32>.
This is because I'm passing the same memory by reference to different components. In the scf.parallel's semantics, we are certain that we won't be storing to the same memory address by different memref.stores. However, when lowering down to Calyx, we lost this semantics and only have

calyx.group @bb0_5 {
  ....
  calyx.assign %arg_mem_1.write_en = %true : i1
  ....
}

across different component. And since it has multiple drivings, we can't lower it to Verilog.

Any thoughts?

@rachitnigam
Copy link
Contributor

Okay, the reasons for marking it a draft makes sense @jiahanxie353!

In the scf.parallel's semantics, we are certain that we won't be storing to the same memory address by different memref.stores.

It is not sufficient to show that we are going to write to different memory locations. The problem is that a hardware memory can only support a single read or write in a clock cycle (because it is a single-ported memory). For cases like this, you need to "bank" the memory into separate parts so that it access entirely different physical memories. First, I'd recommend reading the Dahlia paper to get a sense of what is going on. Next, we'd need to think more carefully about how this should be supported. @andrewb1999's work is adding this capability in a more general fashion but we might need to bank memories somehow.

Also tagging @ethanuppal since he's been thinking about similar problems in his language.

@jiahanxie353
Copy link
Contributor Author

Is there a general algorithm to determine an "optimal" bank size for each memory?

Consider a somewhat made-up example with appropriate complexity:

module {
  func.func @main(%buffer : memref<16xi32>) {
    %idx_one = arith.constant 1 : index
    %idx_two = arith.constant 2 : index
    %idx_three = arith.constant 3 : index
    %idx_four = arith.constant 4 : index
    %idx_five = arith.constant 5 : index

    %one = arith.constant 1 : i32
    %six = arith.constant 6 : i32
    %val = arith.addi %one, %six : i32

    %step1 = arith.constant 1 : index
    %step2 = arith.constant 2 : index
    %lb1 = arith.constant 0 : index
    %lb2 = arith.constant 1 : index
    %ub1 = arith.constant 5 : index
    %ub2 = arith.constant 4 : index
    %mem1 = memref.alloc() : memref<25xi32>

    scf.parallel (%iv1, %iv2) = (%lb1, %lb2) to (%ub1, %ub2) step (%step1, %step2) {
      %mul1 = arith.muli %iv1, %idx_five : index
      %load_idx1 = arith.addi %mul1, %idx_four : index
      %load_val1 = memref.load %mem1[%load_idx1] : memref<25xi32>

      %mul2 = arith.muli %iv1, %idx_two : index
      %load_idx2 = arith.addi %mul2, %idx_three : index
      %load_val2 = memref.load %mem1[%load_idx2] : memref<25xi32>
      %sum1 = arith.addi %load_val1, %load_val2 : i32

      %mul3 = arith.muli %iv2, %idx_one : index
      %load_idx3 = arith.addi %mul3, %idx_one : index
      %load_val3 = memref.load %mem1[%load_idx3] : memref<25xi32>

      %mul4 = arith.muli %iv1, %idx_four : index
      %load_idx4 = arith.addi %mul4, %idx_two : index
      %load_val4 = memref.load %buffer[%load_idx4] : memref<16xi32>
      %sum2 = arith.addi %load_val3, %load_val4 : i32

      %store_idx = arith.muli %iv2, %idx_three : index
      %store_val = arith.addi %sum1, %sum2 : i32
      memref.store %store_val, %buffer[%store_idx] : memref<16xi32>
    }

    return
  }
}

And I have generated the memory access patterns:

Memory access (iv1=0, iv2=1) (0, 3) (1, 1) (1, 3) (2, 1) (2, 3) (3, 1) (3, 3) (4, 1) (4, 3)
load mem1[5iv1+4] 4 4 9 9 14 14 19 19 24 24
load mem1[2iv1+3] 3 3 5 5 7 7 9 9 11 11
load mem1[3iv2+1] 4 10 4 10 4 10 4 10 4 10
load buffer[4iv1+2] 2 2 6 6 10 10 14 14 18 18
store buffer[3iv2] 3 9 3 9 3 9 3 9 3 9

And I realize, for this example, and according to the documentation:

Semantically we require that the iteration space can be iterated in any order, and the loop body can be executed in parallel. If there are data races, the behavior is undefined.

Since (iv1=0, iv2=1) and (iv1=0, iv2=3) all access mem1's 4th, 3rd element, there will be potential data race and we cannot do any parallelism here. First, this example is completely artificial and may I'm over-complicating it. But if it were to happen, should we reject this kind of access patterns in the first place?


And if we take a step back to consider a somewhat simplified access pattern with only one induction variable:

Memory access iv1=0 1 2 3 4
load mem1[5iv1+4] 4 9 14 19 24
load mem1[2iv1+3] 3 5 7 9 11

What is the algorithm for determining the banking factor?
If we were to use some brute force algorithm, we search from bankingFactor = the size of access patterns (5 in the example above) to 1.
Let's arbitrarily take `bankingFactor = 4. we find out that,

  • Group 1:
Memory access iv1=0
load mem1[5iv1+4] 4
load mem1[2iv1+3] 3

"consume"s two banks: 1. the first bank (4 % 4 = 0); 2. the fourth bank (3 % 4 = 3)

  • Group 2:
Memory access iv1=1
load mem1[5iv1+4] 9
load mem1[2iv1+3] 5

"consume"s 1 bank, i.e., the second bank (9 % 4 = 1, 5 % 4 = 1);
And Group 1 & 2 can be executed in parallel, since they are not sharing any bank:

par {
  component1 {
    iv1 = 0;
    load mem1[5iv1+4];
    load mem1[2iv1+3];
  }
  component2 {
    iv1 = 1;
    load mem1[5iv1+4];
    load mem1[2iv1+3];
  }
}
  • Group 3:
Memory access iv1= 2
load mem1[5iv1+4] 14
load mem1[2iv1+3] 7

"consume"s two banks: 1. third (14 % 4 = 2); 2. fourth (7 % 4 = 3).
So Group 3 cannot be executed in parallel with Group 1; but can be executed with Group 2.

  • Group 4:
Memory access iv1=3
load mem1[5iv1+4] 19
load mem1[2iv1+3] 9

"consume"s two banks: 1. fourth (19 % 4 = 3); 2. second (9 % 4 = 1)
So Group 4 conflicts with Group 1 (they all use the fourth bank) & Group 2 (they all use the second bank) & Group 3 (all use the fourth bank).

  • Group 4:
Memory access iv1=4
load mem1[5iv1+4] 24
load mem1[2iv1+3] 1

"consume"s two banks: 1. first (24 % 4 = 0); 2. second (1 % 4 = 1).
So Group 5 conflicts with Group 4, resulting Group 4 forced to be executed in sequence isolated from other groups; more analysis.

To put different groups together and determine which ones can be executed in parallel, this will almost sound like graph coloring problem. Is there any general way to do it?

  • In addition, we might move on to other bankingFactor and check if we can achieve a higher parallelism.
  • Worst case, if the work described above is too tedious; or if none is infeasible, we fall back the banking factor to 1, meaning there will be no parallelism at all: there will only be one component in the par group; and everything inside scf.parallel will be unrolled and executed in sequence inside the component.

Any suggestion to tackle this?

@rachitnigam
Copy link
Contributor

You can look at the Tracebank paper for details on how to automatically determine good banking factors. I think optimality exists but for large programs, you might have to use heuristics.

@ethanuppal
Copy link

I second TraceBank.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Calyx The Calyx dialect
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants