Metadata-Version: 2.1
Name: sodac
Version: 0.0.20200428.dev2
Summary: Stencil with optimized dataflow architecture
Home-page: https://github.com/Blaok/soda
Author: Blaok Chi
License: UNKNOWN
Platform: UNKNOWN
Classifier: Development Status :: 2 - Pre-Alpha
Classifier: Intended Audience :: Developers
Classifier: Intended Audience :: Science/Research
Classifier: License :: OSI Approved :: MIT License
Classifier: Programming Language :: Python :: 3 :: Only
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.6
Classifier: Programming Language :: Python :: 3.7
Classifier: Programming Language :: Python :: 3.8
Classifier: Programming Language :: Python :: 3.9
Classifier: Topic :: System :: Hardware
Requires-Python: >=3.6
Description-Content-Type: text/markdown
Requires-Dist: cached-property
Requires-Dist: haoda (>=0.0.20200428.dev1)
Requires-Dist: pulp
Requires-Dist: textx
Requires-Dist: toposort

# SODA: Stencil with Optimized Dataflow Architecture

## Publication

+ Yuze Chi, Jason Cong, Peng Wei, Peipei Zhou. [SODA: Stencil with Optimized Dataflow Architecture](https://doi.org/10.1145/3240765.3240850). In ICCAD, 2018. (Best Paper Candidate) [[PDF]](https://about.blaok.me/pub/iccad18.pdf) [[Slides]](https://about.blaok.me/pub/iccad18.slides.pdf)

## SODA DSL Example

    # comments start with hashtag(#)

    kernel: blur      # the kernel name, will be used as the kernel name in HLS
    burst width: 512  # DRAM burst I/O width in bits, for Xilinx platform by default it's 512
    unroll factor: 16 # how many pixels are generated per cycle

    # specify the dram bank, type, name, and dimension of the input tile
    # the last dimension is not needed and a placeholder '*' must be given
    # dram bank is optional
    # multiple inputs can be specified but 1 and only 1 must specify the dimensions
    input dram 0 uint16: input(2000, *)

    # specify an intermediate stage of computation, may appear 0 or more times
    local uint16: tmp(0, 0) = (input(-1, 0) + input(0, 0) + input(1, 0)) / 3

    # specify the output
    # dram bank is optional
    output dram 1 uint16: output(0, 0) = (tmp(0, -1) + tmp(0, 0) + tmp(0, 1)) / 3

    # how many times the whole computation is repeated (only works if input matches output)
    iterate: 2

    # how to deal with border, currently only 'ignore' is available
    border: ignore

    # how to cluster modules, currently only 'none' is available
    cluster: none

    # constant values that may be referenced as coefficients or lookup tables (implementation currently broken)
    # array partitioning information can be passed to HLS code
    param uint16, partition cyclic factor=2 dim=1, partition cyclic factor=2 dim=2: p1[20][30]
    # keyword 'dup' allows simultaneous access to the same parameter
    param uint16, dup 3, partition complete: p2[20]

## TODOs

+ [x] support multiple inputs & outputs
+ [x] use RTL flow to accelerate HLS

## Design Considerations

+ All keywords are mandatory except intermediate `local` and extra `param` are optional
+ For non-iterative stencil, `unroll factor` shall be determined by the DRAM bandwidth, i.e. saturate the external bandwidth, since the resource is usually not the bottleneck
+ For iterative stencil, to use more PEs in a single iteration or to implement more iterations is yet to be explored
+ Currently `math.h` functions can be parsed but type induction is not fully implemented
+ Note that `2.0` will be a `double` number. To generate `float`, use `2.0f`. This may help reduce DSP usage
+ SODA is tiling-based and the size of the tile is specified in the `input` keyword. The last dimension is omitted because it is not needed in the reuse buffer generation

## Getting Started

### Prerequisites

+ Python 3.3+
+ Python dependencies installed via `python3 -m pip install -r requirements.txt`
+ SDAccel 2018.3 (earlier versions might work but won't be supported)

### Clone the Repo
    git clone https://github.com/UCLA-VAST/soda.git
    cd soda

### Generate HLS kernel code
    make kernel

### Run C-Sim
    make csim

### Generate HDL code
    make hls SYNTHESIS_FLOW=rtl

### Run Co-Sim
    make cosim SYNTHESIS_FLOW=rtl

### Generate FPGA Bitstream
    make bitstream SYNTHESIS_FLOW=rtl

### Run Bitstream
    make hw SYNTHESIS_FLOW=rtl # requires actual FPGA hardware and driver

## Code Snippets

### Configuration

+ 5-point 2D Jacobi: `t0(0, 0) = (t1(0, 1) + t1(1, 0) + t1(0, 0) + t1(0, -1) + t1(-1, 0)) * 0.2f`
+ tile size is `(2000, *)`

Each function in the below code snippets is synthesized into an RTL module.
Their arguments are all `hls::stream` FIFOs; Without unrolling, a simple line-buffer pipeline is generated, producing 1 pixel per cycle.
With unrolling, a SODA microarchitecture pipeline is generated, procuding 2 pixeles per cycle.

### Without Unrolling

    #pragma HLS dataflow
    Module1Func(
      /*output*/ &from_t1_offset_0_to_t1_offset_1999,
      /*output*/ &from_t1_offset_0_to_t0_pe_0,
      /* input*/ &from_super_source_to_t1_offset_0);
    Module2Func(
      /*output*/ &from_t1_offset_1999_to_t1_offset_2000,
      /*output*/ &from_t1_offset_1999_to_t0_pe_0,
      /* input*/ &from_t1_offset_0_to_t1_offset_1999);
    Module3Func(
      /*output*/ &from_t1_offset_2000_to_t1_offset_2001,
      /*output*/ &from_t1_offset_2000_to_t0_pe_0,
      /* input*/ &from_t1_offset_1999_to_t1_offset_2000);
    Module3Func(
      /*output*/ &from_t1_offset_2001_to_t1_offset_4000,
      /*output*/ &from_t1_offset_2001_to_t0_pe_0,
      /* input*/ &from_t1_offset_2000_to_t1_offset_2001);
    Module4Func(
      /*output*/ &from_t1_offset_4000_to_t0_pe_0,
      /* input*/ &from_t1_offset_2001_to_t1_offset_4000);
    Module5Func(
      /*output*/ &from_t0_pe_0_to_super_sink,
      /* input*/ &from_t1_offset_0_to_t0_pe_0,
      /* input*/ &from_t1_offset_1999_to_t0_pe_0,
      /* input*/ &from_t1_offset_2000_to_t0_pe_0,
      /* input*/ &from_t1_offset_4000_to_t0_pe_0,
      /* input*/ &from_t1_offset_2001_to_t0_pe_0);

In the above code snippet, `Module1Func` to `Module4Func` are forwarding modules; they constitute the line buffer.
The line buffer size is approximately two lines of pixels, i.e. 4000 pixels.
`Module5Func` is a computing module; it implements the computation kernel.
The whole design is fully pipelined; however, with only 1 computing module, it can only produce 1 pixel per cycle.

### Unroll 2 Times

    #pragma HLS dataflow
    Module1Func(
      /*output*/ &from_t1_offset_1_to_t1_offset_1999,
      /*output*/ &from_t1_offset_1_to_t0_pe_0,
      /* input*/ &from_super_source_to_t1_offset_1);
    Module1Func(
      /*output*/ &from_t1_offset_0_to_t1_offset_2000,
      /*output*/ &from_t1_offset_0_to_t0_pe_1,
      /* input*/ &from_super_source_to_t1_offset_0);
    Module2Func(
      /*output*/ &from_t1_offset_1999_to_t1_offset_2001,
      /*output*/ &from_t1_offset_1999_to_t0_pe_1,
      /* input*/ &from_t1_offset_1_to_t1_offset_1999);
    Module3Func(
      /*output*/ &from_t1_offset_2000_to_t1_offset_2002,
      /*output*/ &from_t1_offset_2000_to_t0_pe_1,
      /*output*/ &from_t1_offset_2000_to_t0_pe_0,
      /* input*/ &from_t1_offset_0_to_t1_offset_2000);
    Module4Func(
      /*output*/ &from_t1_offset_2001_to_t1_offset_4001,
      /*output*/ &from_t1_offset_2001_to_t0_pe_1,
      /*output*/ &from_t1_offset_2001_to_t0_pe_0,
      /* input*/ &from_t1_offset_1999_to_t1_offset_2001);
    Module5Func(
      /*output*/ &from_t1_offset_2002_to_t1_offset_4000,
      /*output*/ &from_t1_offset_2002_to_t0_pe_0,
      /* input*/ &from_t1_offset_2000_to_t1_offset_2002);
    Module6Func(
      /*output*/ &from_t1_offset_4001_to_t0_pe_0,
      /* input*/ &from_t1_offset_2001_to_t1_offset_4001);
    Module7Func(
      /*output*/ &from_t0_pe_0_to_super_sink,
      /* input*/ &from_t1_offset_1_to_t0_pe_0,
      /* input*/ &from_t1_offset_2000_to_t0_pe_0,
      /* input*/ &from_t1_offset_2001_to_t0_pe_0,
      /* input*/ &from_t1_offset_4001_to_t0_pe_0,
      /* input*/ &from_t1_offset_2002_to_t0_pe_0);
    Module8Func(
      /*output*/ &from_t1_offset_4000_to_t0_pe_1,
      /* input*/ &from_t1_offset_2002_to_t1_offset_4000);
    Module7Func(
      /*output*/ &from_t0_pe_1_to_super_sink,
      /* input*/ &from_t1_offset_0_to_t0_pe_1,
      /* input*/ &from_t1_offset_1999_to_t0_pe_1,
      /* input*/ &from_t1_offset_2000_to_t0_pe_1,
      /* input*/ &from_t1_offset_4000_to_t0_pe_1,
      /* input*/ &from_t1_offset_2001_to_t0_pe_1);

In the above code snippet, `Module1Func` to `Module6Func` and `Module8Func` are forwarding modules; they constitute the line buffers of the SODA microarchitecture.
Although unrolled, the line buffer size is still approximately two lines of pixels, i.e. 4000 pixels.
`Module7Func` is a computing module; it is instanciated twice.
The whole design is fully pipelined and can produce 2 pixel per cycle.
In general, the unroll factor can be set to any number that satisfies the throughput requirement.

## Projects Using SODA

+ Yi-Hsiang Lai, Yuze Chi, Yuwei Hu, Jie Wang, Cody Hao Yu, Yuan Zhou, Jason Cong, Zhiru Zhang. [HeteroCL: A Multi-Paradigm Programming Infrastructure for Software-Defined Reconfigurable Computing](https://doi.org/10.1145/3289602.3293910). In FPGA, 2019. (Best Paper Candidate) [[PDF]](https://about.blaok.me/pub/fpga19-heterocl.pdf) [[Slides]](https://about.blaok.me/pub/fpga19-heterocl.slides.pdf)
+ Yuze Chi, Young-kyu Choi, Jason Cong, Jie Wang. [Rapid Cycle-Accurate Simulator for High-Level Synthesis](https://doi.org/10.1145/3289602.3293918). In FPGA, 2019. [[PDF]](https://about.blaok.me/pub/fpga19-flash.pdf) [[Slides]](https://about.blaok.me/pub/fpga19-flash.slides.pdf)


