HPCA 26 Paper #494 Reviews and Comments
===========================================================================
Paper #494 HyGCN: A GCN Accelerator with Hybrid Architecture


Review #494A
===========================================================================

Overall merit (pre-rebuttal)
----------------------------
3. Weak accept

Overall merit (post rebuttal)
-----------------------------
2. Weak reject

Reviewer expertise
------------------
2. Some familiarity

Paper summary
-------------
A hybrid accelerator is proposed to efficiently perform the two very different phases of GCN, namely Aggregation and Combination. A programming model is also presented to leverage the use of proposed specialized hardware. The Aggregation engine mainly consists a number of SIMD units and necessary intermediate buffers for accessing memory and communicate with the Combination engine. To optimize the Aggregation phase, a data-aware access scheduling has been proposed to address the sparsity and irregularity of accesses. The Combination phase engine consists of multiple systolic arrays. Both the programming model and hybrid architecture is designed to support for fine grain pipelining of the Aggregation and Combination phases.

Strengths
---------
The paper clearly motivates the necessity for designing a hybrid accelerator for GCNs.
The paper also contains a detailed design and evaluation methodology.

Weaknesses
----------
Only discuss about inference phase of GCN, not the training phase.
The evaluation should include speedup when using NN architectures, not on a general purpose CPU.

The writing should be improved.

Detailed Assessment and Feedback for Authors
--------------------------------------------
i.	Discussed in Section 3.1, Parallelism Exploitation: While running GCN in conventional processor, is there any way to quantify wait time among the threads due to data copy and synchronizations?

ii.	Should compare with implementation that use only systolic arrays, or more recent accelerators using some form of dataflow architectures.

iii.	Section 5.1, Experimental methodology: Is the architecture simulator execution driven? What changes (if any) to PyTorch Geometry libraries has been done to achieve the desired programming model proposed in the paper?

iv.	Section 4.5.2, Coordination of off-chip memory access: When access priority to memory has been specified as edges > input features > weights > output features, how is it ensured that no starvation happens for low priority accesses in presence of high priority accesses coming from different vertices? 

v.	The paper state that the main focus is on inference stage. It will be good discuss what will happen differently in training phase? And how the proposed design will leverage the training phase? Is there any design change needed for the training phase?


Review #494B
===========================================================================

Overall merit (pre-rebuttal)
----------------------------
3. Weak accept

Overall merit (post rebuttal)
-----------------------------
3. Weak accept

Reviewer expertise
------------------
2. Some familiarity

Paper summary
-------------
Paper performs some characterization of GCNs and proposes an architecture.

Strengths
---------
Brief but useful characterization of GCNs

Weaknesses
----------
The paper mixes up algorithmic advances with the architecture
The architecture seems basically an attachment of a multicore chip to many SIMD cores

Detailed Assessment and Feedback for Authors
--------------------------------------------
The paper does a good job of introducing GCNs and the early characterization is good. However, there is much room for improvement here to make this an even better paper.

1) For typical GCN applications, what percentage of time is aggregate and what percentage is combine on a traditional CPU?

2) Some more detailed analysis and comparison of aggregrate with traditional "irregular workloads" like sparse vector-matrix or big-graph processing and how specialization architectures/parallel implementations have approached it, and why those ideas may/may not apply

3) Finally, section 4.3 from what I can tell is implicitly developing a parallel algorithm for aggregate. How much speedup would one obtain from a traditional CPU using this algorithmic advancement you have made.


Review #494C
===========================================================================

Overall merit (pre-rebuttal)
----------------------------
3. Weak accept

Overall merit (post rebuttal)
-----------------------------
3. Weak accept

Reviewer expertise
------------------
3. Knowledgeable

Paper summary
-------------
This paper identified the bottlenecks of processing graph convolutional neural nets (GCNs) on conventional architectures. To solve these issues, this paper further proposed an Edge and MVM (matrix vector multiplication) centric programming model with the proposed HyGCN accelerator. Evaluations showed that the HyGCN was able to achieve a significant performance and energy cost improvement over the PyTorch Geometric framework running on Intel Xeon CPU.

Strengths
---------
1. The motivation for a new programming model and corresponding HyGCN accelerator is clear and strong.
2. Using PyTorch Geometric as the baseline is a fair comparison as, it is one of the newest platform to analyze graphs via neural networks.

Weaknesses
----------
1. The idea of this work is fairly simple, and the novelty of this work is low.
2. The writing quality of the paper is low; many figures in evaluation are not very readable.
3. The datasets used in the evaluation are relatively small.

Detailed Assessment and Feedback for Authors
--------------------------------------------
1. I like the motivation analysis of the work, which well illustrates the bottlenecks of current GCNs on conventional systems.

2. The figures in this paper should be improved, as most of them in evaluation are hard to read.

3. Some questions regarding of the evaluation part:
  a) What is the programming language used to implement the GCN models?

  b) The dataset used in the evaluation are relatively small compared to the real-world graph datasets. Can you justify the reasons of selecting these small graph datasets?  According to the paper, the memory capacity of the Linux workstation is 400GB, so you should be able to use larger data sets, right? 

  c) The Performance (speedup) and energy consumption data are normalized to the baseline. Could you show the absolute values of the numbers in the results? It is very difficult to draw insights from these normalized numbers.


Review #494D
===========================================================================

Overall merit (pre-rebuttal)
----------------------------
4. Accept

Overall merit (post rebuttal)
-----------------------------
4. Accept

Reviewer expertise
------------------
3. Knowledgeable

Paper summary
-------------
The paper presents an accelerator for efficient acceleration of computational phases of Graph convolutional networks. The accelerator consists of an aggregation engine and a combination engine that is built as a systolic array. The aggregation engine handles the graph traversal and aggregation of features and feeds the systolic array. The paper introduces interval-shard graph partitioning to increase data reuse in the memory hierarchy that consists of an eDRAM used as a shared buffer.  The paper presents results for comparison for standard datasets and GCN networks running on a Xeon CPU

Strengths
---------
The paper is well written and makes the case for handling the aggregation phase as a distinct phase with its own processing needs. 

The coordination of off-chip memory access coming from the edge buffer, input buffer, weight buffer and output buffer to optimize the traffic sent to DRAM is a good idea. 

The latency-aware and energy-aware pipelining of the aggregation engine and combination engine is a good addition to the coupling of the two hybrid engines to optimize for latency or energy.

Weaknesses
----------
Comparison against GPU is missing

A breakdown of the time taken on the CPU for aggregation and combination given that the PyG framework runs the workload phase-by-phase is missing.  A part of the framework inefficiency is reflected as CPU inefficiency.

Detailed Assessment and Feedback for Authors
--------------------------------------------
The paper is well written and a hybrid engine that handles the aggregation phase and forms the aggregated feature vector that is later fed to a systolic array for computing the neural network (MLP)  is unique.  Having dedicated edge buffer, weight buffer and input buffer and maximizing reuse is a good idea to improve the efficiency of the overall engine

The coordination of off-chip memory access coming from the edge, input, weight and output buffer to optimize the traffic sent to DRAM is a good idea as it allows for optimal use of row buffers in the DRAM. 

A comparison of the proposed accelerator against the workload running on the GPU is missing
The PyG framework runs the workload phase-by-phase and does not take advantage of any sparsity removal due to blind prefetch which just pollutes the larger cache in the memory hierarchy for the CPU.  A part of the framework inefficiency is reflected as CPU inefficiency.


Review #494E
===========================================================================

Overall merit (pre-rebuttal)
----------------------------
2. Weak reject

Overall merit (post rebuttal)
-----------------------------
2. Weak reject

Reviewer expertise
------------------
2. Some familiarity

Paper summary
-------------
A new approach to machine learning called graph convolutional neural networks (GCN) has emerged that specifically targets graph datasets. There are two primary and demanding computational phases: aggregation and combination where the former behaves as graph processing and the latter like neural networks. These are very demanding and execute inefficiently on standard hardware. This paper introduces HyGCN, a hybrid GCN accelerator that targets both computational phases and offer a programming models to support it. The performance of the accelerator is compared with a software framework running on Intel Xeon and several orders of magnitude higher performance and energy improvements are observed.

Strengths
---------
- The paper seems to be the first to propose accelerators to the emerging class GCN.
- The paper seems to propose many innovative ideas for GCN acceleration

Weaknesses
----------
- The paper is not self-contained concerning concepts it builds on. 
- Not clear what concepts are new in graph processing and machine learning.
- Not clear whether GCN algorithms  are mature enough to warrant acceleration at this point in time.

Detailed Assessment and Feedback for Authors
--------------------------------------------
GCNs are emerging as the primary approach to apply machine learning to graph analytics. The paper very nicely analyzes the computational behavior of the two computationally demanding phases: aggregation and combination. It notes that aggregation faces the same issues as graph analytics with poor cache behavior and combination faces similar issues as neural networks. 

However, it is claimed that because of the irregular behavior of the aggregation phase, there is not any parallelism (Page 4). I do not follow that argumentation. First it is not clear what irregular behavior refers to and why parallelism is absent. The opposite is claimed for the combination phase and the reasoning is hard to follow. This needs to be clarified.(The concepts of dynamic/irregular and static/regular in Table 3 are not clear either.) Nevertheless, an important observation is that the two identified computational phases have very different computational behavior when it comes to compute vs. memory access. This is an important point.

The paper is not very careful in introducing concepts: Section 4.1 first discusses gather vs scatter-based processing methods and quickly concludes that gather based methods are preferably for a reason that escapes this reviewer. Later on, the concept of edge-level parallelism is introduced. This is not defined in the paper and again this reviewer is lost with unclear concepts that seem central to understanding the design choices made. A paper must be self contained to make impact, especially when it combines concepts from two application domains - graph analytics and machine learning - as this paper does.

Section 4.2 introduces concepts that are superficially treated. For example, a prefetcher is added to prefetch graph and weight data. This is not trivial and I expected to understand how that is done!! Further, in Section 4.3 it is mentioned that vertex latency and workload imbalance is mitigated. First of all, vertex latency is a concept that must be defined and it would be good to understand how it is mitigated. In general, the paper is not self-contained but builds on terminology that is only available to experts on graph processing or machine learning which limits its accessibility to for example machine learning (architecture) experts in other domains. 

Overall, I sense that there are many clever ideas in this paper but it is hard to distinguish them from what is known in prior art. The paper does not make a good job to distill what is new from the perspective of graph analytics and what is new from the perspective of machine learning. The related work section is too short to be of any help to make the case for the contribution of the paper.

However, the evaluation seems to make a solid case that the proposed hybrid accelerator offers orders of magnitude higher performance compared to running it on commodity hardware. That is impressive. However, GCN developments are a moving target and the question is whether the time is ripe for settling on an accelerator for this domain.


Response by Mingyu Yan <yanmingyu@ict.ac.cn> (Author) (800 words)
---------------------------------------------------------------------------
Thank reviewers for valuable comments!

---
### [Contribution]--Summary

(1)***Quantitative characterization*** to motivate GCN accelerator. 

(2)***New programming-model*** for graph-processing-like Aggregation and NN-like Combination. 

(3)***Hybrid architecture***: in Aggregation-Engine, we propose ***data-aware scheduling*** with sparsity elimination to optimize irregular and redundant data access; in Combination-Engine, we assemble ***multigranular systolic arrays*** to maximize data reuse. 

(4)Flexible ***inter-engine pipeline*** and efficient DRAM ***access coordination*** to improve overall performance.


### [Comparison]-to-graph-analytics/NN-accelerators

(1)Comparison-to-**graph-analytics**-accelerator(Section-4.3.4): 

The length of vertex-feature-vectors in GCNs is ***orders of magnitude larger*** than traditional graph analytics, which significantly increases the ***intra-vertex parallelism***(Fig.4) and amplifies gains from reduced feature load via ***graph partition*** and ***sparsity elimination(Fig.5)***. Moreover, the special ***Sample*** operation further increases the sparsity.

(2)Comparison-to-**NN(i.e.-MLP-here)**-accelerators(Section-4.4.3): 

The weights cannot be shared in MLP but ***fully shared by vertices*** in Combination of GCNs, which creates ***opportunities for data reuse***. Second, the ***multigranular systolic arrays***(Fig.6) design distinguishes from NN-accelerators, which enables flexible latency-/energy-aware working modes(Fig.7). 

(3)Besides, we ***co-optimize*** the two engines with running pipeline(Fig.8) and memory coordination(Fig.9).

### [Challenges]-on-CPU
PyTorch-Geometric might gain slight speedup on CPU with our algorithmic optimization, but it’s still challenged by

(1)inefficient memory subsystem due to the ***workload-agnostic***(Reference[16]&Section-3.1), 

(2)***difficulty in data reuse*** like systolic arrays(Reference[9]&Fig.6), 

(3)***expensive on-line preprocessing*** for workload reorganization and streaming, 

(4)high power(***240W-v.s.-6.7W(ours)***).


### Common--Questions

#### @[CQ-1]:Reviewer[B/D][Execution-time-of-Aggregation/Combination-on-PyTorch-Geometric(CPU)]
Execution time of Aggregation/Combination is ***57~82%***/***18~43%*** on long-feature-vector-length graphs(e.g.-Cora/Citeseer), and ***88~99.9%***/***0.1~12%*** on others.


#### @[CQ-2]:Reviewer[C/D][Absolute-result-values-&-GPU-comparison]
We will add absolute result values and GPU comparison in revision. 
For example, compared to PyTorch-Geometric-on-***NVIDIA-V100***, our work achieves ***6.67x(0.0015s-v.s.-0.010s)*** speedup and consume only ***0.22%(0.0018J-v.s.-0.82J)*** energy when performing GraphSage on COLLAB(dataset).  


#### @[CQ-3]:Reviewer[A/C/E][Figure/writing-polish;clarifiction/detail/comparison-append]
In revision, we will polish figures/writing for better readability and provide clarifications/details/comparisons requested by reviewers.


# Reviewer--Questions


### @[ReviewerA]

#### [A-1][The-way-to-quantify-waiting-time-for-data-copy-and-synchronizations]
Profile KMP_Barrier_Function&MKL_BLAS_Copy_Function with ***Intel-VTune***. 


#### [A-2][Comparison-to-NN-architectures]
The graph-based Aggregation with irregular compute/memory pattern(Reference[16]&Section-3) is incompatible with current NN-accelerators mainly tailored for a regular pattern(Reference[9]), leading to difficulty in implementing GCNs on them.


#### [A-3][Is-the-architecture-simulator-execution-driven]
Yes, see~Section-5.1.

#### [A-4] [Changes-to-PyTorch-Geometry-libraries]

PyTorch-Geometric needs to significantly modify its coarse-grain message-passing mechanism to stream Aggregation-Combination for each vertex and still meets inefficiencies(See~[Challenges]). This is out of our scope, so we use standard programming-models in GCN domain rather than ours on CPU.

#### [A-5][Starvation-of-low-priority-accesses] 
We execute access requests batch-by-batch. Therefore, low priority accesses in current batch are handled before high-priority-accesses coming at next batch, rather than always high-priority-accesses first.

#### [A-6][Training-discussion]
Training involves forward>backward>update passes with data dependency, whose compute/memory pattern is more complex than inference with only forward pass. Besides, the gradient propagation in graph is far more complicated than layer-by-layer NNs. These concerns make training not suitable for a starting work to explore GCN hardware. Training accelerators can leverage our architecture to design forward pass, and additionally need specialized blocks for other passes and an efficient memory hierarchy to connect them.


#### [A-7][Paper-writing-improvement]
See~[CQ-3].


### @[ReviewerB]

#### [B-1][Architecture-design-advances]
See~[Contribution]&[Comparison]. 

#### [B-2][Time-breakdown-of-Aggregation/Combination-on-CPU]
See~[CQ-1]. 

#### [B-3][Comparison-to-graph-processing]
See~[Comparison].

#### [B-4][Apply-our-algorithm-on-CPU]
We will implement this in revision, but inefficiencies on CPU still exist(See~[Challenges]).


### @[ReviewerC]

#### [C-1] [Novelty-of-our-work]
See~[Contribution]&[Comparison].

#### [C-2] [Figure/writing-revision-for-readability-improvement]
See~[CQ-3].

#### [C-3] [Programming-language-used-to-implement-GCN-Models]
C++~interface/library.

#### [C-4] [Why-these-datasets]
The datasets(Table.5) are standard in GCN domain. Due to the long length of vertex-feature-vectors, they are actually not small compared to those for conventional graph analytics. 

#### [C-5] [Absolute-values-of-the-result-numbers]
See~[CQ-2].


### @[ReviewerD]

#### [D-1][Comparison-to-GPU]
See~[CQ-2]

#### [D-2] [Time-breakdown-of-Aggregation/Combination-on-CPU]
See~[CQ-1].

#### [D-3] [Framework-inefficiency-v.s.-CPU-inefficiency]
Will clarify in revision.


### @[ReviewerE]


#### [E-1][Novelty;comparson-to-graph-analytics-or-neural-networks;related-work&writing-improvement]
See~[Contribution]&[Comparison]&[CQ-3]. 

#### [E-2][Why-we-need-GCN-accelerator]
GCNs are showing great potential in various tasks(Reference[14,18,24,34,33,39]). Some companies have deployed GCNs in data centers(Reference[3,27,41]), which reflects the increasing importance and the upcoming wide applications. In this context, an efficient architecture is timely to achieve high performance and stimulate GCN development, like the early DianNao(ASPLOS-2014) in machine learning.


#### [E-3][Concept-clarification & writing-improvement] 
Thanks for kind reminders. We will add definitions and clarify all points below in revision. 

  **(1)Different-execution-behavior-of-Aggregation/Combination**

Here ‘irregular&dynamic’ means variable computational-graph and memory-access patterns for each vertex. Aggregation’s irregular behavior is caused by randomly traversing neighboring vertices, while Combination’s regular behavior is owing to the identical connection pattern for each neuron within an NN-layer.

For the Aggregation parallelism, our claim just means the ‘irregular’ could impede the parallelism exploitation on general processors. Actually, there exists parallelism in Aggregation that have been exploited in our design. Specifically, (a)**Intra-vertex-parallelism**: Due to the long length of vertex-feature-vectors, the aggregation of ***elements*** can be performed ***in parallel***(Fig.4); (b)**Edge-level-parallelism**: As each vertex possesses many incoming edges(neighbors),  an ***edge-by-edge pipeline***(Fig.4) is enabled. 

   **(2)Why-gather-based-implementation**

Executing the scatter of multiple vertices concurrently requires atomic operations to avoid read-after-write conflicts and requires a final synchronization, which hinders the parallelism exploitation(Reference[16]).
 
   **(3)Prefetcher-implementation**

The prefetch function is integrated into eSched(edge-scheduler) and SIMD-cores. Its workflow is: 

1)eSched-requests-edge-data(i.e.-NeighborID)-from-Edge-Buffer-for-SIMD-cores: 

           (a)EdgeOffset=EdgeIndex*EdgeDataSize; 

           (b)EdgeAcessAddress=EdgeOffset+EdgeArrayBaseAddress;

           (c)Access edge-data(i.e.-NeighborID).


2)SIMD-cores-request-feature-from-Input-Buffer:

           (a)FeatureOffset=NeighborID*FeatureDataSize;

           (b)FeatureAcessAddress=FeatureOffset+FeatureArrayBaseAddress; 

           (c)Access feature-data.


   **(4)Mitigation-of-workload-imbalance-and-vertex-latency**

We assign the aggregation of elements inside the vertex-feature-vector to SIMD-cores for parallel execution(Fig.4). If a vertex cannot occupy all cores, free cores can be assigned to other vertices. Thus, SIMD-cores are always busy without workload imbalance. 

The vertex latency is the execution time to complete a vertex’s computation. Since SIMD-cores perform a vertex’s element aggregation in parallel, the vertex latency is smaller than processing multiple vertices together.


Comment @A1 by Reviewer D
---------------------------------------------------------------------------
[Summary]  This paper was discussed during the HPCA PC meeting. The reviewers were of the opinion that the writing quality of the paper needs to improve and the paper needs to be better organized to talk about algorithm improvements independent of the implementation and the comparisons need to be done against the same algorithm implemented on CPU/GPU vs accelerator. There were concerns raised around the size of the datasets, the energy efficiency claims and the profiling data that was provided in the rebuttal. The paper was conditionally accepted with shepherding and needs to be improved prior to final submission.  Prof Per Stenstrom will the PC member shepherding the paper.  He can be reached at per.stenstrom@chalmers.se and he will be in contact with the authors of the paper on future revisions and timelines for revising the paper.