Jaipur Wealth Management:The first HipChips interpretation of 2022
High Performance Chiclet and Interconnect Architectures, on June 19, 2022, the first session (together with the 49th ISCA meeting) was held in New York, USA to explore the impact of small chip Chiplet and interconnected technology in the future, thereby promoting industrialAccelerate cooperation with the academic community and build a Chiclet ecology.
This time Hipchips is also the first seminar on the theme of the "small chip" to appear on the stage of the computer architecture. As a result, it has attracted the Google, Meta (Facebook), Intel, AMD, NVIDIA, and Zurich Federal Sciences (Eth Zurich)., University of Illinois (UIUC), the University of California Los Angeles (UCLA), Georgia Tech, and IIT BOMBAY (IIT BOMBAY) and other fields.Standardization and other cutting -edge research and progress.
The first session of the agenda and slice:
HipChips Chiclet Workshop @ISCA Conference
Chiplet-Based Accelerator Level Parallelism (Alp) Chiclet Architecture for Large SCALE SYSTEM Designphysical and Logical Inter-Die Interface Design For EOUS ARCHITECTUCOSCOHERENT and NON-Coherent Data Sharing Protocols Via Fast Chiplet InterconnectionChiplet ARCHITECTUURES for In-Memory Computing T EchnologySodsa-Based 3DArchitecture for Efficient Ml AccelerationChiplet-Based Security ComputingPower Evaluation and Performance Modlet Architectware Optimization EWORK With Fast Inter-Chiplet NetworkChiplet Topology Aware Ml OptimizationssCheduling for Massive Heterogeneous Chiplet-Based Processorsors
How to divide data between Chiplets and optimize data migration for more efficient parallel processing is the key to success.
62.7%of the system’s power consumption are spent on data migration.
Amirali Boroumand, Saugata Ghose, Youngsok Kim, Rachea Ausavarunirun, Eric Shu, RAHUL THAKUR, DAEHYUN KIM, Aki Kuusela, Allan Knies, PARTHASAR Athy Ranganathan, and onur Mutlu, "Google Workloads for Consumer Devices: Mitigating Data Movement Bottlenecks" Procedings of the 23rdInternetal Conference on Architectural Support for Programming
Languages and Operating Systems (ASPLOS), Williamsburg, VA, USA, March 2018.
Speaker: Bob Brennan (VP, Customer Solutions Engineering, Intel Foundry Services)
A variety of factors promote the development of Monolithic to Chiclet
Great chip manufacturing cost
Different chips have different requirements for process
The bandwidth and power supply gap of AI operations
Chip design cost and product listing speed under advanced process
Optimization of system -level high -speed IO interface
Single -core system develops to multi -core system: standard interface, interface protocol, software stack
IO case research: separated PCIe and memory
Server case study: Multi-Core Userver
Internet network/storage case study: IPU/DPU
AI case research: cache reasoning architecture
Chiclet has opened worldwide cooperation.The Intel report shows its confidence in the development of the chip architecture in the industry to develop to Chiclet, which shows the application cases of multiple neighborhoods, but it means that the current Chiclet development still faces some challenges and requires cooperation in the ecology to solve it.
Speaker: Dharmesh Jani (Open Ecosystem Lead, @ Meta), Ravi Agarwal (Technical Source Manager)
Normal human behavior scenes
Cognition, understanding the world and building a cognitive model requires a lot of training and excavation. Finding goals in various types of data need to be created comprehensive and created new things
Challenge of AI application scenarios
For the use scenarios, challenges and case studies of heterogeneous integrated large -scale chips.The report illustrates the dilemma of computing platforms facing AI computing, pointing out that it breaks through the dilemma through trap packaging technology.
Speaker: TAWFIK RAHAL-ANSHUMAN MITTAL (@ AMD)
Moore’s Law is still growing in the failure but data center computing power demand
Power management comparison of clients and data centers
You can send and manage the power consumption of chips through ChiPlet technology
Chiclet technology can be divided into fine -grained power consumption and management.
As shown in the two scenarios, Chiclet’s application can save 25%of power consumption, but how to divide Cores is a problem
Power distribution in different ways
Manage power consumption through some algorithms
The road to the data center is high -efficiency.From the perspective of product power consumption, AMD expands the construction of the ChiPlet architecture, and believes that Chiclet can help software and hardware better divide power consumption and perform power management.
Speaker: Rakesh Kumar (@ university of Illinois Urbana-Champaign)
The evolution of wafer -level computing
The previous wafer -level chip target was to sit on a wafer of a wafer, with a high cost
UCLA’s wafer -level silicon interconnection technology
Worse GPU architecture
Comparison of good rate of interconnection architecture:
Thread block and data layout strategy
"MCM-GPU: Multi-Chip-Module GPUS for Continued Performance Scalability", a. Arunkumar et. Al., ISCA 2017
Worse -level computing based on Chiclet.The report illustrates the limitations of the wafer -level chip before, introduces chip wafer -level connection technology, heat design, etc., and uses wafer GPUs as an example to show the advantages of wafer -level computing, but it feels about the relationship with Chiplet.Not big.
Speaker: Puneet Gupta (@ UCLA)
2048 chicklet architecture
Specific details of wafer chip design
The wafer -level design needs to consider the power of the chip, the clock, the front and rear test of the silicon, and the IO DIE architecture.
Designed a large -scale system based on Chiclet.The report describes the chip architecture more clearly compared to Chiclet-based Waferscale Computing report, as well as problems, clocks, and testing that the wafer-level design needs to be solved.
Speaker:
Big data processing has put forward higher requirements for hardware platforms.
Calculation (IMC) provides a realistic method for alleviating von Nayman.
The Crossbar architecture provides a good platform for computing deep learning networks.
The IMC accelerator uses an architecture fixed on the film.Therefore, the IMC chip will lead to more power consumption due to a larger area, so the CHIPLET design encapsulated by 2.5D will be an alternative option.
RRAM/SRAM’s practice has been explored for the Chiclet -based IMC architecture.
Users can debug the parameters to adjust the architecture, including mapping, architecture division, and IMC unit features.
Enter the DNN parameter and architecture parameters into SIAM. SIAM performs parameters and resources, including the on -chip interconnection and board -level interconnection, and build a computing platform.Evaluation tools are evaluated with characteristics such as performance and delay on the overall architecture.
SIAM input includes
The calculation architecture of SIAM is as follows:
The component particles can be refined to the specific IMC Cell, and the layer span is wide.
The architecture of the size Chiclet
The characteristics of the DNN network: inherent non -linear weights and activation distributions, which have a negative impact on IMC’s usage, lead to more hardware resources and power consumption, and also affect the cost of the overall system.
Algorithm mapping refers to the usage of maximizing IMC
Small chip groups that integrate smaller IMCs for initialization or smaller layer calculations are more suitable for big data migration scenes; large chip groups with larger IMCs are used for larger and deeper layers computing, which is more suitable for small data migration scenarios.Essence
NOP is used to pass big data to each chip group.
Compared with the same type of chip
ACM database page
For the heterogeneous Chiclet architecture calculated in DNN.In response to the calculation of accelerated deep learning networks, a new Chiclet architecture was proposed -the size IMC nuclear hybrid heterogeneity, and a set of software simulation environment SIAM was constructed for the architecture, and the performance and area assessment tool were integrated.
This architecture is compared to the GPU and other accelerators. The architecture is the DNN model to improve the performance and the power consumption efficiency of ~ 100 times.
Speaker: Tianqi Tang, Yuan XIE (@ University of California)
Chiclet cost composition
The cost exploration of the Chiclet architecture under advanced packaging.The report builds a mathematical model for Chiclet’s manufacturing costs, and conducts case studies on the Chiclet system of homogeneous and heterogeneous. Chiclet has different costs in different applications.
Speaker: Allan Cantle (@ Nallasway)
Comparison of the architecture of the encapsulation and the internal and encapsulated memory configuration
Traditional local memory and RDMA
Share memory of interconnection through CXLJaipur Wealth Management
Share CXL memory connected to optical technology connecting a total of optical technology
Local memory connected through the OIF-VSR’s Chiclet interface
Co-packaging optical, CXL sharing memory, local memory, OIF-VSR interface interconnection
Through comparison, you can get the following conclusions:
The power consumption of the architecture of silicon light interconnection is lower
The goal architecture of the target
On-Package architecture
Off-Package architecture
Calculate the computing architecture boundary with the encapsulated chips.The reporting of the architectural performance and power consumption comparison of several encapsulation and external combinations, the architecture of the future goals must be the architecture of ON-PACKAGE and Off-Package mixed.
Speaker: EDI ROYTMAN, Ajaya Durg, Thomas Liljeberg, LIING LIAO, Robert Munoz (All @ Intel Corporation)
From the characteristics of HBM/DDR to the ideal system memory of AI/HPC nodes
All computing and communication types can directly access the modular, combined, scalable shared, pool -can HBM -like bandwidth -like bandwidth similar to DDR capacity, delay, and ECC verification LPDDR effect
Intel’s optical computing interconnection solution
Light module integrated chip and package schematic diagram:
Memory access architecture diagram:
Direct memory access —- "The memory of the common node —-" shared/pooling memory and IO equipment
Advanced memory architecture can get better performance
Should I use optical interconnection technology
Light interconnection technology has more calculation nodes and greater power consumption scenarios.
Integrated light connects the opportunity of Chiclet’s AI and HPC.The report emphasizes that the high bandwidth and large-capacity memory architecture of light connection can reach 5 times performance and improve the performance of 2-3 times at the same cost. Therefore, it is necessary to have the research on the sensitivity research of the AI/HPC workload and the reference of light interconnection.In terms of design, we have conducted research, and jointly developed the strong interoperability and available Chiclet interface standards for XPU and optical interface interfaces.
Speaker:
Pruek vanna-hempikul, Serdogan, Mohanalingam Kathaperumal, Madhavan Swaminathan, and SUNG KYU LIM (@ Georgia Institute of Technology)
Ram Gupta, Ravi Agarwal, (@ Meta)
Praveen Anmula, Kevin Reinbook, (@ Siemens)
A non -TSV 3D stack packaging method
Footprint, PPA+SI/PI comparison
Smaller Footprint.
The advantages of the GLASS intermediate medium layer of logic and storage Chiclet and the advantages of power/signal integrity.Glass mid -level is a new intermediary medium layer material with 2.5D and 3D packaging. Through the performance comparison of Glass and Silicon, Glass can support a lower cost of 3D Chiclet. In terms of PPA and SI/PIBetter performance.
Speaker: Raja Swaminathan, John Wuu (@AMD Senior Fellow) Comment
The road to AMD 3D VCACHE.AMD Zen3 CPU used a small chip to expand the L3 V-Cache, from 32MB to (32MB+64MB), and generally achieved 15% performance improvement
Speaker: Alex Burlak (VP TEST & Analytics @proteantecs)
Heterogeneous integration is facing challenges of quality and reliability
For high -bandwidth D2D reliability, in -chip detection and depth data analysis.The report mainly introduces the work of Proteantecs in LANE’s high -resolution detection, visualization of product to good rate, advanced feature detection, and test optimization of coverage.
Speaker: Shahab Ardalan (LMNS), Bapi Vinnikota (BRCM), TAWFIK ARABI (AMD), ELAD Alon (BCA)
A comparative study of judgment D2D interface.There are many pictures in this report, but less information. It is seen that Kezhi is an important conclusion based on the efficacy and delay of one -way and two -way links. It is a very important conclusion that energy efficiency and delay are important.
Speaker: Elad Alon
BOW is a physical layer standard protocol for D2D parallel interface
Bow interface for D2D application scenarios.This report is quite far away from ODSA’s BOW promotion report, which introduces the advantages of BOW in low latency, timing design, high -operating and flexibility of packaging, compatibility of the RX/TX channel signal, etc.Contribute.
Speaker: DHARMESH JANI (Open Ecosystem Lead @ Meta, Co-Chair OCP Incubation Commith)
The arrival of the DSA era
John Hennessy and David Patterson in 2018 predict the coming of the DSA era
ODSA’s responsibility
OCP is mainly launched in module -level, subsystem -level, system -level, and data centers. From 2019, it will work through ODSA in module -level components
ODSA will promote the construction of the Chiclet market ecosystem in the open D2D interface, ChiPlet reference design, and reference workflow work, and then the development of other OCP business.
ODSA technology stack
ODSA builds the Chiclet ecosystem under the OCP organization.The report introduces ODSA under the OCP organization to build an open Chiplet system chip ecosystem and introduces the work of ODSA, including Chiplet packaging technology, interface protocol technology, and use cases.
Speaker: John Wilson (Nivida)
Objective to calculate architecture upgrade: The calculation performance of each watt increases
The evolution of the bandwidth of the package:
High -bandwidth density, energy efficiency, and short -range signal output that can be calculated on large -scale parallel scales.From a process perspective, discussed the bandwidth limit of Off-Chip and Off-Package. For the design of the interface PHY, the output method of the single-end signal of the Chip-to-Chip transmission bandwidth increased in a large amount of data transmission scenarios is proposed.The method of packaging and PCB hierarchical group-referenced signaling, using Simultaneous Bidirectional Signaling at the interposer level; it also shows that the 2.5D package of Chip-TO-chip is still a lot of data transmission power consumption. War.
Speaker:
Shekar Geedimatla, Robin James PayyAppillil, Devi Sreekumar, and Shalabh Gupta DEPARTMENT of Electrical Engineering, IIT BOMBAY, Mumbai – 400076, Indiaa
BOW standard can support high -density signal interconnection on the substrate
Each Slice of BOW has 16 signals, each line provides 16Gbps transmission bandwidth, and a slice provides up to 256Gbps bandwidth.
Dual Stripline configuration
Half-Pitch Offset can reduce signal string disturbances.
The efficient signal routing method based on the double -pair configuration based on BOW interface.By encapsion, the wiring density of the packaging is increased by the Dual-Stripline Configuration. The Half-Pitch offset can reduce the effects of the stringing. By the results of the simulation results, the eye charts and the string disturbance meet the requirements.
Speaker: Ken Chang, Scott Huss (@ Cadence)
Chiclet IO type classification
Relative serial PAM4 differential signals and parallel interfaces are advanced packaging, and the parallel interface standard packaging has the effect of folding in both energy efficiency and bandwidth density.
Cadence D2D interface Ultralink
6/7 BIT encoding can reach the DC balance as much as possible.
Design space of the Chiplet IO interface.This article divides the current Chiclet IO types into three categories according to the packaging and signal coding types, and introduces the Cadence D2D interface in detail. Its 6/7bit coding feature makes the interface the interface in low delay, energy efficiency, bandwidth density, low cost, low cost, low cost, low cost, low cost, low cost, low cost, low cost, low cost, low cost, low cost, low cost, low cost, low cost, low cost, low cost, and low cost.It has a relatively balanced advantage.At the same time, Ucie’s future will unify the Chiclet interface.
Researcher: Weiming zhao, weifeng zhang (@ Alibaba Cloud)
Software development challenges
Vertical extension: fragmented software ecology
Different hardware has the huge transplantation work provided by different suppliers and lack of interoperability for long -term product listing
Work load parallelization for a distributed computing system for Chiclet system
Purpose and solution for software upgrades
Reduce workload to give full play to the performance of AI
Extension of more hardware and reasonable arrangements to arrange the work of different chys on the horizontal expansion of more the same processor
Not relying on specific AI frameworks less memory resources occupy less running time
Unified AI computing programming model: open deep learning API (Open Deep Learning API, ODLA)
Optimized compiler framework: Heterogeneity Aware Lowering & Optimization (Halo) (Halo)
Based on ODLA code compile AI algorithm, build a workflow optimization classic compiler optimization AI algorithm support heterogeneous equipment optimization and sharing
Halo architecture
Halo component
Performance comparison after using HALO
Software compilation framework for heterogeneous architecture.Alibaba Cloud’s Halo/ODLA, which has invested in construction since 2017, has supported the scalable lightweight interface, minimalist memory footprint, and internal heterogeneous parallel support because of its tailoring and scalable.It is suitable for the hardware and hard -hard co -computing platform as a small chip acceleration system.
In order to solve the problems of different AI computing platforms and the inaccessible AI algorithm model, various AI models are compiled into the API interface described by C ++. Through the runtime database corresponding to the API, the C ++ program of the AI model can run in different calculations in different calculationsOn the platform.
Speaker: Rishi Chugh (@ Cadence)
The parameters of the ChIPLET system automation engine
Can be configured with IO Chiclet architecture.The report feels that the publicity report of Cadence’s Chiclet products has a full -process tools such as interface IP and performance evaluation. Customers only need to care about the architecture level.
Speaker: Rohit Mittal & Cliff Young (@ Google)
Cases of the evolution of general Chiclet.Google’s report puts forward the direction of Chiclet as a customized chip in the future, but it is necessary to break the ecological "chicken eggs and eggs and chicken". To this endmanner.
Speaker: Dr. Duncan Haldane (@ JITX)
A chival system compiler
Compiler framework:
Take BOW as an example to describe information such as SLICE, DIE component architecture, links and other information
Software -defined Chiclet system design.The report describes the integration and optimization of Chiclet, Package, and Board through the Chisel language and software definition schemeNagpur Stock. Its system design indicates (ESIR) and Chiclet compilers to make the automatic verification and optimization of small chip systems more efficient and convenient.It is speculated that the TAPEOUT process allows traditional EDA tools to participate.
"Small Chip Integration" from Software and Hardariers, the Chiclet Ecology-ISCA 2022-HipChips Symposium Organization Observation
CEREBRAS: Labor chip
Jitx_esir
Lucknow Investment