BEGIN:VCALENDAR
VERSION:2.0
PRODID:Linklings LLC
BEGIN:VTIMEZONE
TZID:America/Chicago
X-LIC-LOCATION:America/Chicago
BEGIN:DAYLIGHT
TZOFFSETFROM:-0600
TZOFFSETTO:-0500
TZNAME:CDT
DTSTART:19700308T020000
RRULE:FREQ=YEARLY;BYMONTH=3;BYDAY=2SU
END:DAYLIGHT
BEGIN:STANDARD
TZOFFSETFROM:-0500
TZOFFSETTO:-0600
TZNAME:CST
DTSTART:19701101T020000
RRULE:FREQ=YEARLY;BYMONTH=11;BYDAY=1SU
END:STANDARD
END:VTIMEZONE
BEGIN:VEVENT
DTSTAMP:20211207T054746Z
LOCATION:263
DTSTART;TZID=America/Chicago:20211116T160000
DTEND;TZID=America/Chicago:20211116T163000
UID:submissions.supercomputing.org_SC21_sess264_exforum112@linklings.com
SUMMARY:Eliminate Variance, Keep Your SLAs: Domain-Specific Networks for M
 achine Learning
DESCRIPTION:Exhibitor Forum\n\nEliminate Variance, Keep Your SLAs: Domain-
 Specific Networks for Machine Learning\n\nAbts\n\nComputation- and communi
 cation-intensive workloads like machine learning (ML) and high-performance
  computing (HPC) require strict adherence to customer service level agreem
 ents (SLAs). With SLAs confounded by variability of run-to-run performance
 , loosely characterized as 99th percentile “tail latency", optimizing thes
 e workloads requires eliminating sources of latency and performance varian
 ce. Groq’s emerging novel tensor streaming processor (TSP) architecture an
 d its RealScale™ synchronous network allows robust SLA delivery without ex
 ecution time variability to support batch-1 inference of giga-scale ML wor
 kloads. \n\nThis talk will give a guided tour of networking, both inside a
 nd out, for ML on Groq’s TSP system architecture. Data movement is used fo
 r fine-grained communication between processing elements for reshaping ten
 sors in ML workloads. We’ll discuss the interconnection network in terms o
 f topology, routing and flow control, focusing on the GroqChip™ processor’
 s unique on-chip and off-chip network. The on-chip network makes use of ha
 rdware support for tensor data types, which are lowered to a rank-2 tensor
  for the purpose of efficiently mapping to the underlying hardware, and pr
 ovides over 60 terabytes/sec of on-chip stream bandwidth to stream tensors
  to the functional units consuming them, and 3.6 terabytes/sec of off-chip
  bisection bandwidth interconnecting a rack of 72 GroqChips. Further, we w
 ill discuss instruction set architecture (ISA) support and software stack 
 for tensor re-shapes, optimizing tensor elements through rearrangement and
  efficiently parallelizing the workload. The resulting tensor streaming mu
 ltiprocessor allows modern giga-scale ML workloads to operate efficiently 
 at-scale exploiting both model and data parallelism.\n\nTag: Correctness, 
 Machine Learning and Artificial Intelligence, Parallel Programming Languag
 es and Models\n\nRegistration Category: Tech Program Reg Pass, Exhibit Hal
 l Only
END:VEVENT
END:VCALENDAR