Imzy
  • Discover communities
  • Log In
  • Sign up
  • Home
  • Discover communities
  • Log In
  • Sign up
  • About
  • Learn More
  • Contact
  • Community Policy
  • FAQ
  • Sitemap
  • Terms
  • Privacy Policy
  • Available on the App Store
  • Available on Google Play
Copyright © 2017 Saurus, Inc. All rights reserved.
Technology

Technology

Latest developments and curiosities in the world of technology

19786 members
Posted byzhemaoin/technology-Apr 05 at 11:56 AM

Quantifying the performance of the TPU, our first machine learning chip

Quantifying the performance of the TPU, our first machine learning chip

We've been using compute-intensive machine learning in our products for the past 15 years. We use it so much that we even designed an entirely new class of custom machine learning accelerator, the Tensor Processing Unit. Just how fast is the TPU, actually?

googleblog.com
Comments3
  • zhemaoApr 05 at 2:46 PM

    A bit of a summary after skimming the paper. Academic papers can be kind of a slog.

    Most of you are likely familiar with the basics of how neural networks operate. The network is programmed with a set of weights, which it uses to make a decision based on a provided input. For instance, you could feed the network an image and ask it whether it's a cat picture. There is a "learning" phase, during which you feed it a bunch of inputs for which the answer is already known. The network generates a decision based on its current weights, checks against the correct answer, and then updates its weights accordingly. In this way, you "train" the neural network to make better decisions.

    There is then an "inference" phase, in which you take the trained neural network with the perfected weights and put it into production. Now it takes in real inputs for which there is no known answer. The weights do not change at this point. This is the phase that TPU is accelerating. The accelerator takes advantage of a few key aspects of the inference process.

    1. The weights don't change, so you can keep them in read-only memory.
    2. NN inference is probabilistic, so you don't need very precise numbers. TPU uses 8-bit integers for its computation. Much less expensive than single-precision floating point.

    The other key finding about their workload is that inference is often latency sensitive and not throughput sensitive. Developers do care about how long it takes to get a response for a single input, not just the raw number of inputs processed per unit time. This kind of makes sense for a web application.

    The accelerator design is essentially a systolic array performing matrix multiply. The array has a bunch of 8-bit multiply-accumulators (MAC). Data flows in from the left and weights flow in from the top. The chart on page 3 has more detail.

    As you might expect, they get much better performance/watt than even a top-of-the-line Nvidia GPU or Intel CPU.

    • irokieApr 06 at 3:29 AM

      Great digest, very much appreciated!

  • bluedepthApr 05 at 3:42 PM

    Slowly approaching the showdown between organic and inorganic intelligences. Thanks for the writeup @zhemao!

Technology

Technology

Latest developments and curiosities in the world of technology

19786 members
  • About
  • Sitemap
  • Terms and Conditions
  • Privacy Policy
  • Copyright © 2017 Saurus, Inc. All rights reserved.