Special chips for ML applications are gaining popularity these days. Google has an interesting article on their ASIC, called the Tensor Processing Unit (TPU). It seems like the primary things these chips do well is matrix multiplication, which seems to be the bottleneck for inference with neural networks:
During the execution of this massive matrix multiply, all intermediate results are passed directly between 64K ALUs without any memory access, significantly reducing power consumption and increasing throughput. As a result, the CISC-based matrix processor design delivers an outstanding performance-per-watt ratio: TPU provides a 83X better ratio compared with contemporary CPUs and a 29X better ratio than contemporary GPUs.
The other interesting thing is that the chips are much simpler than your average CPU or GPU:
As compared to CPUs and GPUs, the single-threaded TPU has none of the sophisticated microarchitectural features that consume transistors and energy to improve the average case but not the 99th-percentile case: no caches, branch prediction, out-of-order execution, multiprocessing, speculative prefetching, address coalescing, multithreading, context switching and so forth. Minimalism is a virtue of domain-specific processors." (p.8)