Date: 2:00pm, March 3, 2017
Location: MUDD 327
Speaker: Behnaz Arzani, PhD. Candidate-University of Pennsylvania
Abstract: Components within a data center and the Internet can fail. When failures occur, users of the network (clients) do not have access to the various components of this complex distributed systems to determine the cause of failures. In this talk, I present my work on endpoint diagnosis, where the aim is to provide tools to help clients nd the cause of these failures. e proposed solution is based on a two-step approach where the endpoint identies the entity responsible for failures without requiring any support from the network or the remote end hosts. If the network is determined as the cause of the failure, a second step is triggered to identify the device responsible for the failure. In order to achieve this goal, we tackle the research challenge of inferring the cause of data center failures using only TCP statistics collected at one of the endpoints. To validate this approach, we have developed two monitoring tools NetPoirot and Vigil. NetPoirot detects the right entity (storage, compute, or network) to assist in the failure resolution process. Vigil further closes the network diagnostics gap by pinpointing the specic network entity that causes the failure. Our results on a large production datacenter show that NetPoirot and Vigil can eectively identify faults in a data center with low overhead while providing accuracies as high as 90%.