NO.176 AIOps – Best Practices and Future Trends
October 12 - 15, 2020 (Check-in: October 11, 2020 )
- Ahmed E. Hassan
- Queen’s University, Canada
- Yasutaka Kamei
- Kyushu University, Japan
- Zhen Ming (Jack) Jiang
- York University, Canada
Cloud computing is now ubiquitous. IDG  reports that 73% of enterprises leverage the cloud in their IT infrastructure. On the one hand, cloud computing provides benefits like lower infrastructure costs and high elasticity. On the other hand, failures in the cloud are difficult to detect and can be quite costly when they occur. Such failures are estimated to cost $700 billion yearly, due to their large deployment sizes .
Since failures can happen at different levels (hardware, operating system, container or VM, and application level), different probes are installed to ensure the Quality of Service (QoS) for cloud offerings. Some probes passively collect resource usage data (e.g., CPU and memory) or performance measures (e.g., response time and throughput), whereas others proactively check the health of their system by periodically performing heartbeats  or sanity checks (e.g., ). The size of these raw monitoring data can be extremely large (in the range of tens of terabytes on a daily basis).
It is very challenging to effectively transform such monitoring data into actionable insights due to the size and complexity of such data. For example, parsing the recorded monitoring data is challenging, as their data format is highly tool dependent. Furthermore, studies show that there can be many redundancies in the monitoring data . Detecting and diagnosing problems usually requires correlating various data sources (e.g., different nodes or probes)  and the signalto-noise ratio can be very low .
To cope with this challenge, AIOps (Artificial Intelligence for IT Operations) leverages data analytics and machine learning (ML) techniques to assist DeveOps engineers to improve the quality of computing platforms in a cost effective manner. There have been quite a few recent research efforts from different research communities (e.g., data mining , networking , software engineering [10, 11], and computer systems [7, 14]) as well as industries (e.g., Alibaba , BlackBerry , IBM , Facebook , and Microsoft ) in this area.
Even with all the above works, there is no central venue to bring all the cross-disciplinary researchers and practitioners together. Therefore there is a dire need for a community to be built around this research area. The proposed Shonan seminar will invite researchers and practitioners in multiple disciplines (e.g., networking, computer systems, computer architecture, software engineering, data mining, and machine learning) from around the world to discuss the current best practices and future trends of AIOps. Each of our potential invitees has conducted research or has extensive industrial experience in this area. All three organizers have prior experience in organizing successful Shonan seminars [18, 19, 21], workshops , and research summits . We envision this seminar would be the first step towards building this crossdisciplinary research subject, which is very important in both research and practice.
Emerging Challenges to be Discussed in the Meeting
In addition to the building of high performance ML models within an AIOps solution, there are several emerging challenges that we plan to discuss during our NII Shonan meeting:
- How do we make AIOps solutions Trustable? AIOps solutions must incorporate years of field-tested engineertrusted domain expertise into their ML models, instead of simply employing sophisticated models on raw data.
- How do we make AIOps solutions Interpretable? AIOps solutions need to be interpretable even if at the cost of lower performance. Such interpretable models enable DevOps engineers to reason about model recommendations, to gain upper management support for following such recommendations, and, more importantly, enabling DevOps engineers to improve the status quo (e.g., by improving and optimizing their monitoring solutions).
- How do we make AIOps solutions Maintainable? AIOps solutions need to require minimal maintenance and fine-tuning, since DevOps engineers are usually not ML experts, who are already overcommitted on many company-wide ML-initiatives.
- How do we make AIOps solutions Scalable? AIOps solutions need to be scalable and efficient as they must analyze the monitoring data from thousands to millions of nodes and react to changes in microseconds.
- How do we evaluate the AIOps solutions In-context? AIOps solutions must be evaluated in a context which resembles their actual production usage. Traditional ML evaluation techniques (e.g., cross-validation) are rarely applicable, as they do not consider the real-life peculiarities.
We expect the NII Shonan meeting to have lively discussions about various emerging challenges in order to identify key issues that can be solved by academics and which are of great importance to practitioners. Furthermore, by discussing with industrial participants, researchers would be able to access valuable industrial monitoring datasets, which might not be otherwise possible. We also expect the researchers to be able to identify collaborators that are suitable for the problems on which they wish to work, among the other invitees, who may or may not come from the same research community. We expect that these collaborations will push the boundaries of research with respect to AIOps through many high impact publications. In addition, we expect to come up with an agenda on how to design and teach courses in the area of AIOps.
Gartner  projects that by 2022, 40% of global enterprises will have strategically implemented AIOps solutions to support their IT operations. Industrial participants can benefit greatly from this seminar by learning and discussing existing state-of-the-art research as well as finding suitable potential collaborators for their problems.
 “2018 Cloud Computing Survey,” https://www.idg.com/tools-for-marketers/2018-cloud-computing-survey/, Last accessed 04/17/2019.
 M. Machowinski, “How predictive maintenance can eliminate downtime,” https://technology.ihs.com/572369/businesses-losing-700-billion-a-year-to-it-downtime-says-ihs, Last accessed 04/17/2019.
 M. G. Gouda and T. M. McGuire, “Accelerated Heartbeat Protocols,” in Proceedings of the 18th International Conference on Distributed Computing Systems (ICDCS), 1998.
 G. Amvrosiadis, A. Oprea, and B. Schroeder, “Practical scrubbing: Getting to the bad sector at the right time,” in IEEE/IFIP International Conference on Dependable Systems and Networks. 2012.
 A. Hassan, D. Martin, P. Flora, P. Mansfield and D. Dietz, "An Industrial Case Study of Customizing Operational Profiles Using Log Compression," 2008 ACM/IEEE 30th International Conference on Software Engineering, Leipzig, 2008, pp. 713-723.
 V. Nair, A. Raul, S. Khanduja, V. Bahirwani, O. Shao, S. Sellamanickam, S. Keerthi, S. Herbert, and S. Dhulipalla, “Learning a Hierarchical Monitoring System for Detecting and Diagnosing Service Issues”. In Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD). 2015.
 J. Xue, R. Birke, L. Y. Chen and E. Smirni, "Spatial–Temporal Prediction Models for Active Ticket Managing in Data Centers," in IEEE Transactions on Network and Service Management. 2018.
 Pankaj Prasad and Charley Rich, “Market Guide for AIOps Platforms,” https://www.gartner.com/doc/3892967/market-guide-aiops-platforms, November 2018.
 N. El-Sayed, H. Zhu and B. Schroeder, "Learning from Failure Across Multiple Clusters: A Trace-Driven Approach to Understanding, Predicting, and Mitigating Job Terminations," IEEE 37th International Conference on Distributed Computing Systems (ICDCS). 2017
 W. Shang, Z. M. Jiang, H. Hemmati, B. Adams, A. E. Hassan and P. Martin. "Assisting developers of Big Data Analytics Applications when deploying on Hadoop clouds," 2013 35th International Conference on Software Engineering (ICSE). 2013.
 Q. Lin, K. Hsieh, Y. Dang, H. Zhang, K. Sui, Y. Xu, J. Lou, C. Li, Y. Wu, R. Yao, M. Chintalapati, and D. Zhang. Predicting Node failure in cloud service systems. In Proceedings of the 26th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering (ESEC/FSE). 2018.
 M. Syer, W. Shang, Z. Jiang, and A. Hassan. Continuous validation of performance test workloads. Automated Software Engineering. 2017.
 D. Liu, Y. Zhao, K. Sui, L. Zou, D. Pei, Q. Tao, X. Chen, and D. Tan. "FOCUS: Shedding light on the high search response time in the wild", The IEEE 35th Annual IEEE International Conference on Computer Communications (INFOCOM). 2016.
 Karthik Nagaraj, Charles Killian, and Jennifer Neville. Structured comparative analysis of systems logs to diagnose performance problems. In Proceedings of the 9th USENIX conference on Networked Systems Design and Implementation (NSDI). 2012.
 AIOps for Big Data, from Alibaba. https://medium.com/@alitech_2017/aiops-for-big-data-from-alibabae147455f71dd. 2018.
 K. Veeraraghavan, J. Meza, D. Chou, W. Kim, S. Margulis, S. Michelson, R. Nishtala, D. Obenshain, D. Perelman, and Y. Jiun Song. “Kraken: leveraging live traffic tests to identify and resolve resource utilization bottlenecks in large scale web services”. In Proceedings of the 12th USENIX conference on Operating Systems Design and Implementation (OSDI). 2016.
 International Workshop on Load Testing and Benchmarking of Software Systems (LTB) . http://ltb2017.eecs.yorku.ca.
 Shonan seminar on Software Analytics: Principles and Practice. https://shonan.nii.ac.jp/seminars/037/.
 Shonan seminar on Mobile App Store Analytics. https://shonan.nii.ac.jp/seminars/070/.
 MSR Vision 2020. https://msrcanada.wordpress.com/msrvision2020/.
 Shonan seminar on “Data-Driven Search-Based Software Engineering”. https://shonan.nii.ac.jp/seminars/105/