Current Location:homepage  Research
Research Center of Fault Tolerant and Mobile Computing
Editor:贾岩  Updated:2016-03-02  Views:114

1 .Mission

Nowadays, Big Data, Cloud Computing, Internet of Things and Artificial Intelligence have set off a wave of new technologies in the field of computer science. China has proposed several innovation-driven development strategies. The Internet, Cloud Computing, Big Data and Artificial Intelligence promote the upgrade of traditional industries by innovative means. All of these motivate the new demands and challenges in the domain of processing capacity and usability of computer systems. Traditional computer systems cannot meet the requirements of the new situation, neither the processing capacity nor the ratio between performance and power. Our researches mainly focus on new computer systems and intelligent wearable computer systems. The former including large-scale computing systems based on new devices (such as Memristors, NVM), brain-like computer architectures based on neuro-morphic devices and processors. The latter includes four parts which are new fault-tolerant computers, evaluation techniques for fault-tolerant computer systems which based on new types devices, intelligent fault prediction and diagnosis and the development of related prototype systems. Both these researches are expected to play an important role in the relevant industries or systems in the country.

2. Overview

Computer architecture is a set of rules and methods that describe the functionality, organization, and implementation of computer systems. Fault-tolerant computing can be defined as the process by which a computing system continues to perform its specified tasks correctly in the presence of faults with the goal of improving the dependability of the system.

The research of the school in this field includes the development of fault-tolerant computers design and evaluation, cloud computing testing and wireless sensor node chip. Inspired by both theoretical and applied computing achievements since last century, the faculty members of this research area have done a series of milestone works such as the measurement and evaluation of High-End Fault-Tolerance computers, test tools for the cloud system, the low power wireless sensor node chip.

There are 6 professors, 2 associate professors and 2 lecturers working in this field, where 6 faculties are doctoral supervisors and 8 are master supervisors. Currently, more than 20 doctoral students and 30 master students study in this field. Leading professors include: Prof. Decheng Zuo, Prof. Hongwei Liu, and Prof. Weizhe Zhang.

In the past four years, 2011-2015, the faculty members of this field have been granted more than 20 research projects, among which 4 are funded by National Natural Science Foundation of China (NSFC), 3 funded by National High-tech R&D Program of China (863 program). More than 100 research papers have been published in international and domestic academic journals and conferences.

30 Ph.D. students and more than 500 master students have graduated from this research field. The outstanding alumni include Dr. Dongshen Wang and Dr. Yibo Xue. Both of them are the professors of Tsinghua University.

The detailed information of the research work in this field can be found in

3. Research Topics

  • Fault tolerant computer design: Computer system architecture design covering fault tolerant computing.

  • Evaluation of computer system: Measurement and evaluation for computer performance, availability and reliability.

  • Mobile computing and wearable computing: Solving the key technique of mobile computing and wearable computing fields.

  • Evaluation of cloud computing: Measurement and evaluation for cloud platform performance, availability and reliability.

  • Software reliability modeling: Analyzing the failure process of software and giving the reliability evaluation.

4. The Faculty

Prof. Decheng Zuo

Personal Website:

He is a professor and an assistant dean of the School of Computer Science and Technology in HIT, a director of Fault Tolerant Computing Lab, and a deputy director of Fault-Tolerance Committee in China Computer Federation. He is also a member of the expert committee for Information Field, National High-tech R&D Program of China (863 Program) for the past five years.

His research is focused on computer system architecture. He works on a widely range of subjects, including parallel computing and architecture, fault tolerant computer, computer system architecture evaluation theory and technology.

He has either hosted or participated in over 10 research tasks in the field of fault tolerant computing and mobile computing. Besides, he also hosted a project supported by 863 Program and National Natural Science Foundation of China (NSFC). He has published more than 70 papers, over 50 of which were indexed by SCI and EI. He is the winner of 1 Third Prize of Scientific and Technological Advance in National Defense, and 1 Outstanding Contribution Award of The 11th Five-year Plan National Science and Technology Plan Perform.

Prof. Hongwei Liu

Personal website:

He is a professor and an assistant dean of the School of Computer Science and Technology in HIT, and a director of hardware teaching-and-research section. He is also a special member of Fault-Tolerance Committee in China Computer Federation, and a member of the standing committee of Computer System Architecture Committee in China Computer Federation.

His research focuses on a widely range of subjects, including parallel computing and architecture, fault tolerant computer, resource allocation and optimization in cloud computing system, evaluation theory and technology in cloud computing system, mobile computing and software reliability modeling.

He has either hosted or participated in over 10 research tasks in the field of fault tolerant computing and mobile computing. Besides, he also hosted a project supported by 863 Program and National Natural Science Foundation of China (NSFC). He has published more than 90 papers, over 50 of which were indexed by SCI and EI. He is the winner of 1 Third Prize of Scientific and Technological Advance in National Defense, and 1 Second Prize of Teaching and Learning Excellence Award of Harbin Institute of Technology.

Other researchers

  • Prof. Zhibo Wu, working on mobile computing, software reliability modeling;

  • Prof. Ling Wang, working on network on chip and wireless sensor network;

  • Prof. Jian Dong, working on high availability computer system, fault tolerant computing;

  • Associate Prof. Dongxin Wen, working on availability modeling of mass storage, mobile computing.

  • Associate Prof. Zhan Zhang, working on evaluation of cloud computing, wearable computing.

5. Selected Publications

The faculty members of this field in the school publish their innovative findings, on top journals such as IEEE Transactions on Computers, IEEE Transactions on Cloud Computing, Sensors, and Journal of Systems and Software, and on top conferences such as QSIC, ICA3PP and CLUSTER.

5.1 Selected Journal Papers

  1. Jinyong Wang, Zhibo Wu, Yanjun Shu, Zhan Zhang. An Imperfect Software Debugging Model Considering Log-logistic Distribution Fault Content Function. Journal of Software and System, 2015, 100:167–181.

  2. Jian Li, Haiying Zhou, Decheng Zuo, KM Hou, C De Vaulx. Ubiquitous Health Monitoring and Real-time Cardiac Arrhythmias Detection: A Case Study. Bio-medical materials and engineering, 2014, 24(1):1027-1033.

  3. Jian Dong, Xiao Ren, Decheng Zuo, Hongwei Liu. An Adaptive Failure Detector Based on Quality of Service in Peer-to-Peer Networks. Sensors, 2014, 14(9): 16617-16629.

  4. Yiwei Ci, Zhan Zhang, Decheng Zuo, Zhibo Wu. A Multi-Cycle Check Pointing Protocol that Ensures Strict 1-rollback. Information Processing Letters, 2012, 112(20): 788–793.

  5. Ling Wang, Chunda Ding, Yingtao Jiang. A High Performance, Low Area Reconfiguration Controller for Network-on-chip-based Partial Dynamically Reconfigurable SoC Designs. International Journal of Electronics, 2010, 97(10):1207-1225.

[10] Yiwei Ci, Zhan Zhang, Decheng Zuo, Xiaozong Yang. Message Fragment based Causal Message Logging. Journal of Parallel and Distributed Computing, 2009, 69(11): 915–921.

5.2 Selected Top Conference Papers

  1. Yanjun Shu, Yan Zhao, Hongwei Liu, Decheng Zuo, Xiaozong Yang. A Hybrid QoS Evaluation Tool Based on the Cloud Computing Platform. Proceedings of the 15th International Conference on Algorithms and Architectures for Parallel Processing (ICA3PP 2015), 2015.11.18-2015.11.20, Zhangjiajie, China, 822-835.

  2. Jinyong Wang, Zhibo Wu, Yanjun Shu, Zhan Zhang. A General Imperfect Software Debugging Model Considering the Nonlinear Process of Fault Introduction. Proceedings of the 14th International Conference on Quality Software (QSIC 2014), 2014.10.2-2014.10.3, Dallas USA, 222-227.

  3. Jinyong Wang, Zhibo Wu, Yanjun Shu, Lixin Xue. A Study on Software Reliability Prediction based on Triple Exponential Smoothing Method. Proceedings of the 46th Summer Computer Simulation Conference (SCSC2014), 2014.7.6-2014.7.10, Monterrey CA USA, 440 - 448.

  4. Ling Wang, Zhen Wang, Yingtao Jiang. Flow Control Mechanism for Wireless Network on-chip. Proceedings of 10th International Conference Information Technology: New Generations (ITIC 2013), 2013.4.15-2014.4.17, Las Vegas Nevada USA, 483-488.

  5. Ling Wang, Feng Wu, Jianwen Zhang, Yingtao Jiang. Reconfigurable Global Network Local Bus (RGNLS): A Hybrid on-chip Communication Architecture for Area-efficient, Dynamically Reconfigurable SoC Designs. Proceedings of 2009 International Conference on Embedded Systems and Application (ESA 09), 2009.7.13-2009.7.16, Las Vegas Nevada USA, 211-217.

[10] Chunyan Hou, Gang Cui, Hongwei Liu, Xiaozong Yang. Reliability Analysis of Component Software based on Testing Data Transformation. Proceedings of the 8th IEEE/ACIS International Conference on IEEE (ICIS2009), 2009.6.1-2009.6.3, Shanghai China, 955-960.

6. Selected Research Projects

  1. National Natural Science Foundation of China (NSFC), Study of Availability Model of High-Performance Fault-Tolerance Computers based on QoS (Grant No. 61173020). $0.01M, 2012-2015.

    This project focuses on the availability test of High-Performance Fault-Tolerance Computers. Its main theoretical achievements include: Establishing the availability evaluation fault sets oriented to transactional fault tolerant computer systems, fault coverage estimation model based on extreme value theory, importance evaluation model for multi-state system components under imperfect fault coverage circumstance, transactional fault tolerant computer systems’ state classification method, non-consistency modular failure correlation model related to load behavior, core software fault propagation model.

  2. National High-tech R&D Program of China (863 Program), Testing and Evaluating System for Cloud Computing (Grant No. 2013AA01A215). $0.89M, 2013-2015.

    This project had the breakthrough technology and methods for measuring cloud computing systems’ performance, availability, security, compatibility and Standard compliance. The technology and methods were implemented  and used to analyze the core hardware/software and application service features of cloud computing systemsfrom the perspective of key hardware/software products and cloud application services. A comprehensive evaluation infrastructure towards cloud computing systems had been established covering cloud server, cloud storage, core software and cloud services platform. The test outlines and test instructions had also been drawn up. Based on the research results, the project team had carried out a series of tests, which has played a good role in promoting the development of the domestic cloud computing systems.

  3. National High-tech R&D Program of China (863 Program), Measurement and Evaluation of High-End Fault-Tolerance Computer (Grant No. 2008AA01A204). $5M, 2008-2010.

    This project has solved several key issues in relation to in the evaluation of the high-end fault-tolerant computer systems. For the first time a domestic high-end fault-tolerant computer evaluation system and evaluation standard has been established. The important innovation has been made on the following: Availability testing methods based on the measurement of field-replaceable units, hardware-software combined fault injection methods, TPC-C automatic tuning bottleneck diagnostic test method, industrial applications benchmarks for banking and telecommunications. Through the comparison tests among HP Superdome, SUN M8000, Inspur and Huawei high-end fault-tolerant computers, the results showed that the research achievements and evaluation infrastructure can satisfy assessment requirements for high-end fault-tolerant computers, and promote the high-end computer industry.

  4. National High-tech R&D Program of China (863 Program), General Technology of High-End Fault-Tolerance Computer (Grant No. 2008AA01A201). $3.5M, 2008-2010.

    This project focuses on the research of high-end fault-tolerance computer, including the computer architecture and general technology, the draft of relevant criterion and standard, the development strategies and policy suggestions. It provides support for the development of China's high-end fault-tolerant computer technology, industry, application and market. This project promotes the development of high-end fault-tolerant computer in many ways. 

7.Selected Awards

In 2011, the Ministry of Science and Technology of the People's Republic of China granted to Prof. Decheng Zuo the Outstanding Contribution Award of “the Eleven Five” National Science and Technology Plan Perform. This award for the Advanced Individual during the 11th Five National Science and Technology Plan. As a member of expert committee for Information field, National High-tech R&D Program of China (863 Program), he also made a positive contribution to building an innovative country and supporting the economic and social development.

8.Social Contribution

Our researches focus on the basic theory and key technologies of reliability of information systems which related to aerospace, national defense and national key industries. It has formed a typical orientation represented by fault-tolerant computing and has hired Mr. Wang Endong, chief scientist of Inspur Group and academician of the Chinese Academy of Engineering, as the academic leaders of this orientation. In recent years, mainly focus on high-availability star ship-borne computer systems, new technologies for reliability and availability of high-performance fault-tolerant computer systems and XX system usability evaluation technologies. In recent years, we has undertaken more than 20 research projects such as “Highly Available XX Computer Technology” and “XX Computer System Evaluation Technology”, including three key projects and two key funds, with a total funding of more than 80 million RMB. The research achievements in defense, aerospace, and national key industries are as follows:

The representative achievements of scientific researches obtained in the past five years are described as follows:

  1. Domestically-made high availability computer

Facing the needs of national information construction, it breakthrough the technical bottleneck of high-availability computer system by developing an autonomous and controllable high-performance and highly-available computer system based on domestic processors and operating systems. At the same time, breakthrough the bottleneck of status real-time monitoring of the control system computing platform. The key technologies such as system rapid fault recovery technology, computer unit design technology based on domestic CPUs, high-reliability server design technology and injection-based usability verification technology meet the high-performance and high-availability processing requirements of the system. Defense information construction provides technical reserves, providing a powerful domestic hardware support platform for land, sea, air, and space integration systems.

  1. Evaluation and measurement for high-end fault-tolerant computer

The core of the evaluation cluster computer system is unified control by the evaluation management subsystem. Combining subsystems such as test plan, evaluation model, test benchmark sets, hardware and software fault injection, result recovery analysis, load input, and load simulation, the target of system is achieved. The self-developed load-loading simulation system for specific industry applications such as finance and telecommunications has run on the evaluation cluster in the evaluation environment. A typical application simulation system for the financial telecommunications industry runs on a target machine. At the same time run a modified robustness test program (Ballista, LTP, etc.), a Linkpack test program, a Stream test program and an IOMeter test program on a target machine. The entire evaluation environment utilizes InfiniBand high-speed network to interconnect each device to provide a high-speed communication network. The independently developed software fault injection system injects faults into the target machine. This research has reached the international advanced level in the evaluation and measurement of high-end fault-tolerant computers.

  1. Intelligent storage technology research and industrialization

Researched and developed key technologies and equipment such as storage virtualization and storage automation, disaster recovery, green storage and built intelligent storage systems and green storage systems that can be dynamically scaled, load balanced, automatically optimized and developed suitable for intelligent storage systems. The integrated storage management function enables the system to have automatic and intuitive resource configuration and system management functions.

Through the disaster recovery and backup technologies, the security of enterprise data and continuity of business is ensured. At the same time, the impact of the disaster recovery system on the performance of the production system must be considered. Achieve the independence from the production system and reduce the impact on the performance of the production system to the level which can be accepted by the user. Research and develop high-quality power supply units, dynamic power management, efficient heat dissipation management, solid-state hard drives and other mainstream energy-saving technologies and equipment to reducing the user's energy consumption.

  1. Wearable Computer

Breakthrough the integration design technology based on high-performance embedded CPU. The capability of this technical has reached the level of mainstream products in the world and adopts an independently designed I/O processing platform which realizes technology in terms of system reliability, system energy-saving technology, etc. Completed the research of Linux-based embedded operating system ,realized the system's micro-kernel, modular design and realized the operating system reliability enhancement technology based on checkpoint and other fault tolerance mechanisms and completed the Linux-based geographic information system. And on this basis, the display and plot were realized, and a complete set of wearable computer application software development platform was realized.

In the experimental system, flexible wires, soft switches and various sensors based on new textile materials were used to achieve multi-sensor fusion. The wearer's computer ergonomics design was at the leading level in China. A human body characteristics monitoring system was implemented. The physiological indicators can be monitored. Based on the key technologies that have been broken through, a prototype of the system has been completed. Several trials have been conducted in real applications and they operate very well.