Current Location:homepage  Research
Massive Data Computing Research Center (MDC)
Editor:贾岩  Updated:2016-03-03  Views:308



Big data has received extensive attention from academia and industry. However, big data computing is still in its infancy and many research directions are still blank. The main reason is that the features of large scale, high speed, diversity and value sparsity of big data bring serious challenges to big data computing.

(1) Deep Understanding of Big Data: Deep understanding of big data is the base of big data computing, the purpose of big data analysis, the bridge between big data computing theory and big data applications. However, because of massiveness and value sparsity of big data, deep understanding of big data becomes more and more difficult. First, current acquisition and fusion method of big data is difficult to meet the challenge of large scale, high speed, diversity and value sparsity of big data, lacking of acquisition and fusion method for big data. Secondly, existing data shows that models is difficult to effectively meet the features of diversity and high speed of big data, lacking of unified representation model for big data. What’s more, current qualitative big data metrics is difficult to quantify the characterization of big data, lacking of quantitative metric model for big data. At last, because of massiveness and value sparsity of big data, current knowledge extraction method is difficult to obtain useful knowledge from big data, lacking of semantic extraction method for big data. Generally, nowadays we lack of the theory and method of deep understanding of big data, which includes: big data acquisition and fusion, big data unified expression, big data measurement model and big data knowledge extraction method.

(2) Big Data Computing Theory and Algorithms: The growth rate of big data is faster than the increment speed of computing and storage performance. Therefore, with the constraints of resources, such as CPU, memory, storage, and energy consumption, determining the feasibility of a big data computing task becomes a basic challenge for big data computing. Although the correctness of the results of traditional calculation theory, complexity theory and algorithm theory does not depend on the size of data, the requirements on running time and space are stricter due to the large scale and fast speed of big data. Also, communication, energy and response time have become new computational bottlenecks. Therefore, traditional complexity theory is insufficient to describe and guide the research and practice under the big data environment since it lacks the theory of complexity and computability for big data. In addition, traditional optimization problems are computationally intensive, but under the environment of big data, this problem becomes a data-intensive problem. Any algorithm which exceeds linearity significantly is unacceptable for large-scale input data. At present, new algorithm design and analysis theories for data-intensive characteristics of big data are still in their infancy, lacking a computing model for big data and novel sublinear, approximate, data flow algorithms.

(3) Big Data Collection and Transmission: Existing content matching technology can achieve accurate matching detection of duplicate contents under small-scale data conditions. However, its high space-time complexity cannot meet the requirements effectively under the big data environment. Therefore, it is one of the key challenges for large-scale data collection. The current multimedia data collection methods mainly focus on the accumulation of media data volumes, and do not fully consider the data repetitiveness, the synchronization relationship and internal correlation between data, which results in high data redundancy, loose distribution, and poor correlation and makes the rules hidden in the data difficult to be excavated. The main challenge for data transmission lies in the huge scale of data, which leads to the pressure on computing resources, transmission resources, performance, and transmission infrastructure due to the diversity of heterogeneous data transmission services. First, traditional transmission control protocols and traffic engineering technologies mainly face unreliable, stateless network layer. But big data transmission needs to be in a highly reliable and controllable environment, resulting in a mismatch between the protocol and the environment, which poses a challenge to the realization of ultra-high speed big data transmission. Second, the traditional network infrastructure and data center system provide high versatility. Meanwhile, at the expense of specific transmission service capabilities, they present a challenge to handle heterogeneous multi-service data on current network infrastructure.

(4) Big Data Cleaning: The theoretical foundation for multimodal data integration is weak and lacks guidance for the research of multimodal data integration technologies. In addition, it is necessary to establish the interaction between data quality characteristics (consistency, completeness, accuracy, currency, and entity resolution), and to establish comprehensive data quality characteristics, data quality models, evaluation algorithms, detection and repairing theory and algorithms.

(5) Big Data Analysis and Mining: The models and theory for structured and semi-structured big data analysis and mining are still blank. The complete data reduction model based on the Less-is-More idea, the energy consumption model for big data analysis mining, and the architecture of the big data analysis and mining system for the new hardware architecture have not been established. The common operations required for big data analysis and mining algorithms are not clear, and they lack a correct and complete theoretical system. The micro-programming and macro programming models of big data analysis and mining based on basic building blocks are still immature, and are difficult for users to use.

2. Overview

The massive data computing and database theory is one of the basic subjects of computer science. This field involves the theory and techniques of massive data acquisition, transmission, quality control, storage, management, analysis and mining.

The faculty members of this field in the school focus on massive data computing and database theory and techniques. The objective is to acquire data from a physical world, manage and discover knowledge from the massive data efficiently and effectively, and feedback the discovered knowledge to the physical world. The research achievements include parallel database and data warehouse, compressed database and data warehouse, massive XML and graph data management, uncertain graph mining, wireless sensor network, as well as data cleaning theory and algorithms.

There are 4 professors, 6 associate professors, and 2 lecturers working in this field, among whom 4 are doctoral supervisors and 10 are master supervisors. Currently, there are more than 20 doctoral students and 40 master students studying in this field of the school. Leading professors include: Prof. Jianzhong Li and Prof. Hong GAO.

For recent four years (2011-2015), the faculty members of this field in the school have been granted over 20 research projects, among which 10 are funded by National Natural Science Foundation of China (NSFC), 1 by National Program on Key Basic Research Projects (973 program), 1 by National Sci-Tech Support Plan. The accomplishments include 1 Second Prize at the national level and 1 First Prize at the provincial level. More than 500 research papers have been published in international and domestic academic journals and conferences, including more than 70 papers published in top journals and conferences. 4 papers have won the best paper awards at international conferences.

45 doctoral students and more than 300 master students have graduated from this research field of the school. Among them, 3 doctoral students have won CCF outstanding dissertations. The outstanding alumni include Dr. Jianzhong Li, now the dean of the department of computer science and technology, Heilongjiang University, and Dr. Hong GAO, an assistant professor of the University of New York, Buffalo, in USA.

The detailed information of the research work in this field can be found in


3. Research Topics

  • Parallel data management: Storing and processing massive data with shared-nothing cluster to increase the scalability and efficiency.

  • Compressed data management: Compressing data to reduce the I/O numbers to increase the processing.

  • Data cleaning: Detecting and correcting errors in data.

  • Wireless sensor network: Targeting to collect, transform and manage data within wireless sensor network.

  • Large graph management: Managing and processing queries on very large graphs.

  • Web data management: Harvesting and processing queries over data from massive data sources from web.

  • Information integration: Integrating information from autonomous and heterogonous data sources.

  • Uncertain graph mining: Discovering knowledge from uncertain graphs.

4. The Faculty

Prof. Jianzhong Li

Personal website:

He is a professor and the director of the Department of Computer Science and Engineering at HIT. He worked at the University of California, Berkeley, in USA, as a visiting scholar in 1985. From 1986 to 1987 and from 1992 to 1993, he was a staff scientist at the Information Research Group at Lawrence Berkeley National Laboratory, Berkeley, USA. He has also been a visiting professor at the University of Minnesota at Minneapolis, Minnesota, USA, from 1991 to 1992 and from 1998 to 1999, respectively.

His current research interests include database management systems, data warehousing and data mining, sensor network, and data intensive supercomputing.

He has authored three books, including parallel database systems, principle of database systems and digital library, and published more than 200 technical papers in peer-reviewed journals and conference proceedings, such as VLDB Journal, Algorithmic, IEEE Transactions on Knowledge and Data Engineering, Parallel and Distributed Database, ACM SIGMOD, VLDB, ICDE, INFOCOM, ICDCS, in the areas of databases, data warehousing, data mining, sensor network and data intensive supercomputing.

He has delivered a number of invited presentations and participated in panel discussions on these topics. His professional activities include serving on various program committees for many conferences, such as VLDB, ICDE, INFOCOM, ICDCS and WWW, and he is an associate editor of Knowledge and Data Engineering and has refereed papers for varied journals and proceedings.

Prof. Hong GAO

Personal website: http

She is a professor and a doctoral supervisor of HIT, a deputy director of Database Specialty Committee of China Computer Academy, and a member of the executive editorial board of Journal of Software. She has been the chairman of the International Academic Conferences and the chairman of the Procedural Committees for several times. She has long been engaged in the research work of massive data computation and quality management, wireless sensor networks and graphic data management and computation.

She has presided over 10 projects of National Natural Science Foundation key Project, National 973 Project, National Natural Science Foundation Key Project and National Natural Science Foundation Project, and has obtained a series of basic research results.

She has published over 100 papers on important journals and conferences, both domestic and abroad, including SIGMOD, VLDB, INFOCOM, IEEE Transactions on Knowledge and Data Engineering and IEEE Transactions on Parallel and Distributed Systems, and more than 30 papers on international first-class academic journals and conferences. The papers have been cited more than 2000 times by others. She has put basic research skills into practice, having developed China's first cluster parallel database system, parallel data warehouse system, and wireless sensor network data management system, and also has achieved good economic and social benefits.

She has won 1 Second Prize of National Science and Technology Progress Award, 1 First Prize of Ministerial and Provincial Natural Science Award, 1 National Excellent Journal Paper Award and many other awards.

Other researchers

  • Prof. Hongzhi Wang, working on data quality, big data management and analysis;

  • Associate Prof. Zhaonian Zou, working on uncertain graph mining;

  • Associate Prof. Siyao Cheng, working on wireless sensor network;

  • Associate Prof Yan Zhang, working on data quality;

  • Associate Prof Shengfei Shi, working on wireless sensor network;

  • Associate Prof Jizhou Luo, working on compressed databases;

  • Associate Prof Wei Zhang, working on graph management;

5. Selected Publications

The faculty members of this field in the school publish their innovative findings, especially on top journals such as VLDB Journal, IEEE Transactions on Knowledge and Data Engineering, IEEE Transactions on Parallel Distribution Systems, and ACM Transactions on Database Systems, and top conferences such as VLDB, SIGMOD, ICDE, KDD and Infocom.

5.1 Selected Journal Papers

  1. Lingli Li, Jianzhong Li, Hong Gao. Rule-Based Method for Entity Resolution. IEEE Transactions on Knowledge and Data Engineering, 2015, 27(1): 250-263.

  2. Xixian Han, Jianzhong Li, Hong Gao. Efficient Top-k Retrieval on Massive Data. IEEE Transactions on Knowledge and Data Engineering, 2015, 27(10): 2687-2699.

  3. Jianzhong Li, Guohua Li, Hong Gao. Novel $varepsilon$-Approximation to Data Streams in Sensor Networks. IEEE Transactions on Parallel and Distributed Systems, 2015, 26(6): 1654-1667.

  4. Hong Gao, Xiaolin Fang, Jianzhong Li, Yingshu Li. Data Collection in Multi-Application Sharing Wireless Sensor Networks. IEEE Transactions on Parallel and Distributed Systems, 2015, 26(2): 403-412.

  5. Lei Yu, Jianzhong Li, Siyao Cheng, Shuguang Xiong, Haiying Shen. Secure Continuous Aggregation in Wireless Sensor Networks. IEEE Transactions on Parallel and Distributed Systems, 2014, 25(3): 762-774.

  6. Jianzhong Li, Siyao Cheng, Hong Gao, Zhipeng Cai. Approximate Physical World Reconstruction Algorithms in Sensor Networks. IEEE Transactions on Parallel and Distributed Systems, 2014, 25(12): 3099-3110.

  7. Xixian Han, Jianzhong Li, Donghua Yang, Jinbao Wang. Efficient Skyline Computation on Big Data. IEEE Transactions on Knowledge and Data Engineering, 2013, 25(11): 2521-2535.

  8. Cheng Sheng, Yufei Tao, Jianzhong Li. Exact and Approximate Algorithms for the Most Connected Vertex Problem. ACM Transactions Database Systems, 2012, 37(2): 1-39s.

  9. Wenfei Fan, Jianzhong Li, Shuai Ma, Nan Tang, Wenyuan Yu. Towards Certain Fixes with Editing Rules and Master Data. VLDB Journal, 2012, 21(2): 213-238.

10. Jianzhong Li, Zhaonian Zou, Hong Gao. Mining Frequent Subgraphs over Uncertain Graph Databases under Probabilistic Semantics. VLDB Journal, 2012, 21(6): 753-777.

5.2 Selected Top Conference Papers

  1. Jinglin Peng, Hongzhi Wang, Jianzhong Li, Hong Gao. Set-based Similarity Search for Time Series. Proceedings of the 35th ACM Special Interest Group on Management of Data (SIGMOD 2016), 2016.6.26-2016.7.1, San Francisco USA.

  2. Zeyu Li, Hongzhi Wang, Wei Shao, Jianzhong Li, Hong Gao. Repairing Data through Regular Expressions. Proceedings of the International Conference on Very Large Data Bases Endowment (PVLDB9 (5)), 2015.8.31-2015.9.4,Kohala, Hawaii USA, 432-443.

  3. Jing Gao, Jianzhong Li, Zhipeng Cai, Hong Gao. Composite Event Coverage in Wireless Sensor Networks with Heterogeneous Sensors. Proceedings of the 34th IEEE Conference on Computer Communications (INFOCOM 2015), 2015.4.26-2015.5.1, Hong Kong, China, 217-225.

  4. Siyao Cheng, Zhipeng Cai, Jianzhong Li, Xiaolin Fang. Drawing dominant dataset from big sensory data in wireless sensor networks. Proceedings of the 34th IEEE Conference on Computer Communications (INFOCOM 2015), 2015.4.26-2015.5.1, Hong Kong China, 531-539.

  5. Yajun Yang, Hong Gao, Jeffrey Xu Yu, Jianzhong Li. Finding the Cost-Optimal Path with Time Constraint over Time-Dependent Graphs. The Proceedings of the International Conference on Very Large Data Bases Endowment (PVLDB 7(9)), 2013.8.26-2013.8.30, Trento Italy, 673-684.

  6. Xiaolin Fang, Hong Gao, Jianzhong Li, Yingshu Li. Approximate multiple count in Wireless Sensor Networks. Proceedings of the 33rd IEEE Conference on Computer Communications (INFOCOM 2014), 2014.4.27-2014.5.2, Toronto Canada, 1474-1482.

  7. Zhao Sun, Hongzhi Wang, Haixun Wang, Bin Shao, Jianzhong Li. Efficient Subgraph Matching on Billion Node Graphs. The Proceedings of the International Conference on Very Large Data Bases Endowment (PVLDB 5(9)), 2012.8.27-2012.8.31, Istanbul Turkey, 788-799.

  8. Jinbao Wang, Sai Wu, Hong Gao, Jianzhong Li, Beng Chin Ooi. Indexing Multi-Dimensional Data in a Cloud System. Proceedings of the 29th ACM Special Interest Group on Management of Data (SIGMOD 2010), 2010.6.6-2010.6.11, Indianapolis USA, 591-602.

  9. Zhaonian Zou, Jianzhong Li, Hong Gao, Shuo Zhang. Finding top-k maximal cliques in an uncertain graph. Proceedings of 26th IEEE International Conference on Data Engineering (ICDE 2010), 2010.3.1-2010.3.6, Long Beach USA, 649-652.

10. Zhaonian Zou, Hong Gao, Jianzhong Li. Discovering frequent subgraphs over uncertain graph databases under probabilistic semantics. Proceedings of 16th ACM Knowledge Discovery and Data Mining (KDD 2010), 2010.7.25-2010.7.28, Washington USA, 633-642.

6. Selected Research Projects

  1. National Program on Key Basic Research Project (973 Program), Research on the Theory and Key Techniques for Massive Information Usability (Grant No. 2012CB316200). $5M, 2012-2016.

    This project focuses on the usability of massive information with the background that low-quality data will cause incorrect computation results and bring disaster in applications. The objective is the evaluation of usability of massive data, cleaning massive data, computing on low-usability massive data with quality assurance and low-usability knowledge management.

  2. National Natural Science Foundation of China (NSFC), Key Project. Research on Theory and Key Technologies for CPS (Grant No. 61033015). $0.5M, 2011-2015.

    This project focuses on the data harvesting, connection, knowledge discovery and data-based control in CPS. The background of this project is the wide application of CPS systems and the strong requirements of Internet of Things (IoT). The objective is to build a theory foundation for CPS and develop complete and effective techniques for CPS systems with applications.

  3. National Natural Science Foundation of China (NSFC), Outstanding Young Investigator Award. Research on Parallel Computing Environment and Parallel Database Systems (Grant No. 69525203). $0.2M, 1995-1997.

This project aims to build a high-performance parallel computing environment and parallel databases. The background is the requirement of high-performance data-intensive computing. This project adopts a shared-nothing cluster system to construct a high-performance computation system with a low price and build an efficient database based on the cluster with an efficient query processing and the optimization techniques.

7. Selected Awards

  1. National Scientific and Technological Progress Award, second-class prize. Parallel Database Systems based on Clusters. Prof. Jianzhong Li, Prof. Hong Gao. 2004.

    This award is for the novel techniques and applications of parallel database systems based on clusters. The techniques include the hardware design for shared-nothing cluster system, efficient data partition and storage methods, efficient query processing techniques, and the application of the parallel database system on the fields of finance, network security and tax administration.

  2. Heilongjiang Province Natural Science Award, first-class prize. Theory and Algorithms of Massive Data Management. Prof. Jianzhong Li, Prof. Wenfei Fan, Prof. Hong Gao, Prof. Hongzhi Wang. 2011.

    This award is for the work on the theory and algorithm for massive data management, including compressed data management, parallel data management as well as massive data quality management. The contributions cover various approaches for efficiency of  massive data processing, including parallel query processing techniques on shared-nothing clusters, query processing techniques on compressed data without uncompressing as well as efficient rule-based data cleaning.

  3. Heilongjiang Province Scientific and Technological Progress Award, first-class prize. Research on Parallel Database SystemsProf. Jianzhong Li, 2000.

    This prize is for the theory, algorithms, systems and applications for parallel database systems. The contributions include the novel data partition strategy with efficient support of query processing, parallel algorithms for basic data operators and query optimization algorithms for parallel query processing.

8.  Social Contribution

The massive data computation group is one of the most famous massive data computation groups in China. Prof Jiazhong Li is the chair of ACM SIGMOD China and the CCF committee of Internet of Things. Prof. Hong Gao is the vice chair of the CCF committee of database. This group pays great attentions to the applications of the research results. In recent years, the representative applications are shown as follows.

  1. Cluster and Parallel Database Systems.

Both the hardware and software of cluster and parallel database system have self-owned intellectual property. This system has been applied in more than 20 tax information systems, ICBC (Industrial and Commercial Bank of China ), and National Audit Office of China. Such system is awarded National Prize for Progress in Science and Technology of China.

2. Wireless Sensor Network

We design the wireless sensor network system with sensors, the data collection system as well as thin-network computation system. Such system has been applied in intelligent agriculture, which covered more than 8000 acre blueberry in more than 5 cities.