Chen Qiang: Strengthening Corpus Development for the Healthy Growth of Artificial Intelligence
Thu, Nov 28, 2024
A forum on artificial intelligence (AI) corpora was held for the first time in Shanghai at the 2024 World Artificial Intelligence Conference. The three basic elements of large AI models are: computing power, algorithms, and corpora (data). China is the world’s most digitally diverse country, with an abundance of real-world application scenarios generating vast amounts of data, yet the quality and consistency of these corpora remain uneven.
In this context, Professor Chen Qiang from the Department of Management Science and Engineering, Tongji SEM published an article in Guangming Daily, arguing that strengthening corpus development from the input end can promote the healthy growth of AI. The published original text is as follows.
A forum on AI corpora has recently been held for the first time in Shanghai at the 2024 World Artificial Intelligence Conference. One year ago, China’s first Corpus Data Alliance for Large Model, jointly initiated by the Shanghai Artificial Intelligence Laboratory, People’s Daily Online, the National Center for Meteorology and others, was officially established at the 2023 World Artificial Intelligence Conference. Large AI models have three basic elements: computing power, algorithms, and corpora (data). China is the world’s most digitally diverse country, with an abundance of real-world application scenarios generating vast amounts of data. This has positioned China as a major player in corpus development; however, the quality of these corpora remains uneven and needs further standardization. Specific inputs to large AI models often generate specific outputs, reflecting the law of “you reap what you sow”. Therefore, strengthening corpus development from the input end can promote the healthy growth of AI.
Corpora play two essential roles in AI development: “empowerment” and “education”. The empowerment function allows AI to acquire knowledge, make connections, and develop advanced skills through comprehensive knowledge transfer and training. The education function, on the other hand, enables AI to become more reasonable and understanding by embedding emotions and aligning values across multiple dimensions. For example, as China’s population ages, elderly care robots with healthcare functions are becoming an integral part of daily life. Increasingly, elderly people seek not only professional care but also the warmth of family-like attention. This is where the educational function of corpora should come into play: by combining specific incentive algorithms, these robots can be designed to respond to elderly care needs with an amiable attitude and attentive services. It is thus clear that as AI technologies achieve rapid and intensive breakthroughs, corpus development plays a crucial role in shaping the “body and soul” of AI. This requires accomplishing three tasks: expanding sources, improving quality, and integrating values.
Firstly, expanding sources. Corpus development is a multifaceted and systematic endeavor characterized by multiple sources, high dimensionality, heterogeneity, transboundary, and transfinite features. It requires the extensive mobilization and collaboration of government departments, industry organizations, and enterprises to create a unified effort. While many regions in China have already rushed into action, there is still a need to further expand source channels, uncovering various corpus resources hidden within industry barriers and untapped areas. This will help power the iteration and upgrading of large AI models.
Secondly, improving quality. Large AI models need not only ample data but also high-quality data. Industry-specific professional databases are essential to providing AI models with “premium nourishment”.
Finally, integrating values. Corpora play a subtle but powerful role in shaping AI’s “ways of thinking” and “behavioral patterns”. Therefore, precautions should be taken to accelerate the expansion of high-quality Chinese corpus resources, ensuring that they reflect the value orientation of socialist culture with Chinese characteristics.