logo
EventsNews03.10.2025

SotaTek Contributes to ViGen: Vietnam’s First Open-Source Vietnamese Language Dataset for AI

Hanoi, October 2025 – Vietnam has unveiled the first experimental version of ViGen, an open-source Vietnamese language dataset designed to accelerate the development of artificial intelligence (AI) applications. The launch took place during Vietnam Innovation Day 2025 and was jointly announced by the National Innovation Center (NIC) under the Ministry of Planning and Investment, Meta, and the AI for Vietnam organization.

The opening ceremony gathered representatives from government agencies, research institutions, universities, technology enterprises, and international partners. Activities within the event’s framework also featured the Embassy of the United States in Vietnam, underscoring the role of international cooperation in advancing Vietnam’s AI ecosystem.

Building Vietnam’s AI Ecosystem

ViGen is a central element of the country’s National Strategy on Artificial Intelligence to 2030, which sets ambitious goals for making Vietnam a regional hub in AI research and application. A key requirement of that strategy is the creation of open, large-scale

ViGen addresses this gap by providing a standardized, high-quality dataset that captures the richness of the Vietnamese language and is openly available to the AI community.

According to NIC, the experimental phase introduces several key components. Chief among them is Vi-Primer 1.0, the largest open Vietnamese pre-training dataset developed to date. Alongside this, the project unveiled five assessment frameworks comprising 4,020 curated samples, enabling objective benchmarking of AI models trained on Vietnamese. The ViGen platform also incorporates a mechanism for verified data contribution through VNeID, ensuring transparency and trust in community participation. Looking ahead, the roadmap extends through 2026–2027, with plans to expand dataset size, improve quality, and deepen integration into both academic and industrial applications.

Broad Partnership Across Sectors

The development of ViGen reflects a model of public–private international cooperation. NIC coordinated the initiative, bringing together ministries, enterprises, and universities to pool resources and expertise. Meta contributed technical know-how in large-scale dataset engineering and open-source best practices, while AI for Vietnam supported community-building and advocacy.

SotaTek, as a key partner in building the open Vietnamese dataset, contributed to supporting research, development, and application of artificial intelligence. The company collaborated with research institutions and enterprises to ensure the dataset meets international technical standards while remaining closely aligned with Vietnamese contexts.

Other domestic partners, including universities and AI-focused startups, contributed to data collection, annotation, and evaluation. Their joint efforts strengthened the diversity of applications, from education and e-commerce to smart governance and content moderation.

The participation of the U.S. Embassy in Vietnam in the event further underscored the importance of international collaboration. This presence reflects broader U.S.–Vietnam engagement in technology and innovation, particularly following the elevation of the two countries’ relationship to a Comprehensive Strategic Partnership in 2023.

Open Data for Inclusive Innovation

By being released as open-source, ViGen is expected to lower entry barriers for Vietnamese developers and startups who might otherwise lack access to costly proprietary datasets. This democratization of AI resources encourages inclusive innovation and empowers small and medium enterprises to participate in the digital economy.

Universities and research institutions will also benefit. ViGen offers a rich training ground for the next generation of AI engineers, enabling hands-on learning in natural language processing and large-model development. For businesses, it creates opportunities to develop applications in Vietnamese that are more accurate, context-aware, and culturally appropriate.

Globally, ViGen adds to the diversity of languages represented in AI research. By ensuring Vietnamese is included in the corpus of open data, the project contributes to making AI more inclusive and representative across cultures and languages.

Next Steps

With the experimental version now released, NIC and its partners will collect feedback from the developer and research communities. Updates will focus on expanding the dataset’s scale, refining quality, and aligning with international best practices.

For SotaTek, contributing to ViGen reaffirms its commitment to open innovation and to supporting Vietnam’s national digital transformation journey. By joining forces with domestic and international partners, the company aims to advance AI applications that generate both economic and social value.

As Vietnam advances its National AI Strategy, ViGen stands as an early but significant milestone. It illustrates how government, academia, enterprises, and international partners can co-develop critical digital infrastructure - infrastructure that will shape not only the country’s AI capabilities but also its standing in the global AI landscape.

About our author
SotaTek IT team
SotaTek IT team
With over 1,300 talented employees, we bridge technology and business, uniting our diverse talents with a shared goal – empowering businesses worldwide to thrive with State of the Art technology.