Audio-visual Korean isolated word database - Downloads

Introduction

Audio-visual speech recognition has received attention for improving speech recognition performance by using the complementary nature of audio and visual speech. For research of the topic, having audio-visual speech databases is crucial for developing and evaluating recognition systems. Since there are not many databases that are publicly available and suitable for our research, we collected our own database, which is released here to the research community.

Description

Speakers

In total, 56 speakers (39 males and 17 females) participated in collecting the database. All are native Koreans, aged between 20 and 40.

Vocabulary

The database is composed of two parts: DIGIT and CITY. The DIGIT part contains eleven Korean digits (including two versions of zero) and the CITY part contains sixteen famous Korean city names. The following tables show the vocabulary of each part.

DIGIT

No.	1	2	3	4	5	6	7	8	9	0	0'
Transcription	il	i	sam	sa	o	yuk	chil	pal	gu	gong	young
In Korean	일	이	삼	사	오	육	칠	팔	구	공	영

CITY

No.	1	2	3	4	5	6	7	8	9	10	11	12	13	14	15	16
Transcription	seoul	daejeon	daegu	busan	ulsan	incheon	gwangju	jeonju	gyoungju	jeju	gangneung	chuncheon	suwon	chungju	namwon	gongju
In Korean	서울	대전	대구	부산	울산	인천	광주	전주	경주	제주	강릉	춘천	수원	청주	남원	공주

Recording environment

For the recording session of each speaker, a movie camera mounted on a tripod and a microphone placed in front of the speaker were used to record video and audio data, respectively. The camera focused the face region around the speaker's lips. No special marker on the speaker's face was used. The lighting conditions were not the same for different sessions, since no professionally controlled lighting system was employed.

Data format

Each pronunciation is contained in a video file. The audio track is recorded at a rate of 32 kHz in a 16 bit PCM mono format. The visual part contains 720x480 color image sequences at a frame rate of 30 Hz.

License note

Permission is hereby granted, without written agreement and without license or royalty fees, to use the data provided and its documentation for research purpose only. The data may not be used for any commercial purposes.

The user, he or she who will make use of the dataset, may not distribute the dataset or portions thereof in any way, with the exception of using small portions of data for the exclusive purpose of clarifying academic publications or presentations.

In no event shall the administrators of the data be liable to any party for direct, indirect, special, incidental, or consequential damages arising out of the use of the data and its documentation.

The administrators specifically disclaim any warranties. The data provided is on an "as is" basis and the administrators have no obligation to provide maintenance, support, updates, enhancements, or modifications.

The use of the data is conditional to the users explicitly and clearly citing and acknowledging the following publication in their publications:

J.-S. Lee and C. H. Park, "Robust audio-visual speech recognition based on late integration," IEEE Transactions on Multimedia, vol. 10, no. 5, pp. 767-779, Aug. 2008
J.-S. Lee and C. H. Park, "Hybrid simulated annealing and its application to optimization of hidden Markov models for visual speech recognition," IEEE Transactions on Systems, Man, and Cybernetics: Part B, vol. 40, no. 4, pp. 1188-1196, Aug. 2010

Download

The files are too big to be made available here. Please send an email to jong-seok.lee AT yonsei DOT ac DOT kr.