Audio-visual speech recognition has received attention for improving speech recognition performance by using the complementary nature of audio and visual speech. For research of the topic, having audio-visual speech databases is crucial for developing and evaluating recognition systems. Since there are not many databases that are publicly available and suitable for our research, we collected our own database, which is released here to the research community.
In total, 56 speakers (39 males and 17 females) participated in collecting the database. All are native Koreans, aged between 20 and 40.
The database is composed of two parts: DIGIT and CITY. The DIGIT part contains eleven Korean digits (including two versions of zero) and the CITY part contains sixteen famous Korean city names. The following tables show the vocabulary of each part.
No. | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 0 | 0' |
---|---|---|---|---|---|---|---|---|---|---|---|
Transcription | il | i | sam | sa | o | yuk | chil | pal | gu | gong | young |
In Korean | 일 | 이 | 삼 | 사 | 오 | 육 | 칠 | 팔 | 구 | 공 | 영 |
No. | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 | 13 | 14 | 15 | 16 |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Transcription | seoul | daejeon | daegu | busan | ulsan | incheon | gwangju | jeonju | gyoungju | jeju | gangneung | chuncheon | suwon | chungju | namwon | gongju |
In Korean | 서울 | 대전 | 대구 | 부산 | 울산 | 인천 | 광주 | 전주 | 경주 | 제주 | 강릉 | 춘천 | 수원 | 청주 | 남원 | 공주 |
For the recording session of each speaker, a movie camera mounted on a tripod and a microphone placed in front of the speaker were used to record video and audio data, respectively. The camera focused the face region around the speaker's lips. No special marker on the speaker's face was used. The lighting conditions were not the same for different sessions, since no professionally controlled lighting system was employed.
Each pronunciation is contained in a video file. The audio track is recorded at a rate of 32 kHz in a 16 bit PCM mono format. The visual part contains 720x480 color image sequences at a frame rate of 30 Hz.
Permission is hereby granted, without written agreement and without license or royalty fees, to use the data provided and its documentation for research purpose only. The data may not be used for any commercial purposes.
The user, he or she who will make use of the dataset, may not distribute the dataset or portions thereof in any way, with the exception of using small portions of data for the exclusive purpose of clarifying academic publications or presentations.
In no event shall the administrators of the data be liable to any party for direct, indirect, special, incidental, or consequential damages arising out of the use of the data and its documentation.
The administrators specifically disclaim any warranties. The data provided is on an "as is" basis and the administrators have no obligation to provide maintenance, support, updates, enhancements, or modifications.
The use of the data is conditional to the users explicitly and clearly citing and acknowledging the following publication in their publications:
The files are too big to be made available here. Please send an email to jong-seok.lee AT yonsei DOT ac DOT kr.