In an effort to extend traditional human-computer interfaces research
has introduced embodied agents to utilize the modalities of everyday
human-human communication, like facial expression, gestures and body
postures. However, giving computer agents a human-like body introduces
new challenges. Since human users are very sensitive and critical
concerning bodily behavior the agents must act naturally and
individually in order to be believable.
This dissertation focuses on conversational gestures. It shows how to
generate conversational gestures for an animated embodied agent based
on annotated text input. The central idea is to imitate the gestural
behavior of a human individual. Using TV show recordings as empirical
data, gestural key parameters are extracted for the generation of
natural and individual gestures. The gesture generation task is solved
in three stages: observation, modeling and generation. For each stage,
a software module was developed.
For observation, the video annotation research tool ANVIL was
created. It allows the efficient transcription of gesture, speech and
other modalities on multiple layers. ANVIL is application-independent
by allowing users to define their own annotation schemes, it provides
various import/export facilities and it is extensible via its plug-in
interface. Therefore, the tool is suitable for a wide variety of
research fields. For this work, selected clips of the TV talk show
``Das Literarische Quartett'' were transcribed and analyzed, arriving
at a total of 1,056 gestures. For the modeling stage, the NOVALIS
module was created to compute individual gesture profiles from these
transcriptions with statistical methods. A gesture profile models the
aspects handedness, timing and function of gestures for a single human
individual using estimated conditional probabilities. The profiles are
based on a shared lexicon of 68 gestures, assembled from the
data. Finally, for generation, the NOVA generator was devised to
create gestures based on gesture profiles in an
overgenerate-and-filter approach. Annotated text input is processed in
a graph-based representation in multiple stages where semantic data is
added, the location of potential gestures is determined by heuristic
rules, and gestures are added and filtered based on a gesture
profile. NOVA outputs a linear, player-independent action script in
XML.
Michael Kipp studied Computer Science, Mathematics and Psychology at
Saarland University, Germany, and the University of Edinburgh,
UK. From 1997 on he worked at the German Research Center for
Artificial Intelligence (DFKI) on fields as diverse as neural
networks, machine translation, embodied agents and virtual
theater. After finishing his Doctor of Engineering in 2003, he
embarked on a whole new career journey by starting to work at the
National Theater of the Saarland as a director's assistent.
http://www.michaelkipp.de