Tutorials

Monday, November 12, 09:00 - 12:30

T1: A 3-Hour Primer on Real World Autonomous Driving: Hands-on Introduction to Autoware

Ekim Yurtsever, Jacob Lambert, Yoshiki Ninomiya, Kazuya Takeda

Autonomous driving systems are already on public roads despite being at an early, experimental stage. As an extensive, interdisciplinary field, its current state would greatly benefit from the experience of researchers in areas that are not traditionally associated with intelligent transportation technologies. As such, the aim of this tutorial is to introduce the fundamentals of autonomous driving and give a hands-on experience catered to a wide audience. This workshop will outline the various sensors used, the tasks to be solved and state-of-the-art methods. Finally, we intend to challenge the audience with a real-world task through an interactive session based around the open-source software Autoware.

Monday, November 12, 09:00 - 12:30

T2: Sparse Signal Processing: Recent Advances and Applications

Zhi Tian

Sparsity that characterizes many natural and man-made signals has been exploited over the years in a broad range of statistical inference and signal representation applications, leading to useful results on efficient reconstruction of high-dimensional signals at low sensing costs. This tutorial is motivated by exciting resurgence of interest in the topic, which is propelled by both the latest theoretical advances and emerging new applications. In the emerging era of data deluge, wireless systems such as 5G and Internet of Things (IoT) have to be able to sense and process an unprecedentedly large amount of data in real time, which render traditional communication and signal processing techniques inefficient or inapplicable. Meanwhile, there are exciting new developments on the theory and algorithms of sparse signal processing and compressive sensing, which offer powerful tools to effectively deal with high-dimensional signals, large-size problems, and big-volume data. In this tutorial, we will introduce basic concepts and recent results related to sparse signal processing, emphasizing recent new theoretical results on structure-based compressive sensing beyond sparsity, compressive covariance sensing and super-resolution gridless compressive sensing. We will illustrate these new concepts and techniques using various applications, such as wideband spectrum sensing in wireless cognitive radios, sparse channel estimation for massive MIMO in millimeter-wave communication systems, direction-of-arrival (DOA) estimation using large-antenna linear or rectangular arrays, and decentralized and cooperative sparse event detection in wireless sensor networks. Finally, we will discuss open research issues and future directions on this topic.

Monday, November 12, 09:00 - 12:30

T3: Voice Conversion: Challenges and Opportunities

Hemant A. Patil, Hideki Kawahara

Voice Conversion (VC) modifies the perceived speaker identity in a given speech signal from a source speaker to a particular target speaker. This tutorial will give an overview of VC and various technological challenges associated with it. Though progress in the VC techniques have been observed since last four decades, exactly mimicking a particular target speaker with high quality converted voice is still not up to the mark. One of the possible reasons for these could be not fully exploiting the knowledge of speech production (such as nonlinear source-filter interaction) and perception along with speech prosody and speaking style. One of the key goals of this tutorial will be to explain at what extent the state-of-the-art techniques in the VC are utilizing the speech production knowledge and what are the key research issues associated with it. Detailed discussion of the speech analysis/synthesis techniques, which are primarily motivated from the speech production-perception mechanisms will be discussed. In particular, state-of-the-art high quality vocoders, namely, Speech Transformation and Representation using Adaptive Interpolation of Weight Spectrum (STRAIGHT) (originally proposed by Prof. H. Kawahara), glottal, sinusoidal, and statistical vocoder, such as WaveNet vocoder will be discussed along with their relative comparisons. Furthermore, this tutorial will also discuss strength and weaknesses of the state-of-the-art mapping techniques used in VC to transform spectral as well as source-related information. Modification of the prosodic features, such as, time-scale, pitch and energy are also equally important for transforming perceptual speaker identity in the VC. However, less focus have been given in the literature for transforming the prosodic features. This tutorial will also present the earlier work related to the prosodic feature modifications in the VC. Furthermore, this tutorial will give an overview of the two internationally competitive Voice Conversion Challenges held as special sessions at INTERSPEECH 2016 and Speaker Odyssey 2018. In addition, improvement in the VC can directly become threat to the speaker recognition system, this tutorial will also give brief overview of the countermeasures for the ASV systems against the converted voices. The tutorial will conclude with a summary of the current state-of-the-art in the field and a discussion of future research directions.

Monday, November 12, 13:30 - 17:00

T4: 360° Video for Immersive Media Communications

Yo-Sung Ho

In recent years, 360° video has become an active topic of research and development with the emerging market of AR/VR imaging products. We know that 360° video is playing a key role to provide more realistic and immersive perceptual experiences than the existing 2D counterpart. There are many applications of 360° video in immersive media environments. In this tutorial lecture, we are going to introduce the current MPEG-I visual activities for 360° video: definitions of 3DoF/3DoF+/omnidirectional 6DOF/windowed 6DoF/6DoF, exploration experiments, and test materials. After defining the basic requirements for 3D realistic multimedia services, we will cover various multi-modal immersive media processing techniques for 360° video applications.

Monday, November 12, 13:30 - 17:00

T5: Fundamental Concepts in Machine Learning

Nathan Srebro

Machine learning is become increasingly central in many application domains including computer vision, speech recognition, medical imaging, and language processing. As such, many researchers with a background in signal and information processing are using machine learning methods on a daily basis. In fact, one of the advantages of such methods is that they are relatively easy to use even without specialized training. In this tutorial, I will step back and give an overview of the fundamental principles of machine learning. I will discuss machine learning as an engineering paradigm; contrast between relying on expert knowledge versus data and relate this to the error decomposition at the core of machine learning; talk about the centrality of computational limitations for machine learning and discuss why, in contrast to statistics, machine learning would be trivial in the absence of such consideration; discuss how deep learning fits into this view of machine learning and what its promises and limitations are; and discuss the relationship between machine learning and optimization, statistics and knowledge discovery. The aim of the tutorial is thus to give students and researchers already familiar with some of the basic techniques used in machine learning, and more thorough understanding of the underlying principles, so that they can use machine learning in a more principled and reasoned way.

Monday, November 12, 13:30 - 17:00

T6: Generative Adversarial Network and its Applications to Speech Signal and Natural Language Processing

Hung-yi Lee, Yu Tsao

Generative adversarial network (GAN) is a new idea for training models, in which a generator and a discriminator compete against each other to improve the generation quality. Recently, GAN has shown amazing results in image generation, and a large amount and a wide variety of new ideas, techniques, and applications have been developed based on it. Although there are only few successful cases, GAN has great potential to be applied to text and speech generations to overcome limitations in the conventional methods. There are three parts in this tutorial. In the first part, we will give an introduction of generative adversarial network (GAN) and provide a thorough review about this technology. In the second and the third parts of this tutorial, we will focus on the applications of GAN to speech signal and natural language processing, respectively.