# IT-Forum

## IT-Forum presents Uncoupled isotonic regression and Wasserstein deconvolution

Topic:
Uncoupled isotonic regression and Wasserstein deconvolution
Abstract / Description:

Isotonic regression is a standard problem in shape-constrained estimation where the goal is to estimate an unknown nondecreasing regression function f from independent pairs (x_i,y_i) where 𝔼[y_i]=f(x_i), i=1,...n. While this problem is well understood both statistically and computationally, much less is known about its uncoupled counterpart where one is given only the unordered sets {x_1,...,x_n} and {y_1,...,y_n}. In this work, we leverage tools from optimal transport theory to derive minimax rates under weak moments conditions on y_i and to give an efficient algorithm achieving optimal rates. Both upper and lower bounds employ moment-matching arguments that are also pertinent to learning mixtures of distributions and deconvolution.

Date and Time:
Friday, October 26, 2018 - 1:15pm
Venue:
Packard 202

## IT-Forum: Arbitrarily Varying Broadcast and Relay Channels

Topic:
Arbitrarily Varying Broadcast and Relay Channels
Abstract / Description:

Two models of an arbitrarily varying channel (AVC) are studied; both are relevant to modern networks under jamming attacks by an adversary or a hacker. The arbitrarily varying broadcast channel is considered when state information is available at the transmitter in a causal manner. Inner and outer bounds are established, on both the random code capacity region and the deterministic code capacity region with degraded message sets. The form of the bounds raises the question whether the minimax theorem can be generalized to rate regions, i.e. whether the order of the intersection over state distributions and the union over Shannon strategies can be interchanged. A sufficient condition is given, under which this assertion holds and the random code capacity region is determined. As an example, the arbitrarily varying binary symmetric broadcast channel is examined, showing that there are cases where the condition holds, hence the capacity region is determined, and other cases where there is a gap between the bounds. The gap implies that the minimax theorem does not always hold for rate regions.

In the second part of the talk, a new model is introduced, namely, the arbitrarily varying relay channel. The results include the cutset bound, decode-forward bound and partial decode-forward bound on the random code capacity, which require modification of the usual methods for the AVC to fit the block Markov coding scheme. The random code capacity is further determined for special cases. Then, deterministic coding schemes are considered, and the deterministic code capacity is derived under certain conditions, for the degraded and reversely degraded relay channel, and the case of orthogonal sender components. The following question is addressed: If the encoder-decoder and encoder-relay marginals are both symmetrizable, does that necessarily imply zero capacity? We show and explain why the answer is no. The random code capacity is determined for the arbitrarily varying Gaussian relay channel with sender frequency division, and the deterministic code capacity is bounded using the techniques of Csisz\'ar and Narayan's 1991 paper on the Gaussian AVC. It is observed that the gap vanishes as the input becomes less constrained. It is commonly believed that the primitive relay channel "captures most essential features and challenges of relaying, and thus serves as a good testbed for new relay coding techniques" (Kim, 2007). It is observed that in the arbitrarily varying case, this may no longer be true.

This work is part of a Ph.D. thesis under the supervision of Yossef Steinberg.

Date and Time:
Friday, October 12, 2018 - 1:15pm
Venue:
Packard 202

## IT-Forum: Structured Cooperation for Channels with Feedback and beyond

Topic:
Structured Cooperation for Channels with Feedback and beyond
Abstract / Description:

The capacities of fundamental communication problems such as channels with feedback and two-way communications channels are characterized with multi-letter expressions. The challenge in simplifying these expressions is their exhaustive dependence on all information that is accumulated throughout the communication. In this talk, we aim to simplify such capacities by imposing a structure on the accumulated data via a new sequential quantization technique on a directed graph.

First application of this method is for channels with memory and feedback. We will show upper and lower single-letter bounds on the capacity. The bounds are expressed with structured auxiliary random variable (r.v.), a notion that suits problems of sequential nature. For all cases where the capacity is known, the bounds are tight (with small cardinality of the structured auxiliary r.v.). This reveals a simple capacity formula that captures the major role of structure in feedback problems. We will also present a simple and sequential coding scheme, which is based on the posterior matching principle, and achieves the lower bound (and the capacity in many cases).

As time permits, we will show that structure is beneficial for other communication scenarios such as two-way communication channels with common outputs and the energy harvesting model.

The talk is based on a joint work with Prof. Henry Pfister (Duke Univeristy), Prof. Haim Permuter (BGU) and Prof. Navin Kashyap (IISc).

Date and Time:
Monday, October 8, 2018 - 4:15pm
Venue:
Packard 202

## IT-Forum: Data-Driven Policy Learning: Generalization and Optimization

Topic:
Data-Driven Policy Learning: Generalization and Optimization
Abstract / Description:

The problem of learning good treatment assignment rules from observational data lies at the heart of many challenges in data-driven decision making. While there is a growing body of literature devoted to this problem, most existing results are focused on the binary-action case (i.e., where one action corresponds to assignment to control and to assignment to treatment). In this paper, we study the offline multi-action policy learning problem with observational data and, building on the theory of efficient semi-parametric inference, propose and implement a policy learning algorithm that achieves asymptotically minimax-optimal regret. To the best of our knowledge, this is the first result of this type in the multi-action setup and provides a substantial performance improvement over the existing learning algorithms. We additionally investigate the application aspects of policy learning by working with decision trees, and discuss two different approaches for solving the key step of the learning algorithm to exact optimality, one using a mixed integer program formulation and the other using a tree-search based algorithm.

This is joint work with Susan Athey and Stefan Wager.

The Information Theory Forum (IT-Forum) at Stanford ISL is an interdisciplinary academic forum which focuses on mathematical aspects of information processing. With a primary emphasis on information theory, we also welcome researchers from signal processing, learning and statistical inference, control and optimization to deliver talks at our forum. We also warmly welcome industrial affiliates in the above fields. The forum is typically held in Packard 202 every Friday at 1:15 pm during the academic year.

The Information Theory Forum is organized by graduate students Yanjun Han and Yihui Quek. To suggest speakers, please contact any of the students.

Date and Time:
Friday, October 5, 2018 - 1:15pm
Venue:
Packard 202

## IT-Forum presents Representations, fairness, and privacy: information-theoretic tools for machine learning

Topic:
Representations, fairness, and privacy: information-theoretic tools for machine learning
Abstract / Description:

Information theory can shed light on the algorithm-independent limits of learning from data and serve as a design driver for new machine learning algorithms. In this talk, we discuss a set of flexible information-theoretic tools called the principal inertia components (PICs) that can be used to (i) understand fairness and discrimination in machine learning models, (ii) provide an estimation-theoretic view of privacy, and (iii) characterize data representations learned by complex learning models. The PICs enjoy a long history in both the statistics and information theory, and provide a fine-grained decomposition of the dependence between two random variables. We illustrate these techniques in both synthetic and real-world datasets, and discuss future research directions.

Date and Time:
Friday, September 28, 2018 - 1:15pm
Venue:
Packard 202

## IT-Forum: Reverse hypercontractivity beats measure concentration for information theoretic converses

Topic:
Reverse hypercontractivity beats measure concentration for information theoretic converses
Abstract / Description:

Concentration of measure is a collection of tools and results from analysis and probability theory that have been used in many areas of pure and applied mathematics. Arguably, the first data science application of measure concentration (under the name ''blowing-up lemma'') is the proof of strong converses in multiuser information theory by Ahlswede, G\'acs and K\"orner in 1976. Since then, measure concentration has found applications in many other information theoretic problems, most notably the converse (impossibility) results in information theory. Motivated by this, information theorists (e.g. Marton) have also contributed to the mathematical foundations of measure concentration using their information-theoretic techniques.

Now, after all the past 40 years of such progress, we found that, amusingly, measure concentration is not the right hammer for many of these information theoretic applications. We introduce a new machinery based on functional inequalities and reverse hypercontractivity which yields strict improvements in terms of sharpness of the bounds, generality of the source/channel distributions, and simplicity of the proofs. Examples covered in the talk include: 1. optimal second-order converse to common randomness generation with rate-limited communication; 2. sharpening the relay channel converse bounds by and Wu and Ozgur with much simpler proofs.

The work benefited from collaborations with Thomas Courtade, Paul Cuff, Ayfer Ozgur, Ramon van Handel, and Sergio Verd\'u.

Date and Time:
Friday, August 24, 2018 - 1:15pm
Venue:
Packard 202

Topic:
Abstract / Description:

Quantization plays a critical role in digital signal processing systems. Quantizers are typically designed to obtain an accurate digital representation of the input signal, operating independently of the system task, and are commonly implemented using serial scalar analog-to-digital converters (ADCs). This talk is concerned with hardware-limited task-based quantization, where a system utilizing a serial scalar ADC is designed to provide a suitable representation in order to allow the recovery of a parameter vector underlying the input signal. We propose hardware-limited task-based quantization systems for a fixed and finite quantization resolution, and characterize their achievable distortion. Our results illustrate the benefits of properly taking into account the underlying task in the design of the quantization scheme.

Date and Time:
Wednesday, June 13, 2018 - 1:15pm
Venue:
Packard 202

## IT Forum: Phase transitions in generalized linear models

Topic:
Phase transitions in generalized linear models
Abstract / Description:

This is joint work with Jean Barbier, Florent Krzakala, Nicolas Macris and Lenka Zdeborova.

We consider generalized linear models (GLMs) where an unknown $n$-dimensional signal vector is observed through the application of a random matrix and a non-linear (possibly probabilistic) componentwise output function.

We study the models in the high-dimensional limit, where the observations consists of $m$ points, and $m/n \to \alpha > 0$ as $n \to \infty$. This situation is ubiquitous in applications ranging from supervised machine learning to signal processing.

We will analyze the model-case when the observation matrix has i.i.d. elements and the components of the ground-truth signal are taken independently from some known distribution.

We will compute the limit of the mutual information between the signal and the observations in the large system limit. This quantity is particularly interesting because it is related to the free energy (i.e. the logarithm of the partition function) of the posterior distribution of the signal given the observations. Therefore, the study of the asymptotic mutual information allows to deduce the limit of important quantities such as the minimum mean squared error for the estimation of the signal.

We will observe some phase transition phenomena. Depending on the noise level, the distribution of the signal and the non-linear function of the GLM we may encounter various scenarios where it may be impossible / hard (only with exponential-time algorithms) / easy (with polynomial-time algorithms) to recover the signal.

Date and Time:
Friday, May 18, 2018 - 1:15pm
Venue:
Packard 202

## ISL Colloquium & IT Forum: Random initialization and implicit regularization in nonconvex statistical estimation

Topic:
Random initialization and implicit regularization in nonconvex statistical estimation
Abstract / Description:

Recent years have seen a flurry of activities in designing provably efficient nonconvex procedures for solving statistical estimation / learning problems. Due to the highly nonconvex nature of the empirical loss, state-of-the-art procedures often require suitable initialization and proper regularization (e.g. trimming, regularized cost, projection) in order to guarantee fast convergence. For vanilla procedures such as gradient descent, however, prior theory is often either far from optimal or completely lacks theoretical guarantees.

This talk is concerned with a striking phenomenon arising in two nonconvex problems (i.e. phase retrieval and matrix completion): even in the absence of careful initialization, proper saddle escaping, and/or explicit regularization, gradient descent converges to the optimal solution within a logarithmic number of iterations, thus achieving near-optimal statistical and computational guarantees at once. All of this is achieved by exploiting the statistical models in analyzing optimization algorithms, via a leave-one-out approach that enables the decoupling of certain statistical dependency between the gradient descent iterates and the data. As a byproduct, for noisy matrix completion, we demonstrate that gradient descent achieves near-optimal entrywise error control.

Date and Time:
Wednesday, May 23, 2018 - 4:15pm
Venue:
Building 370

## IT-Forum: Tight sample complexity bounds via dualizing LeCam's method

Topic:
Tight sample complexity bounds via dualizing LeCam's method
Abstract / Description:

In this talk we consider a general question of estimating linear functional of the distribution based on the noisy samples from it. We discover that the (two-point) LeCam lower bound is in fact achievable by optimizing bias-variance tradeoff of an empirical-mean type of estimator. We extend the method to certain symmetric functionals of high-dimensional parametric models.

Next, we apply this general framework to two problems: population recovery and predicting the number of unseen species. In population recovery, the goal is to estimate an unknown high-dimensional distribution (in $L_\infty$-distance) from noisy samples. In the case of \textit{erasure} noise, i.e. when each coordinate is erased with probability $\epsilon$, we discover a curious phase transition in sample complexity at $\epsilon=1/2$. In the second (classical) problem, we observe $n$ iid samples from an unknown distribution on a countable alphabet and the goal is to predict the number of new species that will be observed in the next (unseen) $tn$ samples. Again, we discover a phase transition at $t=1$. In both cases, the complete characterization of sample complexity relies on complex-analytic methods, such as Hadamard's three-lines theorem.

Joint work with Yihong Wu (Yale).

Date and Time:
Friday, May 4, 2018 - 1:15pm
Venue:
Packard 202