Policy gradient methods

Reinforcement learning represents an important class of learning algorithms that learn from an agents experience in an episodic setting. Much like how we learn from experience, an RL algorithm learns by sequentially executing actions, observing the outcome of those actions, and updating its behavior so as to achieve better outcomes in the future. This approach is distinct from traditional supervised learning, in which models learn from labeled data. In a certain sense, RL algorithms generate their own labeled data through experimentation. This view exposes some important characteristics of RL algorithms: (i) they are extremely sample inefficient–a single observation may require a very large number of actions, and (ii) the allocation of rewards and penalties that we apply to those actions introduces noise–for instance, although a sequence of chess moves may have produced a win, it says nothing about the relative quality of each individual move; thus, we cannot avoid reinforcing bad actions and penalizing good ones in reinforcement learning. For these reasons, RL algorithms typically require a large number of simulations to converge. Although RL suffers from inefficient sampling and reward/penalty allocation, however, it allows us to directly optimize a model to perform complex tasks, such as playing chess, walking, or playing video games, and not just map inputs to outputs in a dataset. Within RL there are a few different entities that we can try to learn, each of them offer advantages and disadvantages both in terms of performance and convergence for different tasks. Many of the modern RL methods, such as the actor-critic approach, combine them to boost performance. In this post I’ll briefly discuss and derive these entities and describe how the relate to each other.

OrbTouch -> using deformation as a medium for human-computer interaction

This mini-post explores the use of deformation as a medium for human computer interaction. The question being asked is the following: can we use the shape of an object, and how it changes in time, to encode information? Here I present a deep learning approach to this problem and show that we can use a balloon to control a computer.

A comparison of robust regression methods based on iteratively reweighted least squares and maximum likelihood estimation

Time series data often have heavy-tailed noise. This is a form of heteroskedasticity, and it can be found throughout economics, finance, and engineering. In this post I share a few well known methods in the robust linear regression literature, using a toy dataset as an example throughout.

Cephalopod-inspired mechanical systems

With the exception of computer automation, the design of mechanical systems has not fundamentally changed since the industrial revolution. Even though our philosophy, aesthetics, and tools are far more refined and capable, we still design mechanical systems around bars, linkages, and motors. Similarly, electronic systems have largely been constrained to rigid circuits. These constraints are the central challenge being addressed by two emerging fields: soft robotics and soft electronics. In this post I’ll talk briefly about why non-rigid systems are interesting (and potentially useful), and present results from recent work that I published in Science Magazine on this topic.

Deep reinforcement learning for checkers -- pretraining a policy

This post discusses an approach to approximate a checkers playing policy using a neural network trained on a database of human play. It is the first post in a series covering the engine behind checkers.ai, which uses a combination of supervised learning, reinforcement learning, and game tree search to play checkers. This set of approaches is based on AlphaGo (see this paper). Unlike Go, checkers has a known optimal policy that was found using exhaustive tree search and end-game databases (see this paper). Although Chinook has already solved checkers, the game is still very complex with over 5 x 10^20 states, making it interesting application of deep reinforcement learning. To give you an idea, it took Jonathan Schaeffer and his team at UC Alberta 18 years (1989-2007) to complete to search the game tree end-to-end. Here I will discuss how we can use a database of expert human moves to pretrain a policy, which will be the building block of the engine. I’ll show that a pretrained policy can beat intermediate/advanced human players as well as some of the popular online checkers engines. In later posts we’ll take this policy and improve it through self play.