Have you ever thought about an AI-based machine playing poker with you? If your imagination has gone that wild then Facebook is all set to make it a reality with its new general AI framework called Recursive Belief-based Learning (ReBeL) that can even perform better than humans in poker and with little domain knowledge as compared to the previous poker setups made with AI.
With ReBel, Facebook is also going for multi-agent interactions - which means that the general algorithms will soon have the capacity to be deployed on a large scale and for multi-agent settings as well. The potential applications include workings like auction, negotiations, and cybersecurity or the operation of self-driving cars and trucks.
Facebook’s plan of combining reinforcement learning with search for AI model training can lead to some remarkable advancements. This is because Reinforcement Learning is based on agents learning to achieve goals in order to maximize rewards whereas search is basically defined as a process that starts from the plan to the stage of setting the goal.
One such example is of Deepmind’s Alpha Zero that is based on a similar program to deliver state-of-the-art performance in board games like chess, shogi, and Go. However, the combination falls short when it is being applied for games like poker because of imperfect information that can arise as a result of how the situation in the game changes. Actions then take help from probability or the playing strategy.
Hence, proposing a solution to the problem in the form of ReBel, Facebook researchers have now expanded the notion of “game state” while including the agent’s belief which relies on the state they are in while playing - counting the common knowledge and policies of other players as well.
When working, ReBel trains two AI models; one is of a value network and the other is of policy network. There is reinforcement learning happening with search during the self-play which eventually has resulted into a flexible algorithm that now holds the potential to beat human players.
For a high level, ReBel operates with public belief states rather than going for world states. If that has surprised you then public belief states are there to generalize the notion of “state value” in games with imperfect information like Poker. PBS is also more often regarded as a common-knowledge probability distribution over a limited arrangement of possible actions and states, which we sometimes call history as well.
Now in perfect-information games, PBS can be distilled down to histories just like the way it distills down to world states in two-player zero-sum games. Not to forget that a PBS is actually the decisions that a player can and also the outcomes of the possibilities on one hand.
As soon as ReBel starts to work for every new game, it creates a “subgame” in the beginning which is very much similar to the original one, except for the fact that its roots go back to the initial PBS. The algorithm actually wins by repeating the runtime of “equilibrium-finding” algorithm and then take advantage of the trained value network to create estimates on every stage of the iteration. Furthermore, with enforcement learning, the values come out easily and then added back to the network as training examples. The policies in the “subgame” are also added as examples. The process continues to repeat itself until PBS becomes the new subgame root and completes a certain accuracy threshold.
The researchers also benchmarked ReBel, as a part of the experiment, for games of heads-up no-limit Texas hold’em poker, Liar’s Dice, and turn endgame hold’em. They used 128 PCs with eight graphic cards only to generate the stimulated game data and of course place random bets and stack sizes (ranging from 5000 to 25000 chips) to test its abilities.
ReBel was also trained on a game with one of the best heads up poker players in the world Don Kim and the results turned out to be ReBel playing faster than two seconds per hand across 7,500 hands and how it didn’t take more than 5 seconds for any decision. Overall ReBel scored 165 thousandths - which is a pretty good result when compared to the previous poker playing system by the social media giant Libratus that resulted in 147 thousandths.
To prevent cheating, Facebook has decided that they will not release ReBel’s codebase for Poker. The company only open-sourced Liar Dice’s implementation, which according to researchers is easier to understand and adjust.
Photo: Josh Edelson / Agence France-Presse / Getty Images
Read next: Facebook Boasts 2.7 Billion Monthly Active Users In The Second Quarter of 2020, 3.14 Billion Combined MAU across Whatsapp, Messenger, Instagram and FB
With ReBel, Facebook is also going for multi-agent interactions - which means that the general algorithms will soon have the capacity to be deployed on a large scale and for multi-agent settings as well. The potential applications include workings like auction, negotiations, and cybersecurity or the operation of self-driving cars and trucks.
Facebook’s plan of combining reinforcement learning with search for AI model training can lead to some remarkable advancements. This is because Reinforcement Learning is based on agents learning to achieve goals in order to maximize rewards whereas search is basically defined as a process that starts from the plan to the stage of setting the goal.
One such example is of Deepmind’s Alpha Zero that is based on a similar program to deliver state-of-the-art performance in board games like chess, shogi, and Go. However, the combination falls short when it is being applied for games like poker because of imperfect information that can arise as a result of how the situation in the game changes. Actions then take help from probability or the playing strategy.
Hence, proposing a solution to the problem in the form of ReBel, Facebook researchers have now expanded the notion of “game state” while including the agent’s belief which relies on the state they are in while playing - counting the common knowledge and policies of other players as well.
When working, ReBel trains two AI models; one is of a value network and the other is of policy network. There is reinforcement learning happening with search during the self-play which eventually has resulted into a flexible algorithm that now holds the potential to beat human players.
For a high level, ReBel operates with public belief states rather than going for world states. If that has surprised you then public belief states are there to generalize the notion of “state value” in games with imperfect information like Poker. PBS is also more often regarded as a common-knowledge probability distribution over a limited arrangement of possible actions and states, which we sometimes call history as well.
Now in perfect-information games, PBS can be distilled down to histories just like the way it distills down to world states in two-player zero-sum games. Not to forget that a PBS is actually the decisions that a player can and also the outcomes of the possibilities on one hand.
As soon as ReBel starts to work for every new game, it creates a “subgame” in the beginning which is very much similar to the original one, except for the fact that its roots go back to the initial PBS. The algorithm actually wins by repeating the runtime of “equilibrium-finding” algorithm and then take advantage of the trained value network to create estimates on every stage of the iteration. Furthermore, with enforcement learning, the values come out easily and then added back to the network as training examples. The policies in the “subgame” are also added as examples. The process continues to repeat itself until PBS becomes the new subgame root and completes a certain accuracy threshold.
The researchers also benchmarked ReBel, as a part of the experiment, for games of heads-up no-limit Texas hold’em poker, Liar’s Dice, and turn endgame hold’em. They used 128 PCs with eight graphic cards only to generate the stimulated game data and of course place random bets and stack sizes (ranging from 5000 to 25000 chips) to test its abilities.
ReBel was also trained on a game with one of the best heads up poker players in the world Don Kim and the results turned out to be ReBel playing faster than two seconds per hand across 7,500 hands and how it didn’t take more than 5 seconds for any decision. Overall ReBel scored 165 thousandths - which is a pretty good result when compared to the previous poker playing system by the social media giant Libratus that resulted in 147 thousandths.
To prevent cheating, Facebook has decided that they will not release ReBel’s codebase for Poker. The company only open-sourced Liar Dice’s implementation, which according to researchers is easier to understand and adjust.
Photo: Josh Edelson / Agence France-Presse / Getty Images
Read next: Facebook Boasts 2.7 Billion Monthly Active Users In The Second Quarter of 2020, 3.14 Billion Combined MAU across Whatsapp, Messenger, Instagram and FB