par N Z Il y a 1 année
120
Plus de détails
Policy-Based Methods
Find the policy directly from searching the policy space, rather than state space.
go in the direction of the gradient to improve policy.
Find gradient of policy.
ie how the policy changes as a function of the actions you pick?
Essentially hill climbing.
Policy Gradient Theorem
Value-Based Methods
To find the optimal Q-function Q*.
To find π* from Q*:
π* = argmax_a (Q*(s, a)).
So at any state s, π* should return the action a that gives the best values of Q*(s,a)
To find Q*:
DQN
A neural network approach to estimate Q*.
Q-Learning algorithm
Approximate Q* (Q_hat) by updating its value via temporal-difference learning.
temporal error is given by the difference of the temporal difference target and the current value of Q_hat(s,a)
given by the reward gained and the max Q_hat at s_t+1, varying the action a.
update Q_hat(s,a) by updating current value with temporal difference error
Q* implies that the policy that is followed, ie the actions that are chosen in each state, is the optimal one.
V(s, a) function
This is evaluated in each state, therefore providing information as to which route to take at any moment to maximize expected, discounted, cumulative return.
Following policy π, V(s) evaluates each state s based on what other valued states it can reach from s. Reward is used to judge the intrinsic value of a state to reach a solution to the MDP.
Q(s, a) function
Hence it evaluates each state-action pair while following π (so not all state-action pairs), based on what the agent could do from that state onwards.
Following policy π, ie a is dictated by π, Q(s, a) is the expected, discounted cumulative reward for taking an action according to π in state s, evaluated at all states s.
Problem Formulation: Markov Decision Process
The goal of the agent is to solve the MDP.
Ideally it will find that policy π* which maximises the expected , discounted, cumulative reward.
To do this, it will find a policy function π: S -> Δ(A).
ie a strategy or a decision or a solution
A function that returns an action to pick based on the state the agent is in.
γ is [0, 1]: the discount factor that diminishes the values that are far away in time.
R: S x A x S -> RealNumber. Function returning a real scalar based on the agent's transition from s to s', as a result of action a.
P: S x A -> Δ(S) transition probability from state s, by taking action a to state s'.
A: the set of possible actions taken by the agent at any state
a is an action
S: the set of possible states in the environment
s is a state
If two players Alice and Bob choose strategies A and B, (A, B) is a Nash equilibrium if Alice has no other strategy available that does better than A at maximizing her payoff in response to Bob choosing B, and Bob has no other strategy available that does better than B at maximizing his payoff in response to Alice choosing A. (Wikipedia)
A situation where no-one has anything to gain from changing only their own strategy.
When all players have no better move than the current, as a response to other player's moves.
multi-agent object tracking, but while moving.
cooperation
for cooperation:
so if the ratio of the cost of helping to the benefit it will impart is smaller than the acquaintanceship.
degree of acquaintanceship > ratio of cost of altruistic help TO the benefit to the recipient
degree of acquaintanceship
ie how acquainted they are by giving and receiving help.
probability that an agent knows another's reputation
reputation
reciprocity improves an agent's reputation.
indirect reciprocity
an agent gives help and so will expect to receive help in the future.
direct reciprocity
A helps B, so A will expect help in the future from B
They show that in this scenario those agents that don't cooperate are worse off.
they apply a stochastic decision-making algo that encourages reciprocity while avoiding being taken advantage of.
they also suggest that the best policy for learners is to minimize their assumptions about other learners' policies.
An alternative: find out who is cooperating, competing or in another form of relationship.
Depending on beliefs, result can be better or worse.
they suggest that resulting behaviours of learning other agents models is highly dependent on agents' initial beliefs.
this produces some form of trust
agents trust each other since they know how they reason?
1-level agents that each other's probability distribution.
Categorisation of agents based on their belief of on the adaptability of other agents.
in general N-level agent models teammates as (N-1)-level agents
1-level agent:
I guess it assumes agents are just slightly learning & adapting.
models its teammates as 0-level.
0-level agent:
thinks none of its teammates as performing any learning activity nor adaptive.
Agent A predicts Agent B which predicts Agent A which predicts....
based on these models, agents modify their behaviour to better cooperate with other agents.
Averaging Rewards over sequences of tasks, rather than discounting.
Cooperation is achieved when agents that are further away also get rewarded.
More immediate reward ensures that agents further away from the solution, due to workflow get less reward.
so I guess:
it rewards important contributions
it rewards participation
reward based on how the team would accomplish the task without the learner.
The agent may use a Kalman filter to separate these two signals, and compute its real reward from participating in completing the global task.
In CV, it is used to try to track an object as it passes behind an occlusion.
A Kalman filter can be used to estimate an unknown quantity based on noise or other variables, by estimating a joint probability distribution on these variables.
The learner assumes the reward is a sum of
Some Markov process that estimates the contribution of teammates.
its own contribution to solving the task
2. Vicarious Reinforcement
this allows to balance between local rewarding and global rewarding.
so, reward is received whenever other learners are rewarded.
Vicarious = feeling through/for others.
1. Observational Reinforcement
obtained when a learner reproduces the behaviour of another learner.
may improve overall behaviour by reproducing rare behaviour.
Local reward increases the homogeneity of the learners.
This might lead to more general solutions, so that any agent in any circumstance might find solutions.
With local rewards, learners tend to converge on similar solutions.
This promotes greediness in learners. They won't cooperate, because they don't have enough incentive to.
Reward each agent based only on their performance. Ignore team efforts.
Problems:
Situations where reward has to be assigned differently because global reward can't be computed.
ie: reward for all agents cannot be gathered together.
Not enough individual feedback to learn a solution.
Some learners end up doing most of the work, others just become lazy.
Split team reward equally among learners
divide the reward so that the whole team benefits when 1 team member is rewarded (or punished).
cyclic behaviours
in competitive learning, agents may get stuck in loops rather than building an arms race.
agents circle about one another due to non-transitive relationships
ie they get stuck in loops.
Red-queen effect
it could be that the rest of the team is performing worse and the single learner is getting locally rewarded for better performance.
as a learner improves with respect to the reward it obtains, is the entire team improving as well?
Loss of gradient
Related to laziness and transfers over to cooperative scenarios.
when one learner dominates the other. There is no further learning, since neither learner can get enough reward to adapt.
"Host-parasite" coevolution
1 target learner adapts to test cases, while a secondary competitive learner adjusts the test cases.
competitive concurrent learning is better at wider problem test cases.
The theory that competition pushes players to innovate.
as agents are pitted against each other, they are forced to improve to win.
WoLF: Win or Learn Fast
to large values when losing.
small when winning
varies the learning rate from:
Correlated-Q algo
based on "correlated algorithm" solution concept
Friend-or-Foe Q-Learning
or an adversarial one for competitive agents.
either find a coordination equilibrium for cooperative agents
EXORL algo
Learns optimal response to adaptive and fixed policies of other agent.
Bowling, Hu and Wellman distributed RL algo
Nagayuki et al present an approach where agents approximate other agent's policies rather than their q-values.
Agents also learn q-values for all other agents.
Q-Learning has been applied to extensive-form games.
Normal form: players act simultaneously.
Extensive form: Players alternate in performing actions
General sum games are non-zero sum games. This means that while players compete for score, the total score doesn't necessarily sum to 0.
At the same time, they are also not fully cooperative in the sense that players gain score for themselves, not everyone in the game.
Cooperative CoEvolutionary Algorithms (CCEAs)
Agents tend to co-adapt to other agents (including poor collaborators), rather than trying to optimise their performance, by choosing the ideal collaborator.
Solution to both split problems into subproblems, and concurrently search for solutions to the subproblems.
Peshkin et al. distributed RL approach to partially observable fully cooperative stochastic games.
Their algo converges to local optima, which are not necessarily nash equilibria.
POMDPs: Partially Observable MDPs are used to extend stochastic games by limiting an agent's ability to observe the state of the environment.
Optimal Adaptive Learning Algo
Guaranteed to converge to Nash Equilibria
This converges to Nash equilibria, it's been proved.
Collaboration may not be achieved, and agents reach a nash equilibrium whose outcome is inferior to collaboration among all agents.
Not necessarily the optimal solution.
Scenarios where global reward is assigned. Equally divide reward among all agents.
Vidal and Durfee
by combining this info you can approximate the error in the agents' decision function during learning.
another system that uses parameters
rate at which other agents are learning
learning and retention rates
rate of behaviour change
Evolutionary Game Theory
has been used to:
study trajectory of concurrent Q-Learning processes.
visualise basins of attraction to Nash Equilibria
study cooperative evolution
How does it work??
One approach: treat the other learners as part of a dynamic enviornment.
BUT: That would mean that as a learner adapts to other learner's adaptations, they'd be changing the environment in a way that makes the original learner's adaptation invalid.
The learners' adaptations would include obsolete assumptions on other agent's behaviour, since the latter would have changed as well.
ie: values of action vary with time, without having taken that action
environment that changes while training due to other agents.
breaking the learning process into smaller chunks can allow for more flexible use of computational resources.
differing techniques, and optimizations.
If the problem can be broken down into sub-problems, then the overall search space can be broken down too, and assigned to different learners, hence reducing the search space and computational complexity.
a grouping technique to automatically discover the optimum number of groups and their compositions.
So one learner decides to use the split up teams of agents as opposed to not.
it seems different teams would end specialising in differing tasks.
homogenous
heterogenous
all resources need be available in single place where training/computation is performed...
applies poorly to space?
Comparing how the formed heterogenous teams perform without the specialised agents
reports of appearances of squads and 'subspecies' that I guess are specialised.
a single population is evolved, but agents are tested by pairing them with individuals chosen at random to form a heterogenous team.
don't really get it.
They suggest this to be due to deadlocks caused by identical behaviour between agents.
heterogeneity can help despite the rather homogenous domain.
all predator could be homogenous
all prey could be homogeneous
Restricted breeding works best in situations where agents are grouped by teams that end up specialising.
Cross breeding between teams, but only agents that are similarly specialised.
I infer that to promote specialisation, this allows the breeding of agents more and more specialised in one direction.
agents can only breed with similar agents from other teams.
make teams of agents as differing candidate solutions to the problem.
CAs have many of the hallmarks of MAS.
and a state-update rule, that dictates agent behaviour.
so all agents have the same behaviour: homogenous.
each with their own internal state
consists of a neighbourhood of agents.
how to parallelise the learning of these sub-behaviours
how multi-agent sub-behaviours might interact
Not much has been done to decompose tasks into sub-behaviours for multiple agent systems
Fitness switching
Adaptively changes the reward to emphasise those behaviours that have made the least progress.
Shaping
gradually changes the reward function to favour more complex behaviour.
Layered Learning
then learn increasingly complex behaviours on top
learn basic behaviours first
ie as opposed to turn right, go straight and turn left to reach location A Go to location A.
rational agents may learn optima if they trust their teammates share a common goal of achieving better team behaviour.
This "rational" movement towards nash equilibria can be a movement away from optima.
Rational because learners prefer to be safe at nash equilibria instead of risking a worse reward by chasing an overall optimal solution.
How can agents learn in an environment where "the goalposts keep moving"?
size of the network of interactions.
number of agents
number and complexity of behaviours
Body, moves or location used to convey meaning between agents.
Can agents learn to emit pheromones in rational locations?
AntFarm
successfully applied to forage for food around obstacles.
a system that combines EC and pheromone communication.
EC approach to tune agent policy involving multiple digital pheromones.
RL algos adopting a fixed pheromone laying procedure and use the amounts to infer sensor information.
Typical for foraging problems
Like insects that emit locally detectable chemicals, which can be used to infer a meaning. Communication can happen in this way.
implicit transfer of information as the environment is modified by agents.
think of leaving footsteps on the ground or arranging objects in the environment intentionally to communicate
communication that occurs inevitably as the agent changes the environment
in this situation, Ackley and Littman have shown that rewarding agents for communicating is not necessary.
I guess communication is just that valuable to increasing reward.
agents learn to communicate anyway.
Steels has reported emergence of coherent lexicons between agents.
A communication channel is provided, but its purpose is not specified. It is up to the agents to define it, through learning.
Berenji and Vengerov suggest that simultaneous update of a central policy reduces convergence to sub-optimal behaviours.
This is communication: learners are teaching each other learned information.
each agent contributes to it, as separate learners.
sharing knowledge related to agents' current policies, for cooperation.
(state, action, utility)
an agent can share information about past experiences in the form of episodes (state, action, reward) that others may not have experienced yet.
an agent can inform others of its current state
Intentional, decided upon communication between agents.
derivative to figure out in which direction to climb the hill?
a neural network can figure it out?
returns a part of an agent's optimal policy at some equilibrium point. ???
gives an agent's expected long-term reward under some equilibrium, while considering other agent's interests.
each agent must evaluate the situation of the game by considering other agents' interests, as represented by their Q-functions.
Actions are taken simultaneously by players
normal-form games are games that can be represented in a matrix. ie player A's actions in columns, player B's in rows, and the outcome is based on the A's and B's actions.
Coevolutionary Algorithms (CEAs)
Cooperative coevolution:
succeed together or fail together.
Competitive coevolution:
benefit at the expense of other individuals
fitness then becomes context-sensitive and subjective.
fitness is based on interaction with other individuals
More than one population of individuals. They interact with individuals from the other populations.
examples
genetic programming
evolving actual computer programs
evolution strategies
genetic algorithms
Generally algos work like this: Generations.
Keep going until a good enough individual is found.
Fittest members are chosen, bred and produce 'children' that are then added to population replacing worse individuals.
maybe some random mutation can be introduced
Population is then evaluated according to a 'fitness' function.
function that evaluates how good they are at solving the problem.
Begin with a random population of 'individuals' (candidate solutions.)
Applies Darwinian models of evolution to refine populations of candidate solutions to a given problem.
Not really clear on pros and cons compared to RL.
search the space of possible solutions without the use of value functions.
directly learn behaviours without using value functions.
RL methods have theoretical proofs of convergence, but these usually don't hold in MAS.
Decision-making
Challenges
Consider other agents' choices in the environment.
Not only consider immediate value in a decision, but also its impact on future choices
Apart from converting data into knowledge, decisions need to be taken based on this knowledge.
2 main methods:
Policy-gradient
Parametrise the policy directly, therefore approximating it directly.
Value-based
Compute the expected return at a state or (state, action) pair and then the policy chooses based on this, at any state.
reinforcement information is provided after actions are taken in environment.
Through these representations, the software can train itself to perform new tasks.
Can find representations in high-dimension data.
ie they transform input data into representations, effectively transforming data into actionable knowledge.
the state depends on stochastic function of the old state and the player's interaction.
Eg: Chess?
At any point the game is in a state
extensions of repeated games.
the game consists in a series of interactions between players.
agents may not know everything about the world that other agents know
these interact with one another
There is more than 1 agent