-
RL之Q Learning:利用强化学习之Q Learning实现走迷宫—训练智能体走到迷宫(简单迷宫)的宝藏位置
2018-10-21 20:49:39RL之Q Learning:利用强化学习之Q Learning实现走迷宫—训练智能体走到迷宫(简单迷宫)的宝藏位置 目录 输出结果 设计思路 实现代码 测试记录全过程 输出结果 设计思路 实现代码 from ...RL之Q Learning:利用强化学习之Q Learning实现走迷宫—训练智能体走到迷宫(简单迷宫)的宝藏位置
目录
输出结果
设计思路
实现代码
from __future__ import print_function import numpy as np import time from env import Env EPSILON = 0.1 ALPHA = 0.1 GAMMA = 0.9 MAX_STEP = 30 np.random.seed(0) def epsilon_greedy(Q, state): if (np.random.uniform() > 1 - EPSILON) or ((Q[state, :] == 0).all()): action = np.random.randint(0, 4) # 0~3 else: action = Q[state, :].argmax() return action e = Env() Q = np.zeros((e.state_num, 4)) for i in range(200): e = Env() while (e.is_end is False) and (e.step < MAX_STEP): action = epsilon_greedy(Q, e.present_state) state = e.present_state reward = e.interact(action) new_state = e.present_state Q[state, action] = (1 - ALPHA) * Q[state, action] + \ ALPHA * (reward + GAMMA * Q[new_state, :].max()) e.print_map() time.sleep(0.1) print('Episode:', i, 'Total Step:', e.step, 'Total Reward:', e.total_reward) time.sleep(2)
测试记录全过程
开始 ......... ......... . . ......... . . .A o . ......... . . .A o . . . ......... . . .A o . . . ......... ......... . . .A o . . . ......... ......... . . .A o . . . ......... ......... . . . A o . . . ......... ......... . . . A o . . . ......... ......... . . . A o . . . ......... ......... . . . A o . . . ......... ......... . . . A o . . . ......... ......... . . . o . . . ......... ......... . . . o . . A . ......... ......... . . . o . . A . ......... ......... . . . o . . A . ......... ......... . . . o . . A . ......... ......... . . . o . . A . ......... ......... . . . o . .A . ......... ......... . . . o . .A . ......... ......... . . . o . .A . ......... ......... . . . o . .A . ......... ......... . . . o . .A . ......... ......... . . . o . .A . ......... ......... . . . o . .A . ......... ......... . . . o . .A . ......... ......... . . . o . .A . ......... ......... . . .A o . .A . ......... ......... . . .A o . . . ......... ......... . . .A o . . . ......... ......... . . .A o . . . ......... ......... . . .A o . . . ......... ......... . . .A o . . . ......... ......... . . .A o . . . ......... ......... . . .A o . . . ......... ......... . . .A o . . . ......... ......... . . .A o . . . ......... ......... . . .A o . . . ......... ......... . . .A o . . . ......... ......... . . .A o . . . ......... ......... . . .A o . . . ......... ......... . . .A o . . . ......... ......... . . . o . . . ......... ......... . . . o . .A . ......... ......... . . . o . .A . ......... ......... . . . o . .A . ......... ......... . . . o . .A . ......... ......... . . .A o . .A . ......... ......... . . .A o . . . ......... ......... . . .A o . . . ......... ......... . . .A o . . . ......... ......... . . .A o . . . ......... ......... . . . A o . . . ......... ......... . . . A o . . . ......... ......... . . . A o . . . ......... ......... . . . A o . . . ......... ......... . . . A o . . . ......... ......... . . . o . . . ......... ......... . . . o . . A . ......... ......... . . . o . . A . ......... ......... . . . o . . A . ......... ......... . . . o . . A . ......... ......... . . . o . . A . ......... ......... . . . o . . A . ......... ......... . . . o . . A . ......... ......... . . . o . . A . ......... ......... . . . o . . A . ......... ......... . . . o . . A . ......... ......... . . . o . . A . ......... ......... . . . o . . A . ......... ......... . . . o . . A . ......... ......... . . . o . . A . ......... ......... . . . o . . A . ......... ......... . . . o . . A . ......... ......... . . . o . . A . ......... ......... . . . o . . A . ......... ......... . . . o . . A . ......... ......... . . . o . . A . ......... ......... . . . o . . A . ......... ......... . . . o . . A . ......... ......... . . . o . . A . ......... ......... . . . o . . A . ......... ......... . . . A . . A . ......... ......... . . . A . . . ......... ......... . . . A . . . ......... Episode:0 Total Step:17, Total Reward:100 . . . A . . . ......... Episode:0 Total Step:17, Total Reward:100 . A . . . ......... Episode:0 Total Step:17, Total Reward:100 . . ......... Episode:0 Total Step:17, Total Reward:100 ......... Episode:0 Total Step:17, Total Reward:100 ......... ......... . . ......... . . .A o . ......... . . .A o . . . ......... . . .A o . . . ......... ......... . . .A o . . . ......... ......... . . .A o . . . ......... ......... . . . A o . . . ......... ......... . . . A o . . . ......... ......... . . . A o . . . ......... ......... . . . A o . . . ......... ......... . . . A o . . . ......... ......... . . . A o . . . ......... ......... . . . A o . . . ......... ......... . . . A o . . . ......... ......... . . . A o . . . ......... ......... . . . A o . . . ......... ......... . . . A o . . . ......... ......... . . . A o . . . ......... ......... . . . A o . . . ......... ......... . . . A o . . . ......... ......... . A . . A o . . . ......... ......... . A . . o . . . ......... ......... . A . . o . . . ......... ......... . A . . o . . . ......... ......... . A . . o . . . ......... ......... . A . . o . . . ......... ......... . A . . o . . . ......... ......... . A . . o . . . ......... ......... . A . . o . . . ......... ......... . A . . o . . . ......... ......... . . . o . . . ......... ......... . . . A o . . . ......... ......... . . . A o . . . ......... ......... . . . A o . . . ......... ......... . . . A o . . . ......... ......... . A . . A o . . . ......... ......... . A . . o . . . ......... ......... . A . . o . . . ......... ......... . A . . o . . . ......... ......... . A . . o . . . ......... ......... . A . . o . . . ......... ......... . A . . o . . . ......... ......... . A . . o . . . ......... ......... . A . . o . . . ......... ......... . A . . o . . . ......... ......... . A . . o . . . ......... ......... . A . . o . . . ......... ......... . A . . o . . . ......... ......... . A . . o . . . ......... ......... . A . . o . . . ......... ......... . . . o . . . ......... ......... . . . A o . . . ......... ......... . . . A o . . . ......... ......... . . . A o . . . ......... [F[F[F[F …… . A . . . ......... Episode:48 Total Step:6, Total Reward:100 . . ......... Episode:48 Total Step:6, Total Reward:100 ......... Episode:48 Total Step:6, Total Reward:100 ......... ......... . A . ......... . A . . o . ......... . A . . o . . . ......... . A . . o . . . ......... ......... . A . . o . . . ......... ......... . A . . o . . . ......... ......... . A . . o . . . ......... ......... . A . . o . . . ......... ......... . A . . o . . . ......... ......... . A . . o . . . ......... ......... . A . . o . . . ......... ......... . A . . o . . . ......... ......... . A . . o . . . ......... ......... . A . . o . . . ......... ......... . A . . o . . . ......... ......... . A . . o . . . ......... ......... . A . . o . . . ......... ......... . A . . o . . . ......... ......... . A . . o . . . ......... ......... . A . . o . . . ......... ......... . A . . o . . . ......... ......... . A . . o . . . ......... ......... . A . . o . . . ......... ......... . A . . o . . . ......... ......... . A . . o . . . ......... ......... . . . o . . . ......... ......... . . . A . . . ......... ......... . . . A . . . ......... ......... . . . A . . . ......... Episode:49 Total Step:6, Total Reward:100 . . . A . . . ......... Episode:49 Total Step:6, Total Reward:100 . A . . . ......... Episode:49 Total Step:6, Total Reward:100 . . ......... Episode:49 Total Step:6, Total Reward:100 ......... Episode:49 Total Step:6, Total Reward:100 ......... ......... . A . ......... . A . . o . ......... . A . . o . . . ......... . A . . o . . . ......... …… ......... . . . A . . . ......... Episode:73 Total Step:8, Total Reward:100 . . . A . . . ......... Episode:73 Total Step:8, Total Reward:100 . A . . . ......... Episode:73 Total Step:8, Total Reward:100 . . ......... Episode:73 Total Step:8, Total Reward:100 ......... Episode:73 Total Step:8, Total Reward:100 ......... ......... . A . ......... . A . . o . ......... . A . . o . . . ......... . A . . o . . . ......... ......... . A . . o . . . ......... ......... . A . . o . . . ......... ......... . A . . o . . . ......... ......... . A . . o . . . ......... ......... . A . . o . . . ......... …… ......... . . . Ao . . . ......... ......... . . . A . . . ......... ......... . . . A . . . ......... ......... . . . A . . . ......... Episode:95 Total Step:10, Total Reward:100 . . . A . . . ......... Episode:95 Total Step:10, Total Reward:100 . A . . . ......... Episode:95 Total Step:10, Total Reward:100 . . ......... Episode:95 Total Step:10, Total Reward:100 ......... Episode:95 Total Step:10, Total Reward:100 ......... ......... . A . ......... . A . . o . ......... . A . . o . . . ......... . A . . o . . . ......... ......... . A . . o . . . ......... ......... . A . . o . . . ......... ......... . A . . o . . . ......... ......... . A . . o . . . ......... ......... . A . . o . . . ......... ......... . A . . o . . . ......... ......... . A . . o . . . ......... ......... . A . . o . . . ......... ......... . A . . o . . . ......... ......... . A . . o . . . ......... ......... . A . . o . . . ......... ......... . A . . o . . . ......... ......... . A . . o . . . ......... ......... . A . . o . . . ......... ......... . A . . o . . . ......... ......... . A . . o . . . ......... ......... . A . . o . . . ......... ......... . A . . o . . . ......... ......... . A . . o . . . ......... ......... . A . . o . . . ......... ......... . A . . o . . . ......... ......... . . . o . . . ......... ......... . . . A . . . ......... ......... . . . A . . . ......... ......... . . . A . . . ......... Episode:96 Total Step:6, Total Reward:100 . . . A . . . ......... Episode:96 Total Step:6, Total Reward:100 . A . . . ......... Episode:96 Total Step:6, Total Reward:100 . . ......... Episode:96 Total Step:6, Total Reward:100 ......... Episode:96 Total Step:6, Total Reward:100 ......... ......... . A . ......... . A . . o . ......... . A . . o . . . ......... . A . . o . . . ......... ......... . A . . o . . . ......... ......... . A . . o . . . ......... ......... . A . . o . . . ......... ......... . A . . o . . . ......... ......... . A . . o . . . ......... ......... . A . . o . . . ......... ......... . A . . o . . . ......... ......... . A . . o . . . ......... ......... . A . . o . . . ......... ......... . A . . o . . . ......... ......... . A . . o . . . ......... ......... . A . . o . . . ......... ......... . A . . o . . . ......... ......... . A . . o . . . ......... ......... . A . . o . . . ......... ......... . A . . o . . . ......... ......... . A . . o . . . ......... ......... . A . . o . . . ......... ......... . A . . o . . . ......... ......... . A . . o . . . ......... ......... . A . . o . . . ......... ......... . A . . o . . . ......... ......... . A . . o . . . ......... ......... . A . . o . . . ......... ......... . A . . o . . . ......... ......... . A . . o . . . ......... ......... . A . . o . . . ......... ......... . A . . o . . . ......... ......... . A . . o . . . ......... ......... . A . . o . . . ......... ......... . A . . o . . . ......... ......... . . . o . . . ......... ......... . . . A . . . ......... ......... . . . A . . . ......... ......... . . . A . . . ......... Episode:97 Total Step:8, Total Reward:100 . . . A . . . ......... Episode:97 Total Step:8, Total Reward:100 . A . . . ......... Episode:97 Total Step:8, Total Reward:100 . . ......... Episode:97 Total Step:8, Total Reward:100 ......... Episode:97 Total Step:8, Total Reward:100 ......... ......... . A . ......... . A . . o . ......... . A . . o . . . ......... . A . . o . . . ......... ......... . A . . o . . . ......... ......... . A . . o . . . ......... ......... . A . . o . . . ......... ......... . A . . o . . . ......... ......... . A . . o . . . ......... ......... . A . . o . . . ......... ......... . A . . o . . . ......... ......... . A . . o . . . ......... ......... . A . . o . . . ......... ......... . A . . o . . . ......... ......... . A . . o . . . ......... ......... . A . . o . . . ......... ......... . A . . o . . . ......... ......... . A . . o . . . ......... ......... . A . . o . . . ......... ......... . A . . o . . . ......... ......... . A . . o . . . ......... ......... . A . . o . . . ......... ......... . A . . o . . . ......... ......... . A . . o . . . ......... ......... . A . . o . . . ......... ......... . . . o . . . ......... ......... . . . A . . . ......... ......... . . . A . . . ......... ......... . . . A . . . ......... Episode:98 Total Step:6, Total Reward:100 . . . A . . . ......... Episode:98 Total Step:6, Total Reward:100 . A . . . ......... Episode:98 Total Step:6, Total Reward:100 . . ......... Episode:98 Total Step:6, Total Reward:100 ......... Episode:98 Total Step:6, Total Reward:100 ......... ......... . A . ......... . A . . o . ......... . A . . o . . . ......... . A . . o . . . ......... ......... . A . . o . . . ......... ......... . A . . o . . . ......... ......... . A . . o . . . ......... ......... . A . . o . . . ......... ......... . A . . o . . . ......... ......... . A . . o . . . ......... ......... . A . . o . . . ......... ......... . A . . o . . . ......... ......... . A . . o . . . ......... ......... . A . . o . . . ......... ......... . A . . o . . . ......... ......... . A . . o . . . ......... ......... . A . . o . . . ......... ......... . A . . o . . . ......... ......... . A . . o . . . ......... ......... . A . . o . . . ......... ......... . A . . o . . . ......... ......... . A . . o . . . ......... ......... . A . . o . . . ......... ......... . A . . o . . . ......... ......... . A . . o . . . ......... ......... . . . o . . . ......... ......... . . . A . . . ......... ......... . . . A . . . ......... ......... . . . A . . . ......... Episode:99 Total Step:6, Total Reward:100 . . . A . . . ......... Episode:99 Total Step:6, Total Reward:100 . A . . . ......... Episode:99 Total Step:6, Total Reward:100 . . ......... Episode:99 Total Step:6, Total Reward:100 ......... Episode:99 Total Step:6, Total Reward:100 Episode:99 Total Step:6, Total Reward:100 F:\AI\DL21TF\DL21examples\chapter_18>
-
RL之Q Learning:利用强化学习之Q Learning实现走迷宫—训练智能体走到迷宫(复杂迷宫)的宝藏位置
2018-10-21 21:25:17RL之Q Learning:利用强化学习之Q Learning实现走迷宫—训练智能体走到迷宫(复杂迷宫)的宝藏位置 目录 输出结果 设计思路 实现代码 测试记录全过程 输出结果 设计思路 实现代码 ...RL之Q Learning:利用强化学习之Q Learning实现走迷宫—训练智能体走到迷宫(复杂迷宫)的宝藏位置
目录
输出结果
设计思路
实现代码
from __future__ import print_function import numpy as np import time from env import Env from reprint import output EPSILON = 0.1 ALPHA = 0.1 GAMMA = 0.9 MAX_STEP = 30 np.random.seed(0) def epsilon_greedy(Q, state): if (np.random.uniform() > 1 - EPSILON) or ((Q[state, :] == 0).all()): action = np.random.randint(0, 4) # 0~3 else: action = Q[state, :].argmax() return action e = Env() Q = np.zeros((e.state_num, 4)) with output(output_type="list", initial_len=len(e.map), interval=0) as output_list: for i in range(100): e = Env() while (e.is_end is False) and (e.step < MAX_STEP): action = epsilon_greedy(Q, e.present_state) state = e.present_state reward = e.interact(action) new_state = e.present_state Q[state, action] = (1 - ALPHA) * Q[state, action] + \ ALPHA * (reward + GAMMA * Q[new_state, :].max()) e.print_map_with_reprint(output_list) time.sleep(0.1) for line_num in range(len(e.map)): if line_num == 0: output_list[0] = 'Episode:{} Total Step:{}, Total Reward:{}'.format(i, e.step, e.total_reward) else: output_list[line_num] = '' time.sleep(2)
测试记录全过程
开始 ......... ......... . x . ......... . x . .A x o . ......... . x . .A x o . . . ......... . x . .A x o . . . ......... ......... . x . .A x o . . . ......... ......... . x . .A x o . . . ......... ......... . x . . A x o . . . ......... ......... . x . . A x o . . . ......... ......... . x . . A x o . . . ......... ......... . x . . A x o . . . ......... ......... . x . . A x o . . . ......... ......... . x . . x o . . . ......... ......... . x . . x o . . A . ......... ......... . x . . x o . . A . ......... ......... . x . . x o . . A . ......... ......... . x . . x o . . A . ......... ......... . x . . x o . . A . ......... ......... . x . . x o . .A . ......... ......... . x . . x o . .A . ......... ......... . x . . x o . .A . ......... ......... . x . . x o . .A . ......... ......... . x . . x o . .A . ......... ......... . x . . x o . .A . ......... ......... . x . . x o . .A . ......... ......... . x . . x o . .A . ......... ......... . x . . x o . .A . ......... ......... . x . .A x o . .A . ......... ......... . x . .A x o . . . ......... ......... . x . .A x o . . . ......... ......... . x . .A x o . . . ......... ......... . x . .A x o . . . ......... ......... . x . .A x o . . . ......... ......... . x . .A x o . . . ......... ......... . x . .A x o . . . ......... ......... . x . .A x o . . . ......... ......... . x . .A x o . . . ......... ......... . x . .A x o . . . ......... ......... . x . .A x o . . . ......... ......... . x . .A x o . . . ......... ......... . x . .A x o . . . ......... ......... . x . .A x o . . . ......... ......... . x . . x o . . . ......... ......... . x . . x o . .A . ......... ......... . x . . x o . .A . ......... ......... . x . . x o . .A . ......... ......... . x . . x o . .A . ......... ......... . x . .A x o . .A . ......... ......... . x . .A x o . . . ......... ......... . x . .A x o . . . ......... ......... . x . .A x o . . . ......... ......... . x . .A x o . . . ......... ......... . x . . A x o . . . ......... ......... . x . . A x o . . . ......... ......... . x . . A x o . . . ......... ......... . x . . A x o . . . ......... ......... . x . . A x o . . . ......... ......... . x . . x o . . . ......... ......... . x . . x o . . A . ......... ......... . x . . x o . . A . ......... ......... . x . . x o . . A . ......... ......... . x . . x o . . A . ......... ......... . x . . x o . . A . ......... ......... . x . . x o . . A . ......... ......... . x . . x o . . A . ......... ......... . x . . x o . . A . ......... ......... . x . . x o . . A . ......... ......... . x . . x o . . A . ......... ......... . x . . x o . . A . ......... ......... . x . . x o . . A . ......... ......... . x . . x o . . A . ......... ......... . x . . x o . . A . ......... ......... . x . . x o . . A . ......... ......... . x . . x o . . A . ......... ......... . x . . x o . . A . ......... ......... . x . . x o . . A . ......... ......... . x . . x o . . A . ......... ......... . x . . x o . . A . ......... ......... . x . . x o . . A . ......... ......... . x . . x o . . A . ......... ......... . x . . x o . . A . ......... ......... . x . . x o . . A . ......... ......... . x . . x A . . A . ......... ......... . x . . x A . . . ......... ......... . x . . x A . . . ......... Episode:0 Total Step:17, Total Reward:100 . x . . x A . . . ......... Episode:0 Total Step:17, Total Reward:100 . x A . . . ......... Episode:0 Total Step:17, Total Reward:100 . . ......... Episode:0 Total Step:17, Total Reward:100 ......... Episode:0 Total Step:17, Total Reward:100 ......... ......... . x . ......... . x . .A x o . ......... . x . .A x o . . . ......... . x . .A x o . . . ......... …… ......... . A . . x o . . . ......... Episode:2 Total Step:30, Total Reward:-5 . A . . x o . . . ......... Episode:2 Total Step:30, Total Reward:-5 . x o . . . ......... Episode:2 Total Step:30, Total Reward:-5 . . ......... Episode:2 Total Step:30, Total Reward:-5 ......... Episode:2 Total Step:30, Total Reward:-5 [F …… ......... . x . . x o . . A . ......... ......... . x . . x o . . A . ......... ......... . x . . x o . . A . ......... ......... . x . . x o . . A . ......... ......... . x . . x o . . A . ......... ......... . x . . xAo . . A . ......... ......... . x . . xAo . . . ......... ......... . x . . xAo . . . ......... ......... . x . . xAo . . . ......... ......... . x . . xAo . . . ......... ......... . x . . x A . . . ......... ......... . x . . x A . . . ......... ......... . x . . x A . . . ......... Episode:98 Total Step:8, Total Reward:100 . x . . x A . . . ......... Episode:98 Total Step:8, Total Reward:100 . x A . . . ......... Episode:98 Total Step:8, Total Reward:100 . . ......... Episode:98 Total Step:8, Total Reward:100 ......... Episode:98 Total Step:8, Total Reward:100 ......... ......... . Ax . ......... . Ax . . x o . ......... . Ax . . x o . . . ......... . Ax . . x o . . . ......... ......... . Ax . . x o . . . ......... ......... . Ax . . x o . . . ......... ......... . Ax . . x o . . . ......... ......... . Ax . . x o . . . ......... ......... . Ax . . x o . . . ......... ......... . Ax . . x o . . . ......... ......... . x . . x o . . . ......... ......... . x . . A x o . . . ......... ......... . x . . A x o . . . ......... ......... . x . . A x o . . . ......... ......... . x . . A x o . . . ......... ......... . x . . A x o . . . ......... ......... . x . . x o . . . ......... ......... . x . . x o . . A . ......... ......... . x . . x o . . A . ......... ......... . x . . x o . . A . ......... ......... . x . . x o . . A . ......... ......... . x . . x o . . A . ......... ......... . x . . x o . . A . ......... ......... . x . . x o . . A . ......... ......... . x . . x o . . A . ......... ......... . x . . x o . . A . ......... ......... . x . . Ax o . . A . ......... ......... . x . . Ax o . . . ......... ......... . x . . Ax o . . . ......... ......... . x . . Ax o . . . ......... ......... . x . . Ax o . . . ......... ......... . x . . x o . . . ......... ......... . x . . x o . . A . ......... ......... . x . . x o . . A . ......... ......... . x . . x o . . A . ......... ......... . x . . x o . . A . ......... ......... . x . . x o . . A . ......... ......... . x . . x o . . A . ......... ......... . x . . x o . . A . ......... ......... . x . . x o . . A . ......... ......... . x . . x o . . A . ......... ......... . x . . x o . . A . ......... ......... . x . . x o . . A . ......... ......... . x . . x o . . A . ......... ......... . x . . x o . . A . ......... ......... . x . . x o . . A . ......... ......... . x . . xAo . . A . ......... ......... . x . . xAo . . . ......... ......... . x . . xAo . . . ......... ......... . x . . xAo . . . ......... ......... . x . . xAo . . . ......... ......... . x . . x A . . . ......... ......... . x . . x A . . . ......... ......... . x . . x A . . . ......... Episode:99 Total Step:11, Total Reward:100 . x . . x A . . . ......... Episode:99 Total Step:11, Total Reward:100 . x A . . . ......... Episode:99 Total Step:11, Total Reward:100 . . ......... Episode:99 Total Step:11, Total Reward:100 ......... Episode:99 Total Step:11, Total Reward:100 Episode:99 Total Step:11, Total Reward:100
-
实践Q Learning 实现走迷宫
2018-02-25 19:12:16一、环境构建搭建一个简单的迷宫环境,红色位置出发,黑色位置代表失败,黄色位置代表成功,让红色块慢慢通过不断探索学习的方式走到黄色的位置 #初始化迷宫 def _build_maze(self): h = self.MAZE_H*self.UNIT w...一、环境构建
搭建一个简单的迷宫环境,红色位置出发,黑色位置代表失败,黄色位置代表成功,让红色块慢慢通过不断探索学习的方式走到黄色的位置
#初始化迷宫 def _build_maze(self): h = self.MAZE_H*self.UNIT w = self.MAZE_W*self.UNIT #初始化画布 self.canvas = tk.Canvas(self, bg='white', height=h, width=w) #画线 for c in range(0, w, self.UNIT): self.canvas.create_line(c, 0, c, h) for r in range(0, h, self.UNIT): self.canvas.create_line(0, r, w, r) # 陷阱 self.hells = [self._draw_rect(3, 2, 'black'), self._draw_rect(3, 3, 'black'), self._draw_rect(3, 4, 'black'), self._draw_rect(3, 5, 'black'), self._draw_rect(4, 1, 'black'), self._draw_rect(4, 5, 'black'), self._draw_rect(1, 0, 'black'), self._draw_rect(1, 1, 'black'), self._draw_rect(1, 2, 'black'), self._draw_rect(1, 3, 'black'), self._draw_rect(1, 4, 'black')] self.hell_coords = [] for hell in self.hells: self.hell_coords.append(self.canvas.coords(hell)) # 奖励 self.oval = self._draw_rect(4, 5, 'yellow') # 玩家对象 self.rect = self._draw_rect(0, 0, 'red') self.canvas.pack()
然后就是实现走迷宫的动作了,“上下左右”走到对应的位置得到不同的结果,如果走到了黑块就得到-1的惩罚并结束回合,走到黄块得到1的奖励并结束回合,当然需要返回当前的行走策略得到的奖励(或惩罚)
def step(self, action): s = self.canvas.coords(self.rect) base_action = np.array([0, 0]) if action == 0: # up if s[1] > self.UNIT: base_action[1] -= self.UNIT elif action == 1: # down if s[1] < (self.MAZE_H - 1) * self.UNIT: base_action[1] += self.UNIT elif action == 2: # right if s[0] < (self.MAZE_W - 1) * self.UNIT: base_action[0] += self.UNIT elif action == 3: # left if s[0] > self.UNIT: base_action[0] -= self.UNIT #根据策略移动红块 self.canvas.move(self.rect, base_action[0], base_action[1]) s_ = self.canvas.coords(self.rect) #判断是否得到奖励或惩罚 done = False if s_ == self.canvas.coords(self.oval): reward = 1 done = True elif s_ in self.hell_coords: reward = -1 done = True #elif base_action.sum() == 0: # reward = -1 else: reward = 0 self.old_s = s return s_, reward, done
二、实现Q Learning
Q-Learning的原理很简单,就是用一张Q表来记录每个状态下取不同的策略(action)的权值,而权值是根据历史经验(得到的奖励、惩罚)来不断更新得到的
这是根据Q表来得到价值最高的步骤,当然为了有探索性所以给了一定权重进行完全随机
#选择动作 def choose_action(self, s): self.check_state_exist(s) if np.random.uniform() < self.e_greedy: state_action = self.q_table.ix[s, :] state_action = state_action.reindex( np.random.permutation(state_action.index)) #防止相同列值时取第一个列,所以打乱列的顺序 action = state_action.argmax() else: action = np.random.choice(self.actions) return action
另外就是记录当前的状态,下一步的动作,这个动作得到的奖励或惩罚根据这个核心算法更新到Q表中
#更新q表 def rl(self, s, a, r, s_): self.check_state_exist(s_) q_predict = self.q_table.ix[s, a] #q估计 if s_ != 'terminal': q_target = r + self.reward_decay * self.q_table.ix[s_, :].max() #q现实 else: q_target = r self.q_table.ix[s, a] += self.learning_rate * (q_target - q_predict)
三、训练实验
训练的步骤是
1、根据当前的状态得到下一个步骤
2、执行这个步骤,得到执行后的状态
3、记录算法计算出的权值
def update(): for episode in range(100): s = env.reset() while True: env.render() #选择一个动作 action = RL.choose_action(str(s)) #执行这个动作得到反馈(下一个状态s 奖励r 是否结束done) s_, r, done = env.step(action) #更新状态表 RL.rl(str(s), action, r, str(s_)) s = s_ if done: print(episode) break
我在第25轮的时候得到第一次奖励,等到了第50轮基本就是走最短路径了
四、Q表的解读
这里我们的Q表的数据结构是以动作为列,每一行是对应不同的状态。初始化的时候是这样的:
Empty DataFrame
Columns: [0, 1, 2, 3]Index: []
0 1 2 3
[5.0, 5.0, 35.0, 35.0] 0.0 0.0 0.0 0.0
这里的[5.0, 5.0, 35.0, 35.0]代表当前我们的在迷宫中的状态(位置)。
接下来我们随机走一格,根据走一格后的结果(奖励)进行更新到当前这个状态中,由于我们的宝箱很远,而陷阱很多,所以如果走到陷阱的时候会得到一个负奖励,如:
0 1 2 3
[5.0, 5.0, 35.0, 35.0] 0.0 0.0 -0.01 0.0
[5.0, 45.0, 35.0, 75.0] 0.0 0.0 -0.01 0.0
[45.0, 5.0, 75.0, 35.0] 0.0 0.0 0.00 0.0
[5.0, 85.0, 35.0, 115.0] 0.0 0.0 -0.01 0.0
[45.0, 85.0, 75.0, 115.0] 0.0 0.0 0.00 0.0
[45.0, 45.0, 75.0, 75.0] 0.0 0.0 0.00 0.0
[5.0, 125.0, 35.0, 155.0] 0.0 0.0 -0.01 0.0
[45.0, 125.0, 75.0, 155.0] 0.0 0.0 0.00 0.0
[5.0, 165.0, 35.0, 195.0] 0.0 0.0 -0.01 0.0这些为负数的动作代表了这个状态下这么走会遇到陷阱,随着不断的试错,这张表不断的完善,终于拿到第一个奖励(估计在第30轮)
('----------------', '[205.0, 205.0, 235.0, 235.0]', '------------------', 3, '---------------', 1, '----------------', '[165.0, 205.0, 195.0, 235.0]')
0 1 2 3
[5.0, 5.0, 35.0, 35.0] 0.00 0.00 -0.039404 0.000000
[5.0, 45.0, 35.0, 75.0] 0.00 0.00 -0.019900 0.000000
[45.0, 45.0, 75.0, 75.0] 0.00 0.00 0.000000 0.000000
[5.0, 85.0, 35.0, 115.0] 0.00 0.00 -0.019900 0.000000
[45.0, 85.0, 75.0, 115.0] 0.00 0.00 0.000000 0.000000
[45.0, 5.0, 75.0, 35.0] 0.00 0.00 0.000000 0.000000
[5.0, 125.0, 35.0, 155.0] 0.00 0.00 -0.010000 0.000000
[45.0, 125.0, 75.0, 155.0] 0.00 0.00 0.000000 0.000000
[5.0, 165.0, 35.0, 195.0] 0.00 0.00 -0.010000 0.000000
[5.0, 205.0, 35.0, 235.0] 0.00 0.00 0.000000 0.000000
[45.0, 205.0, 75.0, 235.0] -0.01 0.00 0.000000 0.000000
[45.0, 165.0, 75.0, 195.0] 0.00 0.00 0.000000 0.000000
[85.0, 205.0, 115.0, 235.0] 0.00 0.00 -0.010000 0.000000
[125.0, 205.0, 155.0, 235.0] 0.00 0.00 0.000000 0.000000
[85.0, 165.0, 115.0, 195.0] 0.00 0.00 -0.010000 -0.010000
[85.0, 125.0, 115.0, 155.0] 0.00 0.00 -0.010000 -0.010000
[125.0, 125.0, 155.0, 155.0] 0.00 0.00 0.000000 0.000000
[85.0, 85.0, 115.0, 115.0] 0.00 0.00 -0.010000 -0.010000
[125.0, 165.0, 155.0, 195.0] 0.00 0.00 0.000000 0.000000
[85.0, 45.0, 115.0, 75.0] 0.00 0.00 0.000000 -0.029701
[85.0, 5.0, 115.0, 35.0] 0.00 0.00 0.000000 -0.010000
[125.0, 85.0, 155.0, 115.0] 0.00 0.00 0.000000 0.000000
[125.0, 45.0, 155.0, 75.0] 0.00 -0.01 -0.010000 0.000000
[125.0, 5.0, 155.0, 35.0] 0.00 0.00 0.000000 0.000000
[165.0, 5.0, 195.0, 35.0] 0.00 -0.01 0.000000 0.000000
[205.0, 5.0, 235.0, 35.0] 0.00 0.00 0.000000 0.000000
[205.0, 45.0, 235.0, 75.0] 0.00 0.00 0.000000 -0.010000
[165.0, 45.0, 195.0, 75.0] 0.00 0.00 0.000000 0.000000
[205.0, 85.0, 235.0, 115.0] 0.00 0.00 0.000000 0.000000
[205.0, 125.0, 235.0, 155.0] 0.00 0.00 0.000000 0.000000
[165.0, 85.0, 195.0, 115.0] -0.01 0.00 0.000000 -0.010000
[165.0, 125.0, 195.0, 155.0] 0.00 0.00 0.000000 -0.010000
[205.0, 165.0, 235.0, 195.0] 0.00 0.00 0.000000 0.000000
[165.0, 165.0, 195.0, 195.0] 0.00 0.00 0.000000 -0.010000
[205.0, 205.0, 235.0, 235.0] 0.00 0.00 0.000000 0.010000
[165.0, 205.0, 195.0, 235.0] 0.00 0.00 0.000000 0.000000以这次为分水岭,再继续走会不断根据得到的那个奖励逐步把可以得到奖励的值更新到最优路径上,后期的表会是这样,基本可以一次就走到迷宫上去了,这是我迭代99步的Q表:
0 1 2 \
[5.0, 5.0, 35.0, 35.0] 0.000000e+00 7.215152e-30 -5.851985e-02
[45.0, 5.0, 75.0, 35.0] 0.000000e+00 0.000000e+00 0.000000e+00
[5.0, 45.0, 35.0, 75.0] 0.000000e+00 4.987220e-28 -3.940399e-02
[45.0, 45.0, 75.0, 75.0] 0.000000e+00 0.000000e+00 0.000000e+00
[5.0, 85.0, 35.0, 115.0] 0.000000e+00 3.331252e-26 -1.000000e-02
[45.0, 85.0, 75.0, 115.0] 0.000000e+00 0.000000e+00 0.000000e+00
[5.0, 125.0, 35.0, 155.0] 0.000000e+00 2.139487e-24 -4.900995e-02
[5.0, 165.0, 35.0, 195.0] 0.000000e+00 1.288811e-22 -4.900995e-02
[45.0, 125.0, 75.0, 155.0] 0.000000e+00 0.000000e+00 0.000000e+00
[45.0, 165.0, 75.0, 195.0] 0.000000e+00 0.000000e+00 0.000000e+00
[5.0, 205.0, 35.0, 235.0] 0.000000e+00 0.000000e+00 7.144125e-21
[45.0, 205.0, 75.0, 235.0] -2.970100e-02 0.000000e+00 3.617735e-19
[85.0, 205.0, 115.0, 235.0] 1.668205e-17 0.000000e+00 -3.940399e-02
[85.0, 165.0, 115.0, 195.0] 6.976738e-16 9.453308e-30 -1.990000e-02
[125.0, 205.0, 155.0, 235.0] 0.000000e+00 0.000000e+00 0.000000e+00
[85.0, 125.0, 115.0, 155.0] 2.633329e-14 0.000000e+00 -2.970100e-02
[85.0, 85.0, 115.0, 115.0] 8.916751e-13 0.000000e+00 -1.000000e-02
[85.0, 45.0, 115.0, 75.0] 0.000000e+00 0.000000e+00 2.689670e-11
[125.0, 165.0, 155.0, 195.0] 0.000000e+00 0.000000e+00 0.000000e+00
[125.0, 125.0, 155.0, 155.0] 0.000000e+00 0.000000e+00 0.000000e+00
[85.0, 5.0, 115.0, 35.0] 0.000000e+00 0.000000e+00 0.000000e+00
[125.0, 85.0, 155.0, 115.0] 0.000000e+00 0.000000e+00 0.000000e+00
[125.0, 45.0, 155.0, 75.0] 7.171044e-10 -1.000000e-02 -1.000000e-02
[125.0, 5.0, 155.0, 35.0] 0.000000e+00 0.000000e+00 1.676346e-08
[165.0, 5.0, 195.0, 35.0] 1.396498e-13 -1.000000e-02 3.409809e-07
[165.0, 45.0, 195.0, 75.0] 0.000000e+00 0.000000e+00 0.000000e+00
[205.0, 5.0, 235.0, 35.0] 0.000000e+00 5.989715e-06 3.319001e-13
[205.0, 45.0, 235.0, 75.0] 0.000000e+00 8.988374e-05 2.131374e-08
[205.0, 85.0, 235.0, 115.0] 0.000000e+00 1.126386e-03 0.000000e+00
[165.0, 85.0, 195.0, 115.0] -1.000000e-02 2.585824e-04 0.000000e+00
[205.0, 125.0, 235.0, 155.0] 0.000000e+00 0.000000e+00 0.000000e+00
[205.0, 165.0, 235.0, 195.0] 1.354722e-07 0.000000e+00 0.000000e+00
[205.0, 205.0, 235.0, 235.0] 0.000000e+00 0.000000e+00 0.000000e+00
[165.0, 165.0, 195.0, 195.0] 4.077995e-04 3.949939e-01 0.000000e+00
[165.0, 125.0, 195.0, 155.0] 0.000000e+00 8.241680e-02 1.012101e-04[165.0, 205.0, 195.0, 235.0] 0.000000e+00 0.000000e+00 0.000000e+00
3
[5.0, 5.0, 35.0, 35.0] 0.000000e+00
[45.0, 5.0, 75.0, 35.0] 0.000000e+00
[5.0, 45.0, 35.0, 75.0] 0.000000e+00
[45.0, 45.0, 75.0, 75.0] 0.000000e+00
[5.0, 85.0, 35.0, 115.0] 0.000000e+00
[45.0, 85.0, 75.0, 115.0] 0.000000e+00
[5.0, 125.0, 35.0, 155.0] 0.000000e+00
[5.0, 165.0, 35.0, 195.0] 0.000000e+00
[45.0, 125.0, 75.0, 155.0] 0.000000e+00
[45.0, 165.0, 75.0, 195.0] 0.000000e+00
[5.0, 205.0, 35.0, 235.0] 0.000000e+00
[45.0, 205.0, 75.0, 235.0] 0.000000e+00
[85.0, 205.0, 115.0, 235.0] 0.000000e+00
[85.0, 165.0, 115.0, 195.0] -1.990000e-02
[125.0, 205.0, 155.0, 235.0] 0.000000e+00
[85.0, 125.0, 115.0, 155.0] -1.000000e-02
[85.0, 85.0, 115.0, 115.0] -1.000000e-02
[85.0, 45.0, 115.0, 75.0] -1.990000e-02
[125.0, 165.0, 155.0, 195.0] 0.000000e+00
[125.0, 125.0, 155.0, 155.0] 0.000000e+00
[85.0, 5.0, 115.0, 35.0] -1.990000e-02
[125.0, 85.0, 155.0, 115.0] 0.000000e+00
[125.0, 45.0, 155.0, 75.0] 0.000000e+00
[125.0, 5.0, 155.0, 35.0] 0.000000e+00
[165.0, 5.0, 195.0, 35.0] 0.000000e+00
[165.0, 45.0, 195.0, 75.0] 0.000000e+00
[205.0, 5.0, 235.0, 35.0] 0.000000e+00
[205.0, 45.0, 235.0, 75.0] -1.000000e-02
[205.0, 85.0, 235.0, 115.0] 7.290000e-09
[165.0, 85.0, 195.0, 115.0] -1.000000e-02
[205.0, 125.0, 235.0, 155.0] 1.185054e-02
[205.0, 165.0, 235.0, 195.0] 0.000000e+00
[205.0, 205.0, 235.0, 235.0] 0.000000e+00
[165.0, 165.0, 195.0, 195.0] -1.000000e-02
[165.0, 125.0, 195.0, 155.0] -1.000000e-02
[165.0, 205.0, 195.0, 235.0] 0.000000e+00五、完整的代码
1、环境
# coding: utf-8 import sys import time import numpy as np if sys.version_info.major == 2: import Tkinter as tk else: import tkinter as tk class Maze(tk.Tk, object): UNIT = 40 # pixels MAZE_H = 6 # grid height MAZE_W = 6 # grid width def __init__(self): super(Maze, self).__init__() self.action_space = ['U', 'D', 'L', 'R'] self.n_actions = len(self.action_space) self.title('迷宫') self.geometry('{0}x{1}'.format(self.MAZE_H * self.UNIT, self.MAZE_W * self.UNIT)) #窗口大小 self._build_maze() #画矩形 #x y 格坐标 #color 颜色 def _draw_rect(self, x, y, color): center = self.UNIT / 2 w = center - 5 x_ = self.UNIT * x + center y_ = self.UNIT * y + center return self.canvas.create_rectangle(x_-w, y_-w, x_+w, y_+w, fill = color) #初始化迷宫 def _build_maze(self): h = self.MAZE_H*self.UNIT w = self.MAZE_W*self.UNIT #初始化画布 self.canvas = tk.Canvas(self, bg='white', height=h, width=w) #画线 for c in range(0, w, self.UNIT): self.canvas.create_line(c, 0, c, h) for r in range(0, h, self.UNIT): self.canvas.create_line(0, r, w, r) # 陷阱 self.hells = [self._draw_rect(3, 2, 'black'), self._draw_rect(3, 3, 'black'), self._draw_rect(3, 4, 'black'), self._draw_rect(3, 5, 'black'), self._draw_rect(4, 1, 'black'), self._draw_rect(4, 5, 'black'), self._draw_rect(1, 0, 'black'), self._draw_rect(1, 1, 'black'), self._draw_rect(1, 2, 'black'), self._draw_rect(1, 3, 'black'), self._draw_rect(1, 4, 'black')] self.hell_coords = [] for hell in self.hells: self.hell_coords.append(self.canvas.coords(hell)) # 奖励 self.oval = self._draw_rect(4, 5, 'yellow') # 玩家对象 self.rect = self._draw_rect(0, 0, 'red') self.canvas.pack() #执行画 #重新初始化 def reset(self): self.update() time.sleep(0.5) self.canvas.delete(self.rect) self.rect = self._draw_rect(0, 0, 'red') self.old_s = None #返回 玩家矩形的坐标 [5.0, 5.0, 35.0, 35.0] return self.canvas.coords(self.rect) #走下一步 def step(self, action): s = self.canvas.coords(self.rect) base_action = np.array([0, 0]) if action == 0: # up if s[1] > self.UNIT: base_action[1] -= self.UNIT elif action == 1: # down if s[1] < (self.MAZE_H - 1) * self.UNIT: base_action[1] += self.UNIT elif action == 2: # right if s[0] < (self.MAZE_W - 1) * self.UNIT: base_action[0] += self.UNIT elif action == 3: # left if s[0] > self.UNIT: base_action[0] -= self.UNIT #根据策略移动红块 self.canvas.move(self.rect, base_action[0], base_action[1]) s_ = self.canvas.coords(self.rect) #判断是否得到奖励或惩罚 done = False if s_ == self.canvas.coords(self.oval): reward = 1 done = True elif s_ in self.hell_coords: reward = -1 done = True #elif base_action.sum() == 0: # reward = -1 else: reward = 0 self.old_s = s return s_, reward, done def render(self): time.sleep(0.01) self.update()
2、Q-Learning
# coding: utf-8 import pandas as pd import numpy as np class q_learning_model_maze: def __init__(self, actions, learning_rate=0.01, reward_decay=0.9, e_greedy=0.99): self.actions = actions self.learning_rate = learning_rate self.reward_decay = reward_decay self.e_greedy = e_greedy self.q_table = pd.DataFrame(columns=actions,dtype=np.float32) #检查状态是否存在 def check_state_exist(self, state): if state not in self.q_table.index: self.q_table = self.q_table.append( pd.Series( [0] * len(self.actions), index=self.q_table.columns, name=state, ) ) #选择动作 def choose_action(self, s): self.check_state_exist(s) if np.random.uniform() < self.e_greedy: state_action = self.q_table.ix[s, :] state_action = state_action.reindex( np.random.permutation(state_action.index)) #防止相同列值时取第一个列,所以打乱列的顺序 action = state_action.argmax() else: action = np.random.choice(self.actions) return action #更新q表 def rl(self, s, a, r, s_): self.check_state_exist(s_) q_predict = self.q_table.ix[s, a] #q估计 if s_ != 'terminal': q_target = r + self.reward_decay * self.q_table.ix[s_, :].max() #q现实 else: q_target = r self.q_table.ix[s, a] += self.learning_rate * (q_target - q_predict)
3、训练实验
# coding: utf-8 from maze_env_1 import Maze from q_learning_model_maze import q_learning_model_maze def update(): for episode in range(100): s = env.reset() while True: env.render() #选择一个动作 action = RL.choose_action(str(s)) #执行这个动作得到反馈(下一个状态s 奖励r 是否结束done) s_, r, done = env.step(action) #更新状态表 RL.rl(str(s), action, r, str(s_)) s = s_ if done: print(episode) break if __name__ == "__main__": env = Maze() RL = q_learning_model_maze(actions=list(range(env.n_actions))) env.after(10, update) #延迟10毫秒执行update env.mainloop()
-
增强学习之Q-learning走迷宫
2018-07-19 12:25:12Q-Learning算法 整个算法就是一直不断更新 Q table 里的值, 然后再根据新的值来判断要在某个 state 采取怎样的 action. Qlearning 是一个 off-policy 的算法, 因为里面的 max action 让 Q table 的更新可以不基于...Q-Learning算法
整个算法就是一直不断更新 Q table 里的值, 然后再根据新的值来判断要在某个 state 采取怎样的 action. Qlearning 是一个 off-policy 的算法, 因为里面的 max action 让 Q table 的更新可以不基于正在经历的经验(可以是现在学习着很久以前的经验,甚至是学习他人的经验).
Q-learning中的Q函数
- s: 当前状态state
- a: 从当前状态下,采取的行动action
- s’: 今次行动所产生的新一轮state
- a’: 次回action
- R: 本次行动的奖励reward
- α: 学习速率,比如取0.01
- γ : 折扣因数,表示牺牲当前收益,换区长远收益的程度。比如取0.9要走的迷宫矩阵
是一个 5*6 的矩阵其中 0 表示可走,1 表示障碍物
代码整体实现
代码中 q_table 样式
up down left right (0, 0) -0.550747 -0.533564 -0.644566 -0.410420 (0, 1) -0.811724 -0.344330 -0.362692 -0.354689 (0, 2) -0.510908 -0.571715 -0.354768 -0.354741 (1, 1) -0.297905 -0.247055 -0.478024 -0.537521 (0, 3) -0.599642 -0.512899 -0.354843 -0.354771 (0, 4) -0.546996 -0.470504 -0.354866 -0.354824 (0, 5) -0.370004 -0.361741 -0.354866 -0.397040 (2, 1) -0.259938 -0.109431 -0.464743 -0.526687 (3, 1) -0.176143 -0.403094 -0.368366 0.076880 (3, 2) -0.369096 -0.115697 -0.109689 0.296391 (4, 2) -0.069825 -0.237857 -0.136630 -0.087706 (4, 3) -0.018432 -0.078908 -0.068174 -0.066634 (4, 4) -0.117762 -0.079410 -0.066807 -0.066656 (3, 3) 0.533487 -0.066857 -0.045965 -0.223937 (2, 3) -0.164942 0.020808 -0.152385 0.767553 (4, 5) -0.069677 -0.069658 -0.066724 -0.098813 (2, 4) -0.049835 -0.063313 0.059299 0.993430 (2, 5) 0.000000 0.000000 0.000000 0.000000
q-table 为 DataFrame 类型,index 表示状态( state ),对应迷宫矩阵的索引,columns 表示动作( action )
首先运行 train()
import numpy as np import pandas as pd import random import pickle from sklearn.utils import shuffle # 迷宫矩阵 maze = np.array( [[0, 0, 0, 0, 0, 0, ], [1, 0, 1, 1, 1, 1, ], [1, 0, 1, 0, 0, 0, ], [1, 0, 0, 0, 1, 1, ], [0, 1, 0, 0, 0, 0, ]] ) print(pd.DataFrame(maze)) # 起点 start_state = (0, 0) # 终点 target_state = (2, 5) # 要保存的q_table的文件路径 q_learning_table_path = 'q_learning_table.pkl' class QLearningTable: def __init__(self, alpha=0.01, gamma=0.9): # self.alpha self.gamma 是Q函数中需要用到的两个参数 self.alpha = alpha self.gamma = gamma # 奖励(惩罚)值 self.reward_dict = {'reward_0': -1, 'reward_1': -0.1, 'reward_2': 1} # 动作 self.actions = ('up', 'down', 'left', 'right') self.q_table = pd.DataFrame(columns=self.actions) def get_next_state_reward(self, current_state, action): """ :param current_state: 当前状态 :param action: 动作 :return: next_state下个状态,reward奖励值,done游戏是否结束 """ done = False if action == 'up': next_state = (current_state[0] - 1, current_state[1]) elif action == 'down': next_state = (current_state[0] + 1, current_state[1]) elif action == 'left': next_state = (current_state[0], current_state[1] - 1) else: next_state = (current_state[0], current_state[1] + 1) if next_state[0] < 0 or next_state[0] >= maze.shape[0] or next_state[1] < 0 or next_state[1] >= maze.shape[1] \ or maze[next_state[0], next_state[1]] == 1: # 如果出界或者遇到1,保持原地不动 next_state = current_state reward = self.reward_dict.get('reward_0') # 此处done=True,可理解为进入陷阱,游戏结束,done=False,可理解为在原地白走一步,受到了一次惩罚,但游戏还未结束 # done = True elif next_state == target_state: # 到达目标 reward = self.reward_dict.get('reward_2') done = True else: # maze[next_state[0],next_state[1]] == 0 reward = self.reward_dict.get('reward_1') return next_state, reward, done # 根据返回的reward和next_state更新q_table def learn(self, current_state, action, reward, next_state): self.check_state_exist(next_state) q_sa = self.q_table.loc[current_state, action] max_next_q_sa = self.q_table.loc[next_state, :].max() # 套用公式:Q函数 new_q_sa = q_sa + self.alpha * (reward + self.gamma * max_next_q_sa - q_sa) # 更新q_table self.q_table.loc[current_state, action] = new_q_sa # 如果state不在q_table中,在q_tabel中添加该state def check_state_exist(self, state): if state not in self.q_table.index: self.q_table.loc[state] = pd.Series(np.zeros(len(self.actions)), index=self.actions) # 旋转执行动作 def choose_action(self, state, random_num=0.8): series = pd.Series(self.q_table.loc[state]) # 以0.8的概率执行action,尝试更多的可能性。总是做最好的选择,意味着你可能会错过一些从未探索的道路。 # 为了避免这种情况,可以添加一个随机项,而未必总是选择对当前来说最好的action。 if random.random() > random_num: action = random.choice(self.actions) else: # 因为pd.Series数据的最大值可能出现多个,而argmax()只取第一个,故使用sklearn中的shuffle将其打乱顺序, # 随机选取最大值的索引,选取最大值的action有利于q_table快速收敛 ss = shuffle(series) action = ss.argmax() return action # 训练 def train(): q_learning_table = QLearningTable() # 迭代次数 iterate_num = 500 for _ in range(iterate_num): # 每次迭代 从start_state开始 current_state = start_state while True: # 先检查current_state是否已在q_table中,注意将current_state以为字符串的形式存到q_table中 q_learning_table.check_state_exist(str(current_state)) # 获取当前状态的执行动作 action = q_learning_table.choose_action(str(current_state)) # 根据当前状态current_state和动作action,获取下个状态next_state,奖励值reward以及游戏是否结束done next_state, reward, done = q_learning_table.get_next_state_reward(current_state, action) # 开始学习,更新q_table q_learning_table.learn(str(current_state), action, reward, str(next_state)) # 如果游戏结束,跳出while循环,进入下次迭代 if done: break # current_state跳转到下个状态 current_state = next_state print('game over') # 保存对象q_learning_table到文件q_learning_table_path with open(q_learning_table_path, 'wb') as pkl_file: pickle.dump(q_learning_table, pkl_file)
tain() 运行完后生成一个文件 q_learning_table.pkl,里面存放到是训练好的 QLearningTable 对象模型
然后运行下面一段代码 predict() 用来测试模型
# 预测 def predict(): # 读取q_table with open(q_learning_table_path, 'rb') as pkl_file: q_learning_table = pickle.load(pkl_file) print('start_state:{}'.format(start_state)) current_state = start_state step = 0 while True: step = step + 1 action = q_learning_table.choose_action(str(current_state), random_num=1) # 预测阶段,reward用不到了,故使用_代替 next_state, _, done = q_learning_table.get_next_state_reward(current_state, action) # 输出动作和下个状态 print('step:{step}, action: {action}, state: {state}'.format(step=step, action=action, state=next_state)) # 如果done或者步数超过100,游戏结束退出 if done or step > 100: if next_state == target_state: print('success') else: print('fail') break # 跳转到下个状态 else: current_state = next_state
运行结果
start_state:(0, 0) step:1, action: right, state: (0, 1) step:2, action: down, state: (1, 1) step:3, action: down, state: (2, 1) step:4, action: down, state: (3, 1) step:5, action: right, state: (3, 2) step:6, action: right, state: (3, 3) step:7, action: up, state: (2, 3) step:8, action: right, state: (2, 4) step:9, action: right, state: (2, 5) success
参考
-
经典Q-learning代码-迷宫
2019-01-21 21:53:57经典Q-learning走迷宫的MATLAB代码,假设一个情景,一个机器人处于一栋房子之中,希望他能够走出5号门。假定此时机器人处于0号房间之中。 -
机器学习-Q-Learning-沙鼠走迷宫视频教学
2017-09-17 15:49:42通过有趣的沙鼠走迷宫游戏,让大家掌握Q-学习算法的实质理论,并且帮助学院去动手写一个让机器思考的程序,理解机器学习。 -
强化学习系列(2) - Q-learning走迷宫例子
2020-05-23 18:28:34这里通过zoe走迷宫例子再次学习Q-learning。与强化学习系列(1)中思想一致,其区别主要是通过两个类,迷宫环境Maze和zoe大脑QLearningTable来规范化程序,同时在运行函数步骤来清晰化Q学习的过程。 Part 1. 迷宫环境... -
强化学习Q-Learning实现机器人走迷宫
2020-02-15 18:05:21首先有三部分代码:第一部分是绘制地图代码,第二部分是Q-Learning的代码,第三部分是运行代码 地图如下: 黄色圆形 : 机器人 红色方形 : 炸弹 [reward = -1] 绿色方形 : 宝藏 [reward = +1] 其他方格 : 平地 ... -
用Q-learning算法实现自动走迷宫机器人的方法示例
2020-09-19 08:26:42主要介绍了用Q-learning算法实现自动走迷宫机器人的方法示例,文中通过示例代码介绍的非常详细,对大家的学习或者工作具有一定的参考学习价值,需要的朋友们下面随着小编来一起学习学习吧 -
每天学点算法->强化学习->Q_learning走迷宫
2018-10-08 23:34:26今天给大家分享如何用Q_learning算法来实现走迷宫,我们的红色方块会一次次的尝试不同的格子,直到落入黑格子,获得惩罚*1;或者走进黄格子,获得奖励*1为止。每一次游戏都会更新Q_table的权重,以实现红色方块下一次能够... -
视频教程-机器学习-Q-Learning-沙鼠走迷宫视频教学-机器学习
2020-05-28 10:24:00机器学习-Q-Learning-沙鼠走迷宫视频教学 多年IT从业经验,从An... -
用Python走迷宫|Q-Learning|强化学习
2020-06-16 15:31:08Q-Learning走迷宫 上文中我们了解了Q-Learning算法的思想,基于这种思想我们可以实现很多有趣的功能和小demo,本文让我们通过Q-Learning算法来实现用计算机来走迷宫。 原理简述 我们先从一个比较高端的例子说起,... -
机器学习-Q-Learning-沙鼠走迷宫-alan-专题视频课程
2017-09-18 12:29:45通过有趣的沙鼠走迷宫游戏,让大家掌握Q-Learning算法,并且能动手写一个让机器思考的程序 -
用Q-learning算法实现自动走迷宫机器人
2019-05-23 16:11:31在该项目中,你将使用强化学习算法,实现一个自动走迷宫机器人。 如上图所示,智能机器人显示在右上角。在我们的迷宫中,有陷阱(红色炸弹)及终点(蓝色的目标点)两种情景。机器人要尽量避开陷阱、尽快到达目的... -
【莫烦强化学习】视频笔记(二)3.Q_Learning算法实现走迷宫
2020-07-15 23:03:29第6节 Q学习实现走迷宫 我们要实现的走迷宫的可视化界面像下面视频所展示的一样,红色的探索者要走到黄色圆圈所在的“” 通过强化学习学习走迷宫 ... -
Q-Learning 自己思路详细讲解,运行原理(走迷宫)附上代码
2020-09-11 10:21:16Q-Learning 自己思路详细讲解,运行原路(走迷宫) 网上太多翻译原文或者根据原文改进的思路写的文章,基本理论都一样的,没有说明核心理论是怎么样。我来讲解一下我对Q-Learning的理解,全部原创希望你们能读完,...
-
C# 高级网络编程及RRQMSocket框架详解
-
MySQL 性能优化(思路拓展及实操)
-
AOP五种增强(下)
-
《文件过滤及内容编辑处理命令》
-
MHA 高可用 MySQL 架构与 Altas 读写分离
-
if(!a)的含意
-
vue3从0到1-超详细
-
3.3: DNS服务基础 、 特殊解析 、 DNS子域授权 、 DNS主从架构(1).docx
-
2021年软考系统规划与管理师-下午历年真题解析视频课程
-
xxljob源码分析
-
投标方法论
-
Swagger
-
将进酒 html
-
NexusSetup.exe
-
郑州大学901普通物理(一)历年考研真题汇编
-
C++ 图解多态分类
-
MySQL 数据库的基本操作(数据完整性约束)
-
据说vite还是有坑,不行,那就还用vue-cli吧,命令vue create gua12,记一下,可能过一个星期不看,又忘了
-
1.6: 用户管理 、 组账号管理 、 计划任务.docx
-
2020下半年数据库系统工程师下午真题及答案解析.pdf