Power Allocation in Multiuser Cellular Networks With Deep Q Learning Approach
Abstract
The modeldriven power allocation (PA) algorithms in the wireless cellular networks with interfering multipleaccess channel (IMAC) have been investigated for decades. Nowadays, the datadriven modelfree machine learningbased approaches are rapidly developed in this field, and among them the deep reinforcement learning (DRL) is proved to be of great promising potential. Different from supervised learning, the DRL takes advantages of exploration and exploitation to maximize the objective function under certain constraints. In our paper, we propose a twostep training framework. First, with the offline learning in simulated environment, a deep Q network (DQN) is trained with deep Q learning (DQL) algorithm, which is welldesigned to be in consistent with this PA issue. Second, the DQN will be further finetuned with real data in online training procedure. The simulation results show that the proposed DQN achieves the highest averaged sumrate, comparing to the ones with present DQL training. With different user densities, our DQN outperforms benchmark algorithms and thus a good generalization ability is verified.
I Introduction
Data transmitting in wireless communication networks has experienced explosively growth in recent decades and will keep rising in the future. The user density is greatly increasing, resulting in critical demand for more capacity and spectral efficiency. Therefore, both intracell and intercell interference managements are significant to improve the overall capacity of a cellular network system. The problem of maximizing a generic sumrate is studied in this paper, and it is nonconvex, NPhard and cannot be solved efficiently.
Various modeldriven algorithms have been proposed in the present papers for PA problems, such as fractional programming (FP) [1], weighted MMSE (WMMSE) [2] and some others [3, 4]. Excellent performance can be observed through theoretical analysis and numerical simulations, but serious obstacles are faced in practical deployments [5]. First, these techniques highly rely on tractable mathematical models, which are imperfect in real communication scenarios with the specific user distribution, geographical environment, etc. Second, the computational complexities of these algorithms are high.
In recent years, the machine learning (ML)based approaches have been rapidly developed in wireless communications [6]. These algorithms are usually modelfree, and are compliant with optimizations in practical communication scenarios. Additionally, with developments of graphic processing unit (GPU) or specialized chips, the executions can be both fast and energyefficient, which brings in solid foundations for massive applications.
Two main branches of ML, supervised learning and reinforcement learning (RL) [7], are briefly introduced here. With supervised learning, a deep neural network (DNN) is trained to approximate some given optimal (or suboptimal) objective algorithms, and it has been realized in some applications [8, 9, 10]. However, the target algorithm is usually unavailable and the performance of DNN is bounded by the supervisor. Therefore, the RL has received widespread attention, due to its nature of interacting with an unknown environment by exploration and exploitation. The Q learning method is the most wellstudied RL algorithm, and it is exploited to cope with power allocation (PA) in [11, 12, 13], and some others [14]. The DNN trained with Q learning is called deep Q network (DQN), and it is proposed to address the distributed downlink singleuser PA problem [15].
In our paper, we extend the work in [15], and the PA problem in cellular cells with multiple users is investigated. The design of the DQN model is discussed and introduced. Simulation results show that our DQN outperforms the present DQNs and the benchmark algorithms. The contributions of this work are summarized as follows:

A modelfree twostep training framework is proposed. The DQN is first offline trained with DRL algorithm in simulated scenarios. Second, the learned DQN can be further dynamically optimized in real communication scenarios, with the aid of transfer learning.

The PA problem using deep Q learning (DQL) is discussed, then a DQN enabled approach is proposed to be trained with current sumrate as reward function, including no future reward. The input features are welldesigned to help the DQN get closer to the optimal solution.

After centralized training, the proposed DQN is tested by distributed execution. The averaged ratesum of DQN outperforms the modeldriven algorithms, and also shows good generalization ability in a series of benchmark simulation tests.
The remainder of this paper is organized as follows. Section II outlines the PA problem in the wireless cellular network with IMAC. In Section III our proposed DQN is introduced in detail. Then, this DQN is tested in distinct scenarios, along with benchmark algorithms, and the simulation results are analyzed in Section IV. Conclusions and discussion are given in Section V.
Ii System Model
The problem of PA in the cellular network with interfering multipleaccess channel (IMAC) is considered. In a system with cells, at the center of each cell a base stations (BS) simultaneously serves users with sharing frequency bands. A simple network example is shown in Fig. 1. At time slot , the independent channel coefficient between the th BS and the user in cell is denoted by , and can be expressed as
(1) 
where is the small scale complex flat fading element, and is the large scale fading component taking account of both the geometric attenuation and the shadow fading. Therefore, the signal to interference plus noise ratio (SINR) of this link can be described by
(2) 
where is the set of interference cells around the th cell, is the emitting power of BS, and denotes the additional noise power. With normalized bandwidth, the downlink rate of this link is given as
(3) 
The optimization target is to maximize this generic sumrate objective function under maximum power constraint, and it is formulated as
(4) 
where , and denotes the maximum emitting power. We also define sumrate , , and channel state information (CSI) . This problem is nonconvex and NPhard, so we propose a datadriven learning algorithm based on the DQN model in the following section.
Iii Deep Q Network
Iiia Background
Q learning is one of the most popular RL algorithms aiming to deal with the Markov decision process (MDP) problems [16]. At time instant , by observing the state , the agent takes action and interacts with the environment, and then get the reward and the next state is obtained. The notations and are the action set and the state set, respectively. Since can be continuous, the DQN is proposed to combine Q learning with a flexible DNN to settle infinite state space. The cumulative discounted reward function is given as
(5) 
where is a discount factor that trades off the importance of immediate and future rewards, and denotes the reward. Under a certain policy , the Q function of the agent with an action in state is given as
(6) 
where denotes the DQN parameters, and is the expectation operator. Q learning concerns with how agents ought to interact with an unknown environment so as to maximize the Q function. The maximization of (6) is equivalent to the Bellman optimality equation [17], and it is describe as
(7) 
where is the optimal Q value. The DQN is trained to approximate the Q function, and the standard Q learning update of the parameters is described as
(8) 
where is the learning rate. This update resembles stochastic gradient descent, gradually updating the current value towards the target . The experience data of the agent is loaded as . The DQN is trained with recorded batch data randomly sampled from the experience replay memory, which is a firstin firstout queue.
IiiB Discussion on DRL
In many applications such as playing video games [16], where current strategy has longterm impact on cumulative reward, the DQN achieve remarkable results and beat humans. However, the discount factor is suggested to be zero in this PA problem. The DQL aims to maximize the Q function. Let , from (6) we have
(9) 
For a PA problem, clearly that , . Then we let and get that
(10) 
In the execution period the policy is deterministic, and thus (10) can be written as
(11) 
which is a equivalent form of (4). In this inference process we assume that and , indicating that the optimal solution to (4) is identical to that of (6), under these two conditions.
As shown in Fig. 2, it is wellknown that the optimal solution of (4) is only determined by current CSI , and the sumrate is calculated with . Theoretically the optimal power can be obtained using a DQN with input being just . In fact, the performance of this designed DQN is poor, since it is nonconvex and the optimal point is hard to find. Therefore, we propose to utilize two more auxiliary features: and . Since that the channel can be modeled as a firstorder Markov process, the solution of last time period can help the DQN get closer to the optimum, and (11) can be rewritten as
(12) 
Once and , (7) is simplified to be , and the replay memory is also reduced to be . The DQN works as an estimator to predict the current sumrate of corresponding power levels with a certain CSI. These discussions provide good guidance for the following DQN design.
IiiC DQN Design in Cellular Network
In our proposed modelfree twostep training framework, the DQN is first offline pretrained with DRL algorithm in simulated wireless communication system. This procedure is to reduce the online training stress, due to the large data requirement of datadriven algorithm by nature. Second, with the aid of transfer learning, the learned DQN can be further dynamically finetuned in real scenarios. Since the practical wireless communication system is dynamic and influenced by unknown issues, the datadriven algorithm is believed to be a promising technique. We just discuss the twostep framework here, and the first training step is mainly focused in the following manuscript.
In a certain cellular network, each BSuser link is regarded as an agent and thus a multiagent system is studied. However, multiagent training is difficult since it needs much more learning data, training time and DNN parameters. Therefore, centralized training is considered, and only one agent is trained by using all agents’ experience replay memory. Then, this agent’s learned policy is shared in the distributed execution period. For our designed DQN, components of the replay memory are introduced as follows.
IiiC1 State
The state design for a certain agent is important, since the full environment information is redundant and irrelevant elements must be removed. The agent is assumed to have corresponding perfect instant CSI information in (2), and we define logarithmic normalized interferer set as
(13) 
The channel amplitude of interferers are normalized by that of the needed link, and the logarithmic representation is preferred since the amplitudes of channel often vary by orders of magnitude. The cardinality of is . To further decrease the input dimension and reduce the computational complexities, the elements in are sorted in decrease turn and only the first elements remain. As we discussed in IIIB, these remained components’ and this link’s corresponding downlink rate and transmitting power at last time slot, are the additional two parts of the input to our DQN. Therefore, the state is composed of three features: . The cardinality of state, i.e., the input dimension for DQN is .
IiiC2 Action
In (4) the downlink power is a continuous variable, and is only constrained by maximum power constraint. Since the action space of DQN must be finite, the possible emitting power is quantized in levels. The allowed power set is given as
(14) 
where is the nonzero minimum emitting power.
IiiC3 Reward
In some manuscripts the reward function is elaborately designed to improve the agent’s transmitting rate and also mitigate the interference influence. However, most of these reward functions are suboptimal approaches to the target function of (4). In our paper, the is directly used as the reward function, and it is shared by all agents. In the training simulations with small or medium scale cellular network, this simple method proves to be feasible.
Iv Simulation Results
Iva Simulation Configuration
A cellular network with cells is simulated. At center of each cell, a BS is deployed to synchronously serve users which are located uniformly and randomly within the cell range , where km and km are the inner space and half celltocell distance, respectively. The smallscale fading is simulated to be Rayleigh distributed, and the Jakes model is adopted with Doppler frequency Hz and time period ms. According to the LTE standard, the largescale fading is modeled as dB, where is a lognormal random variable with standard deviation being dB, and is the transmittertoreceiver distance (km). The AWGN power is dBm, and the emitting power constraints and are and dBm, respectively.
A fourlayer feedforward neural network (FNN) is chosen as DQN, and the neuron numbers of two hidden layers are and , respectively. The activation function of output layer is linear, and the ReLU is adopted in the hidden layers. The cardinality of adjacent cells is , the first interferers remain and power level number . Therefore, the input and output dimensions are and , respectively.
In the offline training period, the DQN is first randomly initialized and then trained epoch by epoch. In the first episodes, the agents only take actions stochastically, then they follow by adaptive greedy learning strategy [17] to step in the following exploring period. In each episode, the largescale fading is invariant, and thus the number of training episode must be large enough to overcome the generalization problem. There are time slots per episode, and the DQN is trained with random samples in the experience replay memory every time slots. The Adam algorithm [18] is adopted as the optimizer in our paper, and the learning rate exponentially decays from to . All training hyperparameters are listed in Tab.I for better illustration. In the following simulations, these default hyperparameters will be clarified once changed.
The FP algorithm, WMMSE algorithm, maximum PA and random PA schemes are treated as benchmarks to evaluate our proposed DQNbased algorithm. The perfect CSI of current moment is assumed to be known for all schemes. The simulation code will be available after formal publication.
Parameter  Value  Parameter  Value 

Number of per episode  Initial  
Observe episode number  Final  
Explore episode number  Initial  
Train interval  10  Final  
Memory size  Batch size 
IvB Discount Factor
In this subsection, the performance of different discount factor is studied. We set , and the average rate over the training period is shown in Fig. 3. At the same time slot, obviously the values of with higher are lower than the rest with lower values. The trained DQNs are then tested in three cellular networks with different cell numbers. As shown in Fig. 4 shows that DQN with achieves the highest score, while the lowest value is obtained by the one with highest value. The simulation result shows that the nonzero has a negative influence on the performance of DQN, which is consistent with the analysis in IIIB. Therefore, a zero or low discount factor value is recommended.
IvC Algorithm Comparison
The DQN trained with zero is used, and the four benchmark algorithms stated before are tested as comparisons. In real cellular network, the user density is changing over time, and the DQN must have good generalization ability against this issue. The user number per cell is assumed to be in set . The averaged simulation results are obtained after repeats. As shown in Fig. 5, the DQN achieves the highest in all testing scenarios. Although it is trained with , the DQN still outperforms the other algorithms in the other cases. We also note that the gap between random/maximum PA schemes and the rest optimization algorithms is increased when becomes larger. This can be mainly attributed that the intracell interference gets stronger with increased user density, which indicating that the optimization of PA is more significant in the cellular networks with denser users.
We also give an example result of one testing episode here (). In comparison with the averaged sumrate values in Fig. 5, in Fig. 6 the performance of three PA algorithms (DQN, FP, WMMSE) is not stable, especially depending on the specific largescale fading effects. Additionally, in some episodes the DQN can not be better than the other algorithms over the time (not shown in this paper), which means that there is still potential to improve the DQN performance.
In terms of computation complexity, the time cost of DQN is in linear relationship with layer numbers, with the utilization of GPU. Meanwhile, both FP and WMMSE are iterative algorithms, and thus the time cost is not constant, depending on the stopping criterion condition, initialization and CSI.
V Conclusions
The PA problem in the cellular network with IMAC has been investigated, and the datadriven modelfree DQL has been applied to solve this issue. To be in consistent with the PA optimization target, the current sumrate is used as reward function, including no future reward. This designed DQL algorithm is proposed, and the DQN simply works as an estimator to predict the current sumrate under all power levels with a certain CSI. Simulation results show that the DQN trained with zero achieves the highest average sumrate. Then in a series of different scenarios, the proposed DQN outperforms the benchmark algorithms, indicating that the designed DQN has good generalization abilities. In our twostep training framework, we have realized the offline centralized learning with simulated communication networks, and the learned DQN is tested by distributed executions. In our future work, the online learning will be further studied to accommodate the real scenarios with specific user distributions and geographical environments.
Vi Acknowledgments
This work was supported in part by the National Natural Science Foundation of China (Grant No. 61801112, 61471117, 61601281), the Natural Science Foundation of Jiangsu Province (Grant No. BK20180357), the Open Program of State Key Laboratory of Millimeter Waves (Southeast University, Grant No. Z201804).
References
 [1] K. Shen and W. Yu, “Fractional programming for communication systems—part i: Power control and beamforming,” IEEE Transactions on Signal Processing, vol. 66, no. 10, pp. 2616–2630, 2018.
 [2] Q. Shi, M. Razaviyayn, Z. Q. Luo, and C. He, “An iteratively weighted mmse approach to distributed sumutility maximization for a mimo interfering broadcast channel,” in IEEE International Conference on Acoustics, Speech and Signal Processing, 2011, pp. 4331–4340.
 [3] M. Chiang, P. Hande, T. Lan, and C. W. Tan, “Power control in wireless cellular networks,” Foundations and Trends in Networking, vol. 2, no. 4, pp. 381–533, 2008.
 [4] H. Zhang, L. Venturino, N. Prasad, P. Li, S. Rangarajan, and X. Wang, “Weighted sumrate maximization in multicell networks via coordinated scheduling and discrete power control,” IEEE Journal on Selected Areas in Communications, vol. 29, no. 6, pp. 1214–1224, June 2011.
 [5] Z. Qin, H. Ye, G. Y. Li, and B. F. Juang, “Deep learning in physical layer communications,” CoRR, vol. abs/1807.11713, 2018. [Online]. Available: http://arxiv.org/abs/1807.11713
 [6] T. O’Shea and J. Hoydis, “An introduction to deep learning for the physical layer,” IEEE Transactions on Cognitive Communications and Networking, vol. 3, no. 4, pp. 563–575, Dec 2017.
 [7] Y. Lecun, Y. Bengio, and G. Hinton, “Deep learning.” Nature, vol. 521, no. 7553, p. 436, 2015.
 [8] H. Sun, X. Chen, Q. Shi, M. Hong, X. Fu, and N. D. Sidiropoulos, “Learning to optimize: Training deep neural networks for interference management,” IEEE Transactions on Signal Processing, vol. 66, no. 20, pp. 5438–5453, Oct 2018.
 [9] F. Meng, P. Chen, L. Wu, and X. Wang, “Automatic modulation classification: A deep learning enabled approach,” IEEE Transactions on Vehicular Technology, pp. 1–1, 2018.
 [10] H. Ye, G. Y. Li, and B. Juang, “Power of deep learning for channel estimation and signal detection in ofdm systems,” IEEE Wireless Communications Letters, vol. 7, no. 1, pp. 114–117, Feb 2018.
 [11] R. Amiri, H. Mehrpouyan, L. Fridman, R. K. Mallik, A. Nallanathan, and D. Matolak, “A machine learning approach for power allocation in hetnets considering qos,” CoRR, vol. abs/1803.06760, 2018. [Online]. Available: http://arxiv.org/abs/1803.06760
 [12] E. Ghadimi, F. D. Calabrese, G. Peters, and P. Soldati, “A reinforcement learning approach to power control and rate adaptation in cellular networks,” in 2017 IEEE International Conference on Communications (ICC), May 2017, pp. 1–7.
 [13] F. D. Calabrese, L. Wang, E. Ghadimi, G. Peters, L. Hanzo, and P. Soldati, “Learning radio resource management in rans: Framework, opportunities, and challenges,” IEEE Communications Magazine, vol. 56, no. 9, pp. 138–145, Sep 2018.
 [14] L. Xiao, D. Jiang, D. Xu, H. Zhu, Y. Zhang, and V. Poor, “Twodimensional antijamming mobile communication based on reinforcement learning,” IEEE Transactions on Vehicular Technology, pp. 1–1, 2018.
 [15] Y. S. Nasir and D. Guo, “Deep reinforcement learning for distributed dynamic power allocation in wireless networks,” CoRR, vol. abs/1808.00490, 2018. [Online]. Available: http://arxiv.org/abs/1808.00490
 [16] V. Mnih, K. Kavukcuoglu, D. Silver, A. A. Rusu, J. Veness, M. G. Bellemare, A. Graves, M. Riedmiller, A. K. Fidjeland, and G. Ostrovski, “Humanlevel control through deep reinforcement learning.” Nature, vol. 518, no. 7540, p. 529, 2015.
 [17] S. Sutton and A. G. Barto, Reinforcement Learning: An Introduction. Cambridge, MA: MIT Press, 1998.
 [18] D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,” CoRR, vol. abs/1412.6980, 2014. [Online]. Available: http://arxiv.org/abs/1412.6980