The cross-entropy (CE) method attributed to Reuven Rubinstein is a general Monte Carlo approach to combinatorial and continuous multi-extremal optimization and importance sampling. Monte Carlo methods are a class of Computational Algorithms that rely on repeated Random sampling to compute their results Combinatorial optimization is a branch of optimization. Its domain is optimization problems where the set of Feasible solutions is discrete or can be reduced Continuous optimization is a branch of optimization in Applied mathematics. In Mathematics, the term optimization, or mathematical programming, refers to the study of problems in which one seeks to minimize or maximize a real function In Statistics, importance sampling is a general technique for estimating the properties of a particular distribution, while only having samples generated from a different The method originated from the field of rare event simulation, where very small probabilities need to be accurately estimated, for example in network reliability analysis, queueing models, or performance analysis of telecommunication systems. The CE method can be applied to static and noisy combinatorial optimization problems such as the traveling salesman problem, the quadratic assignment problem, DNA sequence alignment, the max-cut problem and the buffer allocation problem, as well as continuous global optimization problems with many local extrema. The Travelling salesman problem ( TSP) in Operations research is a problem in discrete or Combinatorial optimization. The quadratic assignment problem ( QAP) is one of fundamental Combinatorial optimization problems in the branch of optimization or Operations research In Bioinformatics, a sequence alignment is a way of arranging the Primary sequences of DNA, RNA, or Protein to identify regions of In Graph theory, a cut is a partition of the vertices of a graph into two sets Global optimization is a branch of Applied mathematics and Numerical analysis that deals with the optimization of a function or a set In Mathematics, maxima and minima, known collectively as extrema, are the largest value (maximum or smallest value (minimum that

In a nutshell the CE method consists of two phases:

1. Generate a random data sample (trajectories, vectors, etc. ) according to a specified mechanism.
2. Update the parameters of the random mechanism based on the data to produce a "better" sample in the next iteration. This step involves minimizing the cross-entropy or Kullback-Leibler divergence. In Information theory, the cross entropy between two Probability distributions measures the average number of Bits needed to identify an event from a set In Probability theory and Information theory, the Kullback–Leibler divergence (also information divergence, information gain, or relative

Estimation via importance sampling

Consider the general problem of estimating the quantity $\ell = \mathbb{E}_{\mathbf{u}}[H(\mathbf{X})] = \int H(\mathbf{x})\, f(\mathbf{x}; \mathbf{u})\, \textrm{d}\mathbf{x}$, where H is some performance function and $f(\mathbf{x};\mathbf{u})$ is a member of some parametric family of distributions. Using importance sampling this quantity can be estimated as $\hat{\ell} = \frac{1}{N} \sum_{i=1}^N H(\mathbf{X}_i) \frac{f(\mathbf{X}_i; \mathbf{u})}{g(\mathbf{X}_i)}$, where $\mathbf{X}_1,\dots,\mathbf{X}_N$ is a random sample from $g\,$. In Statistics, importance sampling is a general technique for estimating the properties of a particular distribution, while only having samples generated from a different For positive H, the theoretically optimal importance sampling density (pdf)is given by $g^*(\mathbf{x}) = H(\mathbf{x}) f(\mathbf{x};\mathbf{u})/\ell$. In Mathematics, a probability density function (pdf is a function that represents a Probability distribution in terms of Integrals Formally a probability This, however, depends on the unknown $\ell$. The CE method aims to approximate the optimal pdf by adaptively selecting members of the parametric family that are closest (in the Kullback-Leibler sense) to the optimal pdf g * . In Probability theory and Information theory, the Kullback–Leibler divergence (also information divergence, information gain, or relative

Generic CE algorithm

1. Choose initial parameter vector $\mathbf{v}^{(0)}$; set t = 1.
2. Generate a random sample $\mathbf{X}_1,\dots,\mathbf{X}_N$ from $f(\cdot;\mathbf{v}^{(t-1)})$
3. Solve for $\mathbf{v}^{(t)}$, where
$\mathbf{v}^{(t)} = \mathop{\textrm{argmax}}_{\mathbf{v}} \frac{1}{N} \sum_{i=1}^N H(\mathbf{X}_i)\frac{f(\mathbf{X}_i;\mathbf{u})}{f(\mathbf{X}_i;\mathbf{v}^{(t-1)})} \log f(\mathbf{X}_i;\mathbf{v})$
4. If convergence is reached then stop; otherwise, increase t by 1 and reiterate from step 2.

In several cases, the solution to step 3 can be found analytically. Situations in which this occurs are

• When $f\,$ belongs to the natural exponential family
• When $f\,$ is discrete with finite support
• When $H(\mathbf{X}) = \mathrm{I}_{\{\mathbf{x}\in A\}}$ and $f(\mathbf{X}_i;\mathbf{u}) = f(\mathbf{X}_i;\mathbf{v}^{(t-1)})$, then $\mathbf{v}^{(t)}$ corresponds to the maximum likelihood estimator based on those $\mathbf{X}_k \in A$. In probability and Statistics, an exponential family is a class of Probability distributions sharing a certain form which is specified below In Topology, a discrete space is a particularly simple example of a Topological space or similar structure one in which the points are " isolated " In Mathematics, the support of a function is the set of points where the function is not zero or the closure of that set Maximum likelihood estimation ( MLE) is a popular statistical method used for fitting a mathematical model to some data

Continuous optimization—example

The same CE algorithm can be used for optimization, rather than estimation. Suppose the problem is to maximize some function S(x), for example, $S(x) = \textrm{e}^{-(x-2)^2} + 0.8\,\textrm{e}^{-(x+2)^2}$. To apply CE, one considers first the associated stochastic problem of estimating $\mathbb{P}_{\boldsymbol{\theta}}(S(X)\geq\gamma)$ for a given level $\gamma\,$, and parametric family $\left\{f(\cdot;\boldsymbol{\theta})\right\}$, for example the 1-dimensional Gaussian distribution, parameterized by its mean $\mu_t\,$ and variance $\sigma_t^2$ (so $\boldsymbol{\theta} = (\mu,\sigma^2)$ here). The normal distribution, also called the Gaussian distribution, is an important family of Continuous probability distributions applicable in many fields Hence, for a given $\gamma\,$, the goal is to find $\boldsymbol{\theta}$ so that $D_{\mathrm{KL}}(\textrm{I}_{\{S(x)\geq\gamma\}}\|f_{\boldsymbol{\theta}})$ is minimized. This is done by solving the sample version (stochastic counterpart) of the KL divergence minimization problem, as in step 3 above. It turns out that parameters that minimize the stochastic counterpart for this choice of target distribution and parametric family are the sample mean and sample variance corresponding to the elite samples, which are those samples that have objective function value $\geq\gamma$. The worst of the elite samples is then used as the level parameter for the next iteration. This yields the following randomized algorithm for this problem.

Pseudo-code

1.  mu:=-6; sigma2:=100; t:=0; maxits=100;    // Initialize parameters2.  N:=100; Ne:=10;                           //3.  while t < maxits and sigma2 > epsilon     // While not converged and maxits not exceeded4.   X = SampleGaussian(mu,sigma2,N);         // Obtain N samples from current sampling distribution5.   S = exp(-(X-2)^2) + 0. 8 exp(-(X+2)^2);   // Evaluate objective function at sampled points6.   X = sort(X,S);                           // Sort X by objective function values (in descending order)7.   mu = mean(X(1:Ne)); sigma2=var(X(1:Ne)); // Update parameters of sampling distribution8.   t = t+1;                                 // Increment iteration counter9.  return mu                                 // Return mean of final sampling distribution as solution