Contents

Optimization and search
- Optimization Methods
Evolutionary Algorithms
Representation, Mutation, and Recombination
Fitness, Selection and Population Management
Popular Evolutionary Algorithm Variants
- Genetic Algorithms
Working with Evolutionary Algorithms
Hybridisation with Other Techniques
- Memetic Algorithms
Multiobjective Evolutionary Algorithms
Machine learning and single-layer neural networks
Multi-Layer Network
Reinforcement Learning

Optimization and search

Finding the best solution(max/min) to a problem on a fitness-landscape.

Continuous Optimization is the mathematical discipline which is concerned with finding the maxima and minima of functions, possibly subject to constraints.
Discrete Optimization is the activity of looking thoroughly in order to find an item with specified properties among a collection of items.

Optimization Methods

Discrete

Exhaustive search(Brute-force search)

Test all possible solutions, pick the best
Guaranteed to find the optimal solution

Possible implementation:

def exhaustive_search_max(function, range):
  solution = [range[0], function(range[0])]
  for x in range:
      if function(x) > solution[1]:
          solution = x, function(x)
  return solution

Greedy search

Only test one solution
Makes locally optimal choices, taking the first and “best” solution.
- TSP example: always go to the nearest city

Hill climbing

Pick a solution as the current best(random)
Compare to neighbor, switch if better
Only finds local optimum, not global optimum!

Simulated annealing

Set an initial temperature T
Pick an initial solution
Repeat:
- Pick a solution neighboring the current solution
- If the new one is better, keep it
- Otherwise, keep the new one with probability p
  - p depends on the difference in quality and the temperature. High temp -> high p (more randomness)
- Reduce T

Continuous

Gradient descent/ascent

Finding the gradient of the function and move towards the preferred direction.
\(x^{(k+1)} = x^{(k)} + \gamma \nabla f(x^{(k)})\)
Algorithm:
- Start with a point (guess)
- Repeat
  - Determine a descent direction
  - Choose a step
  - Update
- Until stopping criterion is satisfied

Possible Gradient ascent implementation:

def gradient_ascent(function, derivative, gamma, x , precision):
  dx = gamma * derivative(x)
  while abs(dx) > precision:
      y = function(x)
      x += dx
      dx = gamma * derivative(x)
  return x,function(x)

Exploitation and Exploration

Search methods should combine:
- Trying completely new solutions (like in exhaustive search) => Exploration
- Trying to improve the current best solution by local search => Exploitation Combining these are more likely to give the global optima.

Evolutionary Algorithms

General scheme:

ea_scheme

Psudo-code

BEGIN
	INITIALIZE population with random candidate solutions;
	EVALUATE each candidate;
	REPEAT UNTIL(TERMINATION CONDITION is satisfied) DO
		1 SELECT parents;
		2 RECOMBINE pairs of parents;
		3 MUTATE the resulting offspring;
		4 EVALUATE new candidates;
		5 SELECT individuals for the next generation;
	OD
END

The two pillars of evolution

Pushing towards novelty

Increasing population diversity by genetic operators

mutation
recombination

Pushing towards quality

Decreasing population diversity by selection

of parents
of survivors

Terminology

terminology

Evaluation(fitness) function

Represents the task to solve
Provides basis for comparison of fitness for each phenotype
Assigns a single real-valued fitness to each phenotype

Population

The candidate solutions of the problem
The population is evolving, not the individuals
Selection operators act on population level
Variation operators act on individual level

Selection mechanism

Identifies individuals
- to become parents
- to survive
Pushes population towards higher fitness
Parent selection is usually probabilistic
- high quality solutions more likely to be selected than low quality, but not guaranteed due to local optima issues(stochastic nature)

Survivor selection

N old solutions + K new solutions (offspring) -> N individuals (new population)
Often deterministic:
- Fitness based: e.g., rank old population + offspring and take best
- Age based: make N offspring and delete all old solutions
Sometimes a combination of stochastic and deterministic (elitism)

Variation operators

Role: to generate new candidate solutions
Usually divided into two types according to their arity (number of inputs to the variation operator):
- Arity 1 : mutation operators
- Arity >1 : recombination operators
- Arity = 2 typically called crossover
- Arity > 2 is formally possible, seldom used in EA

Mutation

Role: cause small, random variance to a genotype
Element of randomness is essential and differentiates it from other unary heuristic operators
Importance ascribed depends on representation and historical dialect:
- Binary Genetic Algorithms – background operator responsible for preserving and introducing diversity
- Evolutionary Programming for continuous variables – the only search operator
- Genetic Programming – hardly used

Recombination

Role: merges information from parents into offspring
Choice of what information to merge is stochastic
Hope is that some offspring are better by combining elements of genotypes that lead to good traits

Initialisation

Initialisation usually done at random
- Need to ensure even spread and mixture of possible allele values
- Can include existing solutions, or use problem-specific heuristics, to “seed” the population

Termination

Termination condition checked every generation
- Reaching some (known/hoped for) fitness
- Reaching some maximum allowed number of generations
- Reaching some minimum level of diversity
- Reaching some specified number of generations without fitness improvement

Representation, Mutation, and Recombination

Most common representation of genomes:
- Binary
- Integer
- Real-Valued or Floating-Point
- Permutation
- Tree

Binary Representation

Genotype consists of a string of binary digits
Mutation by altering each gene independently with a probability(mutation rate) \(p_m\)
1-point crossover splits combines a part of each parent on a random spot:
n-point crossover splits along n-points and recombines:
Uniform crossover assigns a ‘head’ to one parent, ‘tail’ to the other and randomly chooses genes from each
Crossover is explorative, creates big diversion in between the parents
Mutation is exploitative, creates random small diversion staying close to the possible optima

Integer Representation

Could be integer variables or categorical values from a set.
N-point / uniform crossover operators work
Extend bit-flipping mutation to make:
- “creep” i.e. more likely to move to similar value
  - Adding a small (positive or negative) value to each gene with probability \(p\)
- Random resetting (esp. categorical variables)
  - With probability \(p_m\) a new value is chosen at random

Real-Valued or Floating-Point

General scheme of floating point mutations
- \(\overline{x} = \langle x_1, ..,x_l \rangle \rightarrow \overline{x}’ = \langle x_1’, .., x_l’ \rangle\)
- \(x_i, x_i’ \in [LB_i, UB_i]\)
Uniform Mutation: \(x_i’\) drawn randomly from \([LB_i,UB_i]\)
- Analogous to bit-flipping (binary) or random resetting (integers)
Nonuniform Mutation:
- Most common method is to add random deviate to each variable separately, taken from \(N(0,\sigma)\) Gaussian distribution and then curtail to range \(x_i’ = x_i + N(0,\sigma)\)
- Standard deviation \(\sigma\), mutation step size, controls amount of change (2/3 of drawings will lie in range \((- \sigma \rightarrow + \sigma))\)
Crossover operators:
- Discrete recombination:
  - each allele value in offspring z comes from one of its parents \((x,y)\) with equal probability: \(z_i\) = \(x_i\) or \(y_i\)
  - Could use n-point or uniform
- Intermediate recombination:
  - exploits idea of creating children “between” parents (hence a.k.a. arithmetic recombination)
  - \(z_i = \alpha x_i + (1-\alpha)y_i\) where \(\alpha:0\leq\alpha\leq1\)
  - The parameter \(\alpha\) can be:
    - constant: \(\alpha = 0.5 \) -> uniform arithmetical crossover
    - variable (e.g. depend on the age of the population)
    - picked at random every time
- Simple arithmetic crossover
  - Parents: \(\langle x_1, ..,x_n \rangle, \langle y_1, ..,y_n \rangle\)
  - Pick a random gene (k) after this point mix values
- Whole arithmetic crossover
  - Parents: \(\langle x_1, ..,x_n \rangle, \langle y_1, ..,y_n \rangle\)
  - Child\(_1\) is: \(a\overline{x} + (1-a)\overline{y}\)
  - reverse for other child. e.g with \(\alpha = 0.5\)

Permutation representation

Useful in ordering/sequencing problems
Task is (or can be solved by) arranging some objects in a certain order. Examples:
- production scheduling: important thing is which elements are scheduled before others (order)
- Travelling Salesman Problem (TSP) : important thing is which elements occur next to each other (adjacency)
if there are n variables then the representation is as a list of n integers, each of which occurs exactly once
Swap mutation
- Pick two alleles at random and swap their positions
Insert mutation
- Pick two allele values at random
- Move the second to follow the first, shifting the rest along to accommodate
- Note that this preserves most of the order and the adjacency information
Scramble mutation
- Pick a subset of genes at random
- Randomly rearrange the alleles in those positions
Inversion mutation
- Pick two alleles at random and then invert the substring between them.
- Preserves most adjacency information (only breaks two links) but disruptive of order information
Conserving Adjacency
- Important for problems where adjacency between elements decides quality (e.g. TSP)
- Partially Mapped Crossover and Edge Recombination are example operators
Conserving Order
- Important for problems order of elements decide performance (e.g. production scheduling)
- Order Crossover and Cycle Crossover are example operators

Partially Mapped Crossover (PMX)

Choose random segment and copy it from P1
Starting from the first crossover point look for elements in that segment of P2 that have not been copied
For each of these i look in the offspring to see what element j has been copied in its place from P1
Place i into the position occupied j in P2, since we know that we will not be putting j there (as is already in offspring)
If the place occupied by j in P2 has already been filled in the offspring k, put i in the position occupied by k in P2
Having dealt with the elements from the crossover segment, the rest of the offspring can be filled from P2. Second child is created analogously

def pmx(a,b,start,stop):
    child = [None]*len(a)

    child[start:stop] = a[start:stop]

    for ind, x in enumerate(b[start:stop]):
        ind += start
        if x not in child:
            while child[ind] != None:
                ind = b.index(a[ind])
            child[ind] = x

    for ind, x in enumerate(child):
        if x == None:
            child[ind] = b[ind]

    return child

Edge Recombination

Works by constructing a table listing which edges are present in the two parents, if an edge is common to both, mark with a +
- e.g. [1 2 3 4 5 6 7 8 9] and [9 3 7 8 2 6 5 1 4]

Pick an initial element, entry, at random and put it in the offspring
Set the variable current element = entry
Remove all references to current element from the table
Examine list for current element:
- If there is a common edge, pick that to be next element
- Otherwise pick the entry in the list which itself has the shortest list
- Ties are split at random
In the case of reaching an empty list:
- a new element is chosen at random

edge_recombination_2

Order crossover

Idea is to preserve relative order that elements occur
1. Choose an arbitrary part from the first parent
2. Copy this part to the first child
3. Copy the numbers that are not in the first part, to the first child:
- starting right from cut point of the copied part,
- using the order of the second parent
- and wrapping around at the end
  1. Analogous for the second child, with parent roles reversed
Copy randomly selected set from first parent
Copy rest from second parent in order 1,9,3,8,2

def order_xover(a,b, start, stop):
    child = [None]*len(a)

    child[start:stop] = a[start:stop]

    b_ind = stop
    c_ind = stop
    l = len(a)
    while None in child:
        if b[b_ind % l] not in child:
            child[c_ind % l] = b[b_ind % l]
            c_ind += 1
        b_ind += 1
    return child

Cycle crossover

Each allele comes from one parent together with its position.
1. Make a cycle of alleles from P1 in the following way.
1. Start with the first allele of P1.
2. Look at the allele at the same position in P2
3. Go to the position with the same allele in P1.
4. Add this allele to the cycle.
5. Repeat step b through d until you arrive at the first allele of P1.
  1. Put the alleles of the cycle in the first child on the positions they have in the first parent.
  2. Take next cycle from second parent
Step 1: identify cycles
Step 2: copy alternate cycles into offspring

def cycle_xover(a,b):
    child = [None]*len(a)
    while None in child:
        ind = child.index(None)
        indices = []
        values = []
        while ind not in indices:
            val = a[ind]
            indices.append(ind)
            values.append(val)
            ind = a.index(b[ind])
        for ind,val in zip(indices, values):
            child[ind] = val
        a,b = b,a
    return child

Tree Representation

In Genetic Algorithms, Evolutionary Strategy and Evolutionary Programming chromosomes are linear structures (bit strings, integer string, real valued vectors, permutations)
Tree shaped chromosomes are non-linear structures
In GA, ES, EP the size of the chromosomes is fixed
Trees in GP (Genetic Programming) may vary in depth and width

Mutation

Most common mutation:
- replace randomly chosen subtree by randomly generated tree
Mutation has two parameters:
- Probability \(p_m\) to choose mutation
- Probability to chose an internal point as the root of the subtree to be replaced
Remarkably pm is advised to be 0 or very small, like 0.05
The size of the child can exceed the size of the parent

Recombination

Most common recombination: exchange two randomly chosen subtrees among the parents
Recombination has two parameters:
- Probability \(p_c\) to choose recombination
- Probability to chose an internal point within each parent as crossover point
The size of offspring can exceed that of the parents

Fitness, Selection and Population Management

Population Management

There are two main types:
- Generational model
  - Each individual survive for one cycle
  - \(\lambda\) offspring are generated
  - The entire set of \(\mu\) parents are replaced by \(\mu\) offspring
- Steady-state model
  - \(\lambda(<\mu)\) parents are replaced by \(\lambda\) offspring
Generation Gap
- The proportion of the population replaced
- Parameter = 1.0 for Generational Model, = \(\lambda\)\pop_size for Steady-state model.

Fitness based competition

Selection can occur in two places:
- Parent selection(selects mating pairs)
- Survivor selection(replaces population)
Selection works on the population
- representation independent
Selection pressure: As selection pressure increases, fitter solutions are more likely to survive, or be chosen as parents
- High selection pressure gives exploitation
- Low selection pressure gives exploration

Parent Selection

Fitness-Proportionate Selection
- Higher fitness have a higher chance of being selected.
Stochastic Universal Sampling
- “Select multiple individuals by making one spin of the wheel with a number of equally spaced arms”

Fitness-Proportionate Selection (FPS)

Probability for individual i to be selected for mating in a population size μ with FPS is: \(P_{FPS}(i)=\frac{f_i}{\sum\limits_{j=1}^{\mu}f_j}\)
Problems:
- Premature Convergence - one highly fit member can rapidly take over.
- Selection pressure lowers at the end of runs
Scaling
- Windowing \(f’(i) = f(i)-\beta^t\) where \(\beta\) is the worst fitness in this(last n) generations
  - Increases selection pressure
- Sigma Scaling \(f’(i) = max(f(i) - (\overline{f}-c\sigma_f),0)\) where *c is a constant, usually 2.0 (Normal distribution)

Ranked-based selection

Attempt to remove problems of FPS by basing selection probabilities on relative rather than absolute fitness
Rank population according to fitness and then base selection probabilities on rank (fittest has rank μ-1 and worst rank 0)
This imposes a sorting overhead on the algorithm

Tournament Selection

All methods above rely on global population statistics
- Could be a bottleneck esp. on parallel machines, very large population
- Relies on presence of external fitness function which might not exist: e.g. evolving game players Idea for a procedure using only local fitness information:
- Pick k members at random then select the best of these
- Repeat to select more individuals
Probability of selecting i will depend on:
- Rank of i
- Size of sample k
  - higher k increases selection pressure
- Whether contestants are picked with replacement
  - Picking without replacement increases selection pressure
- Whether fittest contestant always wins (deterministic) or this happens with probability p

Uniform

\(P_{uniform}(i)=\frac{1}{\mu}\)

Parents are selected by uniform random distribution whenever an operator needs one/some
Uniform parent selection is unbiased - every individual has the same probability to be selected

Survivor Selection(Replacement)

From a set of μ old solutions and λ offspring:Select a set of μ individuals forming the next generation
Survivor selection can be divided into two approaches:
- Age-Based Replacement
  - Fitness is not taken into account
- Fitness-Based Replacement
  - Usually with deterministic elements

Fitness-based replacement

Elitism
- Keep one copy, minimum, of the N fittest solution(s)
- Used in both GGA and SSGA
Delete worst
- Replace \(\lambda\) worst individual* Delete worst
  - Replace \(\lambda\) worst individuals
Round-robin tournament
- Pairwise competitions in round-robin format.
  - x is evaluated against q other randomly chosen individuals(1v1)
  - individuals are assigned “wins”
  - solutions with most wins are winners
- Parameter q allows tuning selection pressure
- (\(mu,\lambda)\)-selection (best can be lost)
  - only based on children \((\lambda > \mu)\)
  - choose best offspring \(\mu\) for nest gen
  - Most used because it is good at leaving local optima
- (\(mu+\lambda)\)-selection (elitist)
  - based og parents and children
    - based og parents and children
    - chose the best offspring \(\mu\) for next gen

Multimodality

The need for several local optima and solutions due to different strategies

Preserving Diversity

Implicit approaches:
- Impose an equivalent of geographical separation
- Impose an equivalent of speciation
Explicit approaches
- Make similar individuals compete for resource(fitness)
- Make similar individuals compete with each other for survival

Explicit approaches

Fitness Sharing
- Restricts the number of individuals within a given niche by “sharing” their fitness
- Need to set the size of the niches share in either genotype or phenotype space
- run EA as normal but after each generation set

\[f'(i)=\frac{f(i)}{\sum\limits_{j=1}^{\mu}sh(d(i,j))}\] \[X(m, n) = \left. \begin{cases} 1-\frac{d}{\sigma} & d \leq \sigma \\ 0 & \text{otherwise} \end{cases} \right\} = xy\]

fitness_sharing

Crowding
- Idea: New individuals replace similar individuals
- Randomly shuffle and pair parents, produce 2 offspring
- Each offspring competes with their nearest parent for survival (using a distance measure)
- Result: Even distribution among niches.

Implicit approaches

Geographical Separation
- “Island” Model Parallel EA:
  - Periodic migration of individual solutions between populations
  - Run multiple populations in parallel
  - After a (usually fixed) number of generations (an Epoch), exchange individuals with neighbours
  - Repeat until ending criteria met
  - Partially inspired by parallel/clustered systems
  - Adaptively exchange individuals(stop pop when no improve over x gens)

Popular Evolutionary Algorithm Variants

Algorithm	Chromosome Representation	Crossover	Mutation
Genetic Algorithm (GA)	Array	X	X
Genetic Programming (GP)	Tree	X	X
Evolution Strategies (ES)	Array	(X)	X
Evolutionary Programming (EP)	No Constraints	-	X

Genetic Algorithms

Simple GA

Typically applied to:
- discrete function optimization
- benchmark for comparison with other algorithms
- straightforward problems with binary representation
Features:
- not too fast
- missing new variants (elitism, sus)
- often modelled by theorists
Reproduction cycle:
- Select parents for the mating pool (size of mating pool = population size)
- Shuffle the mating pool
- Apply crossover for each consecutive pair with probability pc, otherwise copy parents
- Apply mutation for each offspring (bit-flip with probability pm independently for each bit)
- Replace the whole population with the resulting offspring

Summary
Representation	Bit-strings
Recombination	1-Point Crossover
Mutation	Bit flip
Parent selection	Fitness proportional – implemented by Roulette Wheel
Survivor selection	Generational

Genetic Programming GP

Typically applied to:
- machine learning tasks (prediction, classification…)
Attributed features:
- “automatic evolution of computer programs”
- needs huge populations (thousands)
- slow
Special
- non-linear chromosomes: trees
- mutation possible but not necessary
Tree Representation
- Non-linear
- Variable size
Mutation
The trees have a tendency to bloat as the average population increases over time
- Countermeasures:
  - Maximum tree size
  - Parsimony pressure: penalty for being oversized

Summary
Representation	Tree structures
Recombination	Exchange of subtrees
Mutation	Random change in trees
Parent selection	Fitness proportional
Survivor selection	Generational replacement

flowchart_ga_gp

Evolution Strategies ES

Attributed features:
- fast
- good optimizer for real-valued optimisation
- relatively much theory
Special:
- self-adaptation of (mutation) parameters standard
Representation:
- Chromosomes consist of two parts:
  - Object variables: \(x_1,…,x_n\)
  - Strategy parameters (mutation rate, etc): \(p_1,…,p_n\)
- Full size: \({x_1,…,x_n,p_1,…p_n}\)
Parent selection
- Parents are selected by uniform random distribution whenever an operator needs one/some
- Thus: ES parent selection is unbiased - every individual has the same probability to be selected \(P_{uniform}(i)=\frac{1}{\mu}\)
Recombination
- Two parents create one child
- Acts per variable / position by either
  - Intermediary crossover(1,1,1)+(2,2,2)=(1,2,1)
  - Discrete crossover(1,1,1)+(2,2,2)=(1.5,1.5,1.5)
- From two or more parents by either:
  - Local recombination: Two parents make a child
  - Global recombination: Selecting two parents randomly for each gene

Two fixed parents	Two parents selected for each i
Local intermediary	Global intermediary
Local discrete	Global discrete

Summary
Representation	Real-valued vectors
Recombination	Discrete or intermediary
Mutation	Gaussian perturbation
Parent selection	Uniform random
Survivor selection	\((\mu,\lambda) \text{or} (\mu+\lambda)\)

Evolutionary Programming EP

Typically applied to:
- traditional EP: prediction by finite state machines
- contemporary EP: (numerical) optimization
Attributed features:
- very open framework: any representation and mutation op’s OK
- Contemporary EP has almost merged with ES
Special:
- no recombination
- self-adaptation of parameters standard (contemporary EP)
Representation
- For continuous parameter optimisation
- Chromosomes consist of two parts:
  - Object variables: \(x_1,…,x_n\)
  - Mutation step sizes: \(\sigma_1,…,\sigma_n\)
- Full size: \(\langle x_1,…,x_n, \sigma_1,…,\sigma_n\rangle\)
Selection
- Each individual creates one child by mutation
  - Deterministic
  - Not biased by fitness
- Parents and offspring compete for survival in round-robin tournaments.

Summary
Representation	Real-valued vectors
Recombination	None
Mutation	Gaussian perturbation
Parent selection	Deterministic (each parent one offspring)
Survivor selection	Probabilistic \((\mu+\lambda)\)

Summary

ga_summary

Working with Evolutionary Algorithms

Typical applications for EAs

Design Problem
- Optimising spending on improvements to national road network
  - Computing costs negligible
  - Six months to run algorithm on hundreds computers
  - Many runs possible
  - Must produce very good result just once
Repetitive Problem
- Optimising Internet shopping delivery route
  - Need to run regularly/repetitively
  - Different destinations each day
  - Limited time to run algorithm each day
  - Must always be reasonably good route in limited time
On-Line Control
- Darpa Robot Competition
  - Need to run regularly/repetitively
  - Limited time to run algorithm
  - Must always deliver reasonably good solution in limited time

Academic Research on EAs

Show that EC is applicable in a (new) problem domain (real-world applications)
Show that my_EA is better than benchmark_EA
Show that EAs outperform traditional algorithms
Optimize or study impact of parameters on the performance of an EA
Investigate algorithm behavior (e.g. interaction between selection and variation)
See how an EA scales-up with problem size
..

Algorithm design

Design a representation
Design a way of mapping a genotype to a phenotype
Design a way of evaluating an individual
Design suitable mutation operator(s)
Design suitable recombination operator(s)
Decide how to select individuals to be parents
Decide how to select individuals for the next generation (how to manage the population)
Decide how to start: initialization method
Decide how to stop: termination criterion

Measurements and statistics

Basic rules of experimentation

EAs are stochastic -> never draw any conclusion from a single run
- perform sufficient number of independent runs
- use statistical measures (averages, standard deviations)
- use statistical tests to assess reliability of conclusions
EA experimentation is about comparison -> always do a fair competition
- use the same amount of resources for the competitors
- try different comp. limits (to cope with turtle/hare effect)
- use the same performance measures

How to Compare EA Results?

Success Rate: Proportion of runs within x% of target
Mean Best Fitness: Average best solution over n runs
Best result (“Peak performance”) over n runs
Worst result over n runs

Peak vs Average Performance

For repetitive tasks, average (or worst) performance is most relevant
For design tasks, peak performance is most relevant

Measures

Performance measures(off-line)
- Efficiency (alg. speed)
  - Execution time
  - Average no. of evaluations to solution (AES, i.e., number of generated points in the search space)
- Effectiveness (solution quality, also called accuracy)
  - Success rate (SR): % of runs finding a solution
  - Mean best fitness at termination (MBF)
- “Working” measures (on-line)
  - Population distribution (genotypic)
  - Fitness distribution (phenotypic)
  - Improvements per time unit or per genetic operator

Statistical Comparisons and Significance

If a claim is made “Mutation A is better than mutation B”, need to show statistical significance of comparisons
Fundamental problem: two series of samples (random drawings) from the SAME distribution may have DIFFERENT averages and standard deviations
Tests can show if the differences are significant or not
Standard deviations supply additional info
T-test (and alike) indicate the chance that the values came from the same underlying distribution (difference is due to random effects).

Summary for experiments

Be organized
Decide what you want & define appropriate measures
Choose test problems carefully
Make an experiment plan (estimate time when possible)
Perform sufficient number of runs
Keep all experimental data (never throw away anything)
Include in publications all necessary parameters to make others able to repeat your experiments
Use good statistics (“standard” tools from Web, MS, R)
Present results well (figures, graphs, tables, …)
Watch the scope of your claims
Aim at generalizable results
Publish code for reproducibility of results (if applicable)
Publish data for external validation (open science)

Hybridisation with Other Techniques

Memetic Algorithms

What is it?
- The combination of Evolutionary Algorithms with Local Search Operators that work within the EA loop has been termed “Memetic Algorithms”
- Term also applies to EAs that use instance specific knowledge
- Memetic Algorithms have been shown to be orders of magnitude faster and more accurate than EAs on some problems, and are the “state of the art” on many problems.

Reasons to hybridise

Might be looking at improving on existing techniques (non-EA)
Might be looking at improving EA search for good solutions

Local Search

Main Idea simplified

Make a small, but intelligent (problem-specific), change to an existing solution
If the change improves it, keep the improved version
Otherwise, keep trying small, smart changes until it improves, or until we have tried all possible small changes

Def

Defined by combination of neighbourhood and pivot rule
N(x) is defined as the set of points that can be reached from x with one application of a move operator
- e.g. bit flipping search on binary problems

Pivot rules

Is the neighbourhood searched randomly, systematically or exhaustively
does the search stop as soon as a fitter neighbour is found (Greedy Ascent)
or is the whole set of neighbours examined and the best chosen (Steepest Ascent)
of course there is no one best answer, but some are quicker than others to run …

Evolution

Do offspring inherit what their parents have “learnt” in life?
- Yes - Lamarckian evolution
  - Traits acquired in parents lifetimes can be inherited by offspring
  - Improved fitness and genotype
- No - Baldwinian evolution
  - Improved fitness only
In practice, most recent Memetic Algorithms use:
- Pure Lamarckian evolution, or
- A stochastic mix of Lamarckian and Baldwinian evolution

Where to hybridise

hybridise

Summary

It is common practice to hybridise EA’s when using them in a real world context
This may involve the use of operators from other algorithms which have already been used on the problem, or the incorporation of domain-specific knowledge
Memetic algorithms have been shown to be orders of magnitude faster and more accurate than EAs on some problems, and are the “state of the art” on many problems

Multiobjective Evolutionary Algorithms

MOPs - Multi-Objective Problems

Wide range of problems can be categorise by the presence of a number of n possibly conflicting objectives:
- buying a car: speed vs. price vs. reliability
- engineering design: lightness vs. strength
Two problems:
- finding set of good solutions
- choice of best for the particular application

Two approaches to multiobjective optimisation

Weighted sum (scalarisation):
- transform into a single objective optimisation method
- compute a weighted sum of the different objectives
A set of multi-objective solutions (Pareto front):
- The population-based nature of EAs used to simultaneously search for a set of points approximating Pareto front

Comparing solutions

comparing_minimization

Minimization:
- a is better than b
- a is better than c
- a is worse than e
- a and d are incomparable

Dominance relation

Solution x dominates solution y, \(x \preceq y\) if:
- x is better than y in at least one objective,
- x is not worse than y in all other objectives

Pareto optimality

Solution x is non-dominated among a set of solutions Q if no solution from Q dominates x
A set of non-dominated solutions from the entire feasible solution space is the Pareto set, or Pareto front, its members Pareto-optimal solutions

Goal of multiobjective optimisers

Find a set of non-dominated solutions (approximation set) following the criteria of:
- convergence (as close as possible to the Pareto optimal front),
- diversity (spread, distribution)

EC approach:

Requirements:
1. Way of assigning fitness and selecting individuals, usually based on dominance
2. Preservation of a diverse set of points, similarities to multi-modal problems
  - Aim: Evenly distributed population along the Pareto front
  - Usually done by niching techniques such as:
    - fitness sharing
    - adding amount to fitness based on inverse distance to nearest neighbour
  - All rely on some distance metric in genotype /phenotype / objective space
3. Remembering all the non-dominated points you have seen, usually using elitism or an archive
  - Could just use elitist algorithm, e.g. \( \mu + \lambda \) replacement
  - Common to maintain an archive of non dominated points
    - some algorithms use this as a second population that can be in recombination etc.

Summary

MO problems occur very frequently
EAs are very good in solving MO problems
MOEAs are one of the most successful EC subareas

Machine learning and single-layer neural networks

What ML is

A program that can learn from experience
- Modify execution based on new information
Analyzing new data by applying known relevant information

When should ML be used

Human expertise does not exist (navigating on Mars)
Humans are unable to explain their expertise (speech recognition)
Solution changes over time (self-driving vehicles)
Solution needs to be adapted to particular cases (user preferences)
Interfacing computers with the real world (noisy data)
Dealing with large amounts of (complex) data

Types of ML

Supervised learning: Training data include desired outputs. Based on this training set, the algorithm generalises to respond correctly to all possible inputs.

Unsupervised learning: Training data does not include desired outputs, instead the algorithm tries to identify similarities between the inputs that have something in common so that similar items are categorised together.
Reinforcement learning: The algorithm is told when the answer is wrong, but does not get told how to correct it. Algorithm must balance exploration of the unknown environment with exploitation of immediate rewards to maximize long-term rewards.

Supervised learning

Training data provided in pai
Goal is to predict output(y) to input (x): \(y = f(x)\)
Supervision part is output y for each input x.
Examples:
- Classification
- Regression

Classification

Training data consists of “inputs”, denoted x, and corresponding output “class labels”, denoted as y
Goal is to correctly predict for a test data input the corresponding class label.
Learn a “classifier” f(x) from the input data that outputs the class label or a probability over the class labels.
Example:
- Input: image
- Output: category label, eg “cat” vs. “no cat”
Two main phases:
- Training: Learn the classification model from labeled data.
- Prediction: Use the pre-built model to classify new instances.
Classification creates boundaries in the input space between areas assigned to each class

Regression

Regression analysis is used to predict the value of on variable (the dependent variable) on the basis of other variables (the independent variables).
Learn a continuous function.
Given, the following data, can we find the value of the output when x = 0.44?
Goal is to predict for input x an output f(x) that is close to the true y.
It is generally a problem of function approximation, or interpolation, working out the value between values that we know.

Machine Learning Process

Data Collection and Preparation
Feature Selection and Extraction
Algorithm Choice
Parameters and Model Selection
Training
Evaluation

McCulloch and Pitts Neurons

Sum the weighted inputs
- If total is greater than some threshold, neuron “fires”
- Otherwise does not

\(h = \sum_{i=1}^{m} x_i w_i \qquad o = \left. \begin{cases} 1 & h \geq \theta \\ 0 & h < \theta \end{cases} \right\}\)

The weight \(w_j\) can be positive or negative
- Inhibitory or exitatory.

Neural Networks

Can put lots of McCulloch & Pitts neurons together.
Connect them up in any way we like.
In fact, assemblies of the neurons are capable of universal computation.
- Can perform any computation that a normal computer can.
- Just have to solve for all the weights \(w_{ij}\)

The Perceptron Network

perceptron_network

Training neurons is done by adapting the weights
Supervised learning with target outputs.

Learning rule:

\(w_{ij} \leftarrow w_{ij} + \Delta w_{ij}\)

Aim: minimize the error at the output
If E = t-y, want E to be 0
Use: \(\Delta w = \eta(t_j + y_j)x_i\)
- Learning rate: \(\eta\)
  - controls the size of the weight changes
  - Big LR might overshoot, small LR learns slower.
- Desired output: \(t_j\)
- Actual output: \(y_j\)
- Error: \(t_j + y_j\)
- Input: \(x_i\)

Bias input

A fix to make the neuron learn what output to give, when all inputs are zero.
Add each neuron an extra input with a fixed value: A bias node

Limitations

A single layer perceptron can only learn linearly separable problems.
- Boolean AND function is linearly separable, whereas Boolean XOR function is not.

Multi-Layer Network

A neural network with one or more layers of nodes between the input and the output nodes is called multilayer network
The multilayer network structure, or architecture, or topology, consists of an input layer, one or more hidden layers, and one output layer.
The input nodes pass values to the first hidden layer, its nodes to the second and so until producing outputs.
- A network with a layer of input units, a layer of hidden units and a layer of output units is a two-layer network.
- A network with two layers of hidden units is a three layer network, and so on.

Multi-Layer Perceptron(MPL)

No connections within a single layer.
No direct connections between input and output layers
Number of output units need not equal number of input units.
Number of hidden units per layer can be more or less than input or output units.

Layers

draws linear boundaries
combines the boundaries
generate arbitrarily complex boundaries

Backpropagation Algorithm

Forward step: Propagate activation from input to output layer
1. Put the input values in the input layer
2. Calculate the activations of the hidden nodes
3. Calculate the activations of the output nodes
Backward step: propagate errors from output to hidden layer
1. Calculate the output errors
2. Update last layer of weights
3. Propagate error backward, update hidden weights
4. Until first layer is reached

Error Function

Single scalar function for entire network
Parameterized by weights (objects of interest)
Multiple errors of different signs should not cancel out
Sum-of-squares error:
- \[E(w) = \frac{1}{2}\sum_k(t_k-y_k)^2=\frac{1}{2}\sum_k(t_k-\sum_iw_{ik}x_i)^2\]

Gradient Descent Learning

Compute gradient \(\rightarrow\) differentiate sum-of squares error function
- \[\Delta w_{ik} = -\eta\frac{\partial E}{\partial w_{ik}}\]

Error terms

For the outputs
- \[\delta_k = (y_k-t_k)g'(a_k)\]
For the hidden nodes
- \[\delta_j = g'(u_j)\sum_k \delta_k w_{jk}\]

Update Rules

For the weights connected to the outputs:
- \[w_{jk} \leftarrow w_{jk}-\eta\delta_kz_j\]
For the weights on the hidden nodes:
- \[v_{ij} \leftarrow v_{ij}-\eta\delta_jz_i\]
The learning rate \(\eta\) depends on the application

Algorithm (sequential)

Apply an input vector and calculate all activations, a and u
Evaluate deltas for all output units:
- \[\delta_k = (y_k-t_k)g'(a_k)\]
Propagate deltas backwards to hidden layer deltas:
- \[\delta_j = g'(u_j)\sum_k \delta_k w_{jk}\]
Update weights:
- \[w_{jk} \leftarrow w_{jk}-\eta\delta_kz_j\]
- \(v_{ij} \leftarrow v_{ij}-\eta\delta_jz_i\)

Activation function

Defines the output of a node given a input or set og inputs.
Must be nonlinear, differentiable and have a bounded range

Sigmoidal Function

Common in MLP
\[g(a_i) = \frac{1}{1+exp(-ka_i)} = \frac{1}{1+e^{-ka_i}}\]
k is a positive constant.
Range from 0 to 1.
Can use tanh(ka), same shape but range from -1 to 1.

Network Training

Batch Training

Update weights after all inputs seen
Need many epochs
One loop through all the training data is called an epoch

batch_training

Sequential Training

Update weights after each input is seen
Need many epochs

sequential_training

Minibatch

Split all data into N minibatches
Feed each minibatch to the network
Shuffle data and repeat
May lead to the benefits of both batch-learning and sequential learning:
- Rapid convergence
- Avoiding local minima

minibatch

Avoiding local minimum/optimum

Initialize training many times with random weights
Use momentum:
- \[w_{ij} \leftarrow w_{ij} - \eta\Delta_jz_i+\alpha\Delta w_{ij}^{t-1}\]

Practical Issues

Amount of Training

Use 10 times more data than the number of weights

Generalization

Aim of nerual network learning: Generalise from training examples to all possible inputs.
The objective of learning is to achieve good generalization to new cases; we cannot train on all possible data.
Underfitting:
- Symptoms:
  - High training error
  - Training error close to test error
  - High bias
- Possible remedies:
  - Complexify model
  - Add more features
  - Train longer
Just right:
- Symptoms:
  - Training error slightly lower than test error
Overfitting:
- Symptoms:
  - Very low training error
  - Training error much lower than test error
  - High variance
- Possible remedies:
  - Perform regularization
  - Get more data
Approximation of \(y = f(x)\):

Overfitting solution - Cross-Validation

To maximize generalization and avoid overfitting, split data into three sets:

Training set: Train the model.
Validation set: Judge the model’s generalization ability during training.
Test set: Judge the model’s generalization ability after training.

Validation set

Data unseen by training algorithm – not used for backpropagation.
Network is not trained on this data, so we can use it to measure generalization ability.
Goal is to maximize generalization ability, so we should minimize the error on this data set.

stopping_early

Testing set

Data unseen during training and validation.
Has no influence on when to stop training.Has no influence on when to stop training.
With early stopping, we’ve maximized the ability to generalize to the validation set;
To judge the final result, we should measure its ability to generalize to completely unseen data.

Reinforcement Learning

An agent has a state, and based on its actions it relives rewards/punishment. The goal is to learn actions that maximize reward.

Learning is guided by the reward
- An infrequent numerical feedback indicating how well we are doing.
- Problems:
  - The reward does not tell us what we should have done!
  - The reward may be delayed – does not always indicate when we made a mistake.

The reward Function

Corresponds to the fitness function of an evolutionary algorithm.
\(r_{t+1}\) is a function of \((s_t, a_t)\)
The reward is a numeric value, and can be negative(punishment).
Can be given throughout the learning episode, or only in the end.
Goal is to maximize total reward.

Maximizing total reward

Total reward: \(\sum\limits_{t=0}^{N-1} r_{t+1}\)

Future rewards may be uncertain and we might care more about rewards that come soon. Therefore, we discount future rewards:

\(R = \sum\limits_{t=0}^{\infty} \gamma^t t_{t+1}, \qquad 0 \leq \gamma \leq 1\) or \(R = \sum\limits_{k=0}^{\infty} \gamma^k r_{t+k+1}, \qquad 0 \leq \gamma \leq 1\)
Future reward: \(R = r_1 + r_2 + ... + r_n\) \(R_t = r_t + r_{t+1} + ... + r_n\)
Discount future rewards

\[R_t = r_t + \gamma^1 r_{t+1} + ... + \gamma^n r_n\] \[= r_t + \gamma (r_{t+1} + ... ))\] \[= r_t + \gamma R_{t+1}\]

A good strategy for an agent would be to always choose an action that maximizes the (discounted) future reward.

Action Selection

At each learning stage, the RL algorithm looks at the possible actions and calculates the expected average reward.

\[Q_{s,t}(a)\]

Based on \(Q_{s,t}(a)\), an action will be selected using:
- Greedy strategy: pure exploitation
- \(epsilon\)-Greedy strategy: exploitation with a little exploration
- Soft-Max strategy: \(P(Q_{s,t}(a)) = \frac{e^{Q_{s,t}(a)/\tau}}{\sum_b e^{Q_{s,t}(b)/\tau}}\)

Policy \((\pi)\) and Value (\(V\))

The set of actions we took define our policy.\((\pi)\)
The expected rewards we get in return, defines our value (\(V\)).

Markov Decision Process

If we only need to know the current state, the problem has the Markov property**.
No Markov Property:
- \[P_r ( r_t = r',s_{s+t} = s'|s_t,a_t,r_{t-1},...,r_1,s_1,a_1,S_0,a_0 )\]
Markov property:
- \[P_r(r_t = r',s_{t+1} = s'|s_t,a_t)\]

Value

The expected future reward is known as the value.
Two ways to compute the value:
- The value of a state, \(V(s)\), averaged over all possible actions in that state.
  - (state-value function)
    - \[V(s) = E(r_t|s_t = s) = E(\sum_{i_0}^{\infty} \gamma^t r_{t+i+1} | s_t = s)\]
- The value of a state/action pair \(Q(s,a)\).
  - (action-value function)
    - \[Q(s,a) = E(r_t | s_t = s,a_t = a) = E(\sum_{i_0}^{\infty} \gamma^t r_{t+i+1} | s_t = s, a_t = a)\]
Q and V are initially unknown, and learned iteratively as we gain experience

The Q-Learning Algorithm

q_learning_algorithm

The SARSA Algorithm

sarsa_algorithm

On-policy vs off-policy learning:

on_policy_vs_off_policy