Spaces:

MilesCranmer
/

PySR

Sleeping

App Files Files Community

PySR / README.md

MilesCranmer

Calculate length as constant

627c408 about 4 years ago

preview code

raw

history blame

8.42 kB

	# Eureqa.jl

	**Symbolic regression built on Julia, and interfaced by Python.
	Uses regularized evolution and simulated annealing.**

	Backstory: we used the original
	[eureqa](https://www.creativemachineslab.com/eureqa.html)
	in our [paper](https://arxiv.org/abs/2006.11287) to
	convert a graph neural network into
	an analytic equation describing dark matter overdensity. However,
	eureqa is GUI-only, doesn't allow for user-defined
	operators, has no distributed capabilities,
	and has become proprietary. Thus, the goal
	of this package is to have an open-source symbolic regression tool
	as efficient as eureqa, while also exposing a configurable
	python interface.

	The algorithms here implement regularized evolution, as in
	[AutoML-Zero](https://arxiv.org/abs/2003.03384),
	but with additional algorithmic changes such as simulated
	annealing, and classical optimization of constants.


	## Installation

	Install [Julia](https://julialang.org/downloads/). Then, at the command line,
	install the `Optim` package via: `julia -e 'import Pkg; Pkg.add("Optim")'`.
	For python, you need to have Python 3, numpy, and pandas installed.

	## Running:

	### Quickstart

	```python
	import numpy as np
	from eureqa import eureqa

	# Dataset
	X = 2*np.random.randn(100, 5)
	y = 2np.cos(X[:, 3]) + X[:, 0]*2 - 2

	# Learn equations
	equations = eureqa(X, y, niterations=5)

	...

	print(equations)
	```

	which gives:

	```
	Complexity MSE Equation
	0 5 1.947431 plus(-1.7420927, mult(x0, x0))
	1 8 0.486858 plus(-1.8710494, plus(cos(x3), mult(x0, x0)))
	2 11 0.000000 plus(plus(mult(x0, x0), cos(x3)), plus(-2.0, cos(x3)))
	```

	### API

	What follows is the API reference for running the numpy interface.
	Note that most parameters here
	have been tuned with ~1000 trials over several example
	equations, so you likely don't need to tune them yourself.
	However, you should adjust `threads`, `niterations`,
	`binary_operators`, `unary_operators`, and `maxsize` to your requirements.

	The program will output a pandas DataFrame containing the equations,
	mean square error, and complexity. It will also dump to a csv
	at the end of every iteration,
	which is `hall_of_fame.csv` by default. It also prints the
	equations to stdout.

	You can add more operators in `operators.jl`, or use default
	Julia ones. Make sure all operators are defined for scalar `Float32`.
	Then just specify the operator names in your call, as above.
	You can also change the dataset learned on by passing in `X` and `y` as
	numpy arrays to `eureqa(...)`.

	```python
	eureqa(X=None, y=None, threads=4, niterations=20,
	ncyclesperiteration=int(default_ncyclesperiteration),
	binary_operators=["plus", "mult"], unary_operators=["cos", "exp", "sin"],
	alpha=default_alpha, annealing=True, fractionReplaced=default_fractionReplaced,
	fractionReplacedHof=default_fractionReplacedHof, npop=int(default_npop),
	parsimony=default_parsimony, migration=True, hofMigration=True
	shouldOptimizeConstants=True, topn=int(default_topn),
	weightAddNode=default_weightAddNode, weightDeleteNode=default_weightDeleteNode,
	weightDoNothing=default_weightDoNothing,
	weightMutateConstant=default_weightMutateConstant,
	weightMutateOperator=default_weightMutateOperator,
	weightRandomize=default_weightRandomize, weightSimplify=default_weightSimplify,
	timeout=None, equation_file='hall_of_fame.csv', test='simple1', maxsize=20)
	```

	Run symbolic regression to fit f(X[i, :]) ~ y[i] for all i.

	Arguments:

	- `X`: np.ndarray, 2D array. Rows are examples, columns are features.
	- `y`: np.ndarray, 1D array. Rows are examples.
	- `threads`: int, Number of threads (=number of populations running).
	You can have more threads than cores - it actually makes it more
	efficient.
	- `niterations`: int, Number of iterations of the algorithm to run. The best
	equations are printed, and migrate between populations, at the
	end of each.
	- `ncyclesperiteration`: int, Number of total mutations to run, per 10
	samples of the population, per iteration.
	- `binary_operators`: list, List of strings giving the binary operators
	in Julia's Base, or in `operator.jl`.
	- `unary_operators`: list, Same but for operators taking a single `Float32`.
	- `alpha`: float, Initial temperature.
	- `annealing`: bool, Whether to use annealing. You should (and it is default).
	- `fractionReplaced`: float, How much of population to replace with migrating
	equations from other populations.
	- `fractionReplacedHof`: float, How much of population to replace with migrating
	equations from hall of fame.
	- `npop`: int, Number of individuals in each population
	- `parsimony`: float, Multiplicative factor for how much to punish complexity.
	- `migration`: bool, Whether to migrate.
	- `hofMigration`: bool, Whether to have the hall of fame migrate.
	- `shouldOptimizeConstants`: bool, Whether to numerically optimize
	constants (Nelder-Mead/Newton) at the end of each iteration.
	- `topn`: int, How many top individuals migrate from each population.
	- `weightAddNode`: float, Relative likelihood for mutation to add a node
	- `weightDeleteNode`: float, Relative likelihood for mutation to delete a node
	- `weightDoNothing`: float, Relative likelihood for mutation to leave the individual
	- `weightMutateConstant`: float, Relative likelihood for mutation to change
	the constant slightly in a random direction.
	- `weightMutateOperator`: float, Relative likelihood for mutation to swap
	an operator.
	- `weightRandomize`: float, Relative likelihood for mutation to completely
	delete and then randomly generate the equation
	- `weightSimplify`: float, Relative likelihood for mutation to simplify
	constant parts by evaluation
	- `timeout`: float, Time in seconds to timeout search
	- `equation_file`: str, Where to save the files (.csv separated by \|)
	- `test`: str, What test to run, if X,y not passed.
	- `maxsize`: int, Max size of an equation.

	Returns:

	pd.DataFrame, Results dataframe, giving complexity, MSE, and equations
	(as strings).


	# TODO

	- [ ] Make scaling of changes to constant a hyperparameter
	- [ ] Update hall of fame every iteration?
	- [ ] Calculate feature importances of future mutations, by looking at correlation between residual of model, and the features.
	- Store feature importances of future, and periodically update it.
	- [ ] Implement more parts of the original Eureqa algorithms: https://www.creativemachineslab.com/eureqa.html
	- [ ] Sympy printing
	- [ ] Consider adding mutation for constant<->variable
	- [ ] Consider adding mutation to pass an operator in through a new binary operator (e.g., exp(x3)->plus(exp(x3), ...))
	- [ ] Hierarchical model, so can re-use functional forms. Output of one equation goes into second equation?
	- [ ] Use NN to generate weights over all probability distribution conditional on error and existing equation, and train on some randomly-generated equations
	- [ ] Performance:
	- [ ] Use an enum for functions instead of storing them?
	- Current most expensive operations:
	- [ ] Calculating the loss function - there is duplicate calculations happening.
	- [x] Declaration of the weights array every iteration
	- [x] Add a node at the top of a tree
	- [x] Insert a node at the top of a subtree
	- [x] Record very best individual in each population, and return at end.
	- [x] Write our own tree copy operation; deepcopy() is the slowest operation by far.
	- [x] Hyperparameter tune
	- [x] Create a benchmark for accuracy
	- [x] Add interface for either defining an operation to learn, or loading in arbitrary dataset.
	- Could just write out the dataset in julia, or load it.
	- [x] Create a Python interface
	- [x] Explicit constant optimization on hall-of-fame
	- Create method to find and return all constants, from left to right
	- Create method to find and set all constants, in same order
	- Pull up some optimization algorithm and add it. Keep the package small!
	- [x] Create a benchmark for speed
	- [x] Simplify subtrees with only constants beneath them. Or should I? Maybe randomly simplify sometimes?
	- [x] Record hall of fame
	- [x] Optionally (with hyperparameter) migrate the hall of fame, rather than current bests
	- [x] Test performance of reduced precision integers
	- No effect
	- [x] Create struct to pass through all hyperparameters, instead of treating as constants
	- Make sure doesn't affect performance