MilesCranmer commited on
Commit
e1e2d20
1 Parent(s): 82bd48e

Split up readme into docs

Browse files
Files changed (4) hide show
  1. README.md +0 -260
  2. TODO.md +123 -0
  3. docs/options.md +123 -0
  4. pydoc-markdown.yml +12 -5
README.md CHANGED
@@ -91,263 +91,3 @@ The newest version of PySR also returns three additional columns:
91
  - `sympy_format` - sympy equation.
92
  - `lambda_format` - a lambda function for that equation, that you can pass values through.
93
 
94
- ### Custom operators
95
-
96
- One can define custom operators in Julia by passing a string:
97
- ```python
98
- equations = pysr.pysr(X, y, niterations=100,
99
- binary_operators=["mult", "plus", "special(x, y) = x^2 + y"],
100
- extra_sympy_mappings={'special': lambda x, y: x**2 + y},
101
- unary_operators=["cos"])
102
- ```
103
-
104
- Now, the symbolic regression code can search using this `special` function
105
- that squares its left argument and adds it to its right. Make sure
106
- all passed functions are valid Julia code, and take one (unary)
107
- or two (binary) float32 scalars as input, and output a float32. This means if you
108
- write any real constants in your operator, like `2.5`, you have to write them
109
- instead as `2.5f0`, which defines it as `Float32`.
110
- Operators are automatically vectorized.
111
-
112
- We also define `extra_sympy_mappings`,
113
- so that the SymPy code can understand the output equation from Julia,
114
- when constructing a useable function. This step is optional, but
115
- is necessary for the `lambda_format` to work.
116
-
117
- One can also edit `operators.jl`. See below for more options.
118
-
119
- ### Weighted data
120
-
121
- Here, we assign weights to each row of data
122
- using inverse uncertainty squared. We also use 10 processes
123
- instead of the usual 4, which creates more populations
124
- (one population per thread).
125
- ```python
126
- sigma = ...
127
- weights = 1/sigma**2
128
-
129
- equations = pysr.pysr(X, y, weights=weights, procs=10)
130
- ```
131
-
132
-
133
- # Full API
134
-
135
- What follows is the API reference for running the numpy interface.
136
- You likely don't need to tune the hyperparameters yourself,
137
- but if you would like, you can use `hyperparamopt.py` as an example.
138
- However, you should adjust `procs`, `niterations`,
139
- `binary_operators`, `unary_operators`, and `maxsize`
140
- to your requirements.
141
-
142
- The program will output a pandas DataFrame containing the equations,
143
- mean square error, and complexity. It will also dump to a csv
144
- at the end of every iteration,
145
- which is `hall_of_fame.csv` by default. It also prints the
146
- equations to stdout.
147
-
148
- ```python
149
- pysr(X=None, y=None, weights=None, procs=4, populations=None, niterations=100, ncyclesperiteration=300, binary_operators=["plus", "mult"], unary_operators=["cos", "exp", "sin"], alpha=0.1, annealing=True, fractionReplaced=0.10, fractionReplacedHof=0.10, npop=1000, parsimony=1e-4, migration=True, hofMigration=True, shouldOptimizeConstants=True, topn=10, weightAddNode=1, weightInsertNode=3, weightDeleteNode=3, weightDoNothing=1, weightMutateConstant=10, weightMutateOperator=1, weightRandomize=1, weightSimplify=0.01, perturbationFactor=1.0, nrestarts=3, timeout=None, extra_sympy_mappings={}, equation_file='hall_of_fame.csv', test='simple1', verbosity=1e9, maxsize=20, fast_cycle=False, maxdepth=None, variable_names=[], batching=False, batchSize=50, select_k_features=None, threads=None, julia_optimization=3)
150
- ```
151
-
152
- Run symbolic regression to fit f(X[i, :]) ~ y[i] for all i.
153
- Note: most default parameters have been tuned over several example
154
- equations, but you should adjust `threads`, `niterations`,
155
- `binary_operators`, `unary_operators` to your requirements.
156
-
157
- **Arguments**:
158
-
159
- - `X`: np.ndarray or pandas.DataFrame, 2D array. Rows are examples,
160
- columns are features. If pandas DataFrame, the columns are used
161
- for variable names (so make sure they don't contain spaces).
162
- - `y`: np.ndarray, 1D array. Rows are examples.
163
- - `weights`: np.ndarray, 1D array. Each row is how to weight the
164
- mean-square-error loss on weights.
165
- - `procs`: int, Number of processes (=number of populations running).
166
- - `populations`: int, Number of populations running; by default=procs.
167
- - `niterations`: int, Number of iterations of the algorithm to run. The best
168
- equations are printed, and migrate between populations, at the
169
- end of each.
170
- - `ncyclesperiteration`: int, Number of total mutations to run, per 10
171
- samples of the population, per iteration.
172
- - `binary_operators`: list, List of strings giving the binary operators
173
- in Julia's Base, or in `operator.jl`.
174
- - `unary_operators`: list, Same but for operators taking a single `Float32`.
175
- - `alpha`: float, Initial temperature.
176
- - `annealing`: bool, Whether to use annealing. You should (and it is default).
177
- - `fractionReplaced`: float, How much of population to replace with migrating
178
- equations from other populations.
179
- - `fractionReplacedHof`: float, How much of population to replace with migrating
180
- equations from hall of fame.
181
- - `npop`: int, Number of individuals in each population
182
- - `parsimony`: float, Multiplicative factor for how much to punish complexity.
183
- - `migration`: bool, Whether to migrate.
184
- - `hofMigration`: bool, Whether to have the hall of fame migrate.
185
- - `shouldOptimizeConstants`: bool, Whether to numerically optimize
186
- constants (Nelder-Mead/Newton) at the end of each iteration.
187
- - `topn`: int, How many top individuals migrate from each population.
188
- - `nrestarts`: int, Number of times to restart the constant optimizer
189
- - `perturbationFactor`: float, Constants are perturbed by a max
190
- factor of (perturbationFactor\*T + 1). Either multiplied by this
191
- or divided by this.
192
- - `weightAddNode`: float, Relative likelihood for mutation to add a node
193
- - `weightInsertNode`: float, Relative likelihood for mutation to insert a node
194
- - `weightDeleteNode`: float, Relative likelihood for mutation to delete a node
195
- - `weightDoNothing`: float, Relative likelihood for mutation to leave the individual
196
- - `weightMutateConstant`: float, Relative likelihood for mutation to change
197
- the constant slightly in a random direction.
198
- - `weightMutateOperator`: float, Relative likelihood for mutation to swap
199
- an operator.
200
- - `weightRandomize`: float, Relative likelihood for mutation to completely
201
- delete and then randomly generate the equation
202
- - `weightSimplify`: float, Relative likelihood for mutation to simplify
203
- constant parts by evaluation
204
- - `timeout`: float, Time in seconds to timeout search
205
- - `equation_file`: str, Where to save the files (.csv separated by |)
206
- - `test`: str, What test to run, if X,y not passed.
207
- - `maxsize`: int, Max size of an equation.
208
- - `maxdepth`: int, Max depth of an equation. You can use both maxsize and maxdepth.
209
- maxdepth is by default set to = maxsize, which means that it is redundant.
210
- - `fast_cycle`: bool, (experimental) - batch over population subsamples. This
211
- is a slightly different algorithm than regularized evolution, but does cycles
212
- 15% faster. May be algorithmically less efficient.
213
- - `variable_names`: list, a list of names for the variables, other
214
- than "x0", "x1", etc.
215
- - `batching`: bool, whether to compare population members on small batches
216
- during evolution. Still uses full dataset for comparing against
217
- hall of fame.
218
- - `batchSize`: int, the amount of data to use if doing batching.
219
- - `select_k_features`: (None, int), whether to run feature selection in
220
- Python using random forests, before passing to the symbolic regression
221
- code. None means no feature selection; an int means select that many
222
- features.
223
- - `julia_optimization`: int, Optimization level (0, 1, 2, 3)
224
-
225
- **Returns**:
226
-
227
- pd.DataFrame, Results dataframe, giving complexity, MSE, and equations
228
- (as strings).
229
-
230
-
231
- # TODO
232
-
233
- - [x] Async threading, and have a server of equations. So that threads aren't waiting for others to finish.
234
- - [x] Print out speed of equation evaluation over time. Measure time it takes per cycle
235
- - [x] Add ability to pass an operator as an anonymous function string. E.g., `binary_operators=["g(x, y) = x+y"]`.
236
- - [x] Add error bar capability (thanks Johannes Buchner for suggestion)
237
- - [x] Why don't the constants continually change? It should optimize them every time the equation appears.
238
- - Restart the optimizer to help with this.
239
- - [x] Add several common unary and binary operators; list these.
240
- - [x] Try other initial conditions for optimizer
241
- - [x] Make scaling of changes to constant a hyperparameter
242
- - [x] Make deletion op join deleted subtree to parent
243
- - [x] Update hall of fame every iteration?
244
- - Seems to overfit early if we do this.
245
- - [x] Consider adding mutation to pass an operator in through a new binary operator (e.g., exp(x3)->plus(exp(x3), ...))
246
- - (Added full insertion operator
247
- - [x] Add a node at the top of a tree
248
- - [x] Insert a node at the top of a subtree
249
- - [x] Record very best individual in each population, and return at end.
250
- - [x] Write our own tree copy operation; deepcopy() is the slowest operation by far.
251
- - [x] Hyperparameter tune
252
- - [x] Create a benchmark for accuracy
253
- - [x] Add interface for either defining an operation to learn, or loading in arbitrary dataset.
254
- - Could just write out the dataset in julia, or load it.
255
- - [x] Create a Python interface
256
- - [x] Explicit constant optimization on hall-of-fame
257
- - Create method to find and return all constants, from left to right
258
- - Create method to find and set all constants, in same order
259
- - Pull up some optimization algorithm and add it. Keep the package small!
260
- - [x] Create a benchmark for speed
261
- - [x] Simplify subtrees with only constants beneath them. Or should I? Maybe randomly simplify sometimes?
262
- - [x] Record hall of fame
263
- - [x] Optionally (with hyperparameter) migrate the hall of fame, rather than current bests
264
- - [x] Test performance of reduced precision integers
265
- - No effect
266
- - [x] Create struct to pass through all hyperparameters, instead of treating as constants
267
- - Make sure doesn't affect performance
268
- - [x] Rename package to avoid trademark issues
269
- - PySR?
270
- - [x] Put on PyPI
271
- - [x] Treat baseline as a solution.
272
- - [x] Print score alongside MSE: \delta \log(MSE)/\delta \log(complexity)
273
- - [x] Calculating the loss function - there is duplicate calculations happening.
274
- - [x] Declaration of the weights array every iteration
275
- - [x] Sympy evaluation
276
- - [x] Threaded recursion
277
- - [x] Test suite
278
- - [x] Performance: - Use an enum for functions instead of storing them?
279
- - Gets ~40% speedup on small test.
280
- - [x] Use @fastmath
281
- - [x] Try @spawn over each sub-population. Do random sort, compute mutation for each, then replace 10% oldest.
282
- - [x] Control max depth, rather than max number of nodes?
283
- - [x] Allow user to pass names for variables - use these when printing
284
- - [x] Check for domain errors in an equation quickly before actually running the entire array over it. (We do this now recursively - every single equation is checked for nans/infs when being computed.)
285
- - [ ] Sort these todo lists by priority
286
-
287
- ## Feature ideas
288
-
289
- - [ ] Cross-validation
290
- - [ ] read the docs page
291
- - [ ] Sympy printing
292
- - [ ] Better cleanup of zombie processes after <ctl-c>
293
- - [ ] Hierarchical model, so can re-use functional forms. Output of one equation goes into second equation?
294
- - [ ] Call function to read from csv after running, so dont need to run again
295
- - [ ] Add function to plot equations
296
- - [ ] Refresh screen rather than dumping to stdout?
297
- - [ ] Add ability to save state from python
298
- - [ ] Additional degree operators?
299
- - [ ] Multi targets (vector ops). Idea 1: Node struct contains argument for which registers it is applied to. Then, can work with multiple components simultaneously. Though this may be tricky to get right. Idea 2: each op is defined by input/output space. Some operators are flexible, and the spaces should be adjusted automatically. Otherwise, only consider ops that make a tree possible. But will need additional ops here to get it to work. Idea 3: define each equation in 2 parts: one part that is shared between all outputs, and one that is different between all outputs. Maybe this could be an array of nodes corresponding to each output. And those nodes would define their functions.
300
- - [ ] Tree crossover? I.e., can take as input a part of the same equation, so long as it is the same level or below?
301
- - [ ] Consider printing output sorted by score, not by complexity.
302
- - [ ] Dump scores alongside MSE to .csv (and return with Pandas).
303
- - [ ] Create flexible way of providing "simplification recipes." I.e., plus(plus(T, C), C) => plus(T, +(C, C)). The user could pass these.
304
- - [ ] Consider allowing multi-threading turned off, for faster testing (cache issue on travis). Or could simply fix the caching issue there.
305
- - [ ] Consider returning only the equation of interest; rather than all equations.
306
- - [ ] Enable derivative operators. These would differentiate their right argument wrt their left argument, some input variable.
307
-
308
- ## Algorithmic performance ideas:
309
-
310
- - [ ] Idea: use gradient of equation with respect to each operator (perhaps simply add to each operator) to tell which part is the most "sensitive" to changes. Then, perhaps insert/delete/mutate on that part of the tree?
311
- - [ ] Start populations staggered; so that there is more frequent printing (and pops that start a bit later get hall of fame already)?
312
- - [ ] Consider adding mutation for constant<->variable
313
- - [ ] Implement more parts of the original Eureqa algorithms: https://www.creativemachineslab.com/eureqa.html
314
- - [ ] Experiment with freezing parts of model; then we only append/delete at end of tree.
315
- - [ ] Use NN to generate weights over all probability distribution conditional on error and existing equation, and train on some randomly-generated equations
316
- - [ ] For hierarchical idea: after running some number of iterations, do a search for "most common pattern". Then, turn that subtree into its own operator.
317
- - [ ] Calculate feature importances based on features we've already seen, then weight those features up in all random generations.
318
- - [ ] Calculate feature importances of future mutations, by looking at correlation between residual of model, and the features.
319
- - Store feature importances of future, and periodically update it.
320
- - [ ] Punish depth rather than size, as depth really hurts during optimization.
321
-
322
-
323
- ## Code performance ideas:
324
-
325
- - [ ] Try defining a binary tree as an array, rather than a linked list. See https://stackoverflow.com/a/6384714/2689923
326
- - [ ] Add true multi-node processing, with MPI, or just file sharing. Multiple populations per core.
327
- - Ongoing in cluster branch
328
- - [ ] Performance: try inling things?
329
- - [ ] Try storing things like number nodes in a tree; then can iterate instead of counting
330
-
331
- ```julia
332
- mutable struct Tree
333
- degree::Array{Integer, 1}
334
- val::Array{Float32, 1}
335
- constant::Array{Bool, 1}
336
- op::Array{Integer, 1}
337
- Tree(s::Integer) = new(zeros(Integer, s), zeros(Float32, s), zeros(Bool, s), zeros(Integer, s))
338
- end
339
- ```
340
-
341
- - Then, we could even work with trees on the GPU, since they are all pre-allocated arrays.
342
- - A population could be a Tree, but with degree 2 on all the degrees. So a slice of population arrays forms a tree.
343
- - How many operations can we do via matrix ops? Mutate node=>easy.
344
- - Can probably batch and do many operations at once across a population.
345
- - Or, across all populations! Mutate operator: index 2D array and set it to random vector? But the indexing might hurt.
346
- - The big advantage: can evaluate all new mutated trees at once; as massive matrix operation.
347
- - Can control depth, rather than maxsize. Then just pretend all trees are full and same depth. Then we really don't need to care about depth.
348
-
349
- - [ ] Can we cache calculations, or does the compiler do that? E.g., I should only have to run exp(x0) once; after that it should be read from memory.
350
- - Done on caching branch. Currently am finding that this is quiet slow (presumably because memory allocation is the main issue).
351
- - [ ] Add GPU capability?
352
- - Not sure if possible, as binary trees are the real bottleneck.
353
- - Could generate on CPU, evaluate score on GPU?
 
91
  - `sympy_format` - sympy equation.
92
  - `lambda_format` - a lambda function for that equation, that you can pass values through.
93
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
TODO.md ADDED
@@ -0,0 +1,123 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # TODO
2
+
3
+ - [x] Async threading, and have a server of equations. So that threads aren't waiting for others to finish.
4
+ - [x] Print out speed of equation evaluation over time. Measure time it takes per cycle
5
+ - [x] Add ability to pass an operator as an anonymous function string. E.g., `binary_operators=["g(x, y) = x+y"]`.
6
+ - [x] Add error bar capability (thanks Johannes Buchner for suggestion)
7
+ - [x] Why don't the constants continually change? It should optimize them every time the equation appears.
8
+ - Restart the optimizer to help with this.
9
+ - [x] Add several common unary and binary operators; list these.
10
+ - [x] Try other initial conditions for optimizer
11
+ - [x] Make scaling of changes to constant a hyperparameter
12
+ - [x] Make deletion op join deleted subtree to parent
13
+ - [x] Update hall of fame every iteration?
14
+ - Seems to overfit early if we do this.
15
+ - [x] Consider adding mutation to pass an operator in through a new binary operator (e.g., exp(x3)->plus(exp(x3), ...))
16
+ - (Added full insertion operator
17
+ - [x] Add a node at the top of a tree
18
+ - [x] Insert a node at the top of a subtree
19
+ - [x] Record very best individual in each population, and return at end.
20
+ - [x] Write our own tree copy operation; deepcopy() is the slowest operation by far.
21
+ - [x] Hyperparameter tune
22
+ - [x] Create a benchmark for accuracy
23
+ - [x] Add interface for either defining an operation to learn, or loading in arbitrary dataset.
24
+ - Could just write out the dataset in julia, or load it.
25
+ - [x] Create a Python interface
26
+ - [x] Explicit constant optimization on hall-of-fame
27
+ - Create method to find and return all constants, from left to right
28
+ - Create method to find and set all constants, in same order
29
+ - Pull up some optimization algorithm and add it. Keep the package small!
30
+ - [x] Create a benchmark for speed
31
+ - [x] Simplify subtrees with only constants beneath them. Or should I? Maybe randomly simplify sometimes?
32
+ - [x] Record hall of fame
33
+ - [x] Optionally (with hyperparameter) migrate the hall of fame, rather than current bests
34
+ - [x] Test performance of reduced precision integers
35
+ - No effect
36
+ - [x] Create struct to pass through all hyperparameters, instead of treating as constants
37
+ - Make sure doesn't affect performance
38
+ - [x] Rename package to avoid trademark issues
39
+ - PySR?
40
+ - [x] Put on PyPI
41
+ - [x] Treat baseline as a solution.
42
+ - [x] Print score alongside MSE: \delta \log(MSE)/\delta \log(complexity)
43
+ - [x] Calculating the loss function - there is duplicate calculations happening.
44
+ - [x] Declaration of the weights array every iteration
45
+ - [x] Sympy evaluation
46
+ - [x] Threaded recursion
47
+ - [x] Test suite
48
+ - [x] Performance: - Use an enum for functions instead of storing them?
49
+ - Gets ~40% speedup on small test.
50
+ - [x] Use @fastmath
51
+ - [x] Try @spawn over each sub-population. Do random sort, compute mutation for each, then replace 10% oldest.
52
+ - [x] Control max depth, rather than max number of nodes?
53
+ - [x] Allow user to pass names for variables - use these when printing
54
+ - [x] Check for domain errors in an equation quickly before actually running the entire array over it. (We do this now recursively - every single equation is checked for nans/infs when being computed.)
55
+ - [ ] Sort these todo lists by priority
56
+
57
+ ## Feature ideas
58
+
59
+ - [ ] Cross-validation
60
+ - [ ] read the docs page
61
+ - [ ] Sympy printing
62
+ - [ ] Better cleanup of zombie processes after <ctl-c>
63
+ - [ ] Hierarchical model, so can re-use functional forms. Output of one equation goes into second equation?
64
+ - [ ] Call function to read from csv after running, so dont need to run again
65
+ - [ ] Add function to plot equations
66
+ - [ ] Refresh screen rather than dumping to stdout?
67
+ - [ ] Add ability to save state from python
68
+ - [ ] Additional degree operators?
69
+ - [ ] Multi targets (vector ops). Idea 1: Node struct contains argument for which registers it is applied to. Then, can work with multiple components simultaneously. Though this may be tricky to get right. Idea 2: each op is defined by input/output space. Some operators are flexible, and the spaces should be adjusted automatically. Otherwise, only consider ops that make a tree possible. But will need additional ops here to get it to work. Idea 3: define each equation in 2 parts: one part that is shared between all outputs, and one that is different between all outputs. Maybe this could be an array of nodes corresponding to each output. And those nodes would define their functions.
70
+ - [ ] Tree crossover? I.e., can take as input a part of the same equation, so long as it is the same level or below?
71
+ - [ ] Consider printing output sorted by score, not by complexity.
72
+ - [ ] Dump scores alongside MSE to .csv (and return with Pandas).
73
+ - [ ] Create flexible way of providing "simplification recipes." I.e., plus(plus(T, C), C) => plus(T, +(C, C)). The user could pass these.
74
+ - [ ] Consider allowing multi-threading turned off, for faster testing (cache issue on travis). Or could simply fix the caching issue there.
75
+ - [ ] Consider returning only the equation of interest; rather than all equations.
76
+ - [ ] Enable derivative operators. These would differentiate their right argument wrt their left argument, some input variable.
77
+
78
+ ## Algorithmic performance ideas:
79
+
80
+ - [ ] Idea: use gradient of equation with respect to each operator (perhaps simply add to each operator) to tell which part is the most "sensitive" to changes. Then, perhaps insert/delete/mutate on that part of the tree?
81
+ - [ ] Start populations staggered; so that there is more frequent printing (and pops that start a bit later get hall of fame already)?
82
+ - [ ] Consider adding mutation for constant<->variable
83
+ - [ ] Implement more parts of the original Eureqa algorithms: https://www.creativemachineslab.com/eureqa.html
84
+ - [ ] Experiment with freezing parts of model; then we only append/delete at end of tree.
85
+ - [ ] Use NN to generate weights over all probability distribution conditional on error and existing equation, and train on some randomly-generated equations
86
+ - [ ] For hierarchical idea: after running some number of iterations, do a search for "most common pattern". Then, turn that subtree into its own operator.
87
+ - [ ] Calculate feature importances based on features we've already seen, then weight those features up in all random generations.
88
+ - [ ] Calculate feature importances of future mutations, by looking at correlation between residual of model, and the features.
89
+ - Store feature importances of future, and periodically update it.
90
+ - [ ] Punish depth rather than size, as depth really hurts during optimization.
91
+
92
+
93
+ ## Code performance ideas:
94
+
95
+ - [ ] Try defining a binary tree as an array, rather than a linked list. See https://stackoverflow.com/a/6384714/2689923
96
+ - [ ] Add true multi-node processing, with MPI, or just file sharing. Multiple populations per core.
97
+ - Ongoing in cluster branch
98
+ - [ ] Performance: try inling things?
99
+ - [ ] Try storing things like number nodes in a tree; then can iterate instead of counting
100
+
101
+ ```julia
102
+ mutable struct Tree
103
+ degree::Array{Integer, 1}
104
+ val::Array{Float32, 1}
105
+ constant::Array{Bool, 1}
106
+ op::Array{Integer, 1}
107
+ Tree(s::Integer) = new(zeros(Integer, s), zeros(Float32, s), zeros(Bool, s), zeros(Integer, s))
108
+ end
109
+ ```
110
+
111
+ - Then, we could even work with trees on the GPU, since they are all pre-allocated arrays.
112
+ - A population could be a Tree, but with degree 2 on all the degrees. So a slice of population arrays forms a tree.
113
+ - How many operations can we do via matrix ops? Mutate node=>easy.
114
+ - Can probably batch and do many operations at once across a population.
115
+ - Or, across all populations! Mutate operator: index 2D array and set it to random vector? But the indexing might hurt.
116
+ - The big advantage: can evaluate all new mutated trees at once; as massive matrix operation.
117
+ - Can control depth, rather than maxsize. Then just pretend all trees are full and same depth. Then we really don't need to care about depth.
118
+
119
+ - [ ] Can we cache calculations, or does the compiler do that? E.g., I should only have to run exp(x0) once; after that it should be read from memory.
120
+ - Done on caching branch. Currently am finding that this is quiet slow (presumably because memory allocation is the main issue).
121
+ - [ ] Add GPU capability?
122
+ - Not sure if possible, as binary trees are the real bottleneck.
123
+ - Could generate on CPU, evaluate score on GPU?
docs/options.md ADDED
@@ -0,0 +1,123 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Common Options
2
+
3
+ You likely don't need to tune the hyperparameters yourself,
4
+ but if you would like, you can use `hyperparamopt.py` as an example.
5
+
6
+ Common options that you can try include:
7
+ - `niterations`
8
+ - `procs`
9
+ - `populations`
10
+ - `binary_operators`, `unary_operators`
11
+ - `weights`
12
+ - `maxsize`, `maxdepth`
13
+ - `batching`, `batchSize`
14
+ - `variable_names` (or pandas input)
15
+ - SymPy output
16
+
17
+ These are described below
18
+
19
+ The program will output a pandas DataFrame containing the equations,
20
+ mean square error, and complexity. It will also dump to a csv
21
+ at the end of every iteration,
22
+ which is `hall_of_fame.csv` by default. It also prints the
23
+ equations to stdout.
24
+
25
+ ## Iterations
26
+
27
+ This is the total number of generations that `pysr` will run for.
28
+ I usually set this to a large number, and exit when I am satisfied
29
+ with the equations.
30
+
31
+ ## Processors
32
+
33
+ One can adjust the number of workers used by Julia with the
34
+ `procs` option. You should set this equal to the number of cores
35
+ you want `pysr` to use. This will also run `procs` number of
36
+ populations simultaneously by default.
37
+
38
+ ## Populations
39
+
40
+ By default, `populations=procs`, but you can set a different
41
+ number of populations with this option. More populations may increase
42
+ the diversity of equations discovered, though will take longer to train.
43
+ However, it may be more efficient to have `populations>procs`,
44
+ as there are multiple populations running
45
+ on each core.
46
+
47
+ ## Custom operators
48
+
49
+ A list of operators can be found on the operators page.
50
+ One can define custom operators in Julia by passing a string:
51
+ ```python
52
+ equations = pysr.pysr(X, y, niterations=100,
53
+ binary_operators=["mult", "plus", "special(x, y) = x^2 + y"],
54
+ extra_sympy_mappings={'special': lambda x, y: x**2 + y},
55
+ unary_operators=["cos"])
56
+ ```
57
+
58
+ Now, the symbolic regression code can search using this `special` function
59
+ that squares its left argument and adds it to its right. Make sure
60
+ all passed functions are valid Julia code, and take one (unary)
61
+ or two (binary) float32 scalars as input, and output a float32. This means if you
62
+ write any real constants in your operator, like `2.5`, you have to write them
63
+ instead as `2.5f0`, which defines it as `Float32`.
64
+ Operators are automatically vectorized.
65
+
66
+ One should also define `extra_sympy_mappings`,
67
+ so that the SymPy code can understand the output equation from Julia,
68
+ when constructing a useable function. This step is optional, but
69
+ is necessary for the `lambda_format` to work.
70
+
71
+ One can also edit `operators.jl`. See below for more options.
72
+
73
+ ## Weighted data
74
+
75
+ Here, we assign weights to each row of data
76
+ using inverse uncertainty squared. We also use 10 processes
77
+ instead of the usual 4, which creates more populations
78
+ (one population per thread).
79
+ ```python
80
+ sigma = ...
81
+ weights = 1/sigma**2
82
+
83
+ equations = pysr.pysr(X, y, weights=weights, procs=10)
84
+ ```
85
+
86
+ ## Max size
87
+
88
+ `maxsize` controls the maximum size of equation (number of operators,
89
+ constants, variables). `maxdepth` is by default not used, but can be set
90
+ to control the maximum depth of an equation. These will make processing
91
+ faster, as longer equations take longer to test.
92
+
93
+
94
+ ## Batching
95
+ One can turn on mini-batching, with the `batching` flag,
96
+ and control the batch size with `batchSize`. This will make
97
+ evolution faster for large datasets. Equations are still evaluated
98
+ on the entire dataset at the end of each iteration to compare to the hall
99
+ of fame, but only on a random subset during mutations and annealing.
100
+
101
+ ## Variable Names
102
+
103
+ You can pass a list of strings naming each column of `X` with
104
+ `variable_names`. Alternatively, you can pass `X` as a pandas dataframe
105
+ and the columns will be used as variable names. Make sure only
106
+ alphabetical characters and `_` are used in these names.
107
+
108
+ ## SymPy output
109
+
110
+ The `pysr` command will return a pandas dataframe. The `sympy_format`
111
+ column gives sympy equations. You can use this to get LaTeX format, with,
112
+ e.g.,
113
+
114
+ ```python
115
+ simplified = equations.iloc[-1]['sympy_format'].simplify()
116
+ print(sympy.latex(simplified))
117
+ ```
118
+
119
+ If you have set variable names with `variable_names` or a Pandas
120
+ dataframe as input for `X`, this will use the same names for each
121
+ input column instead of `x0`.
122
+
123
+
pydoc-markdown.yml CHANGED
@@ -1,6 +1,6 @@
1
  loaders:
2
  - type: python
3
- #search_path: [pysr]
4
  processors:
5
  - type: filter
6
  - type: smart
@@ -8,18 +8,25 @@ processors:
8
  renderer:
9
  type: hugo
10
  config:
11
- title: My Project
12
  theme: {clone_url: "https://github.com/alex-shpak/hugo-book.git"}
13
  # The "book" theme only renders pages in "content/docs" into the nav.
 
 
14
  content_directory: content/docs
 
15
  default_preamble: {menu: main}
16
  pages:
17
  - title: Home
18
  name: index
19
  source: README.md
20
- - title: Operators
21
- name: operators
22
- source: docs/operators.md
23
  - title: API Documentation
24
  contents:
25
  - pysr.sr.pysr
 
 
 
 
 
 
 
1
  loaders:
2
  - type: python
3
+
4
  processors:
5
  - type: filter
6
  - type: smart
 
8
  renderer:
9
  type: hugo
10
  config:
11
+ title: PySR
12
  theme: {clone_url: "https://github.com/alex-shpak/hugo-book.git"}
13
  # The "book" theme only renders pages in "content/docs" into the nav.
14
+
15
+ build_directory: docs/build
16
  content_directory: content/docs
17
+
18
  default_preamble: {menu: main}
19
  pages:
20
  - title: Home
21
  name: index
22
  source: README.md
23
+ directory: '..'
 
 
24
  - title: API Documentation
25
  contents:
26
  - pysr.sr.pysr
27
+ - title: Operators
28
+ name: operators
29
+ source: docs/operators.md
30
+ - title: Options
31
+ name: options
32
+ source: docs/options.md