Spaces:

MilesCranmer
/

PySR

Sleeping

App Files Files Community

MilesCranmer commited on Jan 31, 2022

Commit

d65676d

unverified ·

2 Parent(s): b9f583e a9212fa

Merge pull request #88 from MilesCranmer/sklearn

Browse files

Files changed (16) hide show

.gitignore +2 -1
Dockerfile +1 -2
README.md +69 -45
TODO.md +3 -0
docs/examples.md +11 -11
docs/operators.md +2 -4
docs/options.md +69 -48
docs/start.md +83 -32
example.py +13 -14
pydoc-markdown.yml +15 -1
pysr/__init__.py +1 -1
pysr/sr.py +990 -763
setup.py +6 -3
test/test.py +63 -60
test/test_jax.py +10 -8
test/test_torch.py +19 -10

.gitignore CHANGED Viewed

@@ -1,6 +1,7 @@
 .dataset*.jl
 .hyperparams*.jl
 *.csv
 *.bkup
 performance*txt
 *.out
@@ -14,4 +15,4 @@ dist
 pysr/.vs/
 pysr.egg-info
 Manifest.toml
-workflow

 .dataset*.jl
 .hyperparams*.jl
 *.csv
+*.csv.out*
 *.bkup
 performance*txt
 *.out
 pysr/.vs/
 pysr.egg-info
 Manifest.toml
+docs/

Dockerfile CHANGED Viewed

@@ -13,7 +13,7 @@ RUN apt-get update && apt-get upgrade -y && apt-get install -y \
     make build-essential libssl-dev zlib1g-dev \
     libbz2-dev libreadline-dev libsqlite3-dev wget curl llvm \
     libncursesw5-dev xz-utils tk-dev libxml2-dev libxmlsec1-dev libffi-dev liblzma-dev \
-    vim git \
     && apt-get clean \
     && rm -rf /var/lib/apt/lists/*
@@ -37,7 +37,6 @@ RUN pip3 install -r /pysr/requirements.txt
 # Install PySR:
 # We do a minimal copy so it doesn't need to rerun at every file change:
 ADD ./setup.py /pysr/setup.py
-ADD ./README.md /pysr/README.md
 ADD ./pysr/ /pysr/pysr/
 RUN pip3 install .

     make build-essential libssl-dev zlib1g-dev \
     libbz2-dev libreadline-dev libsqlite3-dev wget curl llvm \
     libncursesw5-dev xz-utils tk-dev libxml2-dev libxmlsec1-dev libffi-dev liblzma-dev \
+    vim git tmux \
     && apt-get clean \
     && rm -rf /var/lib/apt/lists/*
 # Install PySR:
 # We do a minimal copy so it doesn't need to rerun at every file change:
 ADD ./setup.py /pysr/setup.py
 ADD ./pysr/ /pysr/pysr/
 RUN pip3 install .

README.md CHANGED Viewed

@@ -74,71 +74,95 @@ Most common issues at this stage are solved
 by [tweaking the Julia package server](https://github.com/MilesCranmer/PySR/issues/27).
 to use up-to-date packages.
-## Docker
-You can also test out PySR in Docker, without
-installing it locally, by running the following command in
-the root directory of this repo:
-```bash
-docker build --pull --rm -f "Dockerfile" -t pysr "."
-```
-This builds an image called `pysr`. You can then run this with:
-```bash
-docker run -it --rm -v "$PWD:/data" pysr ipython
-```
-which will link the current directory to the container's `/data` directory
-and then launch ipython.
 # Quickstart
-Here is some demo code (also found in `example.py`)
 ```python
 import numpy as np
-from pysr import pysr, best
-# Dataset
 X = 2 * np.random.randn(100, 5)
-y = 2 * np.cos(X[:, 3]) + X[:, 0] ** 2 - 2
-# Learn equations
-equations = pysr(
-    X,
-    y,
     niterations=5,
     binary_operators=["+", "*"],
     unary_operators=[
         "cos",
         "exp",
-        "sin",  # Pre-defined library of operators (see docs)
-        "inv(x) = 1/x",  # Define your own operator! (Julia syntax)
     ],
 )
-...# (you can use ctl-c to exit early)
-print(best(equations))
 ```
-which gives:
 ```python
-x0**2 + 2.000016*cos(x3) - 1.9999845
 ```
-The second and additional calls of `pysr` will be significantly
-faster in startup time, since the first call to Julia will compile
-and cache functions from the symbolic regression backend.
-One can also use `best_tex` to get the LaTeX form,
-or `best_callable` to get a function you can call.
-This uses a score which balances complexity and error;
-however, one can see the full list of equations with:
 ```python
-print(equations)
 ```
-This is a pandas table, with additional columns:
-- `MSE` - the mean square error of the formula
-- `score` - a metric akin to Occam's razor; you should use this to help select the "true" equation.
-- `sympy_format` - sympy equation.
-- `lambda_format` - a lambda function for that equation, that you can pass values through.

 by [tweaking the Julia package server](https://github.com/MilesCranmer/PySR/issues/27).
 to use up-to-date packages.
 # Quickstart
+Let's create a PySR example. First, let's import
+numpy to generate some test data:
 ```python
 import numpy as np
 X = 2 * np.random.randn(100, 5)
+y = 2.5382 * np.cos(X[:, 3]) + X[:, 0] ** 2 - 0.5
+```
+We have created a dataset with 100 datapoints, with 5 features each.
+The relation we wish to model is $2.5382 \cos(x_3) + x_0^2 - 0.5$.
+Now, let's create a PySR model and train it.
+PySR's main interface is in the style of scikit-learn:
+```python
+from pysr import PySRRegressor
+model = PySRRegressor(
     niterations=5,
+    populations=8,
     binary_operators=["+", "*"],
     unary_operators=[
         "cos",
         "exp",
+        "sin",
+        "inv(x) = 1/x",  # Custom operator (julia syntax)
     ],
+    model_selection="best",
+    loss="loss(x, y) = (x - y)^2",  # Custom loss function (julia syntax)
 )
 ```
+This will set up the model for 5 iterations of the search code, which contains hundreds of thousands of mutations and equation evaluations.
+Let's train this model on our dataset:
 ```python
+model.fit(X, y)
 ```
+Internally, this launches a Julia process which will do a multithreaded search for equations to fit the dataset.
+Equations will be printed during training, and once you are satisfied, you may
+quit early by hitting 'q' and then \<enter\>.
+After the model has been fit, you can run `model.predict(X)`
+to see the predictions on a given dataset.
+You may run:
+```python
+print(model)
+```
+to print the learned equations:
 ```python
+PySRRegressor.equations = [
+   pick      score                                           Equation           MSE  Complexity
+0         0.000000                                          3.5082064  2.710828e+01           1
+1         0.964260                                          (x0 * x0)  3.940544e+00           3
+2         0.030096                          (-0.47978288 + (x0 * x0))  3.710349e+00           5
+3         0.840770                              ((x0 * x0) + cos(x3))  1.600564e+00           6
+4         0.928380                ((x0 * x0) + (2.5313091 * cos(x3)))  2.499724e-01           8
+5  >>>>  13.956461  ((-0.49999997 + (x0 * x0)) + (2.5382001 * cos(...  1.885665e-13          10
+]
 ```
+This arrow in the `pick` column indicates which equation is currently selected by your
+`model_selection` strategy for prediction.
+(You may change `model_selection` after `.fit(X, y)` as well.)
+`model.equations` is a pandas DataFrame containing all equations, including callable format
+(`lambda_format`),
+SymPy format (`sympy_format`), and even JAX and PyTorch format
+(both of which are differentiable).
+There are several other useful features such as denoising (e.g., `denoising=True`),
+feature selection (e.g., `select_k_features=3`).
+For a summary of features and options, see [this docs page](https://pysr.readthedocs.io/en/latest/docs/options/).
+You can see the full API at [this page](https://pysr.readthedocs.io/en/latest/docs/api-documentation/).
+# Docker
+You can also test out PySR in Docker, without
+installing it locally, by running the following command in
+the root directory of this repo:
+```bash
+docker build --pull --rm -f "Dockerfile" -t pysr "."
+```
+This builds an image called `pysr`. If you have issues building (for example, on Apple Silicon),
+you can emulate an architecture that works by including: `--platform linux/amd64`.
+You can then run this with:
+```bash
+docker run -it --rm -v "$PWD:/data" pysr ipython
+```
+which will link the current directory to the container's `/data` directory
+and then launch ipython.

TODO.md CHANGED Viewed

@@ -65,6 +65,9 @@
 - [ ] Automatically convert log, log10, log2, pow to the correct operators.
 - [ ] I think the simplification isn't working correctly (post-merging SymbolicUtils.)
 ## Feature ideas

 - [ ] Automatically convert log, log10, log2, pow to the correct operators.
 - [ ] I think the simplification isn't working correctly (post-merging SymbolicUtils.)
+- [ ] Show demo of PySRRegressor. Fit equations, then show how to view equations.
+- [ ] Add "selected" column string to regular equations dict.
+- [ ] List "Loss" instead of "MSE"
 ## Feature ideas

docs/examples.md CHANGED Viewed

@@ -23,8 +23,9 @@ find the expression `2 cos(x3) + x0^2 - 2`.
 ```python
 X = 2 * np.random.randn(100, 5)
 y = 2 * np.cos(X[:, 3]) + X[:, 0] ** 2 - 2
-expressions = pysr(X, y, binary_operators=["+", "-", "*", "/"], **kwargs)
-print(best(expressions))
 ```
 ## 2. Custom operator
@@ -34,14 +35,13 @@ Here, we define a custom operator and use it to find an expression:
 ```python
 X = 2 * np.random.randn(100, 5)
 y = 1 / X[:, 0]
-expressions = pysr(
-    X,
-    y,
     binary_operators=["plus", "mult"],
     unary_operators=["inv(x) = 1/x"],
     **kwargs
 )
-print(best(expressions))
 ```
 ## 3. Multiple outputs
@@ -51,23 +51,23 @@ each requiring a different feature.
 ```python
 X = 2 * np.random.randn(100, 5)
 y = 1 / X[:, [0, 1, 2]]
-expressions = pysr(
-    X,
-    y,
     binary_operators=["plus", "mult"],
     unary_operators=["inv(x) = 1/x"],
     **kwargs
 )
 ```
 ## 4. Plotting an expression
 Here, let's use the same equations, but get a format we can actually
-use and test. We can add this option after a search via the `get_hof`
 function:
 ```python
-expressions = get_hof(extra_sympy_mappings={"inv": lambda x: 1/x})
 ```
 If you look at the lists of expressions before and after, you will
 see that the sympy format now has replaced `inv` with `1/`.

 ```python
 X = 2 * np.random.randn(100, 5)
 y = 2 * np.cos(X[:, 3]) + X[:, 0] ** 2 - 2
+model = PySRRegressor(binary_operators=["+", "-", "*", "/"], **kwargs)
+model.fit(X, y)
+print(model)
 ```
 ## 2. Custom operator
 ```python
 X = 2 * np.random.randn(100, 5)
 y = 1 / X[:, 0]
+model = PySRRegressor(
     binary_operators=["plus", "mult"],
     unary_operators=["inv(x) = 1/x"],
     **kwargs
 )
+model.fit(X, y)
+print(model)
 ```
 ## 3. Multiple outputs
 ```python
 X = 2 * np.random.randn(100, 5)
 y = 1 / X[:, [0, 1, 2]]
+model = PySRRegressor(
     binary_operators=["plus", "mult"],
     unary_operators=["inv(x) = 1/x"],
     **kwargs
 )
+model.fit(X, y)
 ```
 ## 4. Plotting an expression
 Here, let's use the same equations, but get a format we can actually
+use and test. We can add this option after a search via the `set_params`
 function:
 ```python
+model.set_params(extra_sympy_mappings={"inv": lambda x: 1/x})
+model.sympy()
 ```
 If you look at the lists of expressions before and after, you will
 see that the sympy format now has replaced `inv` with `1/`.

docs/operators.md CHANGED Viewed

@@ -49,7 +49,7 @@ Instead of passing a predefined operator as a string,
 you can define with by passing it to the `pysr` function, with, e.g.,
 ```python
-    pysr(
         ...,
         unary_operators=["myfunction(x) = x^2"],
         binary_operators=["myotherfunction(x, y) = x^2*y"]
@@ -57,9 +57,7 @@ you can define with by passing it to the `pysr` function, with, e.g.,
 ```
-You can also define your own in `julia/operators.jl`,
-and pass the function name as a string. This is suitable
-for more complex functions. Make sure that it works with
 `Float32` as a datatype. That means you need to write `1.5f3`
 instead of `1.5e3`, if you write any constant numbers.

 you can define with by passing it to the `pysr` function, with, e.g.,
 ```python
+    PySRRegressor(
         ...,
         unary_operators=["myfunction(x) = x^2"],
         binary_operators=["myotherfunction(x, y) = x^2*y"]
 ```
+Make sure that it works with
 `Float32` as a datatype. That means you need to write `1.5f3`
 instead of `1.5e3`, if you write any constant numbers.

docs/options.md CHANGED Viewed

@@ -1,10 +1,8 @@
 # Features and Options
-You likely don't need to tune the hyperparameters yourself,
-but if you would like, you can use `hyperparamopt.py` as an example.
 Some configurable features and options in `PySR` which you
 may find useful include:
 - `binary_operators`, `unary_operators`
 - `niterations`
 - `ncyclesperiteration`
@@ -21,18 +19,31 @@ may find useful include:
 These are described below
-The program will output a pandas DataFrame containing the equations,
-mean square error, and complexity. It will also dump to a csv
 at the end of every iteration,
-which is `hall_of_fame_{date_time}.csv` by default. It also prints the
-equations to stdout.
 ## Operators
 A list of operators can be found on the operators page.
 One can define custom operators in Julia by passing a string:
 ```python
-equations = pysr.pysr(X, y, niterations=100,
     binary_operators=["mult", "plus", "special(x, y) = x^2 + y"],
     extra_sympy_mappings={'special': lambda x, y: x**2 + y},
     unary_operators=["cos"])
@@ -51,8 +62,6 @@ so that the SymPy code can understand the output equation from Julia,
 when constructing a useable function. This step is optional, but
 is necessary for the `lambda_format` to work.
-One can also edit `operators.jl`.
 ## Iterations
 This is the total number of generations that `pysr` will run for.
@@ -78,15 +87,15 @@ each population stay closer to the best current equations.
 One can adjust the number of workers used by Julia with the
 `procs` option. You should set this equal to the number of cores
-you want `pysr` to use. This will also run `procs` number of
-populations simultaneously by default.
 ## Populations
-By default, `populations=procs`, but you can set a different
-number of populations with this option. More populations may increase
 the diversity of equations discovered, though will take longer to train.
-However, it may be more efficient to have `populations>procs`,
 as there are multiple populations running
 on each core.
@@ -100,7 +109,8 @@ instead of the usual 4, which creates more populations
 sigma = ...
 weights = 1/sigma**2
-equations = pysr.pysr(X, y, weights=weights, procs=10)
 ```
 ## Max size
@@ -147,55 +157,63 @@ expressions of complexity 5 (e.g., 5.0 + x2 exp(x3)).
 ## LaTeX, SymPy
-The `pysr` command will return a pandas dataframe. The `sympy_format`
-column gives sympy equations, and the `lambda_format` gives callable
-functions. These use the variable names you have provided.
 There are also some helper functions for doing this quickly.
-You can call `get_hof()` (or pass an equation file explicitly to this)
-to get this pandas dataframe.
-You can call the functions `best()` to get the sympy format
-for the best equation, using the `score` column to sort equations.
-`best_latex()` returns the LaTeX form of this, and `best_callable()`
-returns a callable function.
 ## Callable exports: numpy, pytorch, jax
 By default, the dataframe of equations will contain columns
-with the identifier `lambda_format`. These are simple functions
-which correspond to the equation, but executed
-with numpy functions. You can pass your `X` matrix to these functions
-just as you did to the `pysr` call. Thus, this allows
 you to numerically evaluate the equations over different output.
 One can do the same thing for PyTorch, which uses code
 from [sympytorch](https://github.com/patrick-kidger/sympytorch),
 and for JAX, which uses code from
 [sympy2jax](https://github.com/MilesCranmer/sympy2jax).
-For torch, set the argument `output_torch_format=True`, which
-will generate a column `torch_format`. Each element of this column
-is a PyTorch module which runs the equation, using PyTorch functions,
 over `X` (as a PyTorch tensor). This is differentiable, and the
 parameters of this PyTorch module correspond to the learned parameters
 in the equation, and are trainable.
-For jax, set the argument `output_jax_format=True`, which
-will generate a column `jax_format`. Each element of this column
-is a dictionary containing a `'callable'` (a JAX function),
 and `'parameters'` (a list of parameters in the equation).
-One can execute this function with: `element['callable'](X, element['parameters'])`.
 Since the parameter list is a jax array, this therefore lets you also
 train the parameters within JAX (and is differentiable).
-If you forget to turn these on when calling the function initially,
-you can re-run `get_hof(output_jax_format=True)`, and it will re-use
-the equations and other state properties, assuming you haven't
-re-run `pysr` in the meantime!
 ## `loss`
 The default loss is mean-square error, and weighted mean-square error.
@@ -209,26 +227,29 @@ Here are some additional examples:
 abs(x-y) loss
 ```python
-pysr(..., loss="f(x, y) = abs(x - y)^1.5")
 ```
 Note that the function name doesn't matter:
 ```python
-pysr(..., loss="loss(x, y) = abs(x * y)")
 ```
 With weights:
 ```python
-pysr(..., weights=weights, loss="myloss(x, y, w) = w * abs(x - y)")
 ```
 Weights can be used in arbitrary ways:
 ```python
-pysr(..., weights=weights, loss="myloss(x, y, w) = abs(x - y)^2/w^2")
 ```
 Built-in loss (faster) (see [losses](https://astroautomata.com/SymbolicRegression.jl/dev/losses/)).
 This one computes the L3 norm:
 ```python
-pysr(..., loss="LPDistLoss{3}()")
 ```
 Can also uses these losses for weighted (weighted-average):
 ```python
-pysr(..., weights=weights, loss="LPDistLoss{3}()")
 ```

 # Features and Options
 Some configurable features and options in `PySR` which you
 may find useful include:
+- `model_selection`
 - `binary_operators`, `unary_operators`
 - `niterations`
 - `ncyclesperiteration`
 These are described below
+The program will output a pandas DataFrame containing the equations
+to `PySRRegressor.equations` containing the loss value
+and complexity.
+It will also dump to a csv
 at the end of every iteration,
+which is `hall_of_fame_{date_time}.csv` by default.
+It also prints the equations to stdout.
+## Model selection
+By default, `PySRRegressor` uses `model_selection='best'`
+which selects an equation from `PySRRegressor.equations` using
+a combination of accuracy and complexity.
+You can also select `model_selection='accuracy'`.
+By printing a model (i.e., `print(model)`), you can see
+the equation selection with the arrow shown in the `pick` column.
 ## Operators
 A list of operators can be found on the operators page.
 One can define custom operators in Julia by passing a string:
 ```python
+PySRRegressor(niterations=100,
     binary_operators=["mult", "plus", "special(x, y) = x^2 + y"],
     extra_sympy_mappings={'special': lambda x, y: x**2 + y},
     unary_operators=["cos"])
 when constructing a useable function. This step is optional, but
 is necessary for the `lambda_format` to work.
 ## Iterations
 This is the total number of generations that `pysr` will run for.
 One can adjust the number of workers used by Julia with the
 `procs` option. You should set this equal to the number of cores
+you want `pysr` to use.
 ## Populations
+By default, `populations=20`, but you can set a different
+number of populations with this option.
+More populations may increase
 the diversity of equations discovered, though will take longer to train.
+However, it is usually more efficient to have `populations>procs`,
 as there are multiple populations running
 on each core.
 sigma = ...
 weights = 1/sigma**2
+model = PySRRegressor(procs=10)
+model.fit(X, y, weights=weights)
 ```
 ## Max size
 ## LaTeX, SymPy
+After running `model.fit(...)`, you can look at
+`model.equations` which is a pandas dataframe.
+The `sympy_format` column gives sympy equations,
+and the `lambda_format` gives callable functions.
+You can optionally pass a pandas dataframe to the callable function,
+if you called `.fit` on a pandas dataframe as well.
 There are also some helper functions for doing this quickly.
+- `model.latex()` will generate a TeX formatted output of your equation.
+- `model.sympy()` will return the SymPy representation.
+- `model.jax()` will return a callable JAX function combined with parameters (see below)
+- `model.pytorch()` will return a PyTorch model (see below).
 ## Callable exports: numpy, pytorch, jax
 By default, the dataframe of equations will contain columns
+with the identifier `lambda_format`.
+These are simple functions which correspond to the equation, but executed
+with numpy functions.
+You can pass your `X` matrix to these functions
+just as you did to the `model.fit` call. Thus, this allows
 you to numerically evaluate the equations over different output.
+Calling `model.predict` will execute the `lambda_format` of
+the best equation, and return the result. If you selected
+`model_selection="best"`, this will use an equation that combines
+accuracy with simplicity. For `model_selection="accuracy"`, this will just
+look at accuracy.
 One can do the same thing for PyTorch, which uses code
 from [sympytorch](https://github.com/patrick-kidger/sympytorch),
 and for JAX, which uses code from
 [sympy2jax](https://github.com/MilesCranmer/sympy2jax).
+Calling `model.pytorch()` will return
+a PyTorch module which runs the equation, using PyTorch functions,
 over `X` (as a PyTorch tensor). This is differentiable, and the
 parameters of this PyTorch module correspond to the learned parameters
 in the equation, and are trainable.
+```python
+torch_model = model.pytorch()
+torch_model(X)
+```
+**Warning: If you are using custom operators, you must define `extra_torch_mappings` or `extra_jax_mappings` (both are `dict` of callables) to provide an equivalent definition of the functions.** (At any time you can set these parameters or any others with `model.set_params`.)
+For JAX, you can equivalently call `model.jax()`
+This will return a dictionary containing a `'callable'` (a JAX function),
 and `'parameters'` (a list of parameters in the equation).
+You can execute this function with:
+```python
+jax_model = model.jax()
+jax_model['callable'](X, jax_model['parameters'])
+```
 Since the parameter list is a jax array, this therefore lets you also
 train the parameters within JAX (and is differentiable).
 ## `loss`
 The default loss is mean-square error, and weighted mean-square error.
 abs(x-y) loss
 ```python
+PySRRegressor(..., loss="f(x, y) = abs(x - y)^1.5")
 ```
 Note that the function name doesn't matter:
 ```python
+PySRRegressor(..., loss="loss(x, y) = abs(x * y)")
 ```
 With weights:
 ```python
+model = PySRRegressor(..., loss="myloss(x, y, w) = w * abs(x - y)")
+model.fit(..., weights=weights)
 ```
 Weights can be used in arbitrary ways:
 ```python
+model = PySRRegressor(..., weights=weights, loss="myloss(x, y, w) = abs(x - y)^2/w^2")
+model.fit(..., weights=weights)
 ```
 Built-in loss (faster) (see [losses](https://astroautomata.com/SymbolicRegression.jl/dev/losses/)).
 This one computes the L3 norm:
 ```python
+PySRRegressor(..., loss="LPDistLoss{3}()")
 ```
 Can also uses these losses for weighted (weighted-average):
 ```python
+model = PySRRegressor(..., weights=weights, loss="LPDistLoss{3}()")
+model.fit(..., weights=weights)
 ```

docs/start.md CHANGED Viewed

@@ -1,6 +1,4 @@
-# Getting Started
-## Installation
 PySR uses both Julia and Python, so you need to have both installed.
 Install Julia - see [downloads](https://julialang.org/downloads/), and
@@ -16,47 +14,100 @@ python3 -c 'import pysr; pysr.install()'
 The second line will install and update the required Julia packages, including
 `PyCall.jl`.
-## Quickstart
-```python
-import numpy as np
-from pysr import pysr, best, get_hof
-# Dataset
-X = 2*np.random.randn(100, 5)
-y = 2*np.cos(X[:, 3]) + X[:, 0]**2 - 2
-# Learn equations
-equations = pysr(X, y, niterations=5,
-        binary_operators=["plus", "mult"],
-        unary_operators=["cos", "exp", "sin"])
-...# (you can use ctl-c to exit early)
-print(best())
 ```
-which gives:
 ```python
-x0**2 + 2.000016*cos(x3) - 1.9999845
 ```
-The second and additional calls of `pysr` will be significantly
-faster in startup time, since the first call to Julia will compile
-and cache functions from the symbolic regression backend.
-One can also use `best_tex` to get the LaTeX form,
-or `best_callable` to get a function you can call.
-This uses a score which balances complexity and error;
-however, one can see the full list of equations with:
 ```python
-print(get_hof())
 ```
-This is a pandas table, with additional columns:
-- `MSE` - the mean square error of the formula
-- `score` - a metric akin to Occam's razor; you should use this to help select the "true" equation.
-- `sympy_format` - sympy equation.
-- `lambda_format` - a lambda function for that equation, that you can pass values through.

+# Installation
 PySR uses both Julia and Python, so you need to have both installed.
 Install Julia - see [downloads](https://julialang.org/downloads/), and
 The second line will install and update the required Julia packages, including
 `PyCall.jl`.
+Most common issues at this stage are solved
+by [tweaking the Julia package server](https://github.com/MilesCranmer/PySR/issues/27).
+to use up-to-date packages.
+# Quickstart
+Let's create a PySR example. First, let's import
+numpy to generate some test data:
+```python
+import numpy as np
+X = 2 * np.random.randn(100, 5)
+y = 2.5382 * np.cos(X[:, 3]) + X[:, 0] ** 2 - 0.5
 ```
+We have created a dataset with 100 datapoints, with 5 features each.
+The relation we wish to model is $2.5382 \cos(x_3) + x_0^2 - 0.5$.
+Now, let's create a PySR model and train it.
+PySR's main interface is in the style of scikit-learn:
+```python
+from pysr import PySRRegressor
+model = PySRRegressor(
+    niterations=5,
+    populations=8,
+    binary_operators=["+", "*"],
+    unary_operators=[
+        "cos",
+        "exp",
+        "sin",
+        "inv(x)=1/x",  # Custom operator (julia syntax)
+    ],
+    model_selection="best",
+    loss="loss(x, y) = (x - y)^2",  # Custom loss function (julia syntax)
+)
+```
+This will set up the model for 5 iterations of the search code, which contains hundreds of thousands of mutations and equation evaluations.
+Let's train this model on our dataset:
 ```python
+model.fit(X, y)
 ```
+Internally, this launches a Julia process which will do a multithreaded search for equations to fit the dataset.
+Equations will be printed during training, and once you are satisfied, you may
+quit early by hitting 'q' and then \<enter\>.
+After the model has been fit, you can run `model.predict(X)`
+to see the predictions on a given dataset.
+You may run:
 ```python
+print(model)
 ```
+to print the learned equations:
+```python
+PySRRegressor.equations = [
+           pick      score                                           equation          loss  complexity
+        0         0.000000                                          3.0282464  2.816982e+01           1
+        1         1.008026                                          (x0 * x0)  3.751666e+00           3
+        2         0.015337                          (-0.33649465 + (x0 * x0))  3.638336e+00           5
+        3         0.888050                              ((x0 * x0) + cos(x3))  1.497019e+00           6
+        4         0.898539                ((x0 * x0) + (2.4816332 * cos(x3)))  2.481797e-01           8
+        5  >>>>  10.604434  ((-0.49998775 + (x0 * x0)) + (2.5382009 * cos(...  1.527115e-10          10
+]
+```
+This arrow in the `pick` column indicates which equation is currently selected by your
+`model_selection` strategy for prediction.
+(You may change `model_selection` after `.fit(X, y)` as well.)
+`model.equations` is a pandas DataFrame containing all equations, including callable format
+(`lambda_format`),
+SymPy format (`sympy_format`), and even JAX and PyTorch format
+(both of which are differentiable).
+There are several other useful features such as denoising (e.g., `denoising=True`),
+feature selection (e.g., `select_k_features=3`), and many others.
+For a summary of features and options, see [this docs page](https://pysr.readthedocs.io/en/latest/docs/options/).
+You can see the full API at [this page](https://pysr.readthedocs.io/en/latest/docs/api-documentation/).
+# Docker
+You can also test out PySR in Docker, without
+installing it locally, by running the following command in
+the root directory of this repo:
+```bash
+docker build --pull --rm -f "Dockerfile" -t pysr "."
+```
+This builds an image called `pysr`. If you have issues building (for example, on Apple Silicon),
+you can emulate an architecture that works by including: `--platform linux/amd64`.
+You can then run this with:
+```bash
+docker run -it --rm -v "$PWD:/data" pysr ipython
+```
+which will link the current directory to the container's `/data` directory
+and then launch ipython.

example.py CHANGED Viewed

@@ -1,25 +1,24 @@
 import numpy as np
-from pysr import pysr, best
-# Dataset
 X = 2 * np.random.randn(100, 5)
-y = 2 * np.cos(X[:, 3]) + X[:, 0] ** 2 - 2
-# Learn equations
-equations = pysr(
-    X,
-    y,
     niterations=5,
-    binary_operators=["plus", "mult"],
     unary_operators=[
         "cos",
         "exp",
-        "sin",  # Pre-defined library of operators (see https://pysr.readthedocs.io/en/latest/docs/operators/)
-        "inv(x) = 1/x",
     ],
-    loss="loss(x, y) = abs(x - y)",  # Custom loss function
-)  # Define your own operator! (Julia syntax)
-...  # (you can use ctl-c to exit early)
-print(best(equations))

 import numpy as np
 X = 2 * np.random.randn(100, 5)
+y = 2.5382 * np.cos(X[:, 3]) + X[:, 0] ** 2 - 0.5
+from pysr import PySRRegressor
+model = PySRRegressor(
     niterations=5,
+    populations=8,
+    binary_operators=["+", "*"],
     unary_operators=[
         "cos",
         "exp",
+        "sin",
+        "inv(x) = 1/x",  # Custom operator (julia syntax)
     ],
+    model_selection="best",
+    loss="loss(x, y) = (x - y)^2",  # Custom loss function (julia syntax)
+)
+model.fit(X, y)
+print(model)

pydoc-markdown.yml CHANGED Viewed

@@ -54,5 +54,19 @@ renderer:
       preamble: {weight: 4}
     - title: API Documentation
       contents:
-        - pysr.sr.*
       preamble: {weight: 5}

       preamble: {weight: 4}
     - title: API Documentation
       contents:
+        - pysr.sr.PySRRegressor.__init__
+        - pysr.sr.PySRRegressor.fit
+        - pysr.sr.PySRRegressor.predict
+        - pysr.sr.PySRRegressor.__repr__
+        - pysr.sr.PySRRegressor.set_params
+        - pysr.sr.PySRRegressor.get_params
+        - pysr.sr.PySRRegressor.get_best
+        - pysr.sr.PySRRegressor.sympy
+        - pysr.sr.PySRRegressor.latex
+        - pysr.sr.PySRRegressor.jax
+        - pysr.sr.PySRRegressor.pytorch
+        - pysr.sr.PySRRegressor.refresh
+        - pysr.sr.__repr__
+        - pysr.sr.install
+        - pysr.sr.silence_julia_warning
       preamble: {weight: 5}

pysr/__init__.py CHANGED Viewed

@@ -1,6 +1,6 @@
 from .sr import (
     pysr,
-    get_hof,
     best,
     best_tex,
     best_callable,

 from .sr import (
     pysr,
+    PySRRegressor,
     best,
     best_tex,
     best_callable,

pysr/sr.py CHANGED Viewed

@@ -11,11 +11,15 @@ from pathlib import Path
 from datetime import datetime
 import warnings
 from multiprocessing import cpu_count
 is_julia_warning_silenced = False
 def install(julia_project=None):  # pragma: no cover
     import julia
     julia.install()
@@ -36,20 +40,6 @@ def install(julia_project=None):  # pragma: no cover
 Main = None
-global_state = dict(
-    equation_file="hall_of_fame.csv",
-    n_features=None,
-    variable_names=[],
-    extra_sympy_mappings={},
-    extra_torch_mappings={},
-    extra_jax_mappings={},
-    output_jax_format=False,
-    output_torch_format=False,
-    multioutput=False,
-    nout=1,
-    selection=None,
-    raw_julia_output=None,
-)
 already_ran = False
@@ -93,533 +83,14 @@ sympy_mappings = {
 }
-def pysr(
-    X,
-    y,
-    weights=None,
-    binary_operators=None,
-    unary_operators=None,
-    procs=cpu_count(),
-    loss="L2DistLoss()",
-    populations=20,
-    niterations=100,
-    ncyclesperiteration=300,
-    alpha=0.1,
-    annealing=False,
-    fractionReplaced=0.10,
-    fractionReplacedHof=0.10,
-    npop=1000,
-    parsimony=1e-4,
-    migration=True,
-    hofMigration=True,
-    shouldOptimizeConstants=True,
-    topn=10,
-    weightAddNode=1,
-    weightInsertNode=3,
-    weightDeleteNode=3,
-    weightDoNothing=1,
-    weightMutateConstant=10,
-    weightMutateOperator=1,
-    weightRandomize=1,
-    weightSimplify=0.002,
-    perturbationFactor=1.0,
-    extra_sympy_mappings=None,
-    extra_torch_mappings=None,
-    extra_jax_mappings=None,
-    equation_file=None,
-    verbosity=1e9,
-    progress=None,
-    maxsize=20,
-    fast_cycle=False,
-    maxdepth=None,
-    variable_names=None,
-    batching=False,
-    batchSize=50,
-    select_k_features=None,
-    warmupMaxsizeBy=0.0,
-    constraints=None,
-    useFrequency=True,
-    tempdir=None,
-    delete_tempfiles=True,
-    julia_project=None,
-    update=True,
-    temp_equation_file=False,
-    output_jax_format=False,
-    output_torch_format=False,
-    optimizer_algorithm="BFGS",
-    optimizer_nrestarts=3,
-    optimize_probability=1.0,
-    optimizer_iterations=10,
-    tournament_selection_n=10,
-    tournament_selection_p=1.0,
-    denoise=False,
-    Xresampled=None,
-    precision=32,
-    multithreading=None,
-    **kwargs,
-):
-    """Run symbolic regression to fit f(X[i, :]) ~ y[i] for all i.
-    Note: most default parameters have been tuned over several example
-    equations, but you should adjust `niterations`,
-    `binary_operators`, `unary_operators` to your requirements.
-    You can view more detailed explanations of the options on the
-    [options page](https://pysr.readthedocs.io/en/latest/docs/options/) of the documentation.
-    :param X: 2D array. Rows are examples, columns are features. If pandas DataFrame, the columns are used for variable names (so make sure they don't contain spaces).
-    :type X: np.ndarray/pandas.DataFrame
-    :param y: 1D array (rows are examples) or 2D array (rows are examples, columns are outputs). Putting in a 2D array will trigger a search for equations for each feature of y.
-    :type y: np.ndarray
-    :param weights: same shape as y. Each element is how to weight the mean-square-error loss for that particular element of y.
-    :type weights: np.ndarray
-    :param binary_operators: List of strings giving the binary operators in Julia's Base. Default is ["+", "-", "*", "/",].
-    :type binary_operators: list
-    :param unary_operators: Same but for operators taking a single scalar. Default is [].
-    :type unary_operators: list
-    :param procs: Number of processes (=number of populations running).
-    :type procs: int
-    :param loss: String of Julia code specifying the loss function.  Can either be a loss from LossFunctions.jl, or your own loss written as a function. Examples of custom written losses include: `myloss(x, y) = abs(x-y)` for non-weighted, or `myloss(x, y, w) = w*abs(x-y)` for weighted.  Among the included losses, these are as follows. Regression: `LPDistLoss{P}()`, `L1DistLoss()`, `L2DistLoss()` (mean square), `LogitDistLoss()`, `HuberLoss(d)`, `L1EpsilonInsLoss(ϵ)`, `L2EpsilonInsLoss(ϵ)`, `PeriodicLoss(c)`, `QuantileLoss(τ)`.  Classification: `ZeroOneLoss()`, `PerceptronLoss()`, `L1HingeLoss()`, `SmoothedL1HingeLoss(γ)`, `ModifiedHuberLoss()`, `L2MarginLoss()`, `ExpLoss()`, `SigmoidLoss()`, `DWDMarginLoss(q)`.
-    :type loss: str
-    :param populations: Number of populations running.
-    :type populations: int
-    :param niterations: Number of iterations of the algorithm to run. The best equations are printed, and migrate between populations, at the end of each.
-    :type niterations: int
-    :param ncyclesperiteration: Number of total mutations to run, per 10 samples of the population, per iteration.
-    :type ncyclesperiteration: int
-    :param alpha: Initial temperature.
-    :type alpha: float
-    :param annealing: Whether to use annealing. You should (and it is default).
-    :type annealing: bool
-    :param fractionReplaced: How much of population to replace with migrating equations from other populations.
-    :type fractionReplaced: float
-    :param fractionReplacedHof: How much of population to replace with migrating equations from hall of fame.
-    :type fractionReplacedHof: float
-    :param npop: Number of individuals in each population
-    :type npop: int
-    :param parsimony: Multiplicative factor for how much to punish complexity.
-    :type parsimony: float
-    :param migration: Whether to migrate.
-    :type migration: bool
-    :param hofMigration: Whether to have the hall of fame migrate.
-    :type hofMigration: bool
-    :param shouldOptimizeConstants: Whether to numerically optimize constants (Nelder-Mead/Newton) at the end of each iteration.
-    :type shouldOptimizeConstants: bool
-    :param topn: How many top individuals migrate from each population.
-    :type topn: int
-    :param perturbationFactor: Constants are perturbed by a max factor of (perturbationFactor*T + 1). Either multiplied by this or divided by this.
-    :type perturbationFactor: float
-    :param weightAddNode: Relative likelihood for mutation to add a node
-    :type weightAddNode: float
-    :param weightInsertNode: Relative likelihood for mutation to insert a node
-    :type weightInsertNode: float
-    :param weightDeleteNode: Relative likelihood for mutation to delete a node
-    :type weightDeleteNode: float
-    :param weightDoNothing: Relative likelihood for mutation to leave the individual
-    :type weightDoNothing: float
-    :param weightMutateConstant: Relative likelihood for mutation to change the constant slightly in a random direction.
-    :type weightMutateConstant: float
-    :param weightMutateOperator: Relative likelihood for mutation to swap an operator.
-    :type weightMutateOperator: float
-    :param weightRandomize: Relative likelihood for mutation to completely delete and then randomly generate the equation
-    :type weightRandomize: float
-    :param weightSimplify: Relative likelihood for mutation to simplify constant parts by evaluation
-    :type weightSimplify: float
-    :param equation_file: Where to save the files (.csv separated by |)
-    :type equation_file: str
-    :param verbosity: What verbosity level to use. 0 means minimal print statements.
-    :type verbosity: int
-    :param progress: Whether to use a progress bar instead of printing to stdout.
-    :type progress: bool
-    :param maxsize: Max size of an equation.
-    :type maxsize: int
-    :param maxdepth: Max depth of an equation. You can use both maxsize and maxdepth.  maxdepth is by default set to = maxsize, which means that it is redundant.
-    :type maxdepth: int
-    :param fast_cycle: (experimental) - batch over population subsamples. This is a slightly different algorithm than regularized evolution, but does cycles 15% faster. May be algorithmically less efficient.
-    :type fast_cycle: bool
-    :param variable_names: a list of names for the variables, other than "x0", "x1", etc.
-    :type variable_names: list
-    :param batching: whether to compare population members on small batches during evolution. Still uses full dataset for comparing against hall of fame.
-    :type batching: bool
-    :param batchSize: the amount of data to use if doing batching.
-    :type batchSize: int
-    :param select_k_features: whether to run feature selection in Python using random forests, before passing to the symbolic regression code. None means no feature selection; an int means select that many features.
-    :type select_k_features: None/int
-    :param warmupMaxsizeBy: whether to slowly increase max size from a small number up to the maxsize (if greater than 0).  If greater than 0, says the fraction of training time at which the current maxsize will reach the user-passed maxsize.
-    :type warmupMaxsizeBy: float
-    :param constraints: dictionary of int (unary) or 2-tuples (binary), this enforces maxsize constraints on the individual arguments of operators. E.g., `'pow': (-1, 1)` says that power laws can have any complexity left argument, but only 1 complexity exponent. Use this to force more interpretable solutions.
-    :type constraints: dict
-    :param useFrequency: whether to measure the frequency of complexities, and use that instead of parsimony to explore equation space. Will naturally find equations of all complexities.
-    :type useFrequency: bool
-    :param tempdir: directory for the temporary files
-    :type tempdir: str/None
-    :param delete_tempfiles: whether to delete the temporary files after finishing
-    :type delete_tempfiles: bool
-    :param julia_project: a Julia environment location containing a Project.toml (and potentially the source code for SymbolicRegression.jl).  Default gives the Python package directory, where a Project.toml file should be present from the install.
-    :type julia_project: str/None
-    :param update: Whether to automatically update Julia packages.
-    :type update: bool
-    :param temp_equation_file: Whether to put the hall of fame file in the temp directory. Deletion is then controlled with the delete_tempfiles argument.
-    :type temp_equation_file: bool
-    :param output_jax_format: Whether to create a 'jax_format' column in the output, containing jax-callable functions and the default parameters in a jax array.
-    :type output_jax_format: bool
-    :param output_torch_format: Whether to create a 'torch_format' column in the output, containing a torch module with trainable parameters.
-    :type output_torch_format: bool
-    :param tournament_selection_n: Number of expressions to consider in each tournament.
-    :type tournament_selection_n: int
-    :param tournament_selection_p: Probability of selecting the best expression in each tournament. The probability will decay as p*(1-p)^n for other expressions, sorted by loss.
-    :type tournament_selection_p: float
-    :param denoise: Whether to use a Gaussian Process to denoise the data before inputting to PySR. Can help PySR fit noisy data.
-    :type denoise: bool
-    :param precision: What precision to use for the data. By default this is 32 (float32), but you can select 64 or 16 as well.
-    :type precision: int
-    :param multithreading: Use multithreading instead of distributed backend. Default is yes. Using procs=0 will turn off both.
-    :type multithreading: bool
-    :param **kwargs: Other options passed to SymbolicRegression.Options, for example, if you modify SymbolicRegression.jl to include additional arguments.
-    :type **kwargs: dict
-    :returns: Results dataframe, giving complexity, MSE, and equations (as strings), as well as functional forms. If list, each element corresponds to a dataframe of equations for each output.
-    :type: pd.DataFrame/list
-    """
-    global already_ran
-    if binary_operators is None:
-        binary_operators = "+ * - /".split(" ")
-    if unary_operators is None:
-        unary_operators = []
-    if extra_sympy_mappings is None:
-        extra_sympy_mappings = {}
-    if variable_names is None:
-        variable_names = []
-    if constraints is None:
-        constraints = {}
-    if multithreading is None:
-        # Default is multithreading=True, unless explicitly set,
-        # or procs is set to 0 (serial mode).
-        multithreading = procs != 0
-    global Main
-    if Main is None:
-        if multithreading:
-            os.environ["JULIA_NUM_THREADS"] = str(procs)
-        Main = init_julia()
-    buffer_available = "buffer" in sys.stdout.__dir__()
-    if progress is not None:
-        if progress and not buffer_available:
-            warnings.warn(
-                "Note: it looks like you are running in Jupyter. The progress bar will be turned off."
-            )
-            progress = False
-    else:
-        progress = buffer_available
-    assert optimizer_algorithm in ["NelderMead", "BFGS"]
-    assert tournament_selection_n < npop
-    if isinstance(X, pd.DataFrame):
-        variable_names = list(X.columns)
-        X = np.array(X)
-    if len(X.shape) == 1:
-        X = X[:, None]
-    assert not isinstance(y, pd.DataFrame)
-    if len(variable_names) == 0:
-        variable_names = [f"x{i}" for i in range(X.shape[1])]
-    if extra_jax_mappings is not None:
-        for value in extra_jax_mappings.values():
-            if not isinstance(value, str):
-                raise NotImplementedError(
-                    "extra_jax_mappings must have keys that are strings! e.g., {sympy.sqrt: 'jnp.sqrt'}."
-                )
-    if extra_torch_mappings is not None:
-        for value in extra_jax_mappings.values():
-            if not callable(value):
-                raise NotImplementedError(
-                    "extra_torch_mappings must be callable functions! e.g., {sympy.sqrt: torch.sqrt}."
-                )
-    use_custom_variable_names = len(variable_names) != 0
-    # TODO: this is always true.
-    _check_assertions(
-        X,
-        binary_operators,
-        unary_operators,
-        use_custom_variable_names,
-        variable_names,
-        weights,
-        y,
-    )
-    if len(X) > 10000 and not batching:
-        warnings.warn(
-            "Note: you are running with more than 10,000 datapoints. You should consider turning on batching (https://pysr.readthedocs.io/en/latest/docs/options/#batching). You should also reconsider if you need that many datapoints. Unless you have a large amount of noise (in which case you should smooth your dataset first), generally < 10,000 datapoints is enough to find a functional form with symbolic regression. More datapoints will lower the search speed."
-        )
-    if maxsize > 40:
-        warnings.warn(
-            "Note: Using a large maxsize for the equation search will be exponentially slower and use significant memory. You should consider turning `useFrequency` to False, and perhaps use `warmupMaxsizeBy`."
-        )
-    if maxsize < 7:
-        raise NotImplementedError("PySR requires a maxsize of at least 7")
-    X, selection = _handle_feature_selection(X, select_k_features, y, variable_names)
-    if maxdepth is None:
-        maxdepth = maxsize
-    if isinstance(binary_operators, str):
-        binary_operators = [binary_operators]
-    if isinstance(unary_operators, str):
-        unary_operators = [unary_operators]
-    if len(y.shape) == 1 or (len(y.shape) == 2 and y.shape[1] == 1):
-        multioutput = False
-        nout = 1
-        y = y.reshape(-1)
-    elif len(y.shape) == 2:
-        multioutput = True
-        nout = y.shape[1]
-    else:
-        raise NotImplementedError("y shape not supported!")
-    if denoise:
-        if weights is not None:
-            raise NotImplementedError(
-                "No weights for denoising - the weights are learned."
-            )
-        if Xresampled is not None:
-            # Select among only the selected features:
-            if isinstance(Xresampled, pd.DataFrame):
-                # Handle Xresampled is pandas dataframe
-                if selection is not None:
-                    Xresampled = Xresampled[[variable_names[i] for i in selection]]
-                else:
-                    Xresampled = Xresampled[variable_names]
-                Xresampled = np.array(Xresampled)
-            else:
-                if selection is not None:
-                    Xresampled = Xresampled[:, selection]
-        if multioutput:
-            y = np.stack(
-                [_denoise(X, y[:, i], Xresampled=Xresampled)[1] for i in range(nout)],
-                axis=1,
-            )
-            if Xresampled is not None:
-                X = Xresampled
-        else:
-            X, y = _denoise(X, y, Xresampled=Xresampled)
-    julia_project = _get_julia_project(julia_project)
-    tmpdir = Path(tempfile.mkdtemp(dir=tempdir))
-    if temp_equation_file:
-        equation_file = tmpdir / "hall_of_fame.csv"
-    elif equation_file is None:
-        date_time = datetime.now().strftime("%Y-%m-%d_%H%M%S.%f")[:-3]
-        equation_file = "hall_of_fame_" + date_time + ".csv"
-    _create_inline_operators(
-        binary_operators=binary_operators, unary_operators=unary_operators
-    )
-    _handle_constraints(
-        binary_operators=binary_operators,
-        unary_operators=unary_operators,
-        constraints=constraints,
-    )
-    una_constraints = [constraints[op] for op in unary_operators]
-    bin_constraints = [constraints[op] for op in binary_operators]
-    try:
-        # TODO: is this needed since Julia now prints directly to stdout?
-        term_width = shutil.get_terminal_size().columns
-    except:
-        _, term_width = subprocess.check_output(["stty", "size"]).split()
-    if not already_ran:
-        from julia import Pkg
-        Pkg.activate(f"{_escape_filename(julia_project)}")
-        try:
-            if update:
-                Pkg.resolve()
-                Pkg.instantiate()
-            else:
-                Pkg.instantiate()
-        except RuntimeError as e:
-            raise ImportError(
-                f"""
-Required dependencies are not installed or built.  Run the following code in the Python REPL:
-    >>> import pysr
-    >>> pysr.install()
-Tried to activate project {julia_project} but failed."""
-            ) from e
-        Main.eval("using SymbolicRegression")
-        Main.plus = Main.eval("(+)")
-        Main.sub = Main.eval("(-)")
-        Main.mult = Main.eval("(*)")
-        Main.pow = Main.eval("(^)")
-        Main.div = Main.eval("(/)")
-    Main.custom_loss = Main.eval(loss)
-    mutationWeights = [
-        float(weightMutateConstant),
-        float(weightMutateOperator),
-        float(weightAddNode),
-        float(weightInsertNode),
-        float(weightDeleteNode),
-        float(weightSimplify),
-        float(weightRandomize),
-        float(weightDoNothing),
-    ]
-    options = Main.Options(
-        binary_operators=Main.eval(str(tuple(binary_operators)).replace("'", "")),
-        unary_operators=Main.eval(str(tuple(unary_operators)).replace("'", "")),
-        bin_constraints=bin_constraints,
-        una_constraints=una_constraints,
-        parsimony=float(parsimony),
-        loss=Main.custom_loss,
-        alpha=float(alpha),
-        maxsize=int(maxsize),
-        maxdepth=int(maxdepth),
-        fast_cycle=fast_cycle,
-        migration=migration,
-        hofMigration=hofMigration,
-        fractionReplacedHof=float(fractionReplacedHof),
-        shouldOptimizeConstants=shouldOptimizeConstants,
-        hofFile=_escape_filename(equation_file),
-        npopulations=int(populations),
-        optimizer_algorithm=optimizer_algorithm,
-        optimizer_nrestarts=int(optimizer_nrestarts),
-        optimize_probability=float(optimize_probability),
-        optimizer_iterations=int(optimizer_iterations),
-        perturbationFactor=float(perturbationFactor),
-        annealing=annealing,
-        batching=batching,
-        batchSize=int(min([batchSize, len(X)]) if batching else len(X)),
-        mutationWeights=mutationWeights,
-        warmupMaxsizeBy=float(warmupMaxsizeBy),
-        useFrequency=useFrequency,
-        npop=int(npop),
-        ns=int(tournament_selection_n),
-        probPickFirst=float(tournament_selection_p),
-        ncyclesperiteration=int(ncyclesperiteration),
-        fractionReplaced=float(fractionReplaced),
-        topn=int(topn),
-        verbosity=int(verbosity),
-        progress=progress,
-        terminal_width=int(term_width),
-        **kwargs,
-    )
-    np_dtype = {16: np.float16, 32: np.float32, 64: np.float64}[precision]
-    Main.X = np.array(X, dtype=np_dtype).T
-    if len(y.shape) == 1:
-        Main.y = np.array(y, dtype=np_dtype)
-    else:
-        Main.y = np.array(y, dtype=np_dtype).T
-    if weights is not None:
-        if len(weights.shape) == 1:
-            Main.weights = np.array(weights, dtype=np_dtype)
-        else:
-            Main.weights = np.array(weights, dtype=np_dtype).T
-    else:
-        Main.weights = None
-    cprocs = 0 if multithreading else procs
-    raw_julia_output = Main.EquationSearch(
-        Main.X,
-        Main.y,
-        weights=Main.weights,
-        niterations=int(niterations),
-        varMap=(
-            variable_names
-            if selection is None
-            else [variable_names[i] for i in selection]
-        ),
-        options=options,
-        numprocs=int(cprocs),
-        multithreading=bool(multithreading),
-    )
-    _set_globals(
-        X=X,
-        equation_file=equation_file,
-        variable_names=variable_names,
-        extra_sympy_mappings=extra_sympy_mappings,
-        extra_torch_mappings=extra_torch_mappings,
-        extra_jax_mappings=extra_jax_mappings,
-        output_jax_format=output_jax_format,
-        output_torch_format=output_torch_format,
-        multioutput=multioutput,
-        nout=nout,
-        selection=selection,
-        raw_julia_output=raw_julia_output,
-    )
-    equations = get_hof(
-        equation_file=equation_file,
-        n_features=X.shape[1],
-        variable_names=variable_names,
-        output_jax_format=output_jax_format,
-        output_torch_format=output_torch_format,
-        selection=selection,
-        extra_sympy_mappings=extra_sympy_mappings,
-        extra_jax_mappings=extra_jax_mappings,
-        extra_torch_mappings=extra_torch_mappings,
-        multioutput=multioutput,
-        nout=nout,
     )
-    if delete_tempfiles:
-        shutil.rmtree(tmpdir)
-    already_ran = True
-    return equations
-def _set_globals(
-    *,
-    X,
-    equation_file,
-    variable_names,
-    extra_sympy_mappings,
-    extra_torch_mappings,
-    extra_jax_mappings,
-    output_jax_format,
-    output_torch_format,
-    multioutput,
-    nout,
-    selection,
-    raw_julia_output,
-):
-    global global_state
-    global_state["n_features"] = X.shape[1]
-    global_state["equation_file"] = equation_file
-    global_state["variable_names"] = variable_names
-    global_state["extra_sympy_mappings"] = extra_sympy_mappings
-    global_state["extra_torch_mappings"] = extra_torch_mappings
-    global_state["extra_jax_mappings"] = extra_jax_mappings
-    global_state["output_jax_format"] = output_jax_format
-    global_state["output_torch_format"] = output_torch_format
-    global_state["multioutput"] = multioutput
-    global_state["nout"] = nout
-    global_state["selection"] = selection
-    global_state["raw_julia_output"] = raw_julia_output
 def _handle_constraints(binary_operators, unary_operators, constraints):
@@ -646,6 +117,7 @@ def _handle_constraints(binary_operators, unary_operators, constraints):
 def _create_inline_operators(binary_operators, unary_operators):
     for op_list in [binary_operators, unary_operators]:
         for i, op in enumerate(op_list):
             is_user_defined_operator = "(" in op
@@ -710,234 +182,35 @@ def run_feature_selection(X, y, select_k_features):
     return selector.get_support(indices=True)
-def get_hof(
-    equation_file=None,
-    n_features=None,
-    variable_names=None,
-    output_jax_format=None,
-    output_torch_format=None,
-    selection=None,
-    extra_sympy_mappings=None,
-    extra_jax_mappings=None,
-    extra_torch_mappings=None,
-    multioutput=None,
-    nout=None,
-    **kwargs,
-):
-    """Get the equations from a hall of fame file. If no arguments
-    entered, the ones used previously from a call to PySR will be used."""
-    global global_state
-    if equation_file is None:
-        equation_file = global_state["equation_file"]
-    if n_features is None:
-        n_features = global_state["n_features"]
-    if variable_names is None:
-        variable_names = global_state["variable_names"]
-    if extra_sympy_mappings is None:
-        extra_sympy_mappings = global_state["extra_sympy_mappings"]
-    if extra_jax_mappings is None:
-        extra_jax_mappings = global_state["extra_jax_mappings"]
-    if extra_torch_mappings is None:
-        extra_torch_mappings = global_state["extra_torch_mappings"]
-    if output_torch_format is None:
-        output_torch_format = global_state["output_torch_format"]
-    if output_jax_format is None:
-        output_jax_format = global_state["output_jax_format"]
-    if multioutput is None:
-        multioutput = global_state["multioutput"]
-    if nout is None:
-        nout = global_state["nout"]
-    if selection is None:
-        selection = global_state["selection"]
-    global_state["selection"] = selection
-    global_state["equation_file"] = equation_file
-    global_state["n_features"] = n_features
-    global_state["variable_names"] = variable_names
-    global_state["extra_sympy_mappings"] = extra_sympy_mappings
-    global_state["extra_jax_mappings"] = extra_jax_mappings
-    global_state["extra_torch_mappings"] = extra_torch_mappings
-    global_state["output_torch_format"] = output_torch_format
-    global_state["output_jax_format"] = output_jax_format
-    global_state["multioutput"] = multioutput
-    global_state["nout"] = nout
-    global_state["selection"] = selection
-    try:
-        if multioutput:
-            all_outputs = [
-                pd.read_csv(str(equation_file) + f".out{i}" + ".bkup", sep="|")
-                for i in range(1, nout + 1)
-            ]
-        else:
-            all_outputs = [pd.read_csv(str(equation_file) + ".bkup", sep="|")]
-    except FileNotFoundError:
-        raise RuntimeError(
-            "Couldn't find equation file! The equation search likely exited before a single iteration completed."
-        )
-    ret_outputs = []
-    for output in all_outputs:
-        scores = []
-        lastMSE = None
-        lastComplexity = 0
-        sympy_format = []
-        lambda_format = []
-        if output_jax_format:
-            jax_format = []
-        if output_torch_format:
-            torch_format = []
-        use_custom_variable_names = len(variable_names) != 0
-        local_sympy_mappings = {**extra_sympy_mappings, **sympy_mappings}
-        if use_custom_variable_names:
-            sympy_symbols = [sympy.Symbol(variable_names[i]) for i in range(n_features)]
-        else:
-            sympy_symbols = [sympy.Symbol("x%d" % i) for i in range(n_features)]
-        for _, eqn_row in output.iterrows():
-            eqn = sympify(eqn_row["Equation"], locals=local_sympy_mappings)
-            sympy_format.append(eqn)
-            # Numpy:
-            lambda_format.append(
-                CallableEquation(sympy_symbols, eqn, selection, variable_names)
-            )
-            # JAX:
-            if output_jax_format:
-                from .export_jax import sympy2jax
-                func, params = sympy2jax(
-                    eqn,
-                    sympy_symbols,
-                    selection=selection,
-                    extra_jax_mappings=extra_jax_mappings,
-                )
-                jax_format.append({"callable": func, "parameters": params})
-            # Torch:
-            if output_torch_format:
-                from .export_torch import sympy2torch
-                module = sympy2torch(
-                    eqn,
-                    sympy_symbols,
-                    selection=selection,
-                    extra_torch_mappings=extra_torch_mappings,
-                )
-                torch_format.append(module)
-            curMSE = eqn_row["MSE"]
-            curComplexity = eqn_row["Complexity"]
-            if lastMSE is None:
-                cur_score = 0.0
-            else:
-                if curMSE > 0.0:
-                    cur_score = -np.log(curMSE / lastMSE) / (
-                        curComplexity - lastComplexity
-                    )
-                else:
-                    cur_score = np.inf
-            scores.append(cur_score)
-            lastMSE = curMSE
-            lastComplexity = curComplexity
-        output["score"] = np.array(scores)
-        output["sympy_format"] = sympy_format
-        output["lambda_format"] = lambda_format
-        output_cols = [
-            "Complexity",
-            "MSE",
-            "score",
-            "Equation",
-            "sympy_format",
-            "lambda_format",
-        ]
-        if output_jax_format:
-            output_cols += ["jax_format"]
-            output["jax_format"] = jax_format
-        if output_torch_format:
-            output_cols += ["torch_format"]
-            output["torch_format"] = torch_format
-        ret_outputs.append(output[output_cols])
-    if multioutput:
-        return ret_outputs
-    return ret_outputs[0]
-def best_row(equations=None):
-    """Return the best row of a hall of fame file using the score column.
-    By default this uses the last equation file.
-    """
-    if equations is None:
-        equations = get_hof()
-    if isinstance(equations, list):
-        return [eq.iloc[np.argmax(eq["score"])] for eq in equations]
-    return equations.iloc[np.argmax(equations["score"])]
-def best_tex(equations=None):
-    """Return the equation with the best score, in latex format
-    By default this uses the last equation file.
-    """
-    if equations is None:
-        equations = get_hof()
-    if isinstance(equations, list):
-        return [
-            sympy.latex(best_row(eq)["sympy_format"].simplify()) for eq in equations
-        ]
-    return sympy.latex(best_row(equations)["sympy_format"].simplify())
-def best(equations=None):
-    """Return the equation with the best score, in sympy format.
-    By default this uses the last equation file.
-    """
-    if equations is None:
-        equations = get_hof()
-    if isinstance(equations, list):
-        return [best_row(eq)["sympy_format"].simplify() for eq in equations]
-    return best_row(equations)["sympy_format"].simplify()
-def best_callable(equations=None):
-    """Return the equation with the best score, in callable format.
-    By default this uses the last equation file.
-    """
-    if equations is None:
-        equations = get_hof()
-    if isinstance(equations, list):
-        return [best_row(eq)["lambda_format"] for eq in equations]
-    return best_row(equations)["lambda_format"]
-def _escape_filename(filename):
-    """Turns a file into a string representation with correctly escaped backslashes"""
-    str_repr = str(filename)
-    str_repr = str_repr.replace("\\", "\\\\")
-    return str_repr
-# https://gist.github.com/garrettdreyfus/8153571
-def _yesno(question):
-    """Simple Yes/No Function."""
-    prompt = f"{question} (y/n): "
-    ans = input(prompt).strip().lower()
-    if ans not in ["y", "n"]:
-        print(f"{ans} is invalid, please try again...")
-        return _yesno(question)
-    if ans == "y":
-        return True
-    return False
 def _denoise(X, y, Xresampled=None):
@@ -969,9 +242,9 @@ class CallableEquation:
     def __call__(self, X):
         if isinstance(X, pd.DataFrame):
-            X = np.array(X[self._variable_names])
-        if self._selection is not None:
             return self._lambda(*X[:, self._selection].T)
         return self._lambda(*X.T)
@@ -1053,3 +326,957 @@ julia = "1.5"
     project_toml_path = tmp_dir / "Project.toml"
     project_toml_path.write_text(project_toml)

 from datetime import datetime
 import warnings
 from multiprocessing import cpu_count
+from sklearn.base import BaseEstimator, RegressorMixin
 is_julia_warning_silenced = False
 def install(julia_project=None):  # pragma: no cover
+    """Install PyCall.jl and all required dependencies for SymbolicRegression.jl.
+    Also updates the local Julia registry."""
     import julia
     julia.install()
 Main = None
 already_ran = False
 }
+def pysr(X, y, weights=None, **kwargs):
+    warnings.warn(
+        "Calling `pysr` is deprecated. Please use `model = PySRRegressor(**params); model.fit(X, y)` going forward.",
+        DeprecationWarning,
     )
+    model = PySRRegressor(**kwargs)
+    model.fit(X, y, weights=weights)
+    return model.equations
 def _handle_constraints(binary_operators, unary_operators, constraints):
 def _create_inline_operators(binary_operators, unary_operators):
+    global Main
     for op_list in [binary_operators, unary_operators]:
         for i, op in enumerate(op_list):
             is_user_defined_operator = "(" in op
     return selector.get_support(indices=True)
+def _escape_filename(filename):
+    """Turns a file into a string representation with correctly escaped backslashes"""
+    str_repr = str(filename)
+    str_repr = str_repr.replace("\\", "\\\\")
+    return str_repr
+def best(*args, **kwargs):
+    raise NotImplementedError(
+        "`best` has been deprecated. Please use the `PySRRegressor` interface. After fitting, you can return `.sympy()` to get the sympy representation of the best equation."
+    )
+def best_row(*args, **kwargs):
+    raise NotImplementedError(
+        "`best_row` has been deprecated. Please use the `PySRRegressor` interface. After fitting, you can run `print(model)` to view the best equation."
+    )
+def best_tex(*args, **kwargs):
+    raise NotImplementedError(
+        "`best_tex` has been deprecated. Please use the `PySRRegressor` interface. After fitting, you can return `.latex()` to get the sympy representation of the best equation."
+    )
+def best_callable(*args, **kwargs):
+    raise NotImplementedError(
+        "`best_callable` has been deprecated. Please use the `PySRRegressor` interface. After fitting, you can use `.predict(X)` to use the best callable."
+    )
 def _denoise(X, y, Xresampled=None):
     def __call__(self, X):
         if isinstance(X, pd.DataFrame):
+            # Lambda function takes as argument:
+            return self._lambda(**{k: X[k].values for k in X.columns})
+        elif self._selection is not None:
             return self._lambda(*X[:, self._selection].T)
         return self._lambda(*X.T)
     project_toml_path = tmp_dir / "Project.toml"
     project_toml_path.write_text(project_toml)
+class PySRRegressor(BaseEstimator, RegressorMixin):
+    def __init__(
+        self,
+        model_selection="best",
+        weights=None,
+        binary_operators=None,
+        unary_operators=None,
+        procs=cpu_count(),
+        loss="L2DistLoss()",
+        populations=20,
+        niterations=100,
+        ncyclesperiteration=300,
+        alpha=0.1,
+        annealing=False,
+        fractionReplaced=0.10,
+        fractionReplacedHof=0.10,
+        npop=1000,
+        parsimony=1e-4,
+        migration=True,
+        hofMigration=True,
+        shouldOptimizeConstants=True,
+        topn=10,
+        weightAddNode=1,
+        weightInsertNode=3,
+        weightDeleteNode=3,
+        weightDoNothing=1,
+        weightMutateConstant=10,
+        weightMutateOperator=1,
+        weightRandomize=1,
+        weightSimplify=0.002,
+        perturbationFactor=1.0,
+        extra_sympy_mappings=None,
+        extra_torch_mappings=None,
+        extra_jax_mappings=None,
+        equation_file=None,
+        verbosity=1e9,
+        progress=None,
+        maxsize=20,
+        fast_cycle=False,
+        maxdepth=None,
+        variable_names=None,
+        batching=False,
+        batchSize=50,
+        select_k_features=None,
+        warmupMaxsizeBy=0.0,
+        constraints=None,
+        useFrequency=True,
+        tempdir=None,
+        delete_tempfiles=True,
+        julia_project=None,
+        update=True,
+        temp_equation_file=False,
+        output_jax_format=False,
+        output_torch_format=False,
+        optimizer_algorithm="BFGS",
+        optimizer_nrestarts=3,
+        optimize_probability=1.0,
+        optimizer_iterations=10,
+        tournament_selection_n=10,
+        tournament_selection_p=1.0,
+        denoise=False,
+        Xresampled=None,
+        precision=32,
+        multithreading=None,
+        **kwargs,
+    ):
+        """Initialize settings for an equation search in PySR.
+        Note: most default parameters have been tuned over several example
+        equations, but you should adjust `niterations`,
+        `binary_operators`, `unary_operators` to your requirements.
+        You can view more detailed explanations of the options on the
+        [options page](https://pysr.readthedocs.io/en/latest/docs/options/) of the documentation.
+        :param model_selection: How to select a model. Can be 'accuracy' or 'best'. The default, 'best', will optimize a combination of complexity and accuracy.
+        :type model_selection: str
+        :param binary_operators: List of strings giving the binary operators in Julia's Base. Default is ["+", "-", "*", "/",].
+        :type binary_operators: list
+        :param unary_operators: Same but for operators taking a single scalar. Default is [].
+        :type unary_operators: list
+        :param niterations: Number of iterations of the algorithm to run. The best equations are printed, and migrate between populations, at the end of each.
+        :type niterations: int
+        :param populations: Number of populations running.
+        :type populations: int
+        :param loss: String of Julia code specifying the loss function.  Can either be a loss from LossFunctions.jl, or your own loss written as a function. Examples of custom written losses include: `myloss(x, y) = abs(x-y)` for non-weighted, or `myloss(x, y, w) = w*abs(x-y)` for weighted.  Among the included losses, these are as follows. Regression: `LPDistLoss{P}()`, `L1DistLoss()`, `L2DistLoss()` (mean square), `LogitDistLoss()`, `HuberLoss(d)`, `L1EpsilonInsLoss(ϵ)`, `L2EpsilonInsLoss(ϵ)`, `PeriodicLoss(c)`, `QuantileLoss(τ)`.  Classification: `ZeroOneLoss()`, `PerceptronLoss()`, `L1HingeLoss()`, `SmoothedL1HingeLoss(γ)`, `ModifiedHuberLoss()`, `L2MarginLoss()`, `ExpLoss()`, `SigmoidLoss()`, `DWDMarginLoss(q)`.
+        :type loss: str
+        :param denoise: Whether to use a Gaussian Process to denoise the data before inputting to PySR. Can help PySR fit noisy data.
+        :type denoise: bool
+        :param select_k_features: whether to run feature selection in Python using random forests, before passing to the symbolic regression code. None means no feature selection; an int means select that many features.
+        :type select_k_features: None/int
+        :param procs: Number of processes (=number of populations running).
+        :type procs: int
+        :param multithreading: Use multithreading instead of distributed backend. Default is yes. Using procs=0 will turn off both.
+        :type multithreading: bool
+        :param batching: whether to compare population members on small batches during evolution. Still uses full dataset for comparing against hall of fame.
+        :type batching: bool
+        :param batchSize: the amount of data to use if doing batching.
+        :type batchSize: int
+        :param maxsize: Max size of an equation.
+        :type maxsize: int
+        :param ncyclesperiteration: Number of total mutations to run, per 10 samples of the population, per iteration.
+        :type ncyclesperiteration: int
+        :param alpha: Initial temperature.
+        :type alpha: float
+        :param annealing: Whether to use annealing. You should (and it is default).
+        :type annealing: bool
+        :param fractionReplaced: How much of population to replace with migrating equations from other populations.
+        :type fractionReplaced: float
+        :param fractionReplacedHof: How much of population to replace with migrating equations from hall of fame.
+        :type fractionReplacedHof: float
+        :param npop: Number of individuals in each population
+        :type npop: int
+        :param parsimony: Multiplicative factor for how much to punish complexity.
+        :type parsimony: float
+        :param migration: Whether to migrate.
+        :type migration: bool
+        :param hofMigration: Whether to have the hall of fame migrate.
+        :type hofMigration: bool
+        :param shouldOptimizeConstants: Whether to numerically optimize constants (Nelder-Mead/Newton) at the end of each iteration.
+        :type shouldOptimizeConstants: bool
+        :param topn: How many top individuals migrate from each population.
+        :type topn: int
+        :param perturbationFactor: Constants are perturbed by a max factor of (perturbationFactor*T + 1). Either multiplied by this or divided by this.
+        :type perturbationFactor: float
+        :param weightAddNode: Relative likelihood for mutation to add a node
+        :type weightAddNode: float
+        :param weightInsertNode: Relative likelihood for mutation to insert a node
+        :type weightInsertNode: float
+        :param weightDeleteNode: Relative likelihood for mutation to delete a node
+        :type weightDeleteNode: float
+        :param weightDoNothing: Relative likelihood for mutation to leave the individual
+        :type weightDoNothing: float
+        :param weightMutateConstant: Relative likelihood for mutation to change the constant slightly in a random direction.
+        :type weightMutateConstant: float
+        :param weightMutateOperator: Relative likelihood for mutation to swap an operator.
+        :type weightMutateOperator: float
+        :param weightRandomize: Relative likelihood for mutation to completely delete and then randomly generate the equation
+        :type weightRandomize: float
+        :param weightSimplify: Relative likelihood for mutation to simplify constant parts by evaluation
+        :type weightSimplify: float
+        :param equation_file: Where to save the files (.csv separated by |)
+        :type equation_file: str
+        :param verbosity: What verbosity level to use. 0 means minimal print statements.
+        :type verbosity: int
+        :param progress: Whether to use a progress bar instead of printing to stdout.
+        :type progress: bool
+        :param maxdepth: Max depth of an equation. You can use both maxsize and maxdepth.  maxdepth is by default set to = maxsize, which means that it is redundant.
+        :type maxdepth: int
+        :param fast_cycle: (experimental) - batch over population subsamples. This is a slightly different algorithm than regularized evolution, but does cycles 15% faster. May be algorithmically less efficient.
+        :type fast_cycle: bool
+        :param variable_names: a list of names for the variables, other than "x0", "x1", etc.
+        :type variable_names: list
+        :param warmupMaxsizeBy: whether to slowly increase max size from a small number up to the maxsize (if greater than 0).  If greater than 0, says the fraction of training time at which the current maxsize will reach the user-passed maxsize.
+        :type warmupMaxsizeBy: float
+        :param constraints: dictionary of int (unary) or 2-tuples (binary), this enforces maxsize constraints on the individual arguments of operators. E.g., `'pow': (-1, 1)` says that power laws can have any complexity left argument, but only 1 complexity exponent. Use this to force more interpretable solutions.
+        :type constraints: dict
+        :param useFrequency: whether to measure the frequency of complexities, and use that instead of parsimony to explore equation space. Will naturally find equations of all complexities.
+        :type useFrequency: bool
+        :param tempdir: directory for the temporary files
+        :type tempdir: str/None
+        :param delete_tempfiles: whether to delete the temporary files after finishing
+        :type delete_tempfiles: bool
+        :param julia_project: a Julia environment location containing a Project.toml (and potentially the source code for SymbolicRegression.jl).  Default gives the Python package directory, where a Project.toml file should be present from the install.
+        :type julia_project: str/None
+        :param update: Whether to automatically update Julia packages.
+        :type update: bool
+        :param temp_equation_file: Whether to put the hall of fame file in the temp directory. Deletion is then controlled with the delete_tempfiles argument.
+        :type temp_equation_file: bool
+        :param output_jax_format: Whether to create a 'jax_format' column in the output, containing jax-callable functions and the default parameters in a jax array.
+        :type output_jax_format: bool
+        :param output_torch_format: Whether to create a 'torch_format' column in the output, containing a torch module with trainable parameters.
+        :type output_torch_format: bool
+        :param tournament_selection_n: Number of expressions to consider in each tournament.
+        :type tournament_selection_n: int
+        :param tournament_selection_p: Probability of selecting the best expression in each tournament. The probability will decay as p*(1-p)^n for other expressions, sorted by loss.
+        :type tournament_selection_p: float
+        :param precision: What precision to use for the data. By default this is 32 (float32), but you can select 64 or 16 as well.
+        :type precision: int
+        :param **kwargs: Other options passed to SymbolicRegression.Options, for example, if you modify SymbolicRegression.jl to include additional arguments.
+        :type **kwargs: dict
+        :returns: Results dataframe, giving complexity, MSE, and equations (as strings), as well as functional forms. If list, each element corresponds to a dataframe of equations for each output.
+        :type: pd.DataFrame/list
+        """
+        super().__init__()
+        self.model_selection = model_selection
+        if binary_operators is None:
+            binary_operators = "+ * - /".split(" ")
+        if unary_operators is None:
+            unary_operators = []
+        if extra_sympy_mappings is None:
+            extra_sympy_mappings = {}
+        if variable_names is None:
+            variable_names = []
+        if constraints is None:
+            constraints = {}
+        if multithreading is None:
+            # Default is multithreading=True, unless explicitly set,
+            # or procs is set to 0 (serial mode).
+            multithreading = procs != 0
+        buffer_available = "buffer" in sys.stdout.__dir__()
+        if progress is not None:
+            if progress and not buffer_available:
+                warnings.warn(
+                    "Note: it looks like you are running in Jupyter. The progress bar will be turned off."
+                )
+                progress = False
+        else:
+            progress = buffer_available
+        assert optimizer_algorithm in ["NelderMead", "BFGS"]
+        assert tournament_selection_n < npop
+        if extra_jax_mappings is not None:
+            for value in extra_jax_mappings.values():
+                if not isinstance(value, str):
+                    raise NotImplementedError(
+                        "extra_jax_mappings must have keys that are strings! e.g., {sympy.sqrt: 'jnp.sqrt'}."
+                    )
+        else:
+            extra_jax_mappings = {}
+        if extra_torch_mappings is not None:
+            for value in extra_jax_mappings.values():
+                if not callable(value):
+                    raise NotImplementedError(
+                        "extra_torch_mappings must be callable functions! e.g., {sympy.sqrt: torch.sqrt}."
+                    )
+        else:
+            extra_torch_mappings = {}
+        if maxsize > 40:
+            warnings.warn(
+                "Note: Using a large maxsize for the equation search will be exponentially slower and use significant memory. You should consider turning `useFrequency` to False, and perhaps use `warmupMaxsizeBy`."
+            )
+        elif maxsize < 7:
+            raise NotImplementedError("PySR requires a maxsize of at least 7")
+        if maxdepth is None:
+            maxdepth = maxsize
+        if isinstance(binary_operators, str):
+            binary_operators = [binary_operators]
+        if isinstance(unary_operators, str):
+            unary_operators = [unary_operators]
+        self.params = {
+            **dict(
+                weights=weights,
+                binary_operators=binary_operators,
+                unary_operators=unary_operators,
+                procs=procs,
+                loss=loss,
+                populations=populations,
+                niterations=niterations,
+                ncyclesperiteration=ncyclesperiteration,
+                alpha=alpha,
+                annealing=annealing,
+                fractionReplaced=fractionReplaced,
+                fractionReplacedHof=fractionReplacedHof,
+                npop=npop,
+                parsimony=float(parsimony),
+                migration=migration,
+                hofMigration=hofMigration,
+                shouldOptimizeConstants=shouldOptimizeConstants,
+                topn=topn,
+                weightAddNode=weightAddNode,
+                weightInsertNode=weightInsertNode,
+                weightDeleteNode=weightDeleteNode,
+                weightDoNothing=weightDoNothing,
+                weightMutateConstant=weightMutateConstant,
+                weightMutateOperator=weightMutateOperator,
+                weightRandomize=weightRandomize,
+                weightSimplify=weightSimplify,
+                perturbationFactor=perturbationFactor,
+                verbosity=verbosity,
+                progress=progress,
+                maxsize=maxsize,
+                fast_cycle=fast_cycle,
+                maxdepth=maxdepth,
+                batching=batching,
+                batchSize=batchSize,
+                select_k_features=select_k_features,
+                warmupMaxsizeBy=warmupMaxsizeBy,
+                constraints=constraints,
+                useFrequency=useFrequency,
+                tempdir=tempdir,
+                delete_tempfiles=delete_tempfiles,
+                update=update,
+                temp_equation_file=temp_equation_file,
+                optimizer_algorithm=optimizer_algorithm,
+                optimizer_nrestarts=optimizer_nrestarts,
+                optimize_probability=optimize_probability,
+                optimizer_iterations=optimizer_iterations,
+                tournament_selection_n=tournament_selection_n,
+                tournament_selection_p=tournament_selection_p,
+                denoise=denoise,
+                Xresampled=Xresampled,
+                precision=precision,
+                multithreading=multithreading,
+            ),
+            **kwargs,
+        }
+        # Stored equations:
+        self.equations = None
+        self.multioutput = None
+        self.raw_julia_output = None
+        self.equation_file = equation_file
+        self.n_features = None
+        self.extra_sympy_mappings = extra_sympy_mappings
+        self.extra_torch_mappings = extra_torch_mappings
+        self.extra_jax_mappings = extra_jax_mappings
+        self.output_jax_format = output_jax_format
+        self.output_torch_format = output_torch_format
+        self.nout = 1
+        self.selection = None
+        self.variable_names = variable_names
+        self.julia_project = julia_project
+        self.surface_parameters = [
+            "model_selection",
+            "multioutput",
+            "raw_julia_output",
+            "equation_file",
+            "n_features",
+            "extra_sympy_mappings",
+            "extra_torch_mappings",
+            "extra_jax_mappings",
+            "output_jax_format",
+            "output_torch_format",
+            "nout",
+            "selection",
+            "variable_names",
+            "julia_project",
+        ]
+    def __repr__(self):
+        """Prints all current equations fitted by the model.
+        The string `>>>>` denotes which equation is selected by the
+        `model_selection`.
+        """
+        if self.equations is None:
+            return "PySRRegressor.equations = None"
+        output = "PySRRegressor.equations = [\n"
+        equations = self.equations
+        if not isinstance(equations, list):
+            all_equations = [equations]
+        else:
+            all_equations = equations
+        for i, equations in enumerate(all_equations):
+            selected = ["" for _ in range(len(equations))]
+            if self.model_selection == "accuracy":
+                chosen_row = -1
+            elif self.model_selection == "best":
+                chosen_row = equations["score"].idxmax()
+            else:
+                raise NotImplementedError
+            selected[chosen_row] = ">>>>"
+            repr_equations = pd.DataFrame(
+                dict(
+                    pick=selected,
+                    score=equations["score"],
+                    equation=equations["equation"],
+                    loss=equations["loss"],
+                    complexity=equations["complexity"],
+                )
+            )
+            if len(all_equations) > 1:
+                output += "[\n"
+            for line in repr_equations.__repr__().split("\n"):
+                output += "\t" + line + "\n"
+            if len(all_equations) > 1:
+                output += "]"
+            if i < len(all_equations) - 1:
+                output += ", "
+        output += "]"
+        return output
+    def set_params(self, **params):
+        """Set parameters for equation search."""
+        for key, value in params.items():
+            if key in self.surface_parameters:
+                self.__setattr__(key, value)
+            else:
+                self.params[key] = value
+        self.refresh()
+        return self
+    def get_params(self, deep=True):
+        """Get parameters for equation search."""
+        del deep
+        return {
+            **self.params,
+            **{key: self.__getattribute__(key) for key in self.surface_parameters},
+        }
+    def get_best(self):
+        """Get best equation using `model_selection`."""
+        if self.equations is None:
+            raise ValueError("No equations have been generated yet.")
+        if self.model_selection == "accuracy":
+            if isinstance(self.equations, list):
+                return [eq.iloc[-1] for eq in self.equations]
+            return self.equations.iloc[-1]
+        elif self.model_selection == "best":
+            if isinstance(self.equations, list):
+                return [eq.iloc[eq["score"].idxmax()] for eq in self.equations]
+            return self.equations.iloc[self.equations["score"].idxmax()]
+        else:
+            raise NotImplementedError(
+                f"{self.model_selection} is not a valid model selection strategy."
+            )
+    def fit(self, X, y, weights=None, variable_names=None):
+        """Search for equations to fit the dataset and store them in `self.equations`.
+        :param X: 2D array. Rows are examples, columns are features. If pandas DataFrame, the columns are used for variable names (so make sure they don't contain spaces).
+        :type X: np.ndarray/pandas.DataFrame
+        :param y: 1D array (rows are examples) or 2D array (rows are examples, columns are outputs). Putting in a 2D array will trigger a search for equations for each feature of y.
+        :type y: np.ndarray
+        :param weights: Optional. Same shape as y. Each element is how to weight the mean-square-error loss for that particular element of y.
+        :type weights: np.ndarray
+        :param variable_names: a list of names for the variables, other than "x0", "x1", etc.
+            You can also pass a pandas DataFrame for X.
+        :type variable_names: list
+        """
+        if variable_names is None:
+            variable_names = self.variable_names
+        self._run(
+            X=X,
+            y=y,
+            weights=weights,
+            variable_names=variable_names,
+        )
+        return self
+    def refresh(self):
+        # Updates self.equations with any new options passed,
+        # such as extra_sympy_mappings.
+        self.equations = self.get_hof()
+    def predict(self, X):
+        """Predict y from input X using the equation chosen by `model_selection`.
+        You may see what equation is used by printing this object. X should have the same
+        columns as the training data.
+        :param X: 2D array. Rows are examples, columns are features. If pandas DataFrame, the columns are used for variable names (so make sure they don't contain spaces).
+        :type X: np.ndarray/pandas.DataFrame
+        :return: 1D array (rows are examples) or 2D array (rows are examples, columns are outputs).
+        """
+        self.refresh()
+        best = self.get_best()
+        if self.multioutput:
+            return np.stack([eq["lambda_format"](X) for eq in best], axis=1)
+        return best["lambda_format"](X)
+    def sympy(self):
+        """Return sympy representation of the equation(s) chosen by `model_selection`."""
+        self.refresh()
+        best = self.get_best()
+        if self.multioutput:
+            return [eq["sympy_format"] for eq in best]
+        return best["sympy_format"]
+    def latex(self):
+        """Return latex representation of the equation(s) chosen by `model_selection`."""
+        self.refresh()
+        sympy_representation = self.sympy()
+        if self.multioutput:
+            return [sympy.latex(s) for s in sympy_representation]
+        return sympy.latex(sympy_representation)
+    def jax(self):
+        """Return jax representation of the equation(s) chosen by `model_selection`.
+        Each equation (multiple given if there are multiple outputs) is a dictionary
+        containing {"callable": func, "parameters": params}. To call `func`, pass
+        func(X, params). This function is differentiable using `jax.grad`.
+        """
+        if self.using_pandas:
+            warnings.warn(
+                "PySR's JAX modules are not set up to work with a "
+                "model that was trained on pandas dataframes. "
+                "Train on an array instead to ensure everything works as planned."
+            )
+        self.set_params(output_jax_format=True)
+        self.refresh()
+        best = self.get_best()
+        if self.multioutput:
+            return [eq["jax_format"] for eq in best]
+        return best["jax_format"]
+    def pytorch(self):
+        """Return pytorch representation of the equation(s) chosen by `model_selection`.
+        Each equation (multiple given if there are multiple outputs) is a PyTorch module
+        containing the parameters as trainable attributes. You can use the module like
+        any other PyTorch module: `module(X)`, where `X` is a tensor with the same
+        column ordering as trained with.
+        """
+        if self.using_pandas:
+            warnings.warn(
+                "PySR's PyTorch modules are not set up to work with a "
+                "model that was trained on pandas dataframes. "
+                "Train on an array instead to ensure everything works as planned."
+            )
+        self.set_params(output_torch_format=True)
+        self.refresh()
+        best = self.get_best()
+        if self.multioutput:
+            return [eq["torch_format"] for eq in best]
+        return best["torch_format"]
+    def _run(self, X, y, weights, variable_names):
+        global already_ran
+        global Main
+        for key in self.surface_parameters:
+            if key in self.params:
+                raise ValueError(
+                    f"{key} is a surface parameter, and cannot be in self.params"
+                )
+        multithreading = self.params["multithreading"]
+        procs = self.params["procs"]
+        binary_operators = self.params["binary_operators"]
+        unary_operators = self.params["unary_operators"]
+        batching = self.params["batching"]
+        maxsize = self.params["maxsize"]
+        select_k_features = self.params["select_k_features"]
+        Xresampled = self.params["Xresampled"]
+        denoise = self.params["denoise"]
+        constraints = self.params["constraints"]
+        update = self.params["update"]
+        loss = self.params["loss"]
+        weightMutateConstant = self.params["weightMutateConstant"]
+        weightMutateOperator = self.params["weightMutateOperator"]
+        weightAddNode = self.params["weightAddNode"]
+        weightInsertNode = self.params["weightInsertNode"]
+        weightDeleteNode = self.params["weightDeleteNode"]
+        weightSimplify = self.params["weightSimplify"]
+        weightRandomize = self.params["weightRandomize"]
+        weightDoNothing = self.params["weightDoNothing"]
+        if Main is None:
+            if multithreading:
+                os.environ["JULIA_NUM_THREADS"] = str(procs)
+            Main = init_julia()
+        if isinstance(X, pd.DataFrame):
+            if variable_names is not None:
+                warnings.warn("Resetting variable_names from X.columns")
+            variable_names = list(X.columns)
+            X = np.array(X)
+            self.using_pandas = True
+        else:
+            self.using_pandas = False
+        if len(X.shape) == 1:
+            X = X[:, None]
+        assert not isinstance(y, pd.DataFrame)
+        if len(variable_names) == 0:
+            variable_names = [f"x{i}" for i in range(X.shape[1])]
+        use_custom_variable_names = len(variable_names) != 0
+        # TODO: this is always true.
+        _check_assertions(
+            X,
+            binary_operators,
+            unary_operators,
+            use_custom_variable_names,
+            variable_names,
+            weights,
+            y,
+        )
+        self.n_features = X.shape[1]
+        if len(X) > 10000 and not batching:
+            warnings.warn(
+                "Note: you are running with more than 10,000 datapoints. You should consider turning on batching (https://pysr.readthedocs.io/en/latest/docs/options/#batching). You should also reconsider if you need that many datapoints. Unless you have a large amount of noise (in which case you should smooth your dataset first), generally < 10,000 datapoints is enough to find a functional form with symbolic regression. More datapoints will lower the search speed."
+            )
+        X, selection = _handle_feature_selection(
+            X, select_k_features, y, variable_names
+        )
+        if len(y.shape) == 1 or (len(y.shape) == 2 and y.shape[1] == 1):
+            self.multioutput = False
+            self.nout = 1
+            y = y.reshape(-1)
+        elif len(y.shape) == 2:
+            self.multioutput = True
+            self.nout = y.shape[1]
+        else:
+            raise NotImplementedError("y shape not supported!")
+        if denoise:
+            if weights is not None:
+                raise NotImplementedError(
+                    "No weights for denoising - the weights are learned."
+                )
+            if Xresampled is not None:
+                # Select among only the selected features:
+                if isinstance(Xresampled, pd.DataFrame):
+                    # Handle Xresampled is pandas dataframe
+                    if selection is not None:
+                        Xresampled = Xresampled[[variable_names[i] for i in selection]]
+                    else:
+                        Xresampled = Xresampled[variable_names]
+                    Xresampled = np.array(Xresampled)
+                else:
+                    if selection is not None:
+                        Xresampled = Xresampled[:, selection]
+            if self.multioutput:
+                y = np.stack(
+                    [
+                        _denoise(X, y[:, i], Xresampled=Xresampled)[1]
+                        for i in range(self.nout)
+                    ],
+                    axis=1,
+                )
+                if Xresampled is not None:
+                    X = Xresampled
+            else:
+                X, y = _denoise(X, y, Xresampled=Xresampled)
+        self.julia_project = _get_julia_project(self.julia_project)
+        tmpdir = Path(tempfile.mkdtemp(dir=self.params["tempdir"]))
+        if self.params["temp_equation_file"]:
+            self.equation_file = tmpdir / "hall_of_fame.csv"
+        elif self.equation_file is None:
+            date_time = datetime.now().strftime("%Y-%m-%d_%H%M%S.%f")[:-3]
+            self.equation_file = "hall_of_fame_" + date_time + ".csv"
+        _create_inline_operators(
+            binary_operators=binary_operators, unary_operators=unary_operators
+        )
+        _handle_constraints(
+            binary_operators=binary_operators,
+            unary_operators=unary_operators,
+            constraints=constraints,
+        )
+        una_constraints = [constraints[op] for op in unary_operators]
+        bin_constraints = [constraints[op] for op in binary_operators]
+        try:
+            # TODO: is this needed since Julia now prints directly to stdout?
+            term_width = shutil.get_terminal_size().columns
+        except:
+            _, term_width = subprocess.check_output(["stty", "size"]).split()
+        if not already_ran:
+            from julia import Pkg
+            Pkg.activate(f"{_escape_filename(self.julia_project)}")
+            try:
+                if update:
+                    Pkg.resolve()
+                    Pkg.instantiate()
+                else:
+                    Pkg.instantiate()
+            except RuntimeError as e:
+                raise ImportError(
+                    f"""
+    Required dependencies are not installed or built.  Run the following code in the Python REPL:
+        >>> import pysr
+        >>> pysr.install()
+    Tried to activate project {self.julia_project} but failed."""
+                ) from e
+            Main.eval("using SymbolicRegression")
+            Main.plus = Main.eval("(+)")
+            Main.sub = Main.eval("(-)")
+            Main.mult = Main.eval("(*)")
+            Main.pow = Main.eval("(^)")
+            Main.div = Main.eval("(/)")
+        Main.custom_loss = Main.eval(loss)
+        mutationWeights = [
+            float(weightMutateConstant),
+            float(weightMutateOperator),
+            float(weightAddNode),
+            float(weightInsertNode),
+            float(weightDeleteNode),
+            float(weightSimplify),
+            float(weightRandomize),
+            float(weightDoNothing),
+        ]
+        options = Main.Options(
+            binary_operators=Main.eval(str(tuple(binary_operators)).replace("'", "")),
+            unary_operators=Main.eval(str(tuple(unary_operators)).replace("'", "")),
+            bin_constraints=bin_constraints,
+            una_constraints=una_constraints,
+            loss=Main.custom_loss,
+            maxsize=int(maxsize),
+            hofFile=_escape_filename(self.equation_file),
+            npopulations=int(self.params["populations"]),
+            batching=batching,
+            batchSize=int(
+                min([self.params["batchSize"], len(X)]) if batching else len(X)
+            ),
+            mutationWeights=mutationWeights,
+            terminal_width=int(term_width),
+            probPickFirst=self.params["tournament_selection_p"],
+            ns=self.params["tournament_selection_n"],
+            # These have the same name:
+            parsimony=self.params["parsimony"],
+            alpha=self.params["alpha"],
+            maxdepth=self.params["maxdepth"],
+            fast_cycle=self.params["fast_cycle"],
+            migration=self.params["migration"],
+            hofMigration=self.params["hofMigration"],
+            fractionReplacedHof=self.params["fractionReplacedHof"],
+            shouldOptimizeConstants=self.params["shouldOptimizeConstants"],
+            warmupMaxsizeBy=self.params["warmupMaxsizeBy"],
+            useFrequency=self.params["useFrequency"],
+            npop=self.params["npop"],
+            ncyclesperiteration=self.params["ncyclesperiteration"],
+            fractionReplaced=self.params["fractionReplaced"],
+            topn=self.params["topn"],
+            verbosity=self.params["verbosity"],
+            optimizer_algorithm=self.params["optimizer_algorithm"],
+            optimizer_nrestarts=self.params["optimizer_nrestarts"],
+            optimize_probability=self.params["optimize_probability"],
+            optimizer_iterations=self.params["optimizer_iterations"],
+            perturbationFactor=self.params["perturbationFactor"],
+            annealing=self.params["annealing"],
+        )
+        np_dtype = {16: np.float16, 32: np.float32, 64: np.float64}[
+            self.params["precision"]
+        ]
+        Main.X = np.array(X, dtype=np_dtype).T
+        if len(y.shape) == 1:
+            Main.y = np.array(y, dtype=np_dtype)
+        else:
+            Main.y = np.array(y, dtype=np_dtype).T
+        if weights is not None:
+            if len(weights.shape) == 1:
+                Main.weights = np.array(weights, dtype=np_dtype)
+            else:
+                Main.weights = np.array(weights, dtype=np_dtype).T
+        else:
+            Main.weights = None
+        cprocs = 0 if multithreading else procs
+        self.raw_julia_output = Main.EquationSearch(
+            Main.X,
+            Main.y,
+            weights=Main.weights,
+            niterations=int(self.params["niterations"]),
+            varMap=(
+                variable_names
+                if selection is None
+                else [variable_names[i] for i in selection]
+            ),
+            options=options,
+            numprocs=int(cprocs),
+            multithreading=bool(multithreading),
+        )
+        self.variable_names = variable_names
+        self.selection = selection
+        # Not in params:
+        # selection, variable_names, multioutput
+        self.equations = self.get_hof()
+        if self.params["delete_tempfiles"]:
+            shutil.rmtree(tmpdir)
+        already_ran = True
+    def get_hof(self):
+        """Get the equations from a hall of fame file. If no arguments
+        entered, the ones used previously from a call to PySR will be used."""
+        try:
+            if self.multioutput:
+                all_outputs = []
+                for i in range(1, self.nout + 1):
+                    df = pd.read_csv(
+                        str(self.equation_file) + f".out{i}" + ".bkup",
+                        sep="|",
+                    )
+                    # Rename Complexity column to complexity:
+                    df.rename(
+                        columns={
+                            "Complexity": "complexity",
+                            "MSE": "loss",
+                            "Equation": "equation",
+                        },
+                        inplace=True,
+                    )
+                    all_outputs.append(df)
+            else:
+                all_outputs = [pd.read_csv(str(self.equation_file) + ".bkup", sep="|")]
+                all_outputs[-1].rename(
+                    columns={
+                        "Complexity": "complexity",
+                        "MSE": "loss",
+                        "Equation": "equation",
+                    },
+                    inplace=True,
+                )
+        except FileNotFoundError:
+            raise RuntimeError(
+                "Couldn't find equation file! The equation search likely exited before a single iteration completed."
+            )
+        ret_outputs = []
+        for output in all_outputs:
+            scores = []
+            lastMSE = None
+            lastComplexity = 0
+            sympy_format = []
+            lambda_format = []
+            if self.output_jax_format:
+                jax_format = []
+            if self.output_torch_format:
+                torch_format = []
+            use_custom_variable_names = len(self.variable_names) != 0
+            local_sympy_mappings = {
+                **self.extra_sympy_mappings,
+                **sympy_mappings,
+            }
+            if use_custom_variable_names:
+                sympy_symbols = [
+                    sympy.Symbol(self.variable_names[i]) for i in range(self.n_features)
+                ]
+            else:
+                sympy_symbols = [
+                    sympy.Symbol("x%d" % i) for i in range(self.n_features)
+                ]
+            for _, eqn_row in output.iterrows():
+                eqn = sympify(eqn_row["equation"], locals=local_sympy_mappings)
+                sympy_format.append(eqn)
+                # Numpy:
+                lambda_format.append(
+                    CallableEquation(
+                        sympy_symbols, eqn, self.selection, self.variable_names
+                    )
+                )
+                # JAX:
+                if self.output_jax_format:
+                    from .export_jax import sympy2jax
+                    func, params = sympy2jax(
+                        eqn,
+                        sympy_symbols,
+                        selection=self.selection,
+                        extra_jax_mappings=self.extra_jax_mappings,
+                    )
+                    jax_format.append({"callable": func, "parameters": params})
+                # Torch:
+                if self.output_torch_format:
+                    from .export_torch import sympy2torch
+                    module = sympy2torch(
+                        eqn,
+                        sympy_symbols,
+                        selection=self.selection,
+                        extra_torch_mappings=self.extra_torch_mappings,
+                    )
+                    torch_format.append(module)
+                curMSE = eqn_row["loss"]
+                curComplexity = eqn_row["complexity"]
+                if lastMSE is None:
+                    cur_score = 0.0
+                else:
+                    if curMSE > 0.0:
+                        cur_score = -np.log(curMSE / lastMSE) / (
+                            curComplexity - lastComplexity
+                        )
+                    else:
+                        cur_score = np.inf
+                scores.append(cur_score)
+                lastMSE = curMSE
+                lastComplexity = curComplexity
+            output["score"] = np.array(scores)
+            output["sympy_format"] = sympy_format
+            output["lambda_format"] = lambda_format
+            output_cols = [
+                "complexity",
+                "loss",
+                "score",
+                "equation",
+                "sympy_format",
+                "lambda_format",
+            ]
+            if self.output_jax_format:
+                output_cols += ["jax_format"]
+                output["jax_format"] = jax_format
+            if self.output_torch_format:
+                output_cols += ["torch_format"]
+                output["torch_format"] = torch_format
+            ret_outputs.append(output[output_cols])
+        if self.multioutput:
+            return ret_outputs
+        return ret_outputs[0]
+    def score(self, X, y):
+        del X
+        del y
+        raise NotImplementedError

setup.py CHANGED Viewed

@@ -1,7 +1,10 @@
 import setuptools
-with open("README.md", "r") as fh:
-    long_description = fh.read()
 setuptools.setup(
     name="pysr",
@@ -12,7 +15,7 @@ setuptools.setup(
     long_description=long_description,
     long_description_content_type="text/markdown",
     url="https://github.com/MilesCranmer/pysr",
-    install_requires=["julia", "numpy", "pandas", "sympy"],
     packages=setuptools.find_packages(),
     package_data={"pysr": ["../Project.toml", "../datasets/*"]},
     include_package_data=False,

 import setuptools
+try:
+    with open("README.md", "r") as fh:
+        long_description = fh.read()
+except FileNotFoundError:
+    long_description = ""
 setuptools.setup(
     name="pysr",
     long_description=long_description,
     long_description_content_type="text/markdown",
     url="https://github.com/MilesCranmer/pysr",
+    install_requires=["julia", "numpy", "pandas", "sympy", "scikit-learn"],
     packages=setuptools.find_packages(),
     package_data={"pysr": ["../Project.toml", "../datasets/*"]},
     include_package_data=False,

test/test.py CHANGED Viewed

@@ -1,8 +1,8 @@
 import unittest
 from unittest.mock import patch
 import numpy as np
-from pysr import pysr, get_hof, best, best_tex, best_callable, best_row
-from pysr.sr import run_feature_selection, _handle_feature_selection, _yesno
 import sympy
 from sympy import lambdify
 import pandas as pd
@@ -21,32 +21,33 @@ class TestPipeline(unittest.TestCase):
     def test_linear_relation(self):
         y = self.X[:, 0]
-        equations = pysr(self.X, y, **self.default_test_kwargs)
-        print(equations)
-        self.assertLessEqual(equations.iloc[-1]["MSE"], 1e-4)
     def test_multiprocessing(self):
         y = self.X[:, 0]
-        equations = pysr(
-            self.X, y, **self.default_test_kwargs, procs=2, multithreading=False
-        )
-        print(equations)
-        self.assertLessEqual(equations.iloc[-1]["MSE"], 1e-4)
     def test_multioutput_custom_operator(self):
         y = self.X[:, [0, 1]] ** 2
-        equations = pysr(
-            self.X,
-            y,
             unary_operators=["sq(x) = x^2"],
-            binary_operators=["plus"],
             extra_sympy_mappings={"sq": lambda x: x ** 2},
             **self.default_test_kwargs,
             procs=0,
         )
         print(equations)
-        self.assertLessEqual(equations[0].iloc[-1]["MSE"], 1e-4)
-        self.assertLessEqual(equations[1].iloc[-1]["MSE"], 1e-4)
     def test_multioutput_weighted_with_callable_temp_equation(self):
         y = self.X[:, [0, 1]] ** 2
@@ -58,10 +59,7 @@ class TestPipeline(unittest.TestCase):
         y = (2 - w) * y
         # Thus, pysr needs to use the weights to find the right equation!
-        pysr(
-            self.X,
-            y,
-            weights=w,
             unary_operators=["sq(x) = x^2"],
             binary_operators=["plus"],
             extra_sympy_mappings={"sq": lambda x: x ** 2},
@@ -70,34 +68,46 @@ class TestPipeline(unittest.TestCase):
             temp_equation_file=True,
             delete_tempfiles=False,
         )
         np.testing.assert_almost_equal(
-            best_callable()[0](self.X), self.X[:, 0] ** 2, decimal=4
         )
         np.testing.assert_almost_equal(
-            best_callable()[1](self.X), self.X[:, 1] ** 2, decimal=4
         )
-    def test_empty_operators_single_input(self):
         X = np.random.randn(100, 1)
         y = X[:, 0] + 3.0
-        equations = pysr(
-            X,
-            y,
             unary_operators=[],
             binary_operators=["plus"],
             **self.default_test_kwargs,
         )
-        self.assertLessEqual(equations.iloc[-1]["MSE"], 1e-4)
     def test_noisy(self):
         np.random.seed(1)
         y = self.X[:, [0, 1]] ** 2 + np.random.randn(self.X.shape[0], 1) * 0.05
-        equations = pysr(
-            self.X,
-            y,
             # Test that passing a single operator works:
             unary_operators="sq(x) = x^2",
             binary_operators="plus",
@@ -106,8 +116,9 @@ class TestPipeline(unittest.TestCase):
             procs=0,
             denoise=True,
         )
-        self.assertLessEqual(best_row(equations=equations)[0]["MSE"], 1e-2)
-        self.assertLessEqual(best_row(equations=equations)[1]["MSE"], 1e-2)
     def test_pandas_resample(self):
         np.random.seed(1)
@@ -130,9 +141,7 @@ class TestPipeline(unittest.TestCase):
                 "T": np.random.randn(100),
             }
         )
-        equations = pysr(
-            X,
-            y,
             unary_operators=[],
             binary_operators=["+", "*", "/", "-"],
             **self.default_test_kwargs,
@@ -140,11 +149,12 @@ class TestPipeline(unittest.TestCase):
             denoise=True,
             select_k_features=2,
         )
-        self.assertNotIn("unused_feature", best_tex())
-        self.assertIn("T", best_tex())
-        self.assertIn("x", best_tex())
-        self.assertLessEqual(equations.iloc[-1]["MSE"], 1e-2)
-        fn = best_callable()
         self.assertListEqual(list(sorted(fn._selection)), [0, 1])
         X2 = pd.DataFrame(
             {
@@ -154,44 +164,45 @@ class TestPipeline(unittest.TestCase):
             }
         )
         self.assertLess(np.average((fn(X2) - true_fn(X2)) ** 2), 1e-2)
 class TestBest(unittest.TestCase):
     def setUp(self):
         equations = pd.DataFrame(
             {
-                "Equation": ["1.0", "cos(x0)", "square(cos(x0))"],
-                "MSE": [1.0, 0.1, 1e-5],
-                "Complexity": [1, 2, 3],
             }
         )
-        equations["Complexity MSE Equation".split(" ")].to_csv(
             "equation_file.csv.bkup", sep="|"
         )
-        self.equations = get_hof(
-            "equation_file.csv",
-            n_features=2,
             variables_names="x0 x1".split(" "),
             extra_sympy_mappings={},
             output_jax_format=False,
             multioutput=False,
             nout=1,
         )
     def test_best(self):
-        self.assertEqual(best(self.equations), sympy.cos(sympy.Symbol("x0")) ** 2)
-        self.assertEqual(best(), sympy.cos(sympy.Symbol("x0")) ** 2)
     def test_best_tex(self):
-        self.assertEqual(best_tex(self.equations), "\\cos^{2}{\\left(x_{0} \\right)}")
-        self.assertEqual(best_tex(), "\\cos^{2}{\\left(x_{0} \\right)}")
     def test_best_lambda(self):
         X = np.random.randn(10, 2)
         y = np.cos(X[:, 0]) ** 2
-        for f in [best_callable(), best_callable(self.equations)]:
             np.testing.assert_almost_equal(f(X), y, decimal=4)
@@ -221,11 +232,3 @@ class TestFeatureSelection(unittest.TestCase):
         np.testing.assert_array_equal(
             np.sort(selected_X, axis=1), np.sort(X[:, [2, 3]], axis=1)
         )
-class TestHelperFunctions(unittest.TestCase):
-    @patch("builtins.input", side_effect=["y", "n"])
-    def test_yesno(self, mock_input):
-        # Assert that the yes/no function correctly deals with y/n
-        self.assertEqual(_yesno("Test"), True)
-        self.assertEqual(_yesno("Test"), False)

 import unittest
 from unittest.mock import patch
 import numpy as np
+from pysr import PySRRegressor
+from pysr.sr import run_feature_selection, _handle_feature_selection
 import sympy
 from sympy import lambdify
 import pandas as pd
     def test_linear_relation(self):
         y = self.X[:, 0]
+        model = PySRRegressor(**self.default_test_kwargs)
+        model.fit(self.X, y)
+        model.set_params(model_selection="accuracy")
+        print(model.equations)
+        self.assertLessEqual(model.get_best()["loss"], 1e-4)
     def test_multiprocessing(self):
         y = self.X[:, 0]
+        model = PySRRegressor(**self.default_test_kwargs, procs=2, multithreading=False)
+        model.fit(self.X, y)
+        print(model.equations)
+        self.assertLessEqual(model.equations.iloc[-1]["loss"], 1e-4)
     def test_multioutput_custom_operator(self):
         y = self.X[:, [0, 1]] ** 2
+        model = PySRRegressor(
             unary_operators=["sq(x) = x^2"],
             extra_sympy_mappings={"sq": lambda x: x ** 2},
+            binary_operators=["plus"],
             **self.default_test_kwargs,
             procs=0,
         )
+        model.fit(self.X, y)
+        equations = model.equations
         print(equations)
+        self.assertLessEqual(equations[0].iloc[-1]["loss"], 1e-4)
+        self.assertLessEqual(equations[1].iloc[-1]["loss"], 1e-4)
     def test_multioutput_weighted_with_callable_temp_equation(self):
         y = self.X[:, [0, 1]] ** 2
         y = (2 - w) * y
         # Thus, pysr needs to use the weights to find the right equation!
+        model = PySRRegressor(
             unary_operators=["sq(x) = x^2"],
             binary_operators=["plus"],
             extra_sympy_mappings={"sq": lambda x: x ** 2},
             temp_equation_file=True,
             delete_tempfiles=False,
         )
+        model.fit(self.X, y, weights=w)
         np.testing.assert_almost_equal(
+            model.predict(self.X)[:, 0], self.X[:, 0] ** 2, decimal=4
         )
         np.testing.assert_almost_equal(
+            model.predict(self.X)[:, 1], self.X[:, 1] ** 2, decimal=4
         )
+    def test_empty_operators_single_input_sklearn(self):
         X = np.random.randn(100, 1)
         y = X[:, 0] + 3.0
+        regressor = PySRRegressor(
+            model_selection="accuracy",
             unary_operators=[],
             binary_operators=["plus"],
             **self.default_test_kwargs,
         )
+        self.assertTrue("None" in regressor.__repr__())
+        regressor.fit(X, y)
+        self.assertTrue("None" not in regressor.__repr__())
+        self.assertTrue(">>>>" in regressor.__repr__())
+        self.assertLessEqual(regressor.equations.iloc[-1]["loss"], 1e-4)
+        np.testing.assert_almost_equal(regressor.predict(X), y, decimal=1)
+        # Tweak model selection:
+        regressor.set_params(model_selection="best")
+        self.assertEqual(regressor.get_params()["model_selection"], "best")
+        self.assertTrue("None" not in regressor.__repr__())
+        self.assertTrue(">>>>" in regressor.__repr__())
+        # "best" model_selection should also give a decent loss:
+        np.testing.assert_almost_equal(regressor.predict(X), y, decimal=1)
     def test_noisy(self):
         np.random.seed(1)
         y = self.X[:, [0, 1]] ** 2 + np.random.randn(self.X.shape[0], 1) * 0.05
+        model = PySRRegressor(
             # Test that passing a single operator works:
             unary_operators="sq(x) = x^2",
             binary_operators="plus",
             procs=0,
             denoise=True,
         )
+        model.fit(self.X, y)
+        self.assertLessEqual(model.get_best()[1]["loss"], 1e-2)
+        self.assertLessEqual(model.get_best()[1]["loss"], 1e-2)
     def test_pandas_resample(self):
         np.random.seed(1)
                 "T": np.random.randn(100),
             }
         )
+        model = PySRRegressor(
             unary_operators=[],
             binary_operators=["+", "*", "/", "-"],
             **self.default_test_kwargs,
             denoise=True,
             select_k_features=2,
         )
+        model.fit(X, y)
+        self.assertNotIn("unused_feature", model.latex())
+        self.assertIn("T", model.latex())
+        self.assertIn("x", model.latex())
+        self.assertLessEqual(model.get_best()["loss"], 1e-2)
+        fn = model.get_best()["lambda_format"]
         self.assertListEqual(list(sorted(fn._selection)), [0, 1])
         X2 = pd.DataFrame(
             {
             }
         )
         self.assertLess(np.average((fn(X2) - true_fn(X2)) ** 2), 1e-2)
+        self.assertLess(np.average((model.predict(X2) - true_fn(X2)) ** 2), 1e-2)
 class TestBest(unittest.TestCase):
     def setUp(self):
         equations = pd.DataFrame(
             {
+                "equation": ["1.0", "cos(x0)", "square(cos(x0))"],
+                "loss": [1.0, 0.1, 1e-5],
+                "complexity": [1, 2, 3],
             }
         )
+        equations["complexity loss equation".split(" ")].to_csv(
             "equation_file.csv.bkup", sep="|"
         )
+        self.model = PySRRegressor(
+            equation_file="equation_file.csv",
             variables_names="x0 x1".split(" "),
             extra_sympy_mappings={},
             output_jax_format=False,
             multioutput=False,
             nout=1,
         )
+        self.model.n_features = 2
+        self.model.refresh()
+        self.equations = self.model.equations
     def test_best(self):
+        self.assertEqual(self.model.sympy(), sympy.cos(sympy.Symbol("x0")) ** 2)
     def test_best_tex(self):
+        self.assertEqual(self.model.latex(), "\\cos^{2}{\\left(x_{0} \\right)}")
     def test_best_lambda(self):
         X = np.random.randn(10, 2)
         y = np.cos(X[:, 0]) ** 2
+        for f in [self.model.predict, self.equations.iloc[-1]["lambda_format"]]:
             np.testing.assert_almost_equal(f(X), y, decimal=4)
         np.testing.assert_array_equal(
             np.sort(selected_X, axis=1), np.sort(X[:, [2, 3]], axis=1)
         )

test/test_jax.py CHANGED Viewed

@@ -1,6 +1,6 @@
 import unittest
 import numpy as np
-from pysr import sympy2jax, get_hof
 import pandas as pd
 from jax import numpy as jnp
 from jax import random
@@ -25,7 +25,7 @@ class TestJAX(unittest.TestCase):
         X = np.random.randn(100, 10)
         equations = pd.DataFrame(
             {
-                "Equation": ["1.0", "cos(x0)", "square(cos(x0))"],
                 "MSE": [1.0, 0.1, 1e-5],
                 "Complexity": [1, 2, 3],
             }
@@ -35,18 +35,20 @@ class TestJAX(unittest.TestCase):
             "equation_file.csv.bkup", sep="|"
         )
-        equations = get_hof(
-            "equation_file.csv",
-            n_features=2,
-            variables_names="x1 x2 x3".split(" "),
-            extra_sympy_mappings={},
             output_jax_format=True,
             multioutput=False,
             nout=1,
             selection=[1, 2, 3],
         )
-        jformat = equations.iloc[-1].jax_format
         np.testing.assert_almost_equal(
             np.array(jformat["callable"](jnp.array(X), jformat["parameters"])),
             np.square(np.cos(X[:, 1])),  # Select feature 1

 import unittest
 import numpy as np
+from pysr import sympy2jax, PySRRegressor
 import pandas as pd
 from jax import numpy as jnp
 from jax import random
         X = np.random.randn(100, 10)
         equations = pd.DataFrame(
             {
+                "Equation": ["1.0", "cos(x1)", "square(cos(x1))"],
                 "MSE": [1.0, 0.1, 1e-5],
                 "Complexity": [1, 2, 3],
             }
             "equation_file.csv.bkup", sep="|"
         )
+        model = PySRRegressor(
+            equation_file="equation_file.csv",
             output_jax_format=True,
+            variables_names="x1 x2 x3".split(" "),
             multioutput=False,
             nout=1,
             selection=[1, 2, 3],
         )
+        model.n_features = 2
+        model.using_pandas = False
+        model.refresh()
+        jformat = model.jax()
         np.testing.assert_almost_equal(
             np.array(jformat["callable"](jnp.array(X), jformat["parameters"])),
             np.square(np.cos(X[:, 1])),  # Select feature 1

test/test_torch.py CHANGED Viewed

@@ -1,7 +1,7 @@
 import unittest
 import numpy as np
 import pandas as pd
-from pysr import sympy2torch, get_hof
 import torch
 import sympy
@@ -24,7 +24,7 @@ class TestTorch(unittest.TestCase):
         X = np.random.randn(100, 10)
         equations = pd.DataFrame(
             {
-                "Equation": ["1.0", "cos(x0)", "square(cos(x0))"],
                 "MSE": [1.0, 0.1, 1e-5],
                 "Complexity": [1, 2, 3],
             }
@@ -34,9 +34,9 @@ class TestTorch(unittest.TestCase):
             "equation_file.csv.bkup", sep="|"
         )
-        equations = get_hof(
-            "equation_file.csv",
-            n_features=2,  # TODO: Why is this 2 and not 3?
             variables_names="x1 x2 x3".split(" "),
             extra_sympy_mappings={},
             output_torch_format=True,
@@ -44,8 +44,12 @@ class TestTorch(unittest.TestCase):
             nout=1,
             selection=[1, 2, 3],
         )
-        tformat = equations.iloc[-1].torch_format
         np.testing.assert_almost_equal(
             tformat(torch.tensor(X)).detach().numpy(),
             np.square(np.cos(X[:, 1])),  # Selection 1st feature
@@ -84,9 +88,9 @@ class TestTorch(unittest.TestCase):
             "equation_file_custom_operator.csv.bkup", sep="|"
         )
-        equations = get_hof(
-            "equation_file_custom_operator.csv",
-            n_features=3,
             variables_names="x1 x2 x3".split(" "),
             extra_sympy_mappings={"mycustomoperator": sympy.sin},
             extra_torch_mappings={"mycustomoperator": torch.sin},
@@ -95,8 +99,13 @@ class TestTorch(unittest.TestCase):
             nout=1,
             selection=[0, 1, 2],
         )
-        tformat = equations.iloc[-1].torch_format
         np.testing.assert_almost_equal(
             tformat(torch.tensor(X)).detach().numpy(),
             np.sin(X[:, 0]),  # Selection 1st feature

 import unittest
 import numpy as np
 import pandas as pd
+from pysr import sympy2torch, PySRRegressor
 import torch
 import sympy
         X = np.random.randn(100, 10)
         equations = pd.DataFrame(
             {
+                "Equation": ["1.0", "cos(x1)", "square(cos(x1))"],
                 "MSE": [1.0, 0.1, 1e-5],
                 "Complexity": [1, 2, 3],
             }
             "equation_file.csv.bkup", sep="|"
         )
+        model = PySRRegressor(
+            model_selection="accuracy",
+            equation_file="equation_file.csv",
             variables_names="x1 x2 x3".split(" "),
             extra_sympy_mappings={},
             output_torch_format=True,
             nout=1,
             selection=[1, 2, 3],
         )
+        model.n_features = 2  # TODO: Why is this 2 and not 3?
+        model.using_pandas = False
+        model.refresh()
+        tformat = model.pytorch()
+        self.assertEqual(str(tformat), "_SingleSymPyModule(expression=cos(x1)**2)")
         np.testing.assert_almost_equal(
             tformat(torch.tensor(X)).detach().numpy(),
             np.square(np.cos(X[:, 1])),  # Selection 1st feature
             "equation_file_custom_operator.csv.bkup", sep="|"
         )
+        model = PySRRegressor(
+            model_selection="accuracy",
+            equation_file="equation_file_custom_operator.csv",
             variables_names="x1 x2 x3".split(" "),
             extra_sympy_mappings={"mycustomoperator": sympy.sin},
             extra_torch_mappings={"mycustomoperator": torch.sin},
             nout=1,
             selection=[0, 1, 2],
         )
+        model.n_features = 3
+        model.using_pandas = False
+        model.refresh()
+        # Will automatically use the set global state from get_hof.
+        tformat = model.pytorch()
+        self.assertEqual(str(tformat), "_SingleSymPyModule(expression=sin(x0))")
         np.testing.assert_almost_equal(
             tformat(torch.tensor(X)).detach().numpy(),
             np.sin(X[:, 0]),  # Selection 1st feature