Spaces:
Sleeping
Sleeping
MilesCranmer
commited on
Commit
•
d94ce53
1
Parent(s):
9060684
Many more examples in docs
Browse files- docs/_sidebar.md +2 -2
- docs/examples.md +86 -1
docs/_sidebar.md
CHANGED
@@ -1,9 +1,9 @@
|
|
1 |
- Using PySR
|
2 |
|
3 |
- [Getting Started](/)
|
4 |
-
- [Options](options.md)
|
5 |
-
- [Operators](operators.md)
|
6 |
- [Examples](examples.md)
|
|
|
|
|
7 |
|
8 |
- API Reference
|
9 |
|
|
|
1 |
- Using PySR
|
2 |
|
3 |
- [Getting Started](/)
|
|
|
|
|
4 |
- [Examples](examples.md)
|
5 |
+
- [More Options](options.md)
|
6 |
+
- [Operators](operators.md)
|
7 |
|
8 |
- API Reference
|
9 |
|
docs/examples.md
CHANGED
@@ -96,7 +96,92 @@ Which gives us:
|
|
96 |
|
97 |
![](https://github.com/MilesCranmer/PySR/raw/master/docs/images/example_plot.png)
|
98 |
|
99 |
-
## 5.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
100 |
|
101 |
For the many other features available in PySR, please
|
102 |
read the [Options section](options.md).
|
|
|
96 |
|
97 |
![](https://github.com/MilesCranmer/PySR/raw/master/docs/images/example_plot.png)
|
98 |
|
99 |
+
## 5. Feature selection
|
100 |
+
|
101 |
+
PySR and evolution-based symbolic regression in general performs
|
102 |
+
very poorly when the number of features is large.
|
103 |
+
Even, say, 10 features might be too much for a typical equation search.
|
104 |
+
|
105 |
+
If you are dealing with high-dimensional data with a particular type of structure,
|
106 |
+
you might consider using deep learning to break the problem into
|
107 |
+
smaller "chunks" which can then be solved by PySR, as explained in the paper
|
108 |
+
[2006.11287](https://arxiv.org/abs/2006.11287).
|
109 |
+
|
110 |
+
For tabular datasets, this is a bit trickier. Luckily, PySR has a built-in feature
|
111 |
+
selection mechanism. Simply declare the parameter `select_k_features=5`, for selecting
|
112 |
+
the most important 5 features.
|
113 |
+
|
114 |
+
Here is an example. Let's say we have 30 input features and 300 data points, but only 2
|
115 |
+
of those features are actually used:
|
116 |
+
```python
|
117 |
+
X = np.random.randn(300, 30)
|
118 |
+
y = X[:, 3]**2 - X[:, 19]**2 + 1.5
|
119 |
+
```
|
120 |
+
|
121 |
+
Let's create a model with the feature selection argument set up:
|
122 |
+
```python
|
123 |
+
model = PySRRegressor(
|
124 |
+
binary_operators=["+", "-", "*", "/"],
|
125 |
+
unary_operators=["exp"],
|
126 |
+
select_k_features=5,
|
127 |
+
**kwargs
|
128 |
+
)
|
129 |
+
```
|
130 |
+
Now let's fit this:
|
131 |
+
```python
|
132 |
+
model.fit(X, y)
|
133 |
+
```
|
134 |
+
|
135 |
+
Before the Julia backend is launched, you can see the string:
|
136 |
+
```
|
137 |
+
Using features ['x3', 'x5', 'x7', 'x19', 'x21']
|
138 |
+
```
|
139 |
+
which indicates that the feature selection (powered by a gradient-boosting tree)
|
140 |
+
has successfully selected the relevant two features.
|
141 |
+
|
142 |
+
This fit should find the solution quickly, whereas with the huge number of features,
|
143 |
+
it would have struggled.
|
144 |
+
|
145 |
+
This simple preprocessing step is enough to simplify our tabular dataset,
|
146 |
+
but again, for more structured datasets, you should try the deep learning
|
147 |
+
approach mentioned above.
|
148 |
+
|
149 |
+
## 5. Denoising
|
150 |
+
|
151 |
+
Many datasets, especially in the observational sciences,
|
152 |
+
contain intrinsic noise. PySR is noise robust itself, as it is simply optimizing a loss function,
|
153 |
+
but there are still some additional steps you can take to reduce the effect of noise.
|
154 |
+
|
155 |
+
One thing you could do, which we won't detail here, is to create a custom log-likelihood
|
156 |
+
given some assumed noise model. By passing weights to the fit function, and
|
157 |
+
defining a custom loss function such as `loss="myloss(x, y, w) = w * (x - y)^2"`,
|
158 |
+
you can define any sort of log-likelihood you wish. (However, note that it must be bounded at zero)
|
159 |
+
|
160 |
+
However, the simplest thing to do is preprocessing, just like for feature selection. To do this,
|
161 |
+
set the parameter `denoise=True`. This will fit a Gaussian process (containing a white noise kernel)
|
162 |
+
to the input dataset, and predict new targets (which are assumed to be denoised) from that Gaussian process.
|
163 |
+
|
164 |
+
For example:
|
165 |
+
```python
|
166 |
+
X = np.random.randn(100, 5)
|
167 |
+
noise = np.random.randn(100) * 0.1
|
168 |
+
y = np.exp(X[:, 0]) + X[:, 1] + X[:, 2] + noise
|
169 |
+
```
|
170 |
+
|
171 |
+
Let's create and fit a model with the denoising argument set up:
|
172 |
+
```python
|
173 |
+
model = PySRRegressor(
|
174 |
+
binary_operators=["+", "-", "*", "/"],
|
175 |
+
unary_operators=["exp"],
|
176 |
+
denoise=True,
|
177 |
+
**kwargs
|
178 |
+
)
|
179 |
+
model.fit(X, y)
|
180 |
+
print(model)
|
181 |
+
```
|
182 |
+
If all goes well, you should find that it predicts the correct input equation, without the noise term!
|
183 |
+
|
184 |
+
## 6. Additional features
|
185 |
|
186 |
For the many other features available in PySR, please
|
187 |
read the [Options section](options.md).
|