{ "cells": [ { "cell_type": "markdown", "metadata": { "id": "DS4E1PagbDgL" }, "source": [ "# Setup" ] }, { "cell_type": "markdown", "metadata": { "id": "tQ1r1bbb0yBv" }, "source": [ "\n", "## Instructions\n", "1. Work on a copy of this notebook: _File_ > _Save a copy in Drive_ (you will need a Google account).\n", "2. (Optional) If you would like to do the deep learning component of this tutorial, turn on the GPU with Edit->Notebook settings->Hardware accelerator->GPU\n", "3. Execute the following cell (click on it and press Ctrl+Enter) to install Julia. This may take a minute or so.\n", "4. Continue to the next section.\n", "\n", "_Notes_:\n", "* If your Colab Runtime gets reset (e.g., due to inactivity), repeat steps 3, 4.\n", "* After installation, if you want to change the Julia version or activate/deactivate the GPU, you will need to reset the Runtime: _Runtime_ > _Delete and disconnect runtime_ and repeat steps 2-4." ] }, { "cell_type": "markdown", "metadata": { "id": "COndi88gbDgO" }, "source": [ "**Run the following code to install Julia**" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "GIeFXS0F0zww", "outputId": "5399ed75-f77f-47c5-e53b-4b2f231f2839" }, "outputs": [], "source": [ "!curl -fsSL https://install.julialang.org | sh -s -- -y --default-channel 1.10" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "Iu9X-Y-YNmwM", "outputId": "ee14af65-043a-4ad6-efa0-3cdcc48a4eb8" }, "outputs": [], "source": [ "# Make julia available on PATH:\n", "!ln -s $HOME/.juliaup/bin/julia /usr/local/bin/julia\n", "\n", "# Test it works:\n", "!julia --version" ] }, { "cell_type": "markdown", "metadata": { "id": "ORv1c6xvbDgV" }, "source": [ "Install PySR" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "EhMRSZEYFPLz", "outputId": "e3aad3cb-d921-473e-b77b-8fa6a3a9e2e8" }, "outputs": [], "source": [ "!pip install pysr && python -m pysr install" ] }, { "cell_type": "markdown", "metadata": { "id": "etTMEV0wDqld" }, "source": [ "Colab's printing is non-standard, so we need to manually initialize Julia and redirect its printing. Normally, however, this is not required, and PySR will automatically start Julia during the first call to `.fit`:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "j666aOI8xWF_" }, "outputs": [], "source": [ "def init_colab_printing():\n", " from pysr.julia_helpers import init_julia\n", " from julia.tools import redirect_output_streams\n", "\n", " julia_kwargs = dict(optimize=3, threads=\"auto\", compiled_modules=False)\n", " init_julia(julia_kwargs=julia_kwargs)\n", " redirect_output_streams()\n", "\n", "\n", "init_colab_printing()" ] }, { "cell_type": "markdown", "metadata": { "id": "qeCPKd9wldEK" }, "source": [ "Now, let's import all of our libraries:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "vFpyRxmhFqeH" }, "outputs": [], "source": [ "import sympy\n", "import numpy as np\n", "from matplotlib import pyplot as plt\n", "from pysr import PySRRegressor\n", "from sklearn.model_selection import train_test_split" ] }, { "cell_type": "markdown", "metadata": { "id": "gsRMQ7grbDga" }, "source": [ "# Simple PySR example:\n" ] }, { "cell_type": "markdown", "metadata": { "id": "myTEwdiUFiGL" }, "source": [ "First, let's learn a simple function\n", "\n", "$$2.5382 \\cos(x3) + x0^2 - 2$$" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "Cb1eb2XuFQh8" }, "outputs": [], "source": [ "# Dataset\n", "np.random.seed(0)\n", "X = 2 * np.random.randn(100, 5)\n", "y = 2.5382 * np.cos(X[:, 3]) + X[:, 0] ** 2 - 2" ] }, { "cell_type": "markdown", "metadata": { "id": "cturCkaVjzLs" }, "source": [ "By default, we will set up 30 populations of expressions (which evolve independently except for migrations), use 4 threads, and use `\"best\"` for our model selection strategy:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "4nDAAnisdhTc" }, "outputs": [], "source": [ "default_pysr_params = dict(\n", " populations=30,\n", " model_selection=\"best\",\n", ")" ] }, { "cell_type": "markdown", "metadata": { "id": "N4gANfkaj8ie" }, "source": [ "PySR can run for arbitrarily long, and continue to find more and more accurate expressions. You can set the total number of cycles of evolution with `niterations`, although there are also a [few more ways](https://github.com/MilesCranmer/PySR/pull/134) to stop execution.\n", "\n", "**This first execution will take a bit longer to startup, as the library is JIT-compiled. The next execution will be much faster.**" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 1000 }, "id": "p4PSrO-NK1Wa", "outputId": "55910ab3-895d-400b-e9ce-c75aef639c68" }, "outputs": [], "source": [ "# Learn equations\n", "model = PySRRegressor(\n", " niterations=30,\n", " binary_operators=[\"+\", \"*\"],\n", " unary_operators=[\"cos\", \"exp\", \"sin\"],\n", " **default_pysr_params\n", ")\n", "\n", "model.fit(X, y)" ] }, { "cell_type": "markdown", "metadata": { "id": "-bsAECbdkQsQ" }, "source": [ "We can print the model, which will print out all the discovered expressions:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 252 }, "id": "4HR8gknlZz4W", "outputId": "496283bd-a743-4cc6-a2f9-9619ba91d870" }, "outputs": [], "source": [ "model" ] }, { "cell_type": "markdown", "metadata": { "id": "ME3ddPxXkWQg" }, "source": [ "We can also view the SymPy format of the best expression:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 38 }, "id": "IQKOohdpztS7", "outputId": "0e7d058a-cce1-45ae-db94-6625f7e53a06" }, "outputs": [], "source": [ "model.sympy()" ] }, { "cell_type": "markdown", "metadata": { "id": "EHIIPlmClltn" }, "source": [ "We can also view the SymPy of any other expression in the list, using the index of it in `model.equations_`." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 39 }, "id": "GRcxq-TTlpRX", "outputId": "50bda367-1ed1-4860-8fcf-c940f2e4d935" }, "outputs": [], "source": [ "model.sympy(2)" ] }, { "cell_type": "markdown", "metadata": { "id": "YMugcGX4tbqj" }, "source": [ "## Output" ] }, { "cell_type": "markdown", "metadata": { "id": "gIWt5wz5cjXE" }, "source": [ "`model.equations_` is a Pandas DataFrame. We can export the results in various ways:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 35 }, "id": "HFGaNL6tbDgi", "outputId": "0f364da5-e18d-4e31-cadf-087d641a3aed" }, "outputs": [], "source": [ "model.latex()" ] }, { "cell_type": "markdown", "metadata": { "id": "4hS8kqutcmPQ" }, "source": [ "These is also `model.sympy(), model.jax(), model.pytorch()`. All of these can take an index as input, to get the result for an arbitrary equation in the list.\n", "\n", "We can also use `model.predict` for arbitrary equations, with the default equation being the one chosen by `model_selection`:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "Vbz4IMsk2NYH", "outputId": "361d4b6e-ac23-479d-b511-5001af05ca43" }, "outputs": [], "source": [ "ypredict = model.predict(X)\n", "ypredict_simpler = model.predict(X, 2)\n", "\n", "print(\"Default selection MSE:\", np.power(ypredict - y, 2).mean())\n", "print(\"Manual selection MSE for index 2:\", np.power(ypredict_simpler - y, 2).mean())" ] }, { "cell_type": "markdown", "metadata": { "id": "SQDUScGebDgr" }, "source": [ "# Custom operators" ] }, { "cell_type": "markdown", "metadata": { "id": "qvgVbOoSFtQY" }, "source": [ "A full list of operators is given here: https://astroautomata.com/PySR/operators,\n", "but we can also use any binary or unary operator in `julia`, or define our own as arbitrary functions.\n", "\n", "Say that we want a command to do quartic powers:\n", "\n", "$$ y = x_0^4 - 2 $$" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "JvXOVqSyFsdr" }, "outputs": [], "source": [ "y = X[:, 0] ** 4 - 2" ] }, { "cell_type": "markdown", "metadata": { "id": "-zoqaL8KGSK5" }, "source": [ "We can do this by passing a string in Julia syntax.\n", "\n", "We also define the operator in sympy, with `extra_sympy_mappings`, to enable its use in `predict`, and other export functions." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 339 }, "id": "PoEkpvYuGUdy", "outputId": "02834373-a054-400b-8247-2bf33a5c5beb" }, "outputs": [], "source": [ "model = PySRRegressor(\n", " niterations=5,\n", " populations=40,\n", " binary_operators=[\"+\", \"*\"],\n", " unary_operators=[\"cos\", \"exp\", \"sin\", \"quart(x) = x^4\"],\n", " extra_sympy_mappings={\"quart\": lambda x: x**4},\n", ")\n", "model.fit(X, y)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 38 }, "id": "emn2IajKbDgy", "outputId": "11d5d3cf-de43-4f2b-f653-30016e09bdd0" }, "outputs": [], "source": [ "model.sympy()" ] }, { "cell_type": "markdown", "metadata": { "id": "wbWHyOjl2_kX" }, "source": [ "Since `quart` is arguably more complex than the other operators, you can also give it a different complexity, using, e.g., `complexity_of_operators={\"quart\": 2}` to give it a complexity of 2 (instead of the default 2). You can also define custom complexities for variables and constants (`complexity_of_variables` and `complexity_of_constants`, respectively - both take a single number).\n", "\n", "\n", "One can also add a binary operator, with, e.g., `\"myoperator(x, y) = x^2 * y\"`. All Julia operators that work on scalar 32-bit floating point values are available.\n", "\n", "Make sure that any operator you add is valid over the real line. So, e.g., you will need to define `\"mysqrt(x) = sqrt(abs(x))\"` to enable it for negative numbers,\n", "or, simply have it return a very large number for bad inputs (to prevent negative input in a soft way):\n", "`\"mysqrt(x::T) where {T} = (x >= 0) ? x : T(-1e9)\"` (Julia syntax for a template function of input type `T`), which will make `mysqrt(x)` return -10^9 for negative x–hurting the loss of the equation." ] }, { "cell_type": "markdown", "metadata": { "id": "pEXT4xskbDg0" }, "source": [ "## Scoring" ] }, { "cell_type": "markdown", "metadata": { "id": "IyeYbVVOG60w" }, "source": [ "Using `model_selection=\"best\"`selects the equation with the max score and prints it. But in practice it is best to look through all the equations manually, select an equation above some MSE threshold, and then use the score to select among that loss threshold.\n", "\n", "Here, \"score\" is defined by:\n", "$$ \\text{score} = - \\log(\\text{loss}_i/\\text{loss}_{i-1})/\n", "(\\text{complexity}_i - \\text{complexity}_{i-1})$$" ] }, { "cell_type": "markdown", "metadata": { "id": "I3IxmvSQrhfw" }, "source": [ "This scoring is motivated by the common strategy of looking for drops in the loss-complexity curve.\n", "\n", "From Schmidt & Lipson (2009) -" ] }, { "cell_type": "markdown", "metadata": { "id": "eUeXyoLxrd8o" }, "source": [ "![F4.large.jpg]()" ] }, { "cell_type": "markdown", "metadata": { "id": "gDZyxsA7bDg9" }, "source": [ "# Noise example" ] }, { "cell_type": "markdown", "metadata": { "id": "cJCHdDt6IOou" }, "source": [ "Here is an example with noise. Known Gaussian noise with $\\sigma$ between 0.1 and 5.0. We record samples of $y$:\n", "\n", "$$ \\sigma \\sim U(0.1, 5.0) $$\n", "$$ \\epsilon \\sim \\mathcal{N}(0, \\sigma^2)$$\n", "$$ y = 5\\;\\cos(3.5 x_0) - 1.3 + \\epsilon.$$\n", "We have 5 features, say. The weights change the loss function to be:\n", "$$MSE = \\sum [(y - f(x))^2*w],$$\n", "\n", "so in this example, we can set:\n", "$$w = 1/\\sigma^2.$$" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "up1RvmwyOdal" }, "outputs": [], "source": [ "np.random.seed(0)\n", "N = 3000\n", "upper_sigma = 5\n", "X = 2 * np.random.rand(N, 5)\n", "sigma = np.random.rand(N) * (5 - 0.1) + 0.1\n", "eps = sigma * np.random.randn(N)\n", "y = 5 * np.cos(3.5 * X[:, 0]) - 1.3 + eps" ] }, { "cell_type": "markdown", "metadata": { "id": "-EJPDZbP5YEZ" }, "source": [ "Let's look at this dataset:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 467 }, "id": "sqMqb4nJ5ZR5", "outputId": "aa24922b-2395-4e00-dce3-268fc8e603dc" }, "outputs": [], "source": [ "plt.scatter(X[:, 0], y, alpha=0.2)\n", "plt.xlabel(\"$x_0$\")\n", "plt.ylabel(\"$y$\")" ] }, { "cell_type": "markdown", "metadata": { "id": "kaddasbBuDDv" }, "source": [ "Define some weights to use:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "3wqz9_sIbDhA" }, "outputs": [], "source": [ "weights = 1 / sigma**2" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "v8WBYtcZbDhC", "outputId": "37d4002f-e9d6-40c0-9a24-c671d9c384e6" }, "outputs": [], "source": [ "weights[:5]" ] }, { "cell_type": "markdown", "metadata": { "id": "NXWdQSCFuAzV" }, "source": [ "Let's run PySR again:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "a07K3KUjOxcp", "outputId": "41d11915-78b7-4446-c153-b92a5e2abd4c" }, "outputs": [], "source": [ "model = PySRRegressor(\n", " loss=\"myloss(x, y, w) = w * abs(x - y)\", # Custom loss function with weights.\n", " niterations=20,\n", " populations=20, # Use more populations\n", " binary_operators=[\"+\", \"*\"],\n", " unary_operators=[\"cos\"],\n", ")\n", "model.fit(X, y, weights=weights)" ] }, { "cell_type": "markdown", "metadata": { "id": "CHCMO9CouFLP" }, "source": [ "Let's see if we get similar results to the true equation" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "oHyUbcg6ggmx" }, "outputs": [], "source": [ "model" ] }, { "cell_type": "markdown", "metadata": { "id": "OchlZZQP8Ums" }, "source": [ "We can also filter all equations up to 2x the most accurate equation, then select the best score from that list:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "PB67POLr8b_L" }, "outputs": [], "source": [ "best_idx = model.equations_.query(\n", " f\"loss < {2 * model.equations_.loss.min()}\"\n", ").score.idxmax()\n", "model.sympy(best_idx)" ] }, { "cell_type": "markdown", "metadata": { "id": "SRHTP4x55roh" }, "source": [ "We can also use `denoise=True`, which will run the input through a Gaussian process to denoise the dataset, before fitting on it." ] }, { "cell_type": "markdown", "metadata": { "id": "eTGQ4NA78yAw" }, "source": [ "Let's look at the fit:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "ezCC0IkS8zFf" }, "outputs": [], "source": [ "plt.scatter(X[:, 0], y, alpha=0.1)\n", "y_prediction = model.predict(X, index=best_idx)\n", "plt.scatter(X[:, 0], y_prediction)" ] }, { "cell_type": "markdown", "metadata": { "id": "2x-8M8W4G-KM" }, "source": [ "# Multiple outputs" ] }, { "cell_type": "markdown", "metadata": { "id": "LIJcWqBQG-KM" }, "source": [ "For multiple outputs, multiple equations are returned:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "_Aar1ZJwG-KM" }, "outputs": [], "source": [ "X = 2 * np.random.randn(100, 5)\n", "y = 1 / X[:, [0, 1, 2]]" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "9Znwq40PG-KM" }, "outputs": [], "source": [ "model = PySRRegressor(\n", " binary_operators=[\"+\", \"*\"],\n", " unary_operators=[\"inv(x) = 1/x\"],\n", " extra_sympy_mappings={\"inv\": lambda x: 1 / x},\n", ")\n", "model.fit(X, y)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "0Y_vy0sqG-KM" }, "outputs": [], "source": [ "model" ] }, { "cell_type": "markdown", "metadata": { "id": "-UP49CsGG-KN" }, "source": [ "# Julia packages and types" ] }, { "cell_type": "markdown", "metadata": { "id": "tOdNHheUG-KN" }, "source": [ "PySR uses [SymbolicRegression.jl](https://github.com/MilesCranmer/SymbolicRegression.jl)\n", "as its search backend. This is a pure Julia package, and so can interface easily with any other\n", "Julia package.\n", "For some tasks, it may be necessary to load such a package.\n", "\n", "For example, let's say we wish to discovery the following relationship:\n", "\n", "$$ y = p_{3x + 1} - 5, $$\n", "\n", "where $p_i$ is the $i$th prime number, and $x$ is the input feature.\n", "\n", "Let's see if we can discover this using\n", "the [Primes.jl](https://github.com/JuliaMath/Primes.jl) package.\n", "\n", "First, let's get the Julia backend\n", "Here, we might choose to manually specify unlimited threads, `-O3`,\n", "and `compile_modules=False`, although this will only propagate if Julia has not yet started:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "yUC4BMuHG-KN" }, "outputs": [], "source": [ "import pysr\n", "\n", "jl = pysr.julia_helpers.init_julia(\n", " julia_kwargs=dict(optimize=3, threads=\"auto\", compiled_modules=False)\n", ")" ] }, { "cell_type": "markdown", "metadata": { "id": "af07m4uBG-KN" }, "source": [ "\n", "\n", "`jl` stores the Julia runtime.\n", "\n", "Now, let's run some Julia code to add the Primes.jl\n", "package to the PySR environment:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "xBlMY-s4G-KN" }, "outputs": [], "source": [ "jl.eval(\n", " \"\"\"\n", "import Pkg\n", "Pkg.add(\"Primes\")\n", "\"\"\"\n", ")" ] }, { "cell_type": "markdown", "metadata": { "id": "1rJFukD6G-KN" }, "source": [ "This imports the Julia package manager, and uses it to install\n", "`Primes.jl`. Now let's import `Primes.jl`:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "1PQl1rIaG-KN" }, "outputs": [], "source": [ "jl.eval(\"import Primes\")" ] }, { "cell_type": "markdown", "metadata": { "id": "edGdMxKnG-KN" }, "source": [ "\n", "Now, we define a custom operator:\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "9Ut3HcW3G-KN" }, "outputs": [], "source": [ "jl.eval(\n", " \"\"\"\n", "function p(i::T) where T\n", " if 0.5 < i < 1000\n", " return T(Primes.prime(round(Int, i)))\n", " else\n", " return T(NaN)\n", " end\n", "end\n", "\"\"\"\n", ")" ] }, { "cell_type": "markdown", "metadata": { "id": "_wcV8889G-KN" }, "source": [ "\n", "We have created a function `p`, which takes a number `i` of type `T` (e.g., `T=Float64`).\n", "`p` first checks whether the input is between 0.5 and 1000.\n", "If out-of-bounds, it returns `NaN`.\n", "If in-bounds, it rounds it to the nearest integer, computes the corresponding prime number, and then\n", "converts it to the same type as input.\n", "\n", "The equivalent function in Python would be:\n", "\n", "```python\n", "import sympy\n", "\n", "def p(i):\n", " if 0.5 < i < 1000:\n", " return float(sympy.prime(int(round(i))))\n", " else:\n", " return float(\"nan\")\n", "```\n", "\n", "(However, note that this version assumes 64-bit float input, rather than any input type `T`)\n", "\n", "Next, let's generate a list of primes for our test dataset.\n", "Since we are using PyJulia, we can just call `p` directly to do this:\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "giqwisEPG-KN" }, "outputs": [], "source": [ "primes = {i: jl.p(i * 1.0) for i in range(1, 999)}" ] }, { "cell_type": "markdown", "metadata": { "id": "MPAqARj6G-KO" }, "source": [ "Next, let's use this list of primes to create a dataset of $x, y$ pairs:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "jab4tRRRG-KO" }, "outputs": [], "source": [ "import numpy as np\n", "\n", "X = np.random.randint(0, 100, 100)[:, None]\n", "y = [primes[3 * X[i, 0] + 1] - 5 + np.random.randn() * 0.001 for i in range(100)]" ] }, { "cell_type": "markdown", "metadata": { "id": "3eFgWrjcG-KO" }, "source": [ "Note that we have also added a tiny bit of noise to the dataset.\n", "\n", "Finally, let's create a PySR model, and pass the custom operator. We also need to define the sympy equivalent, which we can leave as a placeholder for now:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "pEYskM2_G-KO" }, "outputs": [], "source": [ "from pysr import PySRRegressor\n", "import sympy\n", "\n", "\n", "class sympy_p(sympy.Function):\n", " pass\n", "\n", "\n", "model = PySRRegressor(\n", " binary_operators=[\"+\", \"-\", \"*\", \"/\"],\n", " unary_operators=[\"p\"],\n", " niterations=20,\n", " extra_sympy_mappings={\"p\": sympy_p},\n", ")" ] }, { "cell_type": "markdown", "metadata": { "id": "ee30bd41" }, "source": [ "We are all set to go! Let's see if we can find the true relation:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "li-TB19iG-KO" }, "outputs": [], "source": [ "model.fit(X, y)" ] }, { "cell_type": "markdown", "metadata": { "id": "jwhTWZryG-KO" }, "source": [ "if all works out, you should be able to see the true relation (note that the constant offset might not be exactly 1, since it is allowed to round to the nearest integer).\n", "\n", "You can get the sympy version of the best equation with:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "bSlpX9xAG-KO" }, "outputs": [], "source": [ "model.sympy()" ] }, { "cell_type": "markdown", "metadata": { "id": "cPc1EDvRbDhL" }, "source": [ "# High-dimensional input: Neural Nets + Symbolic Regression" ] }, { "cell_type": "markdown", "metadata": { "id": "3hS2kTAbbDhL" }, "source": [ "In this example, let's learn a high-dimensional problem. **This will use the method proposed in our NeurIPS paper: https://arxiv.org/abs/2006.11287.**\n", "\n", "Let's consider a time series problem:\n", "\n", "$$ z = y^2,\\quad y = \\frac{1}{10} \\sum(y_i),\\quad y_i = x_{i0}^2 + 6 \\cos(2*x_{i2})$$\n", "\n", "Imagine our time series is 10 timesteps. That is very hard for symbolic regression, even if we impose the inductive bias of $$z=f(\\sum g(x_i))$$ - it is the square of the number of possible equations!\n", "\n", "But, as in our paper, **we can break this problem down into parts with a neural network. Then approximate the neural network with the symbolic regression!**\n", "\n", "Then, instead of, say, $(10^9)^2=10^{18}$ equations, we only have to consider $2\\times 10^9$ equations." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "SXJGXySlbDhL" }, "outputs": [], "source": [ "import numpy as np\n", "\n", "rstate = np.random.RandomState(0)\n", "\n", "N = 100000\n", "Nt = 10\n", "X = 6 * rstate.rand(N, Nt, 5) - 3\n", "y_i = X[..., 0] ** 2 + 6 * np.cos(2 * X[..., 2])\n", "y = np.sum(y_i, axis=1) / y_i.shape[1]\n", "z = y**2\n", "X.shape, y.shape" ] }, { "cell_type": "markdown", "metadata": { "id": "8ZqGupq_uSgp" }, "source": [ "## Neural Network definition" ] }, { "cell_type": "markdown", "metadata": { "id": "r2NR0h8-bDhN" }, "source": [ "So, as described above, let's first use a neural network with the sum inductive bias to solve this problem.\n", "\n", "Essentially, we will learn two neural networks:\n", "- `f`\n", "- `g`\n", "\n", "each defined as a multi-layer perceptron. We will sum over `g` the same way as in our equation, but we won't define the summed part beforehand.\n", "\n", "Then, we will fit `g` and `f` **separately** using symbolic regression." ] }, { "cell_type": "markdown", "metadata": { "id": "aca54ffa" }, "source": [ "> **Warning**\n", ">\n", "> We import torch *after* already starting PyJulia. This is required due to interference between their C bindings. If you use torch, and then run PyJulia, you will likely hit a segfault. So keep this in mind for mixed deep learning + PyJulia/PySR workflows." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "k-Od8b9DlkHK" }, "outputs": [], "source": [ "!pip install pytorch_lightning" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "nWVfkV_YbDhO" }, "outputs": [], "source": [ "import torch\n", "from torch import nn, optim\n", "from torch.nn import functional as F\n", "from torch.utils.data import DataLoader, TensorDataset\n", "import pytorch_lightning as pl\n", "\n", "hidden = 128\n", "total_steps = 100_000\n", "\n", "\n", "def mlp(size_in, size_out, act=nn.ReLU):\n", " return nn.Sequential(\n", " nn.Linear(size_in, hidden),\n", " act(),\n", " nn.Linear(hidden, hidden),\n", " act(),\n", " nn.Linear(hidden, hidden),\n", " act(),\n", " nn.Linear(hidden, size_out),\n", " )\n", "\n", "\n", "class SumNet(pl.LightningModule):\n", " def __init__(self):\n", " super().__init__()\n", "\n", " ########################################################\n", " # The same inductive bias as above!\n", " self.g = mlp(5, 1)\n", " self.f = mlp(1, 1)\n", "\n", " def forward(self, x):\n", " y_i = self.g(x)[:, :, 0]\n", " y = torch.sum(y_i, dim=1, keepdim=True) / y_i.shape[1]\n", " z = self.f(y)\n", " return z[:, 0]\n", "\n", " ########################################################\n", "\n", " # PyTorch Lightning bookkeeping:\n", " def training_step(self, batch, batch_idx):\n", " x, z = batch\n", " predicted_z = self(x)\n", " loss = F.mse_loss(predicted_z, z)\n", " return loss\n", "\n", " def validation_step(self, batch, batch_idx):\n", " return self.training_step(batch, batch_idx)\n", "\n", " def configure_optimizers(self):\n", " optimizer = torch.optim.Adam(self.parameters(), lr=self.max_lr)\n", " scheduler = {\n", " \"scheduler\": torch.optim.lr_scheduler.OneCycleLR(\n", " optimizer,\n", " max_lr=self.max_lr,\n", " total_steps=self.trainer.estimated_stepping_batches,\n", " final_div_factor=1e4,\n", " ),\n", " \"interval\": \"step\",\n", " }\n", " return [optimizer], [scheduler]" ] }, { "cell_type": "markdown", "metadata": { "id": "kK725aSEuUvG" }, "source": [ "## Data bookkeeping" ] }, { "cell_type": "markdown", "metadata": { "id": "KdWVtWUcbDhQ" }, "source": [ "Put everything into PyTorch and do a train/test split:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "0ym19abgbDhR" }, "outputs": [], "source": [ "from multiprocessing import cpu_count\n", "\n", "Xt = torch.tensor(X).float()\n", "zt = torch.tensor(z).float()\n", "X_train, X_test, z_train, z_test = train_test_split(Xt, zt, random_state=0)\n", "train_set = TensorDataset(X_train, z_train)\n", "train = DataLoader(\n", " train_set, batch_size=128, num_workers=cpu_count(), shuffle=True, pin_memory=True\n", ")\n", "test_set = TensorDataset(X_test, z_test)\n", "test = DataLoader(test_set, batch_size=256, num_workers=cpu_count(), pin_memory=True)" ] }, { "cell_type": "markdown", "metadata": { "id": "3dw_NefuudIq" }, "source": [ "## Train the model with PyTorch Lightning on GPUs:" ] }, { "cell_type": "markdown", "metadata": { "id": "hhlhLQUBbDhT" }, "source": [ "Start the model:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "1ldN0999bDhU" }, "outputs": [], "source": [ "pl.seed_everything(0)\n", "model = SumNet()\n", "model.total_steps = total_steps\n", "model.max_lr = 1e-2" ] }, { "cell_type": "markdown", "metadata": { "id": "WWRsu5A9bDhW" }, "source": [ "PyTorch Lightning trainer object:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "33R2nrv-b62w" }, "outputs": [], "source": [ "trainer = pl.Trainer(max_steps=total_steps, accelerator=\"gpu\", devices=1)" ] }, { "cell_type": "markdown", "metadata": { "id": "jh91CukM5CkI" }, "source": [ "Here, we fit the neural network:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "TXZdF8k1bDhY" }, "outputs": [], "source": [ "trainer.fit(model, train_dataloaders=train, val_dataloaders=test)" ] }, { "cell_type": "markdown", "metadata": { "id": "uYzk0yU4ulfH" }, "source": [ "## Latent vectors of network\n", "\n", "Let's get the input and output of the learned `g` function from the network over some random data:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "s2sQLla5bDhb" }, "outputs": [], "source": [ "np.random.seed(0)\n", "idx = np.random.randint(0, 10000, size=1000)\n", "\n", "X_for_pysr = Xt[idx]\n", "y_i_for_pysr = model.g(X_for_pysr)[:, :, 0]\n", "y_for_pysr = torch.sum(y_i_for_pysr, dim=1) / y_i_for_pysr.shape[1]\n", "z_for_pysr = zt[idx] # Use true values.\n", "\n", "X_for_pysr.shape, y_i_for_pysr.shape" ] }, { "cell_type": "markdown", "metadata": { "id": "nCCIvvAGuyFi" }, "source": [ "## Learning over the network:\n", "\n", "Now, let's fit `g` using PySR.\n", "\n", "> **Warning**\n", ">\n", "> First, let's save the data, because sometimes PyTorch and PyJulia's C bindings interfere and cause the colab kernel to crash. If we need to restart, we can just load the data without having to retrain the network:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "UX7Am6mZG-KT" }, "outputs": [], "source": [ "nnet_recordings = {\n", " \"g_input\": X_for_pysr.detach().cpu().numpy().reshape(-1, 5),\n", " \"g_output\": y_i_for_pysr.detach().cpu().numpy().reshape(-1),\n", " \"f_input\": y_for_pysr.detach().cpu().numpy().reshape(-1, 1),\n", " \"f_output\": z_for_pysr.detach().cpu().numpy().reshape(-1),\n", "}\n", "\n", "# Save the data for later use:\n", "import pickle as pkl\n", "\n", "with open(\"nnet_recordings.pkl\", \"wb\") as f:\n", " pkl.dump(nnet_recordings, f)" ] }, { "cell_type": "markdown", "metadata": { "id": "krhaNlwFG-KT" }, "source": [ "We can now load the data, including after a crash (be sure to re-run the import cells at the top of this notebook, including the one that starts PyJulia)." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "NF9aSFXHG-KT" }, "outputs": [], "source": [ "import pickle as pkl\n", "\n", "nnet_recordings = pkl.load(open(\"nnet_recordings.pkl\", \"rb\"))\n", "f_input = nnet_recordings[\"f_input\"]\n", "f_output = nnet_recordings[\"f_output\"]\n", "g_input = nnet_recordings[\"g_input\"]\n", "g_output = nnet_recordings[\"g_output\"]" ] }, { "cell_type": "markdown", "metadata": { "id": "_hTYHhDGG-KT" }, "source": [ "And now fit using a subsample of the data (symbolic regression only needs a small sample to find the best equation):" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "51QdHVSkbDhc" }, "outputs": [], "source": [ "rstate = np.random.RandomState(0)\n", "f_sample_idx = rstate.choice(f_input.shape[0], size=500, replace=False)\n", "\n", "model = PySRRegressor(\n", " niterations=50,\n", " binary_operators=[\"+\", \"-\", \"*\"],\n", " unary_operators=[\"cos\", \"square\"],\n", ")\n", "model.fit(g_input[f_sample_idx], g_output[f_sample_idx])" ] }, { "cell_type": "markdown", "metadata": { "id": "1a738a33" }, "source": [ "If this segfaults, restart the notebook, and run the initial imports and PyJulia part, but skip the PyTorch training. This is because PyTorch's C binding tends to interefere with PyJulia. You can then re-run the `pkl.load` cell to import the data." ] }, { "cell_type": "markdown", "metadata": { "id": "xginVMmTu3MZ" }, "source": [ "## Validation" ] }, { "cell_type": "markdown", "metadata": { "id": "6WuaeqyqbDhe" }, "source": [ "Recall we are searching for $f$ and $g$ such that:\n", "$$z=f(\\sum g(x_i))$$\n", "which approximates the true relation:\n", "$$ z = y^2,\\quad y = \\frac{1}{10} \\sum(y_i),\\quad y_i = x_{i0}^2 + 6 \\cos(2 x_{i2})$$\n", "\n", "Let's see how well we did in recovering $g$:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "E1_VWQ45bDhf" }, "outputs": [], "source": [ "model.equations_[[\"complexity\", \"loss\", \"equation\"]]" ] }, { "cell_type": "markdown", "metadata": { "id": "mlU1hidZkgCY" }, "source": [ "A neural network can easily undo a linear transform (which commutes with the summation), so any affine transform in $g$ is to be expected. The network for $f$ has learned to undo the linear transform.\n", "\n", "This likely won't find the exact result, but it should find something similar. You may wish to try again but with many more `total_steps` for the neural network (10,000 is quite small!).\n", "\n", "Then, we can learn another analytic equation for $f$." ] }, { "cell_type": "markdown", "metadata": { "id": "TntGlQEwbDhk" }, "source": [ "**Now, we can compose these together to get the time series model!**\n", "\n", "Think about what we just did: we found an analytical equation for $z$ in terms of $500$ datapoints, under the assumption that $z$ is a function of a sum of another function over an axis:\n", "\n", "$$ z = f(\\sum_i g(x_i)) $$\n", "\n", "And we pulled out analytical copies for $g$ using symbolic regression." ] }, { "cell_type": "markdown", "metadata": { "id": "1QsHVjAVbDhk" }, "source": [ "# Other PySR Options" ] }, { "cell_type": "markdown", "metadata": { "id": "S5dO61g1bDhk" }, "source": [ "The full list of PySR parameters can be found here: https://astroautomata.com/PySR/api" ] } ], "metadata": { "accelerator": "GPU", "colab": { "provenance": [] }, "gpuClass": "standard", "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" } }, "nbformat": 4, "nbformat_minor": 0 }