Spaces:

charlesfrye
/

ask-fsdl

Runtime error

App Files Files Community

ask-fsdl / documents /lecture-03.md

charlesfrye

adds documents

a08f3cd over 2 years ago

preview code

raw

history blame contribute delete

25.7 kB

	---
	description: Principles for testing software, tools for testing Python code, practices for debugging models and testing ML
	---

	# Lecture 3: Troubleshooting & Testing

	<div align="center">
	<iframe width="720" height="405" src="https://www.youtube-nocookie.com/embed/RLemHNAO5Lw?list=PL1T8fO7ArWleMMI8KPJ_5D5XSlovTW_Ur" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen></iframe>
	</div>

	Lecture by [Charles Frye](https://twitter.com/charles_irl).<br />
	Notes by [James Le](https://twitter.com/le_james94) and [Vishnu Rachakonda](https://www.linkedin.com/in/vrachakonda/).<br />
	Published August 22, 2022.
	[Download slides](https://fsdl.me/2022-lecture-03-slides).

	## 1 - Testing Software

	1. The general approach is that tests will help us ship faster with
	fewer bugs, but they won't catch all of our bugs.

	2. That means we will use testing tools but won't try to achieve 100%
	coverage.

	3. Similarly, we will use linting tools to improve the development
	experience but leave escape valves rather than pedantically
	following our style guides.

	4. Finally, we'll discuss tools for automating these workflows.

	### 1.1 - Tests Help Us Ship Faster. They Don't Catch All Bugs

	![](./media/image1.png)

	**Tests are code we write that are designed to fail intelligibly when
	our other code has bugs**. These tests can help catch some bugs before
	they are merged into the main product, but they can't catch all bugs.
	The main reason is that test suites are not certificates of correctness.
	In some formal systems, tests can be proof of code correctness. But we
	are writing in Python (a loosely goosey language), so all bets are off
	in terms of code correctness.

	[Nelson Elhage](https://twitter.com/nelhage?lang=en)
	framed test suites more like classifiers. The classification problem is:
	does this commit have a bug, or is it okay? The classifier output is
	whether the tests pass or fail. We can then **treat test suites as a
	"prediction" of whether there is a bug**, which suggests a different way
	of designing our test suites.

	When designing classifiers, we need to trade off detection and false
	alarms. **If we try to catch all possible bugs, we can inadvertently
	introduce false alarms**. The classic signature of a false alarm is a
	failed test - followed by a commit that fixes the test rather than the
	code.

	To avoid introducing too many false alarms, it's useful to ask yourself
	two questions before adding a test:

	1. Which real bugs will this test catch?

	2. Which false alarms will this test raise?

	If you can think of more examples for the second question than the first
	one, maybe you should reconsider whether you need this test.

	One caveat is that: in some settings, correctness is important.
	Examples include medical diagnostics/intervention, self-driving
	vehicles, and banking/finance. A pattern immediately arises here: If you
	are operating in a high-stakes situation where errors have consequences
	for people's lives and livelihoods, even if it's not regulated yet, it
	might be regulated soon. These are examples of **low-feasibility,
	high-impact ML projects** discussed in the first lecture.

	![](./media/image19.png)


	### 1.2 - Use Testing Tools, But Don't Chase Coverage

	- [Pytest](https://docs.pytest.org/) is the standard
	tool for testing Python code. It has a Pythonic implementation and
	powerful features such as creating separate suites, sharing
	resources across tests, and running parametrized variations of
	tests.

	- Pure text docs can't be checked for correctness automatically, so
	they are hard to maintain or trust. Python has a nice module,
	[[doctests]](https://docs.python.org/3/library/doctest.html),
	for checking code in the documentation and preventing rot.

	- Notebooks help connect rich media (charts, images, and web pages)
	with code execution. A cheap and dirty solution to test notebooks
	is adding some asserts and using nbformat to run the
	notebooks.

	![](./media/image17.png)


	Once you start adding different types of tests and your codebase grows,
	you will want coverage tools for recording which code is checked or
	"covered" by tests. Typically, this is done in lines of code, but some
	tools can be more fine-grained. We recommend
	[Codecov](https://about.codecov.io/), which generates nice
	visualizations you can use to drill down and get a high-level overview
	of the current state of your testing. Codecov helps you understand your
	tests and can be incorporated into your testing. You can say you want to
	reject commits not only where tests fail, but also where test coverage
	goes down below a certain threshold.

	However, we recommend against that. Personal experience, interviews, and
	published research suggest that only a small fraction of the tests you
	write will generate most of your value. **The right tactic,
	engineering-wise, is to expand the limited engineering effort we have on
	the highest-impact tests and ensure that those are super high quality**.
	If you set a coverage target, you will instead write tests in order to
	meet that coverage target (regardless of their quality). You end up
	spending more effort to write tests and deal with their low quality.

	![](./media/image16.png)


	### 1.3 - Use Linting Tools, But Leave Escape Valves

	Clean code is of uniform and standard style.

	1. Uniform style helps avoid spending engineering time on arguments
	over style in pull requests and code review. It also helps improve
	the utility of our version control by cutting down on noisy
	components of diffs and reducing their size. Both benefits make it
	easier for humans to visually parse the diffs in our version
	control system and make it easier to build automation around them.

	2. Standard style makes it easier to accept contributions for an
	open-source repository and onboard new team members for a
	closed-source system.

	![](./media/image18.png)


	One aspect of consistent style is consistent code formatting (with
	things like whitespace). The standard tool for that in Python is
	[the] [black] [Python
	formatter](https://github.com/psf/black). It's a very
	opinionated tool with a fairly narrow scope in terms of style. It
	focuses on things that can be fully automated and can be nicely
	integrated into your editor and automated workflows.

	For non-automatable aspects of style (like missing docstrings), we
	recommend [[flake8]](https://flake8.pycqa.org/). It comes
	with many extensions and plugins such as docstring completeness, type
	hinting, security, and common bugs.

	ML codebases often have both Python code and shell scripts in them.
	Shell scripts are powerful, but they also have a lot of sharp edges.
	[shellcheck](https://www.shellcheck.net/) knows all the
	weird behaviors of bash that often cause errors and issues that aren't
	immediately obvious. It also provides explanations for why it's raising
	a warning or an error. It's very fast to run and can be easily
	incorporated into your editor.

	![](./media/image6.png)


	One caveat to this is: pedantic enforcement of style is obnoxious.
	To avoid frustration with code style and linting, we recommend:

	1. Filtering rules down to the minimal style that achieves the goals we
	set out (sticking with standards, avoiding arguments, keeping
	version control history clean, etc.)

	2. Having an "opt-in" application of rules and gradually growing
	coverage over time - which is especially important for existing
	codebases (which may have thousands of lines of code that we need
	to be fixed).

	### 1.4 - Always Be Automating

	**To make the best use of testing and linting practices, you want to
	automate these tasks and connect to your cloud version control system
	(VCS)**. Connecting to the VCS state reduces friction when trying to
	reproduce or understand errors. Furthermore, running things outside of
	developer environments means that you can run tests automatically in
	parallel to other development work.

	Popular, open-source repositories are the best place to learn about
	automation best practices. For instance, the PyTorch Github library has
	tons of automated workflows built into the repo - such as workflows that
	automatically run on every push and pull.

	![](./media/image15.png)


	The tool that PyTorch uses (and that we recommend) is [GitHub
	Actions](https://docs.github.com/en/actions), which ties
	automation directly to VCS. It is powerful, flexible, performant, and
	easy to use. It gets great documentation, can be used with a YAML file,
	and is embraced by the open-source community. There are other options
	such as [pre-commit.ci](https://pre-commit.ci/),
	[CircleCI](https://circleci.com/), and
	[Jenkins](https://www.jenkins.io/); but GitHub Actions
	seems to have won the hearts and minds in the open-source community in
	the last few years.

	To keep your version control history as clean as possible, you want to
	be able to run tests and linters locally before committing. We recommend
	[pre-commit](https://github.com/pre-commit/pre-commit)
	to enforce hygiene checks. You can use it to run formatting, linting,
	etc. on every commit and keep the total runtime to a few seconds.
	pre-commit is easy to run locally and easy to automate with GitHub
	Actions.

	**Automation to ensure the quality and integrity of our software is a
	productivity enhancer.** That's broader than just CI/CD. Automation
	helps you avoid context switching, surfaces issues early, is a force
	multiplier for small teams, and is better documented by default.

	One caveat is that: automation requires really knowing your tools.
	Knowing Docker well enough to use it is not the same as knowing Docker
	well enough to automate it. Bad automation, like bad tests, takes more
	time than it saves. Organizationally, that makes automation a good task
	for senior engineers who have knowledge of these tools, have ownership
	over code, and can make these decisions around automation.

	### Summary

	1. Automate tasks with GitHub Actions to reduce friction.

	2. Use the standard Python toolkit for testing and cleaning your
	projects.

	3. Choose testing and linting practices with the 80/20 principle,
	shipping velocity, and usability/developer experience in mind.

	## 2 - Testing ML Systems

	1. Testing ML is hard, but not impossible.

	2. We should stick with the low-hanging fruit to start.

	3. Test your code in production, but don't release bad code.

	### 2.1 - Testing ML Is Hard, But Not Impossible

	Software engineering is where many testing practices have been
	developed. In software engineering, we compile source code into
	programs. In machine learning, training compiles data into a model.
	These components are harder to test:

	1. Data is heavier and more inscrutable than source code.

	2. Training is more complex and less well-defined.

	3. Models have worse tools for debugging and inspection than compiled
	programs.

	In this section, we will focus primarily on "smoke" tests. These tests
	are easy to implement and still effective. They are among the 20% of
	tests that get us 80% of the value.

	### 2.2 - Use Expectation Testing on Data

	We test our data by checking basic properties. We express our
	expectations about the data, which might be things like there are no
	nulls in this column or the completion date is after the start date.
	With expectation testing, you will start small with only a few
	properties and grow them slowly. You only want to test things that are
	worth raising alarms and sending notifications to others.

	![](./media/image14.png)


	We recommend
	[[great_expectations]](https://greatexpectations.io/) for
	data testing. It automatically generates documentation and quality
	reports for your data, in addition to built-in logging and alerting
	designed for expectation testing. To get started, check out [this
	MadeWithML tutorial on
	great_expectations](https://github.com/GokuMohandas/testing-ml).

	![](./media/image13.png)

	To move forward, you want to stay as close to the data as possible:

	1. A common pattern is that there's a benchmark dataset with
	annotations (in academia) or an external annotation team (in the
	industry). A lot of the detailed information about that data can
	be extracted by simply looking at it.

	2. One way for data to get internalized into the organization is that
	at the start of the project, model developers annotate data ad-hoc
	(especially if you don't have the budget for an external
	annotation team).

	3. However, if the model developers at the start of the project move on
	and more developers get onboarded, that knowledge is diluted. A
	better solution is an internal annotation team that has a regular
	information flow with the model developers is a better solution.

	4. The best practice ([recommended by Shreya
	Shankar](https://twitter.com/sh_reya/status/1521903046392877056))
	is t**o have a regular on-call rotation where model developers
	annotate data themselves**. Ideally, these are fresh data so that
	all members of the team who are developing models know about the
	data and build intuition/expertise in the data.

	### 2.3 - Use Memorization Testing on Training

	Memorization is the simplest form of learning. Deep neural networks
	are very good at memorizing data, so checking whether your model can
	memorize a very small fraction of the full data set is a great smoke
	test for training. If a model can\'t memorize, then something is clearly
	very wrong!

	Only really gross issues with training will show up with this test. For
	example, your gradients may not be calculated correctly, you have a
	numerical issue, or your labels have been shuffled; serious issues like
	these. Subtle bugs in your model or your data are not going to show up.
	A way to catch smaller bugs is to include the length of run time in your
	test coverage. It's a good way to detect if smaller issues are making it
	harder for your model to learn. If the number of epochs it takes to
	reach an expected performance suddenly goes up, it may be due to a
	training bug. PyTorch Lightning has an "overfit_batches" feature that
	can help with this.

	**Make sure to tune memorization tests to run quickly, so you can
	regularly run them**. If they are under 10 minutes or some short
	threshold, they can be run every PR or code change to better catch
	breaking changes. A couple of ideas for speeding up these tests are
	below:

	![](./media/image3.png)

	Overall, these ideas lead to memorization tests that implement model
	training on different time scale and allow you to mock out scenarios.

	A solid, if expensive idea for testing training is to **rerun old
	training jobs with new code**. It's not something that can be run
	frequently, but doing so can yield lessons about what unexpected changes
	might have happened in your training pipeline. The main drawback is the
	potential expense of running these tests. CI platforms like
	[CircleCI](https://circleci.com/) charge a great deal for
	GPUs, while others like Github Actions don't offer access to the
	relevant machines easily.

	The best option for testing training is to **regularly run training with
	new data that's coming in from production**. This is still expensive,
	but it is directly related to improvements in model development, not
	just testing for breakages. Setting this up requires a data flywheel
	similar to what we talked about in Lecture 1. Further tooling needed to
	achieve will be discussed down the line.

	### 2.4 - Adapt Regression Testing for Models

	Models are effectively functions. They have inputs and produce
	outputs like any other function in code. So, why not test them like
	functions with regression testing? For specific inputs, we can check to
	see whether the model consistently returns the same outputs. This is
	best done with simpler models like classification models. It's harder to
	maintain such tests with more complex models. However, even in a more
	complex model scenario, regression testing can be useful for comparing
	changes from training to production.

	![](./media/image11.png)


	A more sophisticated approach to testing for ML models is to **use loss
	values and model metrics to build documented test suites out of your
	data**. Consider this similar to [the test-driven
	development](https://en.wikipedia.org/wiki/Test-driven_development)
	(TDD) code writing paradigm. The test that is written before your code
	in TDD is akin to your model's loss performance; both represent the gap
	between where your code needs to be and where it is. Over time, as we
	improve the loss metric, our model is getting closer to passing "the
	test" we've imposed on it. The gradient descent we use to improve the
	model can be considered a TDD approach to machine learning models!

	![](./media/image9.png)


	While gradient descent is somewhat like TDD, it's not exactly the same
	because simply reviewing metrics doesn't tell us how to resolve model
	failures (the way traditional software tests do).

	To fill in this gap, **start by [looking at the data points that have
	the highest loss](https://arxiv.org/abs/1912.05283)**. Flag
	them for a test suite composed of "hard" examples. Doing this provides
	two advantages: it helps find where the model can be improved, and it
	can also help find errors in the data itself (i.e. poor labels).

	As you examine these failures, you can aggregate types of failures into
	named suites. For example in a self-driving car use case, you could have
	a "night time" suite and a "reflection" suite. **Building these test
	suites can be considered the machine learning version of regression
	testing**, where you take bugs that you\'ve observed in production and
	add them to your test suite to make sure that they don\'t come up again.

	![](./media/image8.png)

	The method can be quite manual, but there are some options for speeding
	it up. Partnering with the annotation team at your company can help make
	developing these tests a lot faster. Another approach is to use a method
	called [Domino](https://arxiv.org/abs/2203.14960) that
	uses foundation models to find errors. Additionally, for testing NLP
	models, use the
	[CheckList](https://arxiv.org/abs/2005.04118) approach.

	### 2.5 - Test in Production, But Don't YOLO

	It's crucial to test in true production settings. This is especially
	true for machine learning models, because data is an important component
	of both the production and the development environments. It's difficult
	to ensure that both are very close to one another.

	**The best way to solve the training and production difference is to
	test in production**.

	Testing in production isn't sufficient on its own. Rather, testing in
	production allows us to develop tooling and infrastructure that allows
	us to resolve production errors quickly (which are often quite
	expensive). It reduces pressure on other kinds of testing, but does not
	replace them.

	![](./media/image7.png)


	We will cover in detail the tooling needed for production monitoring and
	continual learning of ML systems in a future lecture.

	### 2.6 - ML Test Score

	So far, we have discussed writing "smoke" tests for ML: expectation
	tests for data, memorization tests for training, and regression tests
	for models.

	**As your code base and team mature, adopt a more full-fledged approach
	to testing ML systems like the approach identified in the [ML Test
	Score](https://research.google/pubs/pub46555/) paper**. The
	ML Test Score is a rubric that evolved out of machine learning efforts
	at Google. It's a strict rubric for ML test quality that covers data,
	models, training, infrastructure, and production monitoring. It overlaps
	with, but goes beyond some of the recommendations we've offered.

	![](./media/image2.png)

	It's rather expensive, but worth it for high stakes use cases that need
	to be really well-engineered! To be really clear, this rubric is
	really strict. Even our Text Recognizer system we've designed so far
	misses a few categories. Use the ML Test Score as inspiration to develop
	the right testing approach that works for your team's resources and
	needs.

	![](./media/image5.png)

	## 3 - Troubleshooting Models

	**Tests help us figure out something is wrong, but troubleshooting is
	required to actually fix broken ML systems**. Models often require the
	most troubleshooting, and in this section we'll cover a three step
	approach to troubleshooting them.

	1. "Make it run" by avoiding common errors.

	2. "Make it fast" by profiling and removing bottlenecks.

	3. "Make it right" by scaling model/data and sticking with proven
	architectures.

	### 3.1 - Make It Run

	This is the easiest step for models; only a small portion of bugs cause
	the kind of loud failures that prevent a model from running at all.
	Watch out for these bugs in advance and save yourself the trouble of
	models that don't run.

	The first type of bugs that prevent models from running at all are
	shape errors. When the shape of the tensors don't match for the
	operations run on them, models can't be trained or run. Prevent these
	errors by keeping notes on the expected size of tensors, annotate the
	sizes in the code, and even step through your model code with a debugger
	to check tensor size as you go.

	![](./media/image10.png)


	The second type of bugs is out of memory errors. This occurs when
	you try to push a tensor to a GPU that is too large to fit. PyTorch
	Lightning has good tools to prevent this. Make sure you're using the
	lowest precision your training can tolerate; a good default is 16 bit
	precision. Another common reason for this is trying to run a model on
	too much data or too large a batch size. Use the autoscale batch size
	feature in PyTorch Lightning to pick the right size batch. You can use
	gradient accumulation if these batch sizes get too small. If neither of
	these options work, you can look into manual techniques like tensor
	parallelism and gradient checkpoints.

	Numerical errors also cause machine learning failures. This is when
	NaNs or infinite values show up in tensors. These issues most commonly
	appear first in the gradient and then cascade through the model. PyTorch
	Lightning has a good tool for tracking and logging gradient norms. A
	good tip to check whether these issues are caused by precision issues is
	to switch to Python 64 bit floats and see if that causes these issues to
	go away. Normalization layers tend to cause these issues, generally
	speaking. So watch out for how you do normalization!

	### 3.2 - Make It Fast

	![](./media/image4.png)

	Once you can run a model, you'll want it to run fast. This can be tricky
	because the performance of DNN training code is very counterintuitive.
	For example, transformers can actually spend more time in the MLP layer
	than the attention layer. Similarly, trivial components like loading
	data can soak up performance.

	To solve these issues, the primary solution is to **roll up your sleeves
	and profile your code**. You can often find pretty easy Python changes
	that yield big results. Read these two tutorials by
	[Charles](https://wandb.ai/wandb/trace/reports/A-Public-Dissection-of-a-PyTorch-Training-Step--Vmlldzo5MDE3NjU?galleryTag=&utm_source=fully_connected&utm_medium=blog&utm_campaign=using+the+pytorch+profiler+with+w%26b)
	and [Horace](https://horace.io/brrr_intro.html) for more
	details.

	### 3.3 - Make It Right

	After you make it run fast, make the model right. Unlike traditional
	software, machine learning models never are truly perfect. Production
	performance is never perfect. As such, it might be more appropriate to
	say "make it as right as needed".

	Knowing this, making the model run and run fast allows us to make the
	model right through applying scale. To achieve performance benefits,
	scaling a model or its data are generally fruitful and achievable
	routes. It's a lot easier to scale a fast model. [Research from OpenAI
	and other institutions](https://arxiv.org/abs/2001.08361)
	is showing that benefits from scale can be rigorously measured and
	predicted across compute budget, dataset size, and parameter count.

	![](./media/image12.png)

	If you can't afford to scale yourself, consider finetuning a model
	trained at scale for your task.

	So far, all of the advice given has been model and task-agnostic.
	Anything more detailed has to be specific to the model and the relevant
	task. Stick close to working architectures and hyperparameters from
	places like HuggingFace, and try not to reinvent the wheel!

	## 4 - Resources

	Here are some helpful resources that discuss this topic.

	### Tweeters

	1. [Julia Evans](https://twitter.com/b0rk)

	2. [Charity Majors](https://twitter.com/mipsytipsy)

	3. [Nelson Elhage](https://twitter.com/nelhage)

	4. [kipply](https://twitter.com/kipperrii)

	5. [Horace He](https://twitter.com/cHHillee)

	6. [Andrej Karpathy](https://twitter.com/karpathy)

	7. [Chip Huyen](https://twitter.com/chipro)

	8. [Jeremy Howard](https://twitter.com/jeremyphoward)

	9. [Ross Wightman](https://twitter.com/wightmanr)

	### Templates

	1. [Lightning Hydra
	Template](https://github.com/ashleve/lightning-hydra-template)

	2. [NN Template](https://github.com/grok-ai/nn-template)

	3. [Generic Deep Learning Project
	Template](https://github.com/sudomaze/deep-learning-project-template)

	### Texts

	1. [Reliable ML Systems
	talk](https://www.usenix.org/conference/opml20/presentation/papasian)

	2. ["ML Test Score"
	paper](https://research.google/pubs/pub46555/)

	3. ["Attack of the Cosmic
	Rays!"](https://blogs.oracle.com/linux/post/attack-of-the-cosmic-rays)

	4. ["Computers can be
	understood"](https://blog.nelhage.com/post/computers-can-be-understood/)

	5. ["Systems that defy detailed
	understanding"](https://blog.nelhage.com/post/systems-that-defy-understanding/)

	6. [Testing section from MadeWithML course on
	MLOps](https://madewithml.com/courses/mlops/testing/)