00ber commited on
Commit
9d79f8e
·
1 Parent(s): f61d311

fixed readme

Browse files
Files changed (1) hide show
  1. README.md +404 -1
README.md CHANGED
@@ -1 +1,404 @@
1
- Weather forecasting using machine learning
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+
2
+ # Using deep learning to predict the temperature of the next 24 hours at the Ronald Reagan National Airport
3
+
4
+ ## **Abstract**
5
+
6
+ Weather forecasts are an integral part of our day-to-day lives. They
7
+ help us plan ahead and be prepared for the upcoming hours, days and even
8
+ weeks. We use the weather apps on our phones to check tomorrow’s
9
+ temperature or the chances of rain in order to dress appropriately or to
10
+ make sure we take our umbrellas with us. These are all weather forecasts
11
+ that we use regularly without giving a second thought.
12
+ As it is such an important part of our lives, the accuracy of these
13
+ forecasts are very important. Not all weather forecasts are made equal
14
+ as meteorologists use a variety of approaches and a wide range of data
15
+ to make predictions.
16
+ One emerging approach in the field of weather forecasting is the use of
17
+ machine learning to make predictions. The abundance of data being
18
+ collected these days and the increasing advancements in machine learning
19
+ algorithms make this a task that machine learning is really suited
20
+ for.
21
+ This project is an attempt in using machine learning (deep learning in
22
+ particular) to make hourly weather forecasts for each day at Washington
23
+ D.C.
24
+
25
+ ## **Problem Definition and Algorithm**
26
+
27
+ ### **Task Definition:**
28
+
29
+ Using historical hourly weather data of the previous 7 days, predict the
30
+ temperature for the next 24 hours (12am to next day’s 12am).
31
+
32
+ ### **Source of Data**
33
+
34
+ **URL:**
35
+ [Wunderground](https://www.wunderground.com/history/daily/us/va/arlington/KDCA/date/2022-12-12)
36
+ Under the hood, Wunderground sends an HTTP request to
37
+ **https://api.weather.com** which returns historical hourly data for a
38
+ particular date and location.
39
+ The format of the data is as follows:
40
+
41
+ | **Field** | **Description** | **Example** |
42
+ | :-------------------------- | :------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | :--------------------------------------------- |
43
+ | key | Observation weather station ID | KDCA |
44
+ | class | Type of data | observation |
45
+ | expire\_time\_gmt | Expiration time in UNIX seconds | 1669881120 |
46
+ | obs\_id | Observation weather station ID | KDCA |
47
+ | obs\_name | | Washington/Natl |
48
+ | valid\_time\_gmt | Valid time in UNIX seconds. This is the date and time that the observation was made | 1669873920 |
49
+ | day\_ind | Time of day of the observation. D = Day N = Night | N |
50
+ | temp | The observed temperature | 42 |
51
+ | wx\_icon | The two-digit number to represent the observed weather conditions. | 33 |
52
+ | icon\_extd | Code representing explicit full set sensible weather | 3300 |
53
+ | wx\_phrase | A text description of the observed weather conditions at the reporting station | |
54
+ | pressure\_tend | The change in the barometric pressure reading over the last hour expressed as an integer. = Steady = Rising or Rapidly Rising = Falling or Rapidly Falling | 0 |
55
+ | pressure\_desc | A phrase describing the change in the barometric pressure reading over the last hour. (Steady, Rising, Rapidly Rising, Falling, Rapidly Falling) | Steady |
56
+ | dewPt | The temperature which air must be cooled at constant pressure to reach saturation. The Dew Point is also an indirect measure of the humidity of the air. The Dew Point will never exceed the Temperature. When the Dew Point and Temperature are equal, clouds or fog will typically form. The closer the values of Temperature and Dew Point, the higher the relative humidity. | 60 |
57
+ | heat\_index | An apparent temperature. It represents what the air temperature “feels like” on exposed human skin due to the combined effect of warm temperatures and high humidity. When the temperature is 70°F or higher, the Feels Like value represents the computed Heat Index. For temperatures between 40°F and 70°F, the Feels Like value and Temperature are the same, regardless of wind speed and humidity, so use the Temperature value. | 70 |
58
+ | rh | The relative humidity of the air, which is defined as the ratio of the amount of water vapor in the air to the amount of vapor required to bring the air to saturation at a constant temperature. Relative humidity is always expressed as a percentage. | 91 |
59
+ | pressure | Barometric pressure is the pressure exerted by the atmosphere at the earth’s surface, due to the weight of the air. This value is read directly from an instrument called a mercury barometer and its units are expressed in millibars (equivalent to HectoPascals). | 30.06 |
60
+ | vis | The horizontal visibility at the observation point. Visibilities can be reported as fractional values particularly when visibility is less than 2 miles. Visibilities greater than 10 statute miles(16.1 kilometers) which are considered “unlimited” are reported as “999” in your feed. You can also find visibility values that equal zero. This occurrence is not wrong. Dense fogs and heavy snows can produce values near zero. Fog, smoke, heavy rain and other weather phenomena can reduce visibility to near zero miles or kilometers. | 10 |
61
+ | wc | An apparent temperature. It represents what the air temperature “feels like” on exposed human skin due to the combined effect of the cold temperatures and wind speed. When the temperature is 61°F or lower the Feels Like value represents the computed Wind Chill so display the Wind Chill value. For temperatures between 61°F and 75°F, the Feels Like value and Temperature are the same, regardless of wind speed and humidity, so display the Temperature value. | \-25 |
62
+ | wdir | The direction from which the wind blows expressed in degrees. The magnetic direction varies from 1 to 360 degrees, where 360° indicates the North, 90° the East, 180° the South, 270° the West, and so forth. A ‘null’ value represents no determinable wind direction. | 45 |
63
+ | wdir\_cardinal | This field contains the cardinal direction from which the wind blows in an abbreviated form. Wind directions are always expressed as “from whence the wind blows” meaning that a North wind blows from North to South. If you face North in a North wind, the wind is at your face. Face southward and the North wind is at your back. (N , NNE , NE, ENE, E, ESE, SE, SSE, S, SSW, SW, WSW, W, WNW, NW, NNW, CALM, VAR) | WSW |
64
+ | gust | Wind gust speed. This data field contains information about sudden and temporary variations of the average Wind Speed. The report always shows the maximum wind gust speed recorded during the observation period. It is a required display field if Wind Speed is shown. The speed of the gust can be expressed in miles per hour or kilometers per hour. | 35 |
65
+ | wspd | Wind Speed. The wind is treated as a vector; hence, winds must have direction and magnitude (speed). The wind information reported in the hourly current conditions corresponds to a 10-minute average called the sustained wind speed. Sudden or brief variations in the wind speed are known as “wind gusts” and are reported in a separate data field. Wind directions are always expressed as "from whence the wind blows" meaning that a North wind blows from North to South. If you face North in a North wind the wind is at your face. Face southward and the North wind is at your back. | 15 |
66
+ | max\_temp | High temperature in the last 24 hours | 81 |
67
+ | min\_temp | Low temperature in the last 24 hours | 48 |
68
+ | precip\_total | Precipitation amount in the last rolling 24 hour period | 0.3 |
69
+ | precip\_hourly | Precipitation for the last hour | 0.5 |
70
+ | snow\_hourly | Snow increasing rapidly in inches or centimeters per hour depending on whether the snowfall is reported by METAR or TECCI (synthetic observations). METAR snow accumulation is in inches and TECCI is in centimeters | 1 |
71
+ | uv\_desc | Ultraviolet index description (Extreme, High, Low, Minimal, Moderate, No Report, Not Available) | High |
72
+ | feels\_like | An apparent temperature. It represents what the air temperature “feels like” on exposed human skin due to the combined effect of the wind chill or heat index. | 60 |
73
+ | uv\_index | Ultraviolet index (0 to 11 and 999) | 7 |
74
+ | qualifier | Weather description qualifier code | QQ0063 |
75
+ | qualifier\_svrty | Weather description qualifier severity (1 to 6) | 1 |
76
+ | blunt\_phrase | Weather description qualifier short phrase | Warmer than yesterday. |
77
+ | terse\_phrase | Weather description qualifier terse phrase | Dangerous wind chills. Limit outdoor exposure. |
78
+ | clds | Cloud cover description code (SKC, CLR, SCT, FEW, BKN, OVC) | SKC |
79
+ | water\_temp | Water temperature | 80 |
80
+ | primary\_wave\_period | Primary wave period | 13 |
81
+ | primary\_wave\_height | Primary wave height | 3.28 |
82
+ | primary\_swell\_period | Primary swell period | 13 |
83
+ | primary\_swell\_height | Primary swell height | 1.64 |
84
+ | primary\_swell\_direction | Primary swell direction | 190 |
85
+ | secondary\_swell\_period | Secondary swell period | null |
86
+ | secondary\_swell\_height | Secondary swell height | null |
87
+ | secondary\_swell\_direction | Secondary swell direction | null |
88
+
89
+ ### **Choice of algorithm**
90
+
91
+ Since this is a time-series forecasting problem, the Long Short Term
92
+ Memory (LSTM) neural network was used to build the model. For the
93
+ look-back period, a period of 7 days(168 hours) were chosen. And since
94
+ prediction needs to be made for the next 24 hours a multi-step (24
95
+ steps) model was trained.
96
+ A vanilla Recursive Neural Network (RNN) only has a short term memory
97
+ because it suffers from the vanishing gradients problem. This occurs
98
+ because during backpropagation in a vanilla RNN, only the recent hidden
99
+ states are remembered as the gradients for earlier layers get
100
+ exponentially smaller and don’t do much learning at all. An LSTM is much
101
+ more robust to vanishing gradients and can remember information from
102
+ earlier inputs much better than a vanilla RNN.
103
+ My assumption for the prediction problem is that temperature for a
104
+ particular day not only depends on the day before, but on the weather
105
+ conditions for a longer timespan (the entire past week). For this
106
+ reason, LSTM was chosen as it remembers hidden states from the earlier
107
+ timesteps better.
108
+
109
+ ![image](./docs/lstm.png)
110
+ Fig: A simplistic LSTM representation of the modeling task
111
+
112
+ ## **Exploratory Data Analysis**
113
+
114
+ 22 years worth of historical hourly weather data from 2000-01-01 to
115
+ 2000-12-06 for the Ronald Reagan National Airport was used for training
116
+ the model. This amounted to a total of **247349** rows of weather data
117
+ records.
118
+ The first step was to figure out which columns were actually usable in
119
+ my dataset. For that, I first loaded the data into a pandas DataFrame
120
+ and checked the NaN counts for each column. I found that almost half of
121
+ columns were NaN.
122
+
123
+ ![image](./docs/nan-counts.png)
124
+
125
+ Using this information, I decided to keep only the following columns
126
+ (ones that are almost always recorded during observations):
127
+
128
+
129
+ - temp
130
+
131
+ - valid\_time\_gmt
132
+
133
+ - pressure
134
+
135
+ - wspd
136
+
137
+ - heat\_index
138
+
139
+ - dewPt
140
+
141
+ - rh
142
+
143
+ - vis
144
+
145
+ - wc
146
+
147
+ - clds
148
+
149
+ - wdir\_cardinal
150
+
151
+ The next step was to check the columns and see how they were correlated.
152
+ Just by the column definitions, I knew that heat\_index and wc would be
153
+ highly correlated to the temperature, but had no idea about how the
154
+ other columns were related. Using seaborn to plot the correlations, I
155
+ got the following graph:
156
+
157
+ ![image](./docs/correlations.png)
158
+ Fig: Correlation among the columns
159
+
160
+ Some findings that I took away were that dewPt is highly correlated to
161
+ the temperature, relative humidity (rh) is highly correlated to the the
162
+ visibility (vis), and there is also a slight correlation between the
163
+ pressure and the temperature.
164
+
165
+ ## **Data Preprocessing**
166
+
167
+ Since I want to convert the data into fixed width hourly records, the
168
+ first step was to convert the UNIX timestamp to a human readable date.
169
+ This was easily done with the python datetime library.
170
+
171
+ After converting the unix timestamps into datetime objects, I found that
172
+ the time the observations are made are not made uniformly. As can be
173
+ seen from the following screenshot, the observations are not all taken
174
+ at the same hour.
175
+
176
+ ![image](./docs/time-of-hour.png)
177
+ Fig: Graph showing number of observations made at specific minutes of an
178
+ hour
179
+
180
+ On further analysis, it became clear that while the observations in
181
+ first few years of our timespan (2000 2010), the observations were not
182
+ made at regular intervals, the more recent years had regular intervals.
183
+ Especially, as evident in the graph above, almost all hourly records had
184
+ data for the 52<sup>nd</sup> minute. Thus, I decided to use all the
185
+ 52<sup>nd</sup> minute observations as data for our modeling.
186
+
187
+ Despite there being a lot of 52<sup>nd</sup> minute data in the dataset,
188
+ I had many missing observations (missing a few hours in some days).
189
+ Since my LSTM model uses each of the previous 168 hours data as input to
190
+ make predictions, missing data would cause inaccuracies in my model. To
191
+ fix this, I used the following interpolations and backfill/forwardfill
192
+ to fill in the missing rows and created a uniformly spaced timeseries
193
+ dataframe for training:
194
+
195
+ | **Field** | **Fill Type** | **Parameters** |
196
+ | :------------- | :------------ | :---------------------------------------------------------------------- |
197
+ | temp | Interpolation | Polynomial, order=2 |
198
+ | heat\_index | Interpolation | Polynomial, order=2 |
199
+ | pressure | Interpolation | Polynomial, order=2 |
200
+ | wspd | Interpolation | Polynomial, order=2 |
201
+ | dewPt | Interpolation | Polynomial, order=2 |
202
+ | rh | Interpolation | Polynomial, order=2 |
203
+ | wc | Interpolation | Polynomial, order=2 |
204
+ | wdir\_cardinal | Backfill | |
205
+ | vis | Backfill | |
206
+ | clds | Interpolation | Linear (Transformed categorical to ordinal and performed interpolation) |
207
+
208
+ ![image](./docs/interpolation.png)
209
+ Fig: An example of interpolation performed to fill missing
210
+ 52<sup>nd</sup> minute temperature values
211
+
212
+ One big part of predicting weather is to note that weather conditions
213
+ are cyclical in nature. i.e. weather conditions in January of one year
214
+ are similar to weather conditions of January of the next year, weather
215
+ conditions at 1 am today is close distance-wise to weather conditions at
216
+ 1 am in other days. Directly incorporating the timestamp as a feature
217
+ would lose this cyclical information. So, I transformed the timestamps
218
+ (hour of day, day of year) into sin and cosine waves to preserve the
219
+ cyclical information.
220
+
221
+ ![image](./docs/time-of-day.png)
222
+ Fig: Hour of day encoded as sine/cosine wave
223
+
224
+ ![image](./docs/day-of-year.png)
225
+ Fig: Day of year encoded as sine/cosine wave
226
+
227
+ With this, the dataset was ready and I moved on to the training portion
228
+ of the project.
229
+
230
+ # **Training**
231
+
232
+ The dataset was split into 70:20:10 train, validation and test sets. A
233
+ StandardScaler was fit into the train dataset and all the train,
234
+ validation and test sets were standardized with this scaler.
235
+
236
+ The most crucial step for my model was to generate the sequence of
237
+ inputs to feed into my LSTM layer. Since my approach was to use the last
238
+ 168 hours of weather data as the lookback for the model, generating the
239
+ sequence of inputs was a little confusing at first. But thanks to [this
240
+ excellent tensorflow
241
+ tutorial](https://www.tensorflow.org/tutorials/structured_data/time_series#data_windowing),
242
+ I was able to set it up just right.
243
+ After generating the sequences, my training data looked like the
244
+ following:
245
+
246
+ ![image](./docs/training-data.png)
247
+ Fig: Training Data after sequencing
248
+
249
+ From the above figure, we can see that for each sample in our training
250
+ data, the past 168 hours of data have been correctly set up as input
251
+ features and the next 24 hours have been set aside as labels for
252
+ training.
253
+
254
+ I experimented with stacked LSTM, adding more dense layers, varying LSTM
255
+ unit counts, and so on. But considering the training time constraints
256
+ and performance, I found the following model to give the best results
257
+ for me:
258
+
259
+ | **Layer (type)** | **Output Shape** | **Param \#** |
260
+ | :------------------- | :--------------- | :----------- |
261
+ | lstm\_2 (LSTM) | (None, 12) | 1248 |
262
+ | dense\_2 (Dense) | (None, 24) | 312 |
263
+ | reshape\_2 (Reshape) | (None, 24, 1) | 0 |
264
+
265
+ | |
266
+ | :-------------------------- |
267
+ | **Total params**: 1,560 |
268
+ | **Trainable params**: 1,560 |
269
+ | **Non-trainable params**: 0 |
270
+
271
+ **Framework:** Tensorflow, Keras
272
+ **Choice of optimizer:** Adam with learning rate of 1e-3, default
273
+ decay
274
+ **Choice of loss function:** Mean Absolute Error
275
+ I chose MAE as my loss because it was easier to comprehend and the model
276
+ converged considerably well for my project.
277
+
278
+ **Max Epochs:** 100 ( I used Early Stopping with a patience of 5 such
279
+ that if the validation loss didn’t go down for 5 straight epochs, the
280
+ model would stop training further )
281
+
282
+ The model was configured to save every epoch if the validation loss
283
+ improved from the previous best.
284
+
285
+ The best model had a validation loss of 3.5005.
286
+ Although the Mean Absolute Error gave an idea of how accurate my model
287
+ was, I found it useful to compare it to a baseline model. Thus, for
288
+ evaluation of my model, I created the following baseline models and
289
+ trained them on the same dataset:
290
+
291
+ ![image](./docs/base-1.png)
292
+ Fig: Baseline model that predicts a constant temperature no matter what
293
+ the input.
294
+
295
+ ![image](./docs/base-2.png)
296
+ Fig: Baseline model that just repeats the previous day’s temperatures as
297
+ predictions
298
+
299
+ Looking at the following graph of the LSTM’s performance on test data,
300
+ it seems that the LSTM model is indeed making more accurate
301
+ predictions.
302
+
303
+ ![image](./docs/model-eval.png)
304
+ Fig: Samples of LSTM model’s performance on test dataset
305
+
306
+ Comparing the average Mean Absolute Error for all three models, the
307
+ following results were seen:
308
+
309
+ ![image](./docs/comparision.png)
310
+ Fig: Comparison of the LSTM model’s performance vs the two baseline
311
+ models
312
+
313
+ ## **Making Predictions / User Interface**
314
+
315
+ After building my model, I built a website that uses this model and
316
+ allows a user to check the model’s forecasts for tomorrow’s temperature
317
+ for Washington D.C. The website also shows its past predictions and
318
+ compares them with the actual temperatures for that day to give a sense
319
+ of how accurate the model actually is.//
320
+
321
+ The website is available at
322
+ [**http//3.235.0.237/\#/dashboard**](http://3.235.0.237/#/dashboard).
323
+
324
+ ### **Technology Stack**
325
+
326
+ | **Backend** | Flask, Tensorflow |
327
+ | :----------- | :---------------- |
328
+ | **Frontend** | Angular |
329
+ | **Infra** | AWS EC2, Docker |
330
+
331
+ ### **Architecture**
332
+
333
+ ![image](./docs/weather-requests.drawio.png)
334
+ Fig: Request/Response life cycle
335
+
336
+ ### **User Interface**
337
+
338
+ The user interface consists of a horizontal scrollable date picker that
339
+ allows the user to select a date (from the current date to 30 days
340
+ before the current date) to look at predictions for. The rightmost date
341
+ is the current date. For the current date, since we don’t know the full
342
+ day actual data yet, the predictions from the model are shown along with
343
+ the actual temperatures until the time the site is accessed at. The
344
+ actual temperature for future times will be shown once observation data
345
+ is available from the weather API.
346
+ Selecting a date plots the predicted temperature vs the actual
347
+ temperature for that date in a line graph. The red dotted line indicates
348
+ the predicted temperature, blue solid one shows the actual temperature
349
+ for that day. The shaded gray area indicates the absolute error in the
350
+ prediction. Apart from the graph, the page also contains a tabular view
351
+ of the same data at the lower half of the page.
352
+
353
+ ![image](./docs/graph-view.png)
354
+ Fig: Graph View
355
+
356
+ ![image](./docs/table-view.png)
357
+ Fig: Table View
358
+
359
+ ## **Conclusion**
360
+
361
+ The final model had a Mean Absolute Error score of about 3.5. i.e. our
362
+ model’s predicted temperatures are about \(\pm3.5\) from the actual
363
+ temperature. Although the accuracy is not as high as I would have liked,
364
+ looking at the predicted vs the actual temperatures, it seems that the
365
+ model is capturing the trend (increasing or decreasing) relatively
366
+ accurately.
367
+ To improve the model further, I would like to add more important
368
+ features (perhaps from a completely new dataset) and see how it
369
+ performs. Although I experimented with the architecture of the neural
370
+ network, due to time constraints, I wasn’t able to experiment as much as
371
+ I would have liked. So, I definitely want to try different LSTM
372
+ architectures in the near future and see how they compare to my current
373
+ model.
374
+
375
+ ## **References**
376
+
377
+ - Educational Resources
378
+
379
+ - [Time series forecasting -
380
+ Tensorflow](https://www.tensorflow.org/tutorials/structured_data/time_series#multi-step_models)
381
+
382
+ - [Timeseries forecasting for weather prediction Prabhanshu Attri,
383
+ Yashika Sharma, Kristi Takach, Falak
384
+ Shah](https://keras.io/examples/timeseries/timeseries_weather_forecasting/)
385
+
386
+ - [Single and MultiStep Temperature Time Series Forecasting for
387
+ Vilnius Using LSTM Deep Learning Model, Eligijus
388
+ Bujokas](https://towardsdatascience.com/single-and-multi-step-temperature-time-series-forecasting-for-vilnius-using-lstm-deep-learning-b9719a0009de)
389
+
390
+ - Dataset
391
+
392
+ - [API: Weather Underground](https://www.wunderground.com/)
393
+
394
+ - [Dataset
395
+ descriptions](https://www.worldcommunitygrid.org/lt/images/climate/The_Weather_Company_APIs.pdf)
396
+
397
+ - Project Artifacts
398
+
399
+ - [Website URL](http://3.235.0.237/#/dashboard)
400
+
401
+ - [Colab used for
402
+ training](https://colab.research.google.com/drive/1G6E8fT-viMPYw1fnWnBho2ONFfIyAXbK#scrollTo=ykqO9SfV9t7x)
403
+
404
+ - [Source code](https://github.com/00ber/ml-weather-prediction)