vancauwe commited on
Commit
aba41f2
Β·
1 Parent(s): e40a0fa

chore: documentation of refactor

Browse files
docs/{hotdog.md β†’ classifier_hotdog.md} RENAMED
File without changes
docs/dataset_cleaner.md ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ This module provides basic cleaning checks for the dataset that has been downloaded, any row which does not have the expected types is discarded.
2
+
3
+ ::: src.dataset.cleaner
docs/dataset_download.md ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ This module provides a download function for accessing the hugging face Dataset.
2
+
3
+ ::: src.data.download
docs/dataset_fake_data.md ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ This module takes care of generating some fake data.
2
+
3
+ ::: src.dataset.fake_data
docs/{hf_push_observations.md β†’ dataset_hf_push_observations.md} RENAMED
@@ -1,3 +1,3 @@
1
  This module writes an observation into a temporary JSON file, in order to add this JSON file to the Saving-Willy Dataset in the Saving-Willy Hugging Face Community.
2
 
3
- ::: src.hf_push_observations
 
1
  This module writes an observation into a temporary JSON file, in order to add this JSON file to the Saving-Willy Dataset in the Saving-Willy Hugging Face Community.
2
 
3
+ ::: src.dataset.hf_push_observations
docs/dataset_requests.md ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ This module provides functions for filtering the data by localisation and time and for rendering the search possibilities as well as the search results.
2
+
3
+ ::: src.dataset.requests
docs/{main.md β†’ home.md} RENAMED
File without changes
docs/pages.md ADDED
@@ -0,0 +1 @@
 
 
1
+ The pages documented are the pages with functional code. Some pages such as about, benchmarking, challenges are currently only writing, markdown and images and do not require further documentation.
docs/pages_classifiers.md ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ This page displays the input mechanism for your images as well as all the classifiers which can do inference on your image.
2
+
3
+ ::: src.pages.4_πŸ”₯_classifiers
docs/pages_gallery.md ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ This page displays all the cetacean species that can be identified by classifiers.
2
+
3
+ ::: src.pages.7_🌊_gallery
docs/pages_logs.md ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ This page displays all the logs coming from user interactions with the platform and from back-end queries to the hugging face server.
2
+
3
+ ::: src.pages.πŸ“Š_logs
docs/pages_map.md ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ This page displays the recorded observations of the dataset on a map.
2
+
3
+ ::: src.pages.2_🌍_map
docs/pages_requests.md ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ This page displays the data that can be requested. The default view is over all the data in the dataset. The filters on the side bar allow to narrow down on different geographical zones as well as on different time frames.
2
+
3
+ ::: src.pages.3_🀝_data requests
docs/release_protocol.md ADDED
@@ -0,0 +1,13 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Release Protocol
2
+
3
+ We use 2 spaces on hugging face: one for the development of the interface and the main space for showcasing the most recent stable release. The main branch is protected and deploys to the main space when a PR is accepted.
4
+
5
+ We wish to enforce strict commits from the dev branch to the main branch when a PR is made to create a new release.
6
+
7
+ Dev to Main PR Checklist:
8
+
9
+ 1. Open a PR from dev branch to main branch
10
+ 2. Commit: change the dataset to point the dataset to the main dataset
11
+ 3. Commit: change the naming in ReadME to avoid merge conflict
12
+ 4. Ask for Review
13
+ 5. Merge and make a new release of the code
docs/{fix_tabrender.md β†’ utils_fix_tabrender.md} RENAMED
File without changes
docs/{grid_maker.md β†’ utils_grid_maker.md} RENAMED
File without changes
docs/{metadata_handler.md β†’ utils_metadata_handler.md} RENAMED
File without changes
mkdocs.yaml CHANGED
@@ -22,32 +22,40 @@ plugins:
22
 
23
  nav:
24
  - README: index.md
25
- #- Quickstart:
26
- #- Installation: installation.md
27
- #- Usage: usage.md
28
- - API:
29
- - Main app: main.md
 
 
 
 
 
 
 
30
  - Modules:
31
- - Data entry handling:
32
- - Data input: input_handling.md
33
- - Data extraction and validation: input_validator.md
34
  - Data Object Class: input_observation.md
35
- - Classifiers:
 
 
 
 
 
 
36
  - Cetacean Fluke & Fin Recognition: classifier_image.md
37
- - (temporary) Hotdog Classifier: hotdog.md
38
- - Hugging Face Integration:
39
- - Push Observations to Dataset: hf_push_observations.md
40
  - Map of observations: obs_map.md
41
  - Whale gallery: whale_gallery.md
42
  - Whale viewer: whale_viewer.md
43
  - Logging: st_logs.md
44
  - Utils:
45
- - Tab-rendering fix (js): fix_tabrender.md
46
- - Metadata handling: metadata_handler.md
47
- - Grid maker: grid_maker.md
48
 
49
  - Development clutter:
50
  - Demo app: app.md
51
-
52
- - How to contribute:
53
- - Dev Notes: dev_notes.md
 
22
 
23
  nav:
24
  - README: index.md
25
+ - Release Protocol: release_protocol.md
26
+ - How to contribute:
27
+ - Dev Notes: dev_notes.md
28
+ - App:
29
+ - Main App & Home Page: home.md
30
+ - Pages:
31
+ - Overall Notes: pages.md
32
+ - Map Page: pages_map.md
33
+ - Requests Page: pages_requests.md
34
+ - Classifiers Page: pages_classifiers.md
35
+ - Gallery Page: pages_gallery.md
36
+ - Logs: pages_logs.md
37
  - Modules:
38
+ - Data Entry Handling:
39
+ - Data Input: input_handling.md
40
+ - Data Extraction & Validation: input_validator.md
41
  - Data Object Class: input_observation.md
42
+ - Hugging Face Dataset:
43
+ - Download: dataset_download.md
44
+ - Cleaning: dataset_cleaner.md
45
+ - Push Observations to Dataset: dataset_hf_push_observations.md
46
+ - Data Requests: dataset_requests.md
47
+ - Fake data: dataset_fake_data.md
48
+ - Hugging Face Classifiers:
49
  - Cetacean Fluke & Fin Recognition: classifier_image.md
50
+ - (temporary) Hotdog Classifier: classifier_hotdog.md
 
 
51
  - Map of observations: obs_map.md
52
  - Whale gallery: whale_gallery.md
53
  - Whale viewer: whale_viewer.md
54
  - Logging: st_logs.md
55
  - Utils:
56
+ - Tab-rendering fix (js): utils_fix_tabrender.md
57
+ - Metadata handling: utils_metadata_handler.md
58
+ - Grid maker: utils_grid_maker.md
59
 
60
  - Development clutter:
61
  - Demo app: app.md
 
 
 
src/classifier/classifier_image.py CHANGED
@@ -7,7 +7,7 @@ g_logger = logging.getLogger(__name__)
7
  g_logger.setLevel(LOG_LEVEL)
8
 
9
  import whale_viewer as viewer
10
- from hf_push_observations import push_observations
11
  from utils.grid_maker import gridder
12
  from utils.metadata_handler import metadata2md
13
  from input.input_observation import InputObservation
 
7
  g_logger.setLevel(LOG_LEVEL)
8
 
9
  import whale_viewer as viewer
10
+ from dataset.hf_push_observations import push_observations
11
  from utils.grid_maker import gridder
12
  from utils.metadata_handler import metadata2md
13
  from input.input_observation import InputObservation
src/dataset/cleaner.py CHANGED
@@ -1,6 +1,13 @@
1
  import pandas as pd
2
 
3
  def clean_lat_long(df): # Ensure lat and lon are numeric, coerce errors to NaN
 
 
 
 
 
 
 
4
  df['lat'] = pd.to_numeric(df['lat'], errors='coerce')
5
  df['lon'] = pd.to_numeric(df['lon'], errors='coerce')
6
 
@@ -9,6 +16,13 @@ def clean_lat_long(df): # Ensure lat and lon are numeric, coerce errors to NaN
9
  return df
10
 
11
  def clean_date(df): # Ensure lat and lon are numeric, coerce errors to NaN
 
 
 
 
 
 
 
12
  df['date'] = pd.to_datetime(df['date'], errors='coerce')
13
  # Drop rows with NaN in lat or lon
14
  df = df.dropna(subset=['date']).reset_index(drop=True)
 
1
  import pandas as pd
2
 
3
  def clean_lat_long(df): # Ensure lat and lon are numeric, coerce errors to NaN
4
+ """
5
+ Clean latitude and longitude columns in the DataFrame.
6
+ Args:
7
+ df (pd.DataFrame): DataFrame containing latitude and longitude columns.
8
+ Returns:
9
+ pd.DataFrame: DataFrame with cleaned latitude and longitude columns.
10
+ """
11
  df['lat'] = pd.to_numeric(df['lat'], errors='coerce')
12
  df['lon'] = pd.to_numeric(df['lon'], errors='coerce')
13
 
 
16
  return df
17
 
18
  def clean_date(df): # Ensure lat and lon are numeric, coerce errors to NaN
19
+ """
20
+ Clean date column in the DataFrame.
21
+ Args:
22
+ df (pd.DataFrame): DataFrame containing date column.
23
+ Returns:
24
+ pd.DataFrame: DataFrame with cleaned date column.
25
+ """
26
  df['date'] = pd.to_datetime(df['date'], errors='coerce')
27
  # Drop rows with NaN in lat or lon
28
  df = df.dropna(subset=['date']).reset_index(drop=True)
src/dataset/download.py CHANGED
@@ -63,6 +63,12 @@ def try_download_dataset(dataset_id:str, data_files:str) -> dict:
63
  return metadata
64
 
65
  def get_dataset():
 
 
 
 
 
 
66
  # load/download data from huggingface dataset
67
  metadata = try_download_dataset(dataset_id, data_files)
68
 
 
63
  return metadata
64
 
65
  def get_dataset():
66
+ """
67
+ Downloads the dataset from Hugging Face and prepares it for use.
68
+ If the dataset is not available, it creates an empty DataFrame with the specified schema.
69
+ Returns:
70
+ pd.DataFrame: A DataFrame containing the dataset, or an empty DataFrame if the dataset is not available.
71
+ """
72
  # load/download data from huggingface dataset
73
  metadata = try_download_dataset(dataset_id, data_files)
74
 
src/dataset/fake_data.py CHANGED
@@ -4,6 +4,14 @@ import random
4
  from datetime import datetime, timedelta
5
 
6
  def generate_fake_data(df, num_fake):
 
 
 
 
 
 
 
 
7
 
8
  # Options for random generation
9
  species_options = [
@@ -51,7 +59,6 @@ def generate_fake_data(df, num_fake):
51
  end = datetime(end_year, 1, 1)
52
  return start + timedelta(days=random.randint(0, (end - start).days))
53
 
54
- # Generate 20 new observations
55
  new_data = []
56
  for _ in range(num_fake):
57
  lat, lon = random_ocean_coord()
@@ -60,7 +67,6 @@ def generate_fake_data(df, num_fake):
60
  date = random_date()
61
  new_data.append([lat, lon, species, email, date])
62
 
63
- # Create a DataFrame and append
64
  new_df = pd.DataFrame(new_data, columns=['lat', 'lon', 'species', 'author_email', 'date'])
65
  df = pd.concat([df, new_df], ignore_index=True)
66
  return df
 
4
  from datetime import datetime, timedelta
5
 
6
  def generate_fake_data(df, num_fake):
7
+ """
8
+ Generate fake data for the dataset.
9
+ Args:
10
+ df (pd.DataFrame): Original DataFrame to append fake data to.
11
+ num_fake (int): Number of fake observations to generate.
12
+ Returns:
13
+ pd.DataFrame: DataFrame with the original and fake data.
14
+ """
15
 
16
  # Options for random generation
17
  species_options = [
 
59
  end = datetime(end_year, 1, 1)
60
  return start + timedelta(days=random.randint(0, (end - start).days))
61
 
 
62
  new_data = []
63
  for _ in range(num_fake):
64
  lat, lon = random_ocean_coord()
 
67
  date = random_date()
68
  new_data.append([lat, lon, species, email, date])
69
 
 
70
  new_df = pd.DataFrame(new_data, columns=['lat', 'lon', 'species', 'author_email', 'date'])
71
  df = pd.concat([df, new_df], ignore_index=True)
72
  return df
src/{hf_push_observations.py β†’ dataset/hf_push_observations.py} RENAMED
File without changes
src/dataset/requests.py CHANGED
@@ -5,14 +5,27 @@ from dataset.download import get_dataset
5
  from dataset.fake_data import generate_fake_data
6
 
7
  def data_prep():
8
- "Doing data prep"
 
 
 
 
 
9
  df = get_dataset()
 
10
  # df = generate_fake_data(df, 100)
11
  df = clean_lat_long(df)
12
  df = clean_date(df)
13
  return df
14
 
15
  def filter_data(df):
 
 
 
 
 
 
 
16
  df_filtered = df[
17
  (df['date'] >= pd.to_datetime(st.session_state.date_range[0])) &
18
  (df['date'] <= pd.to_datetime(st.session_state.date_range[1])) &
@@ -24,7 +37,11 @@ def filter_data(df):
24
  return df_filtered
25
 
26
  def show_specie_author(df):
27
- print(df)
 
 
 
 
28
  df = df.groupby(['species', 'author_email']).size().reset_index(name='counts')
29
  for specie in df["species"].unique():
30
  st.subheader(f"Species: {specie}")
@@ -35,6 +52,15 @@ def show_specie_author(df):
35
  st.session_state.checkbox_states[key] = st.checkbox(label, key=key)
36
 
37
  def show_new_data_view(df):
 
 
 
 
 
 
 
 
 
38
  df = filter_data(df)
39
  df_ordered = show_specie_author(df)
40
  return df_ordered
 
5
  from dataset.fake_data import generate_fake_data
6
 
7
  def data_prep():
8
+ """
9
+ Prepares the dataset for use in the application.
10
+ Downloads the dataset and cleans the data (and generates fake data if needed).
11
+ Returns:
12
+ pd.DataFrame: A DataFrame containing the cleaned dataset.
13
+ """
14
  df = get_dataset()
15
+ # uncomment to generate some fake data
16
  # df = generate_fake_data(df, 100)
17
  df = clean_lat_long(df)
18
  df = clean_date(df)
19
  return df
20
 
21
  def filter_data(df):
22
+ """
23
+ Filter the DataFrame based on user-selected ranges for latitude, longitude, and date.
24
+ Args:
25
+ df (pd.DataFrame): DataFrame to filter.
26
+ Returns:
27
+ pd.DataFrame: Filtered DataFrame.
28
+ """
29
  df_filtered = df[
30
  (df['date'] >= pd.to_datetime(st.session_state.date_range[0])) &
31
  (df['date'] <= pd.to_datetime(st.session_state.date_range[1])) &
 
37
  return df_filtered
38
 
39
  def show_specie_author(df):
40
+ """
41
+ Display a list of species and their corresponding authors with checkboxes.
42
+ Args:
43
+ df (pd.DataFrame): DataFrame containing species and author information.
44
+ """
45
  df = df.groupby(['species', 'author_email']).size().reset_index(name='counts')
46
  for specie in df["species"].unique():
47
  st.subheader(f"Species: {specie}")
 
52
  st.session_state.checkbox_states[key] = st.checkbox(label, key=key)
53
 
54
  def show_new_data_view(df):
55
+ """
56
+ Filter the dataframe based on the state of the localisation sliders and selected timeframe by the user.
57
+ Then, show the results of the filtering grouped by species then by authors.
58
+ Authors are matched to a checkbox component so the user can click it if he/she/they wish to request data from this author.
59
+ Args:
60
+ df (pd.DataFrame): DataFrame to filter and display.
61
+ Returns:
62
+ pd.DataFrame: Filtered and grouped DataFrame.
63
+ """
64
  df = filter_data(df)
65
  df_ordered = show_specie_author(df)
66
  return df_ordered
src/pages/4_πŸ”₯_classifiers.py CHANGED
@@ -19,7 +19,7 @@ from input.input_handling import init_input_container_states, add_input_UI_eleme
19
  from input.input_handling import dbg_show_observation_hashes
20
 
21
  from utils.workflow_ui import refresh_progress_display, init_workflow_viz, init_workflow_session_states
22
- from hf_push_observations import push_all_observations
23
 
24
  from classifier.classifier_image import cetacean_just_classify, cetacean_show_results_and_review, cetacean_show_results, init_classifier_session_states
25
  from classifier.classifier_hotdog import hotdog_classify
@@ -84,7 +84,6 @@ with tab_inference:
84
  g_logger.info(f"{st.session_state.observations}")
85
 
86
  df = pd.DataFrame([obs.to_dict() for obs in st.session_state.observations.values()])
87
- #df = pd.DataFrame(st.session_state.observations, index=[0])
88
  # with tab_coords:
89
  # st.table(df)
90
  # there doesn't seem to be any actual validation here?? TODO: find validator function (each element is validated by the input box, but is there something at the whole image level?)
 
19
  from input.input_handling import dbg_show_observation_hashes
20
 
21
  from utils.workflow_ui import refresh_progress_display, init_workflow_viz, init_workflow_session_states
22
+ from dataset.hf_push_observations import push_all_observations
23
 
24
  from classifier.classifier_image import cetacean_just_classify, cetacean_show_results_and_review, cetacean_show_results, init_classifier_session_states
25
  from classifier.classifier_hotdog import hotdog_classify
 
84
  g_logger.info(f"{st.session_state.observations}")
85
 
86
  df = pd.DataFrame([obs.to_dict() for obs in st.session_state.observations.values()])
 
87
  # with tab_coords:
88
  # st.table(df)
89
  # there doesn't seem to be any actual validation here?? TODO: find validator function (each element is validated by the input box, but is there something at the whole image level?)